Author: "M. Tamer Özsu" / Search Limiters: Available in Library Collection - Searchworks@Jio Institute Digital Library Search Results

2. Space-Efficient Subgraph Search Over Streaming Graph With Timing Order Constraint

Author: Lei Zou, Youhuan Li, M. Tamer Özsu, and Dongyan Zhao
Subjects: Theoretical computer science, Computational Theory and Mathematics, Computer science, Server, Graph (abstract data type), Concurrent computing, Computer Science Applications, Information Systems, Data modeling
Published: 2022
Full Text: View/download PDF

3. sGrapp: Butterfly Approximation in Streaming Graphs

Author: M. Tamer Özsu and Aida Sheshbolouki
Subjects: General Computer Science, MathematicsofComputing_DISCRETEMATHEMATICS
Abstract: We study the fundamental problem of butterfly (i.e., (2,2)-bicliques) counting in bipartite streaming graphs. Similar to triangles in unipartite graphs, enumerating butterflies is crucial in understanding the structure of bipartite graphs. This benefits many applications where studying the cohesion in a graph shaped data is of particular interest. Examples include investigating the structure of computational graphs or input graphs to the algorithms, as well as dynamic phenomena and analytic tasks over complex real graphs. Butterfly counting is computationally expensive, and known techniques do not scale to large graphs; the problem is even harder in streaming graphs. In this article, following a data-driven methodology, we first conduct an empirical analysis to uncover temporal organizing principles of butterflies in real streaming graphs and then we introduce an approximate adaptive window-based algorithm, sGrapp, for counting butterflies as well as its optimized version sGrapp-x. sGrapp is designed to operate efficiently and effectively over any graph stream with any temporal behavior. Experimental studies of sGrapp and sGrapp-x show superior performance in terms of both accuracy and efficiency.
Published: 2022
Full Text: View/download PDF

4. Optimizing Multi-Query Evaluation in Federated RDF Systems

Author: Zhiwei Xu, Qi Ge, Dongyan Zhao, Lei Zou, M. Tamer Özsu, and Peng Peng
Subjects: Theoretical computer science, Computational complexity theory, Computer science, Heuristic, InformationSystems_DATABASEMANAGEMENT, Joins, 02 engineering and technology, computer.file_format, Query optimization, Computer Science Applications, Computational Theory and Mathematics, 020204 information systems, Scalability, 0202 electrical engineering, electronic engineering, information engineering, SPARQL, Rewriting, RDF, computer, Information Systems
Abstract: This paper revisits the classical problem of multiple query optimization in federated RDF systems. We propose a heuristic query rewriting-based approach to optimize the evaluation of multiple queries. This approach can take advantage of SPARQL 1.1 to share the common computation of multiple queries while considering the cost of both query evaluation and data shipment. Although we prove that finding the optimal rewriting for multiple queries is NP-complete, we propose a heuristic rewriting algorithm with a bounded approximation ratio. Furthermore, we propose an efficient method to use the interconnection topology between RDF sources to filter out irrelevant sources, and utilize some characteristics of SPARQL 1.1 to optimize multiple joins of intermediate matches. The extensive experimental studies show that the proposed techniques are effective, efficient and scalable.
Published: 2021
Full Text: View/download PDF

5. Optimizing Differentially-Maintained Recursive Queries on Dynamic Graphs

Author: Khaled Ammar, Siddhartha Sahu, Semih Salihoglu, and M. Tamer Özsu
Subjects: FOS: Computer and information sciences, Computer Science - Databases, Computer Science - Distributed, Parallel, and Cluster Computing, General Engineering, Databases (cs.DB), Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Differential computation (DC) is a highly general incremental computation/view maintenance technique that can maintain the output of an arbitrary and possibly recursive dataflow computation upon changes to its base inputs. As such, it is a promising technique for graph database management systems (GDBMS) that support continuous recursive queries over dynamic graphs. Although differential computation can be highly efficient for maintaining these queries, it can require prohibitively large amount of memory. This paper studies how to reduce the memory overhead of DC with the goal of increasing the scalability of systems that adopt it. We propose a suite of optimizations that are based on dropping the differences of operators, both completely or partially, and recomputing these differences when necessary. We propose deterministic and probabilistic data structures to keep track of the dropped differences. Extensive experiments demonstrate that the optimizations can improve the scalability of a DC-based continuous query processor.
Published: 2022

6. aeSpTV: An Adaptive and Efficient Framework for Sparse Tensor-Vector Product Kernel on a High-Performance Computing Platform

Author: Guoqing Xiao, Chubo Liu, Albert Y. Zomaya, M. Tamer Özsu, Yuedan Chen, and Tao Li
Subjects: Computer science, Mathematical statistics, 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, Supercomputer, Data structure, Computational Theory and Mathematics, Hardware and Architecture, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, Tensor, Sunway TaihuLight, Sparse matrix
Abstract: Multi-dimensional, large-scale, and sparse data, which can be neatly represented by sparse tensors, are increasingly used in various applications such as data analysis and machine learning. A high-performance sparse tensor-vector product (SpTV), one of the most fundamental operations of processing sparse tensors, is necessary for improving efficiency of related applications. In this article, we propose aeSpTV, an adaptive and efficient SpTV framework on Sunway TaihuLight supercomputer, to solve several challenges of optimizing SpTVon high-performance computing platforms. First, to map SpTV to Sunway architecture and tame expensive memory access latency and parallel writing conflict due to the intrinsic irregularity of SpTV, we introduce an adaptive SpTV parallelization. Second, to co-execute with the parallelization design while still ensuring high efficiency, we design a sparse tensor data structure named CSSoCR. Third, based on the adaptive SpTV parallelization with the novel tensor data structure, we present an autotuner that chooses the most befitting tensor partitioning method for aeSpTV using the variance analysis theory of mathematical statistics to achieve load balance. Fourth, to further leverage the computing power of Sunway, we propose customized optimizations for aeSpTV. Experimental results show that aeSpTV yields good sacalability on both thread-level and process-level parallelism of Sunway. It achieves a maximum GFLOPS of 195.69 on 128 processes. Additionally, it is proved that optimization effects of the partitioning autotuner and optimization techniques are remarkable.
Published: 2020
Full Text: View/download PDF

7. The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation

Author: Ruihong Wang, Jianguo Wang, Stratos Idreos, M. Tamer Özsu, and Walid G. Aref
Subjects: FOS: Computer and information sciences, Hardware_MEMORYSTRUCTURES, Computer Science - Databases, General Engineering, Databases (cs.DB)
Abstract: Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lower cost of ownership. This paper makes the case that MD can fuel the next wave of innovation on database systems. We observe that MD revives the great debate of "shared what" in the database community. We envision that distributed shared-memory databases (DSM-DB, for short) - that have not received much attention before - can be promising in the future with MD. We present a list of challenges and opportunities that can inspire next steps in system design making the case for DSM-DB.
Published: 2022
Full Text: View/download PDF

8. Internet Technology Outlook

Author: M. Tamer Özsu and Latifur Khan
Subjects: Multimedia, Computer Networks and Communications, business.industry, Computer science, Cognitive computing, The Internet, business, computer.software_genre, computer
Published: 2020
Full Text: View/download PDF

9. Distributed Database Systems: The Case for NewSQL

Author: M. Tamer Özsu, Ricardo Jiménez-Peris, and Patrick Valduriez
Subjects: SQL, Database, Distributed database, Computer science, Spanner, InformationSystems_DATABASEMANAGEMENT, 02 engineering and technology, computer.software_genre, NoSQL, Data warehouse, 020204 information systems, Scalability, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, computer, NewSQL, Operational database, computer.programming_language
Abstract: Until a decade ago, the database world was all SQL, distributed, sometimes replicated, and fully consistent. Then, web and cloud applications emerged that need to deal with complex big data, and NoSQL came in to address their requirements, trading consistency for scalability and availability. NewSQL has been the latest technology in the big data management landscape, combining the scalability and availability of NoSQL with the consistency and usability of SQL. By blending capabilities only available in different kinds of database systems such as fast data ingestion and SQL queries and by providing online analytics over operational data, NewSQL opens up new opportunities in many application domains where real-time decisions are critical. NewSQL may also simplify data management, by removing the traditional separation between NoSQL and SQL (ingest data fast, query it with SQL), as well as between operational database and data warehouse/data lake (no more ETLs!). However, a hard problem is scaling out transactions in mixed operational and analytical (HTAP) workloads over big data, possibly coming from different data stores (HDFS, SQL, NoSQL). Today, only a few NewSQL systems have solved this problem. In this paper, wne make the case for NewSQL, introducing their basic principles from distributed database systems and illustrating with Spanner and LeanXcale, two of the most advanced systems in terms of scalable transaction management.
Published: 2021
Full Text: View/download PDF

10. The future is big graphs: a Community View on Graph Processing Systems

Author: Jan Hidders, Mohamed Ragab, Angela Bonifati, Hannes Voigt, Maciej Besta, Alexandru Iosup, Ana Lucia Varbanescu, Petra Selmer, Tim Hegeman, Joshua Shinavier, Katja Hose, Stefania Dumbrava, Vasiliki Kalavri, Matei Ripeanu, Gábor Szárnyas, Eiko Yoneki, Da Yan, Semih Salihoglu, Renzo Angles, Eric Peukert, Olaf Hartig, Christian Schulz, Marcelo Arenas, Riccardo Tommasini, Nikolay Yakovets, Alexandru Uta, Stefan Plantikow, Juan F. Sequeda, M. Tamer Özsu, Emanuele Della Valle, Peter Boncz, Khaled Ammar, Wim Martens, Hsiang-Yun Wu, Bernhard Haslhofer, Sherif Sakr, Adriana Iamnitchi, Khuzaima Daudjee, Walid G. Aref, Antonino Tumeo, Hugo Kapp, Institute of Computer Science [University of Tartu, Estonie], University of Tartu, Base de Données (BD), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), Types and Reasoning for the Web (TYREX), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), Neo4j, Vrije Universiteit Amsterdam [Amsterdam] (VU), Delft University of Technology (TU Delft), BorialisAI, Universidad de Talca, Purdue University [West Lafayette], UC Pontifical University Cathoic [Santiogo], Millennium Institute for Foundational Research on Data (IMFD), Pontificia Universidad Católica de Chile (UC), Eidgenössische Technische Hochschule - Swiss Federal Institute of Technology [Zürich] (ETH Zürich), Centrum Wiskunde & Informatica (CWI), University of Waterloo [Waterloo], Polytechnic University of Milan, Ecole Nationale Supérieure d'Informatique pour l'Industrie et l'Entreprise (ENSIIE), Linköping University (LIU), Austrian Institute of Technology [Vienna] (AIT), Birkbeck College [University of London], Aalborg University [Denmark] (AAU), University of South Florida [Tampa] (USF), Boston University [Boston] (BU), Oracle Labs Switzerland, Universität Bayreuth, Universität Leipzig, University of British Columbia (UBC), Heidelberg University, data.world, Uber Engineering [Palo Alto], Budapest University of Technology and Economics [Budapest] (BME), Pacific Northwest National Laboratory (PNNL), University of Amsterdam [Amsterdam] (UvA), Institute of Applied Physics [Vienna] (TU Wien), Vienna University of Technology (TU Wien), Eindhoven University of Technology [Eindhoven] (TU/e), University of Alabama at Birmingham [ Birmingham] (UAB), University of Cambridge [UK] (CAM), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Université Lumière - Lyon 2 (UL2), VU University Amsterdam, Universität Leipzig [Leipzig], and Vrije universiteit = Free university of Amsterdam [Amsterdam] (VU)
Subjects: FOS: Computer and information sciences, SQL, General Computer Science, Computer science, H.2, J.0, Big graph, 02 engineering and technology, computer.software_genre, Domain (software engineering), Computer Science - Databases, Relational database management system, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Use case, computer.programming_language, [INFO.INFO-WB]Computer Science [cs]/Web, 020207 software engineering, Databases (cs.DB), Data science, Graph, Computer Science - Distributed, Parallel, and Cluster Computing, C.3, E.0, Core (graph theory), Distributed, Parallel, and Cluster Computing (cs.DC), computer
Abstract: Graphs are, by nature, "unifying abstractions" that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed?, Communications of the ACM, 64 (9), ISSN:1557-7317, ISSN:0001-0782
Published: 2021
Full Text: View/download PDF

11. Message from the General ChairsDSC 2020

Author: M. Tamer Özsu, Hui Xiong, and Qing Li
Subjects: ComputingMilieux_THECOMPUTINGPROFESSION, Computer science, media_common.quotation_subject, ComputingMilieux_PERSONALCOMPUTING, Media studies, ComputingMilieux_LEGALASPECTSOFCOMPUTING, China, Cyberspace, GeneralLiterature_MISCELLANEOUS, Pleasure, media_common
Abstract: It is our pleasure to welcome you to the Fifth IEEE International Conference on Data Science in Cyberspace - IEEE DSC 2020 - in Hong Kong, P. R. China. IEEE DSC is a premier international conference dedicated to covering all aspects of data science in cyberspace, in both the theoretical and systems aspects.
Published: 2020
Full Text: View/download PDF

12. Regular Path Query Evaluation on Streaming Graphs

Author: M. Tamer Özsu, Anil Pacaci, Angela Bonifati, University of Waterloo [Waterloo], Base de Données (BD), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Université Lumière - Lyon 2 (UL2), Types and Reasoning for the Web (TYREX), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), Association for Computing Machinery, Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), and Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Theoretical computer science, [INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB], Semantics (computer science), Computer science, Process (computing), Databases (cs.DB), 0102 computer and information sciences, 02 engineering and technology, 01 natural sciences, Graph, Constraint (information theory), Computer Science - Databases, 010201 computation theory & mathematics, 020204 information systems, Path (graph theory), 0202 electrical engineering, electronic engineering, information engineering, Focus (optics), Computer Science::Databases
Abstract: We study persistent query evaluation over streaming graphs, which is becoming increasingly important. We focus on navigational queries that determine if there exists a path between two entities that satisfies a user-specified constraint. We adopt the Regular Path Query (RPQ) model that specifies navigational patterns with labeled constraints. We propose deterministic algorithms to efficiently evaluate persistent RPQs under both arbitrary and simple path semantics in a uniform manner. Experimental analysis on real and synthetic streaming graphs shows that the proposed algorithms can process up to tens of thousands of edges per second and efficiently answer RPQs that are commonly used in real-world workloads., A shorter version of this paper has been accepted for publication in 2020 International Conference on Management of Data (SIGMOD 2020)
Published: 2020
Full Text: View/download PDF

13. Scalable Mining of Maximal Quasi-Cliques: An Algorithm-System Codesign Approach

Author: Jalal Khalil, Zhe Jiang, Guimu Guo, M. Tamer Özsu, and Da Yan
Subjects: Social and Information Networks (cs.SI), FOS: Computer and information sciences, Multi-core processor, Speedup, Degree (graph theory), Computer science, General Engineering, Parallel algorithm, Computer Science - Social and Information Networks, 02 engineering and technology, Load balancing (computing), Vertex (geometry), Computer Science - Distributed, Parallel, and Cluster Computing, 020204 information systems, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), 020201 artificial intelligence & image processing, Distributed, Parallel, and Cluster Computing (cs.DC), Algorithm, MathematicsofComputing_DISCRETEMATHEMATICS
Abstract: Given a user-specified minimum degree threshold $\gamma$, a $\gamma$-quasi-clique is a subgraph $g=(V_g,E_g)$ where each vertex $v\in V_g$ connects to at least $\gamma$ fraction of the other vertices (i.e., $\lceil \gamma\cdot(|V_g|-1)\rceil$ vertices) in $g$. Quasi-clique is one of the most natural definitions for dense structures useful in finding communities in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasi-cliques is notoriously expensive. In this paper, we design parallel algorithms for mining maximal quasi-cliques on G-thinker, a recent distributed framework targeting divide-and-conquer graph mining algorithms that decomposes the mining into compute-intensive tasks to fully utilize CPU cores. However, we found that directly using G-thinker results in the straggler problem due to (i) the drastic load imbalance among different tasks and (ii) the difficulty of predicting the task running time and the time growth with task-subgraph size. We address these challenges by redesigning G-thinker's execution engine to prioritize long-running tasks for mining, and by utilizing a novel timeout strategy to effectively decompose the mining workloads of long-running tasks to improve load balancing. While this system redesign applies to many other expensive dense subgraph mining problems, this paper verifies the idea by adapting the state-of-the-art quasi-clique algorithm, Quick, to our redesigned G-thinker. We improve Quick by integrating new pruning rules, and fixing some missed boundary cases that could lead to missed results. Extensive experiments verify that our new solution scales well with the number of CPU cores, achieving 201$\times$ runtime speedup when mining a graph with 3.77M vertices and 16.5M edges in a 16-node cluster., Comment: Guimu Guo and Da Yan are parallel first authors; this is the full version of our PVLDB 2021 paper with the same title
Published: 2020

14. Experimental analysis of distributed graph systems

Author: M. Tamer Özsu and Khaled Ammar
Subjects: Theoretical computer science, Computer science, business.industry, General Engineering, 020206 networking & telecommunications, Usability, 02 engineering and technology, Performance results, Graph, law.invention, PageRank, law, 020204 information systems, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), Heuristics, business
Abstract: This paper evaluates eight parallel graph processing systems: Hadoop, HaLoop, Vertica, Giraph, GraphLab (PowerGraph), Blogel, Flink Gelly, and GraphX (SPARK) over four very large datasets (Twitter, World Road Network, UK 200705, and ClueWeb) using four workloads (PageRank, WCC, SSSP and K-hop). The main objective is to perform an independent scale-out study by experimentally analyzing the performance, usability, and scalability (using up to 128 machines) of these systems. In addition to performance results, we discuss our experiences in using these systems and suggest some system tuning heuristics that lead to better performance.
Published: 2018
Full Text: View/download PDF

15. ViewDF: Declarative incremental view maintenance for streaming data

Author: Lukasz Golab, M. Tamer Özsu, and Yuke Yang
Subjects: Computer science, Programming language, Materialized view, 02 engineering and technology, computer.software_genre, Hardware and Architecture, 020204 information systems, Component (UML), Streaming data, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Pattern matching, Data mining, computer, Incremental view maintenance, Software, Information Systems
Abstract: We present ViewDF: a flexible and declarative framework for incremental maintenance of materialized views (i.e., results of continuous queries) over streaming data. The main component of the proposed framework is the View Delta Function (ViewDF), which declaratively specifies how to update a materialized view when a new batch of data arrives. We describe and experimentally evaluate a prototype system based on this idea, which allows users to write ViewDFs directly and automatically translates common classes of streaming queries into ViewDFs. Our approach generalizes existing work on incremental view maintenance and enables new optimizations for views that are common in stream analytics, including those with pattern matching and sliding windows.
Published: 2017
Full Text: View/download PDF

16. Correction to: Principles of Distributed Database Systems

Author: Patrick Valduriez and M. Tamer Özsu
Subjects: Distributed database, Computer science, Distributed computing
Published: 2020
Full Text: View/download PDF

17. Message from the ICDE 2019 Chairs

Author: M. Tamer Özsu and Christian Søndergaard Jensen
Published: 2019
Full Text: View/download PDF

18. Message from the ICDE 2019 Chairs

Author: Divesh Srivastava, M. Tamer Özsu, Lionel M. Ni, Xuemin Lin, Wenfei Fan, and Christian S. Jensen
Subjects: Multimedia, Computer science, computer.software_genre, computer, ComputingMilieux_MISCELLANEOUS
Abstract: Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
Published: 2019
Full Text: View/download PDF

19. Graph-Based RDF Data Management

Author: M. Tamer Özsu and Lei Zou
Subjects: Graph database, Information retrieval, Computer science, RDF Schema, Computational Mechanics, 02 engineering and technology, computer.file_format, Linked data, computer.software_genre, Blank node, RDF/XML, Computer Science Applications, 020204 information systems, Simple Knowledge Organization System, 0202 electrical engineering, electronic engineering, information engineering, SPARQL, 020201 artificial intelligence & image processing, computer, RDF query language, computer.programming_language
Abstract: The increasing size of RDF data requires efficient systems to store and query them. There have been efforts to map RDF data to a relational representation, and a number of systems exist that follow this approach. We have been investigating an alternative approach of maintaining the native graph model to represent RDF data, and utilizing graph database techniques (such as a structure-aware index and a graph matching algorithm) to address RDF data management. Since 2009, we have been developing a set of graph-based RDF data management systems that follow this approach: gStore, gStore-D and gAnswer. The first two are designed to support efficient SPARQL query evaluation in a centralized and distributed/parallel environments, respectively, while the last one aims to provide an easy-to-use interface (natural language question/answering) for users to access a RDF repository. In this paper, we give an overview of these systems and also discuss our design philosophy.
Published: 2017
Full Text: View/download PDF

20. A survey of RDF data management systems

Author: M. Tamer Özsu
Subjects: Information retrieval, General Computer Science, Computer science, RDF Schema, 02 engineering and technology, Linked data, computer.file_format, RDF/XML, Theoretical Computer Science, 020204 information systems, Simple Knowledge Organization System, 0202 electrical engineering, electronic engineering, information engineering, SPARQL, 020201 artificial intelligence & image processing, RDF, Cwm, computer, RDF query language, computer.programming_language
Abstract: RDF is increasingly being used to encode data for the semantic web and data exchange. There have been a large number of works that address RDF data management following different approaches. In this paper we provide an overview of these works. This review considers centralized solutions (what are referred to as warehousing approaches), distributed solutions, and the techniques that have been developed for querying linked data. In each category, further classifications are provided that would assist readers in understanding the identifying characteristics of different approaches.
Published: 2016
Full Text: View/download PDF

21. T-thinker

Author: Guimu Guo, M. Tamer Özsu, Mashiur Rahman Chowdhury, Weida Tan, Da Yan, and John C. S. Lui
Subjects: Divide and conquer algorithms, 020203 distributed computing, Computer science, Network communication, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), 020207 software engineering, 02 engineering and technology, Central processing unit, Parallel computing
Abstract: Many computationally expensive problems are solved by a divide-and-conquer algorithm: a problem over a big dataset can be recursively divided into independent tasks over smaller subsets of the dataset. We present a distributed general-purpose framework called T-thinker which effectively utilizes the CPU cores in a cluster by properly decomposing an expensive problem into smaller independent tasks for parallel computation. T-thinker well overlaps CPU processing with network communication, and its superior performance is verified over a re-engineered graph mining system G-thinker available at http://cs.uab.edu/yanda/gthinker/.
Published: 2019
Full Text: View/download PDF

22. GSI: GPU-friendly Subgraph Isomorphism

Author: Lei Zou, Lin Hu, Li Zeng, M. Tamer Özsu, and Fan Zhang
Subjects: Scheme (programming language), FOS: Computer and information sciences, Theoretical computer science, Computer science, Subgraph isomorphism problem, 020207 software engineering, Databases (cs.DB), 02 engineering and technology, Data structure, Bottleneck, Graph, Computer Science - Databases, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Enhanced Data Rates for GSM Evolution, Massively parallel, computer, computer.programming_language, MathematicsofComputing_DISCRETEMATHEMATICS
Abstract: Subgraph isomorphism is a well-known NP-hard problem that is widely used in many applications, such as social network analysis and query over the knowledge graph. Due to the inherent hardness, its performance is often a bottleneck in various real-world applications. Therefore, we address this by designing an efficient subgraph isomorphism algorithm leveraging features of GPU architecture, such as massive parallelism and memory hierarchy. Existing GPU-based solutions adopt a two-step output scheme, performing the same join process twice in order to write intermediate results concurrently. They also lack GPU architecture-aware optimizations that allow scaling to large graphs. In this paper, we propose a GPU-friendly subgraph isomorphism algorithm, GSI. Different from existing edge join-based GPU solutions, we propose a Prealloc-Combine strategy based on the vertex-oriented framework, which avoids joining-twice in existing solutions. Also, a GPU-friendly data structure (called PCSR) is proposed to represent an edge-labeled graph. Extensive experiments on both synthetic and real graphs show that GSI outperforms the state-of-the-art algorithms by up to several orders of magnitude and has good scalability with graph size scaling to hundreds of millions of edges., Comment: 15 pages, 17 figures, conference
Published: 2019
Full Text: View/download PDF

23. Main-Memory Hash Joins on Modern Processor Architectures

Author: Gustavo Alonso, Cagri Balkesen, Jens Teubner, and M. Tamer Özsu
Subjects: Hash join, Out-of-order execution, Computer science, Hash function, Joins, Parallel computing, Simultaneous multithreading, Hash table, Computer Science Applications, Instruction set, Computational Theory and Mathematics, Overhead (computing), Tuple, Information Systems
Abstract: Existing main-memory hash join algorithms for multi-core can be classified into two camps. Hardware-oblivious hash join variants do not depend on hardware-specific parameters. Rather, they consider qualitative characteristics of modern hardware and are expected to achieve good performance on any technologically similar platform. The assumption behind these algorithms is that hardware is now good enough at hiding its own limitations—through automatic hardware prefetching, out-of-order execution, or simultaneous multi-threading (SMT)—to make hardware-oblivious algorithms competitive without the overhead of carefully tuning to the underlying hardware. Hardware-conscious implementations, such as (parallel) radix join , aim to maximally exploit a given architecture by tuning the algorithm parameters (e.g., hash table sizes) to the particular features of the architecture. The assumption here is that explicit parameter tuning yields enough performance advantages to warrant the effort required. This paper compares the two approaches under a wide range of workloads (relative table sizes, tuple sizes, effects of sorted data, etc.) and configuration parameters (VM page sizes, number of threads, number of cores, SMT, SIMD, prefetching, etc.). The results show that hardware-conscious algorithms generally outperform hardware-oblivious ones. However, on specific workloads and special architectures with aggressive simultaneous multi-threading, hardware-oblivious algorithms are competitive. The main conclusion of the paper is that, in existing multi-core architectures, it is still important to carefully tailor algorithms to the underlying hardware to get the necessary performance. But processor developments may require to revisit this conclusion in the future.
Published: 2015
Full Text: View/download PDF

24. Time Constrained Continuous Subgraph Search over Streaming Graphs

Author: Dongyan Zhao, Lei Zou, Youhuan Li, and M. Tamer Özsu
Subjects: FOS: Computer and information sciences, 020203 distributed computing, Theoretical computer science, Time constrained, Computer science, Concurrency, Databases (cs.DB), 02 engineering and technology, Graph, Computer Science - Databases, 020204 information systems, Server, 0202 electrical engineering, electronic engineering, information engineering, Isomorphism
Abstract: The growing popularity of dynamic applications such as social networks provides a promising way to detect valuable information in real time. Efficient analysis over high-speed data from dynamic applications is of great significance. Data from these dynamic applications can be easily modeled as streaming graph. In this paper, we study the subgraph (isomorphism) search over streaming graph data that obeys timing order constraints over the occurrence of edges in the stream. We propose a data structure and algorithm to efficiently answer subgraph search and introduce optimizations to greatly reduce the space cost, and propose concurrency management to improve system throughput. Extensive experiments on real network traffic data and synthetic social streaming data confirms the efficiency and effectiveness of our solution.
Published: 2018
Full Text: View/download PDF

25. Database Administrator (DBA)

Author: M. Tamer Özsu
Subjects: Database, Computer science, Database administrator, computer.software_genre, computer
Published: 2018
Full Text: View/download PDF

26. Database

Author: M. Tamer Özsu
Published: 2017
Full Text: View/download PDF

27. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey

Author: Amine Mhedhbi, Siddhartha Sahu, Jimmy Lin, Semih Salihoglu, and M. Tamer Özsu
Subjects: FOS: Computer and information sciences, Computer science, business.industry, 020207 software engineering, Databases (cs.DB), 02 engineering and technology, computer.software_genre, Data science, Graph, Visualization, Software, Computer Science - Databases, Hardware and Architecture, 020204 information systems, Scalability, 0202 electrical engineering, electronic engineering, information engineering, business, computer, Information Systems, Data integration
Abstract: Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and white papers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants’ responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants’ responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software. We hope these findings can guide future research.
Published: 2017

28. An experimental comparison of pregel-like graph processing systems

Author: Xingfang Wang, Tianqi Jin, Minyang Han, Khuzaima Daudjee, M. Tamer Özsu, and Khaled Ammar
Subjects: Distributed minimum spanning tree, Connected component, Theoretical computer science, PageRank, Computer science, law, Shortest path problem, General Engineering, Graph (abstract data type), Graph, law.invention
Abstract: The introduction of Google's Pregel generated much interest in the field of large-scale graph data processing, inspiring the development of Pregel-like systems such as Apache Giraph, GPS, Mizan, and GraphLab, all of which have appeared in the past two years. To gain an understanding of how Pregel-like systems perform, we conduct a study to experimentally compare Giraph, GPS, Mizan, and GraphLab on equal ground by considering graph and algorithm agnostic optimizations and by using several metrics. The systems are compared with four different algorithms (PageRank, single source shortest path, weakly connected components, and distributed minimum spanning tree) on up to 128 Amazon EC2 machines. We find that the system optimizations present in Giraph and GraphLab allow them to perform well. Our evaluation also shows Giraph 1.0.0's considerable improvement since Giraph 0.1 and identifies areas of improvement for all systems.
Published: 2014
Full Text: View/download PDF

29. Distributed data management using MapReduce

Author: Beng Chin Ooi, Sai Wu, M. Tamer Özsu, and Feng Li
Subjects: General Computer Science, Database, SIMPLE (military communications protocol), Computer science, Interface (Java), business.industry, Data management, Document clustering, computer.software_genre, USable, Theoretical Computer Science, Scalability, Data analysis, business, computer, Implementation
Abstract: MapReduce is a framework for processing and managing large-scale datasets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research effort has been directed toward making it more usable and efficient for supporting database-centric operations. In this article, we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.
Published: 2014
Full Text: View/download PDF

30. gStore: a graph-based SPARQL query engine

Author: M. Tamer Özsu, Lei Chen, Ruizhe Huang, Xuchuan Shen, Lei Zou, and Dongyan Zhao
Subjects: Information retrieval, Graph database, Computer science, RDF Schema, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, InformationSystems_DATABASEMANAGEMENT, computer.file_format, computer.software_genre, RDF/XML, Named graph, Hardware and Architecture, SPARQL, Graph (abstract data type), RDF, computer, Information Systems, RDF query language, computer.programming_language
Abstract: We address efficient processing of SPARQL queries over RDF datasets. The proposed techniques, incorporated into the gStore system, handle, in a uniform and scalable manner, SPARQL queries with wildcards and aggregate operators over dynamic RDF datasets. Our approach is graph based. We store RDF data as a large graph and also represent a SPARQL query as a query graph. Thus, the query answering problem is converted into a subgraph matching problem. To achieve efficient and scalable query processing, we develop an index, together with effective pruning rules and efficient search algorithms. We propose techniques that use this infrastructure to answer aggregation queries. We also propose an effective maintenance algorithm to handle online updates over RDF repositories. Extensive experiments confirm the efficiency and effectiveness of our solutions.
Published: 2013
Full Text: View/download PDF

31. Answering pattern match queries in large graph databases via graph embedding

Author: Lei Chen, M. Tamer Özsu, Lei Zou, and Dongyan Zhao
Subjects: Factor-critical graph, Theoretical computer science, Computer science, Graph embedding, Strength of a graph, computer.software_genre, Simplex graph, Graph power, Reachability, Clique-width, Pattern matching, Graph property, Computer Science::Databases, Complement graph, Distance-hereditary graph, Graph database, Voltage graph, Directed graph, Graph, Graph bandwidth, Hardware and Architecture, Shortest path problem, Graph (abstract data type), Null graph, Graph factorization, computer, MathematicsofComputing_DISCRETEMATHEMATICS, Information Systems
Abstract: The growing popularity of graph databases has generated interesting data management problems, such as subgraph search, shortest path query, reachability verification, and pattern matching. Among these, a pattern match query is more flexible compared with a subgraph search and more informative compared with a shortest path or a reachability query. In this paper, we address distance-based pattern match queries over a large data graph G. Due to the huge search space, we adopt a filter-and-refine framework to answer a pattern match query over a large graph. We first find a set of candidate matches by a graph embedding technique and then evaluate these to find the exact matches. Extensive experiments confirm the superiority of our method.
Published: 2011
Full Text: View/download PDF

32. Walking Without a Map: Ranking-Based Traversal for Querying Linked Data

Author: Olaf Hartig and M. Tamer Özsu
Subjects: Result set, Computer science, Computer Sciences, Rank (computer programming), 02 engineering and technology, Linked data, computer.software_genre, Ranking (information retrieval), Data link, Tree traversal, Datavetenskap (datalogi), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, A priori and a posteriori, 020201 artificial intelligence & image processing, Data mining, Baseline (configuration management), computer
Abstract: The traversal-based approach to execute queries over Linked Data on the WWW fetches data by traversing data links and, thus, is able to make use of up-to-date data from initially unknown data sources. While the downside of this approach is the delay before the query engine completes a query execution, user perceived response time may be improved significantly by returning as many elements of the result set as soon as possible. To this end, the query engine requires a traversal strategy that enables the engine to fetch result-relevant data as early as possible. The challenge for such a strategy is that the query engine does not know a priori which of the data sources discovered during the query execution will contain result-relevant data. In this paper, we investigate 14 different approaches to rank traversal steps and achieve a variety of traversal strategies. We experimentally study their impact on response times and compare them to a baseline that resembles a breadth-first traversal. While our experiments show that some of the approaches can achieve noteworthy improvements over the baseline in a significant number of cases, we also observe that for every approach, there is a non-negligible chance to achieve response times that are worse than the baseline.
Published: 2016

33. Executing queries over schemaless RDF databases

Author: Khuzaima Daudjee, Olaf Hartig, M. Tamer Özsu, and Güneş Aluç
Subjects: Information retrieval, Web search query, business.industry, Computer science, View, Data management, RDF Schema, InformationSystems_DATABASEMANAGEMENT, computer.file_format, Linked data, Query optimization, Query language, Query expansion, Web query classification, Schema (psychology), Web application, SPARQL, Sargable, Tuple, RDF, business, computer, Semantic Web, RDF query language, computer.programming_language
Abstract: Recent advances in Linked Data Management and the Semantic Web have led to a rapid increase in both the quantity as well as the variety of Web applications that rely on the SPARQL interface to query RDF data. Thus, RDF data management systems are increasingly exposed to workloads that are far more diverse and dynamic than what these systems were designed to handle. The problem is that existing systems rely on a workload-oblivious physical representation that has a fixed schema, which is not suitable for diverse and dynamic workloads. To address these issues, we propose a physical representation that is schemaless. The resulting flexibility enables an RDF dataset to be clustered based purely on the workload, which is key to achieving good performance through optimized I/O and cache utilization. Consequently, given a workload, we develop techniques to compute a good clustering of the database. We also design a new query evaluation model, namely, schemaless-evaluation that leverages this workload-aware clustering of the database whereby, with high probability, each tuple in the result set of a query is expected to be contained in at most one cluster. Our query evaluation model exploits this property to achieve better performance while ensuring fast generation of query plans without being hindered by the lack of a fixed physical schema.
Published: 2015
Full Text: View/download PDF

34. Extending DBMSs with satellite databases

Author: Christian Plattner, Gustavo Alonso, and M. Tamer Özsu
Subjects: snapshot isolation, dynamic satellite creation, Database, Computer science, satellite databases, extending database functionality, computer.software_genre, Centralized database, Single system image, Hardware and Architecture, Middleware, Middleware (distributed applications), Scalability, Satellite, Snapshot isolation, Layer (object-oriented design), computer, Information Systems
Abstract: The VLDB Journal, 17 (4), ISSN:1066-8888, ISSN:0949-877X
Published: 2006
Full Text: View/download PDF

35. Incremental click-stream tree model: Learning from new users for web page prediction

Author: Ş G. Ögüdücüı and M. Tamer Özsu
Subjects: Web analytics, Web server, Information Systems and Management, Information retrieval, Computer science, business.industry, Recommender system, computer.software_genre, Data structure, Hardware and Architecture, Web page, Incremental build model, Data mining, business, computer, Software, Decision tree model, Clickstream, Information Systems
Abstract: Predicting the next request of a user has gained importance as Web-based activity increases in order to guide Web users during their visits to Web sites. Previously proposed methods for recommendation use data collected over time in order to extract usage patterns. However, these patterns may change over time, because each day new log entries are added to the database and old entries are deleted. Thus, over time it is highly desirable to perform the update of the recommendation model incrementally. In this paper, we propose a new model for modeling and predicting Web user sessions which attempt to reduce the online recommendation time while retaining predictive accuracy. Since it is very easy to modify the model, it is updated during the recommendation process. The incremental algorithm yields a better prediction accuracy as well as a shorter online recommendation time. A performance evaluation of Incremental Click-Stream Tree model over two different Web server access logs indicate that the proposed incremental model yields significant speed-up of recommendation time and improvement of the prediction accuracy.
Published: 2006
Full Text: View/download PDF

36. An Adaptive Data-Shipping Architecture for Client Caching Data Management Systems

Author: M. Tamer Özsu, Ronald C. Unrau, and Kaladhar Voruganti
Subjects: Information Systems and Management, Data shipping, Computer science, business.industry, Distributed computing, Data management, Data structure, Hardware and Architecture, Pointer swizzling, Robustness (computer science), Cache, Architecture, business, Software, Information Systems, Data transmission
Abstract: Data-shipping is an important form of data distribution architecture where data objects are retrieved from the server, and are cached and operated upon at the client nodes. This architecture reduces network latency and increases resource utilization at the client. Object database management systems (ODBMS), file-systems, mobile data management systems, multi-tiered Web-server systems and hybrid query-shipping/data-shipping architectures all use some variant of the data-shipping. Despite a decade of research, there is still a lack of consensus amongst the proponents of ODBMSs as to the type of data shipping architectures and algorithms that should be used. The absence of both robust (with respect to performance) algorithms, and a comprehensive performance study comparing the competing algorithms are the key reasons for this lack of agreement. In this paper we address both of these problems. We first present an adaptive data-shipping architecture which utilizes adaptive data transfer, cache consistency and recovery algorithms to improve the robustness (with respect to performance) of a data-shipping ODBMS. We then present a comprehensive performance study which evaluates the competing client-server architectures and algorithms. The study verifies the robustness of the new adaptive data-shipping architecture, provides new insights into the performance of the different competing algorithms, and helps to overturn some existing notions about some of the algorithms.
Published: 2004
Full Text: View/download PDF

37. VIEWS OR POINTS OF VIEW ON IMAGES

Author: Vincent Oria and M. Tamer Özsu
Subjects: Information retrieval, Image View, Computer science, View, Representation (systemics), computer.software_genre, Object (computer science), Computer Graphics and Computer-Aided Design, Database design, Computer Science Applications, Data independence, Method, Relational database management system, Computer Vision and Pattern Recognition, Data mining, computer
Abstract: Images like other multimedia data need to be described as it is difficult to grasp their semantics from the raw data. With the emergence of standards like MPEG-7, multimedia data will be increasingly produced together with some semantic descriptors. But a description of a multimedia data is just an interpretation, a point of view on the data and different interpretations can exist for the same multimedia data. In this paper we explore the use of view techniques to define and manage different points of view on images. Views have been widely used in relational database management systems to extend modeling capabilities, and to provide logical data independence. Since our image model is defined on an object-oriented model, we will first propose a powerful object-oriented mechanism based on the distinction between class and type. The object view is used in the image view definition. The image view mechanism exploits the separation of the physical representation in an image of a real world object from the real object itself to allow different interpretations of an image region. Finally we will discuss the implementation of the image view mechanisms on the existing object models.
Published: 2003
Full Text: View/download PDF

38. On type systems for object-oriented database programming languages

Author: M. Tamer Özsu, Yuri Leontiev, and Duane Szafron
Subjects: Theoretical computer science, General Computer Science, Computer science, Functional logic programming, Programming language, computer.software_genre, Data type, Theoretical Computer Science, Unit type, Type safety, Programming paradigm, Object type, Object-relational mapping, First-generation programming language, computer
Abstract: The concept of an object-oriented database programming language (OODBPL) is appealing because it has the potential of combining the advantages of object orientation and database programming to yield a powerful and universal programming language design. A uniform and consistent combination of object orientation and database programming, however, is not straightforward. Since one of the main components of an object-oriented programming language is its type system, one of the first problems that arises during an OODBPL design is related to the development of a uniform, consistent, and theoretically sound type system that is sufficiently expressive to satisfy the combined needs of object orientation and database programming.The purpose of this article is to answer two questions: "What are the requirements that a modern type system for an object-oriented database programming language should satisfy?" and "Are there any type systems developed to-date that satisfy these requirements?". In order to answer the first question, we compile the set of requirements that an OODBPL type system should satisfy. We then use this set of requirements to evaluate more than 30 existing type systems. The result of this extensive analysis shows that while each of the requirements is satisfied by at least one type system, no type system satisfies all of them. It also enables identification of the mechanisms that lie behind the strengths and weaknesses of the current type systems.
Published: 2002
Full Text: View/download PDF

39. Processing SPARQL Queries Over Distributed RDF Graphs

Author: Lei Zou, M. Tamer Özsu, Peng Peng, Lei Chen, and Dongyan Zhao
Subjects: FOS: Computer and information sciences, Computer science, RDF Schema, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, 02 engineering and technology, computer.software_genre, RDF/XML, Computer Science::Digital Libraries, Named graph, Computer Science - Databases, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, SPARQL, RDF, Cwm, Computer Science::Databases, computer.programming_language, Computer Science::Information Retrieval, InformationSystems_DATABASEMANAGEMENT, Databases (cs.DB), computer.file_format, Blank node, Graph, Computer Science - Distributed, Parallel, and Cluster Computing, Hardware and Architecture, Graph (abstract data type), 020201 artificial intelligence & image processing, Distributed, Parallel, and Cluster Computing (cs.DC), Data mining, computer, Information Systems, RDF query language
Abstract: We propose techniques for processing SPARQL queries over a large RDF graph in a distributed environment. We adopt a "partial evaluation and assembly" framework. Answering a SPARQL query Q is equivalent to finding subgraph matches of the query graph Q over RDF graph G. Based on properties of subgraph matching over a distributed graph, we introduce local partial match as partial answers in each fragment of RDF graph G. For assembly, we propose two methods: centralized and distributed assembly. We analyze our algorithms from both theoretically and experimentally. Extensive experiments over both real and benchmark RDF repositories of billions of triples confirm that our method is superior to the state-of-the-art methods in both the system's performance and scalability., 30 pages
Published: 2014

40. Reachable subwebs for traversal-based query execution

Author: Olaf Hartig and M. Tamer Özsu
Subjects: Vertex (graph theory), Theoretical computer science, Computer science, Linked data, computer.software_genre, law.invention, Tree traversal, PageRank, law, Benchmark (computing), Graph (abstract data type), Data mining, Semantic Web, computer
Abstract: Traversal-based approaches to execute queries over data on the Web have recently been studied. These approaches make use of up-to-date data from initially unknown data sources and, thus, enable applications to tap the full potential of the Web. While existing work focuses primarily on implementation techniques, a principled analysis of subwebs that are reachable by such approaches is missing. Such an analysis may help to gain new insight into the problem of optimizing the response time of traversal-based query engines. Furthermore, a better understanding of characteristics of such subwebs may also inform approaches to benchmark these engines. This paper provides such an analysis. In particular, we identify typical graph-based properties of query-specific reachable subwebs and quantify their diversity. Furthermore, we investigate whether vertex scoring methods (e.g., PageRank) are able to predict query-relevance of data sources when applied to such subwebs.
Published: 2014
Full Text: View/download PDF

41. R-Store: A scalable distributed system for supporting real-time analytics

Author: Beng Chin Ooi, Gang Chen, M. Tamer Özsu, and Feng Li
Subjects: Distributed database, Database, Computer science, business.industry, Online analytical processing, Big data, InformationSystems_DATABASEMANAGEMENT, computer.software_genre, Data modeling, Metadata, Data cube, Data access, Analytics, Online transaction processing, business, computer
Abstract: It is widely recognized that OLTP and OLAP queries have different data access patterns, processing needs and requirements. Hence, the OLTP queries and OLAP queries are typically handled by two different systems, and the data are periodically extracted from the OLTP system, transformed and loaded into the OLAP system for data analysis. With the awareness of the ability of big data in providing enterprises useful insights from vast amounts of data, effective and timely decisions derived from real-time analytics are important. It is therefore desirable to provide real-time OLAP querying support, where OLAP queries read the latest data while OLTP queries create the new versions. In this paper, we propose R-Store, a scalable distributed system for supporting real-time OLAP by extending the MapReduce framework. We extend an open source distributed key/value system, HBase, as the underlying storage system that stores data cube and real-time data. When real-time data are updated, they are streamed to a streaming MapReduce, namely Hstreaming, for updating the cube on incremental basis. Based on the metadata stored in the storage system, either the data cube or OLTP database or both are used by the MapReduce jobs for OLAP queries. We propose techniques to efficiently scan the real-time data in the storage system, and design an adaptive algorithm to process the real-time query based on our proposed cost model. The main objectives are to ensure the freshness of answers and low processing latency. The experiments conducted on the TPC-H data set demonstrate the effectiveness and efficiency of our approach.
Published: 2014
Full Text: View/download PDF

42. A temporal approach to managing schema evolution in object database systems

Author: Iqbal A. Goralwalla, Duane Szafron, M. Tamer Özsu, and Randal J. Peters
Subjects: Information Systems and Management, Database, Schema migration, Computer science, Star schema, Schema (psychology), Semi-structured model, Database schema, Object model, computer.software_genre, computer, Conceptual schema, Schema evolution
Abstract: The issues of schema evolution and temporal object models are generally considered to be orthogonal and are handled independently. However, to properly model applications that need incremental design and experimentation, the evolutionary histories of the schema objects should be traceable rather than corrective so that historical queries can be supported. In this paper we propose a method for managing schema changes, and propagating these changes to object instances by exploiting the functionality of a temporal object model. The result is a uniform treatment of schema evolution and temporal support for many object database management systems applications that require both.
Published: 1998
Full Text: View/download PDF

43. Diversified Stress Testing of RDF Data Management Systems

Author: Olaf Hartig, Güneş Aluç, Khuzaima Daudjee, and M. Tamer Özsu
Subjects: Database, Computer science, business.industry, Data management, RDF Schema, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, InformationSystems_DATABASEMANAGEMENT, Linked data, computer.file_format, computer.software_genre, Query language, Test suite, SPARQL, RDF, business, computer, RDF query language, computer.programming_language
Abstract: The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF data continue to be published across heterogeneous domains and integrated at Web-scale such as in the Linked Open Data (LOD) cloud, RDF data management systems are being exposed to queries that are far more diverse and workloads that are far more varied. The first contribution of our work is an in-depth experimental analysis that shows existing SPARQL benchmarks are not suitable for testing systems for diverse queries and varied workloads. To address these shortcomings, our second contribution is the Waterloo SPARQL Diversity Test Suite (WatDiv) that provides stress testing tools for RDF data management systems. Using WatDiv, we have been able to reveal issues with existing systems that went unnoticed in evaluations using earlier benchmarks. Specifically, our experiments with five popular RDF data management systems show that they cannot deliver good performance uniformly across workloads. For some queries, there can be as much as five orders of magnitude difference between the query execution time of the fastest and the slowest system while the fastest system on one query may unexpectedly time out on another query. By performing a detailed analysis, we pinpoint these problems to specific types of queries and workloads.
Published: 2014
Full Text: View/download PDF

44. The Semantic Web – ISWC 2014

Author: Ahmet Soylu, Ali Hasnain, Stefan Bischof, Ruben Verborgh, Syed Sibte Raza Abidi, Konstantinos Tarabanis, Ricardo Usbeck, Dimitrios Ntalaperas, Roman Kontchakov, Paolo Merialdo, Oscar Corcho, Sandro Coelho, Christopher Brewster, Deborah McGuinness, Diego Calvanese, Jan Wielemaker, Helena Deus, Ziqi Zhang, Erik Mannens, Sören Auer, Antony Williams, Riccardo ROSATI, Sebastian Rudolph, Zhibin Quan, Michael Granitzer, Stefan Schlobach, Maria-Esther Vidal, Markus Krötzsch, Olaf Hartig, Eduardo Veas, Ben De Meester, M. Tamer Özsu, Axel Polleres, Stefan Decker, Carole Goble, Karen Karapetyan, Alasdair Gray, and Guohui Xiao
Subjects: Computer science, business.industry, Interoperability, Comparative historical research, Cloud computing, 14. Life underwater, Linked data, Layer (object-oriented design), business, Data science, Semantic Web, Digital history, Field (computer science)
Abstract: We present the Dutch Ships and Sailors Linked Data Cloud. This heterogeneous dataset brings together four curated datasets on Dutch Maritime history as five-star linked data. The individual datasets use separate datamodels, designed in close collaboration with maritime historical researchers. The individual models are mapped to a common interoperability layer, allowing for analysis of the data on the general level. We present the datasets, modeling decisions, internal links and links to external data sources. We show ways of accessing the data and present a number of examples of how the dataset can be used for historical research. The Dutch Ships and Sailors Linked Data Cloud is a potential hub dataset for digital history research and a prime example of the benefits of Linked Data for this field.
Published: 2014
Full Text: View/download PDF

45. Distributed and parallel database systems

Author: M. Tamer Özsu and Patrick Valduriez
Subjects: Distributed design patterns, General Computer Science, Distributed database, Computer science, Parallel database, Distributed computing, Distributed concurrency control, InformationSystems_DATABASEMANAGEMENT, Data-intensive computing, Database theory, Replication (computing), Database tuning, Theoretical Computer Science
Abstract: The maturation of database management system (DBMS) technology has coincided with significant developments in distributed computing and parallel processing technologies. The end result is the development of distributed database management systems and parallel database management systems that are now the dominant data management tools for highly data-intensive applications. With the emergence of cloud computing, distributed and parallel database systems have started to converge. In this chapter, we present an overview of the distributed DBMS and parallel DBMS technologies, highlight the unique characteristics of each, and indicate the similarities between them. We also discuss the new challenges and emerging solutions.
Published: 1996
Full Text: View/download PDF

46. Efficient core decomposition in massive networks

Author: Shumo Chu, Yiping Ke, James Cheng, and M. Tamer Özsu
Subjects: Factor-critical graph, Theoretical computer science, Computer science, Strength of a graph, Degeneracy (graph theory), Simplex graph, law.invention, Graph power, law, Line graph, Complement graph, Distance-hereditary graph, Clique, Discrete mathematics, Degree (graph theory), Quartic graph, Approximation algorithm, Graph theory, Butterfly graph, Graph, Vertex (geometry), Hypercube graph, Graph bandwidth, Path (graph theory), Cycle graph, Level structure, k-edge-connected graph, Algorithm design, Multiple edges, Centrality, Null graph, Graph factorization, MathematicsofComputing_DISCRETEMATHEMATICS
Abstract: The k-core of a graph is the largest subgraph in which every vertex is connected to at least k other vertices within the subgraph. Core decomposition finds the k-core of the graph for every possible k. Past studies have shown important applications of core decomposition such as in the study of the properties of large networks (e.g., sustainability, connectivity, centrality, etc.), for solving NP-hard problems efficiently in real networks (e.g., maximum clique finding, densest subgraph approximation, etc.), and for large-scale network fingerprinting and visualization. The k-core is a well accepted concept partly because there exists a simple and efficient algorithm for core decomposition, by recursively removing the lowest degree vertices and their incident edges. However, this algorithm requires random access to the graph and hence assumes the entire graph can be kept in main memory. Nevertheless, real-world networks such as online social networks have become exceedingly large in recent years and still keep growing at a steady rate. In this paper, we propose the first external-memory algorithm for core decomposition in massive graphs. When the memory is large enough to hold the graph, our algorithm achieves comparable performance as the in-memory algorithm. When the graph is too large to be kept in the memory, our algorithm requires only O(k max ) scans of the graph, where k max is the largest core number of the graph. We demonstrate the efficiency of our algorithm on real networks with up to 52.9 million vertices and 1.65 billion edges.
Published: 2011
Full Text: View/download PDF

47. Distributed Data Management in 2020?

Author: M. Tamer Özsu, Patrick Valduriez, Beng Chin Ooi, Bettina Kemme, Ricardo Jiménez-Peris, and Serge Abiteboul
Subjects: Database server, Informática, Distributed database, Relational database, Computer science, business.industry, Data management, 05 social sciences, 02 engineering and technology, 050905 science studies, Database testing, Replication (computing), World Wide Web, Distributed data store, 0202 electrical engineering, electronic engineering, information engineering, Relational model, 020201 artificial intelligence & image processing, 0509 other social sciences, business, Database transaction
Abstract: Work on distributed data management commenced shortly after the introduction of the relational model in the mid-1970's. 1970's and 1980's were very active periods for the development of distributed relational database technology, and claims were made that in the following ten years centralized databases will be an “antique curiosity” and most organizations will move toward distributed database managers [1]. That prediction has certainly become true, and all commercial DBMSs today are distributed.
Published: 2011

48. Distributed XML Query Processing

Author: Patrick Kling and M. Tamer Özsu
Subjects: XML framework, XML Encryption, XML database, Database, XML Schema Editor, Computer science, Streaming XML, Efficient XML Interchange, XML Signature, XML validation, computer.file_format, computer.software_genre, computer
Abstract: Over the past decade, XML has become a commonly used format for storing and exchanging data in a wide variety of systems. Due to this widespread use, the problem of effectively and efficiently managing XML collections has attracted significant attention in both the research community and in commercial products. One can claim that the centralized management and querying of XML data (i.e., data residing in one system) is now a well understood problem. Unfortunately, centralized techniques are limited in their scalability when presented with large collections and heavy query workloads.
Published: 2010
Full Text: View/download PDF

49. Dynamic Skyline Queries in Large Graphs

Author: M. Tamer Özsu, Dongyan Zhao, Lei Zou, and Lei Chen
Subjects: Skyline, Speedup, Euclidean space, Computer science, InformationSystems_DATABASEMANAGEMENT, Rule-based system, Query optimization, computer.software_genre, Metric space, Data point, Data mining, Graph property, Algorithm, computer, Computer Science::Databases
Abstract: Given a set of query points, a dynamic skyline query reports all data points that are not dominated by other data points according to the distances between data points and query points. In this paper, we study dynamic skyline queries in a large graph (DSG-query for short). Although dynamic skylines have been studied in Euclidean space [14], road network [5], and metric space [3,6], there is no previous work on dynamic skylines over large graphs. We employ a filter-and-refine framework to speed up the query processing that can answer DSG-query efficiently. We propose a novel pruning rule based on graph properties to derive the candidates for DSG-query, that are guaranteed not to introduce false negatives. In the refinement step, with a carefully-designed index structure, we compute short path distances between vertices in O(H), where H is the number of maximal hops between any two vertices. Extensive experiments demonstrate that our methods outperform existing algorithms by orders of magnitude.
Published: 2010
Full Text: View/download PDF

50. Mining frequent itemsets in time-varying data streams

Author: M. Tamer Özsu and Yingying Tao
Subjects: Data stream, Data stream mining, Computer science, Data mining, computer.software_genre, computer, Task (project management)
Abstract: Mining frequent itemsets in data streams is beneficial to many real-world applications but is also a challenging task since data streams are unbounded and have high arrival rates. Moreover, the distribution of data streams can change over time, which makes the task of maintaining frequent itemsets even harder. In this paper, we propose a false-negative oriented algorithm, called TWIM, that can find most of the frequent itemsets, detect distribution changes, and update the mining results accordingly. Experimental results show that our algorithm performs as good as other false-negative algorithms on data streams without distribution change, and has the ability to detect changes over time-varying data streams in -time with a high accuracy rate.
Published: 2009
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

72 results on '"M. Tamer Özsu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources