1. Scalable Pattern Matching in Metadata Graphs via Constraint Checking
- Author
-
Geoffrey Sanders, Hassan H. Halawa, Tahsin Reza, Roger Pearce, and Matei Ripeanu
- Subjects
FOS: Computer and information sciences ,Theoretical computer science ,Computer science ,Subgraph isomorphism problem ,02 engineering and technology ,Incremental search ,Graph ,Computer Science Applications ,Vertex (geometry) ,Constraint (information theory) ,Set (abstract data type) ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computational Theory and Mathematics ,Hardware and Architecture ,020204 information systems ,Modeling and Simulation ,Scalability ,Core (graph theory) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Pattern matching ,Software ,MathematicsofComputing_DISCRETEMATHEMATICS - Abstract
Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. We present an algorithmic pipeline that bases pattern matching on constraint checking . The key intuition is that each vertex and edge participating in a match has to meet a set of constraints implicitly specified by the search template. These constraints can be verified independently and typically are less expensive to compute than searching the full template. The pipeline we propose generates these constraints and iterates over them to eliminate all the vertices and edges that do not participate in any match, thus reducing the background graph to a subgraph that is the union of all template matches—the complete set of all vertices and edges that participate in at least one match. Additional analysis can be performed on this annotated, reduced graph, such as full match enumeration, match counting, or computing vertex/edge centrality. Furthermore, a vertex-centric formulation for constraint checking algorithms exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. This technique (i) enables highly scalable pattern matching in metadata (labeled) graphs; (ii) supports arbitrary patterns with 100% precision; (iii) enables tradeoffs between precision and time-to-solution, while always selects all vertices and edges that participate in matches, thus offering 100% recall; and (iv) supports a set of popular data analytics scenarios. We implement our approach on top of HavoqGT, an open-source asynchronous graph processing framework, and demonstrate its advantages through strong and weak scaling experiments on massive scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) labeled graphs, respectively, and at scales (1,024 nodes / 36,864 cores), orders of magnitude larger than used in the past for similar problems. This article serves two purposes: First, it synthesises the knowledge accumulated during a long-term project [Reza et al. 2017, 2018; Tripoul et al. 2018]. Second, it presents new system features, usage scenarios, optimizations, and comparisons with related work that strengthen the confidence that pattern matching based on iterative pruning via constraint checking is an effective and scalable approach in practice. The new contributions include the following: (i) We demonstrate the ability of the constraint checking approach to efficiently support two additional search scenarios that often emerge in practice, interactive incremental search and exploratory search . (ii) We empirically compare our solution with two additional state-of-the-art systems, Arabsque [Teixeira et al. 2015] and TriAD [Gurajada et al. 2014]. (iii) We show the ability of our solution to accommodate a more diverse range of datasets with varying properties, e.g., scale, skewness, label distribution, and match frequency. (iv) We introduce or extend a number of system features (e.g., work aggregation, load balancing, and the ability to cap the generated traffic) and design optimizations and demonstrate their advantages with respect to improving performance and scalability. (v) We present bottleneck analysis and insights into artifacts that influence performance. (vi) We present a theoretical complexity argument that motivates the performance gains we observe.
- Published
- 2021
- Full Text
- View/download PDF