Descriptor: "Inverted index" / Publisher: acm - Searchworks@Jio Institute Digital Library Search Results

1. IR From Bag-of-words to BERT and Beyond through Practical Experiments

Author: Sean MacAvaney, Nicola Tonellotto, and Craig Macdonald
Subjects: Information retrieval, Bag-of-words model, Scripting language, Computer science, Search engine indexing, Learning to rank, Language model, Inverted index, computer.software_genre, computer, Session (web analytics), Ranking (information retrieval)
Abstract: The task of adhoc search is undergoing a renaissance, sparked by advances in natural language processing. In particular, pre-trained contextualized language models (such as BERT and T5) have consistently shown to be a highly-effective foundation upon which to build ranking models. These models are equipped with a far deeper understanding of language than the capabilities of bag-of-words (BoW) models. Applying these techniques to new tasks can be tricky, however, as they require knowledge of deep learning frameworks, and significant scripting and data munging. In this full-day tutorial, we build up from foundational retrieval principles to the latest neural ranking techniques. We first provide foundational background on classical bag-of-words methods. We then show how feature-based Learning to Rank methods can be used to re-rank these results. Finally, we cover contemporary approaches, such as BERT, doc2query, and dense retrieval. Throughout the process, we demonstrate how these can be easily experimentally applied to new search tasks in a declarative style of conducting experiments exemplified by the PyTerrier and OpenNIR search toolkits. This tutorial is interactive in nature for participants. It is broken into sessions, each of which mixes explanatory presentation with hands-on activities using prepared Jupyter notebooks running on the Google Colab platform. These activities give participants experience applying the techniques covered in the tutorial on the TREC COVID benchmark test collection. The tutorial is broken into four sessions. In the first session, we cover foundational retrieval concepts, including inverted indexing, retrieval, and scoring. We also demonstrate how evaluation can be conducted in a declarative fashion within PyTerrier, encapsulating ideas such as significance testing, and multiple correction, as promoted as IR best practices. In the second session, we build upon the core retrieval concepts to demonstrate how to re-write queries (e.g., using RM3) and re-rank documents (e.g., using learning-to-rank). In the third session, we introduce contextualized language models, such as BERT and show how they can be utilized for document re-ranking (e.g, using Vanilla/monoBERT and EPIC). Finally, in session four, we move beyond re-ranking and cover how approaches that modify documents (e.g., DeepCT) as well as efforts to replace the traditional inverted index with an embedding-based index (e.g., ANCE, ColBERT, and ColBERT-PRF). By the end of the tutorial, participants will have experience conducting IR experiments from classical bag-of-words models to contemporary BERT models and beyond.
Published: 2021
Full Text: View/download PDF

2. Top-k Tree Similarity Join

Author: Wenjie Zhang, Jianhua Wang, and Jianye Yang
Subjects: Structure (mathematical logic), Tree (data structure), Task (computing), Similarity (network science), Computer science, Join (sigma algebra), K-tree, Inverted index, Upper and lower bounds, Algorithm
Abstract: Tree similarity join is useful for analyzing tree structured data. The traditional threshold-based tree similarity join requires a similarity threshold, which is usually a difficult task for users. To remedy this issue, we advocate the problem of top-k tree similarity join. Given a collection of trees and a parameter k, the top-k tree similarity join aims to find k tree pairs with minimum tree edit distance (TED). Although we show that this problem can be resolved by utilizing the threshold-based join, the efficiency is unsatisfactory. In this paper, we propose an efficient algorithm, namely TopKTJoin, which generates the candidate tree pairs incrementally using an inverted index. We also derive TED lower bound for the unseen tree pairs. Together with TED value of the k-th best join result seen so far, we have a chance to terminate the algorithm early without missing any correct results. To further improve the efficiency, we propose two optimization techniques in terms of index structure and verification mechanism. We conduct comprehensive performance studies on real and synthetic datasets. The experimental results demonstrate that TopKTJoin significantly outperforms the baseline method.
Published: 2021
Full Text: View/download PDF

3. Construction and Application on Parallel Corpus for College Japanese Translation Teaching

Author: Xiaoling Yu
Subjects: Collocation, Computer science, business.industry, Process (engineering), media_common.quotation_subject, Teaching method, Carry (arithmetic), InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, User management, Inverted index, Translation (geometry), computer.software_genre, ComputingMilieux_COMPUTERSANDEDUCATION, Artificial intelligence, business, Function (engineering), computer, Natural language processing, media_common
Abstract: The bilingual parallel corpus has the closest relationship with translation teaching. It provides abundant teaching resources and convenient teaching methods, which is conducive to the formation of relatively stable translation skills in a large number of practices. In order to carry out college Japanese translation teaching based on parallel corpus, three subsystems including parallel corpus function, corpus management, query statistics and user management, as well as several functional frameworks under each sub-subsystem are designed; the inverted index file of parallel corpus is designed which is used to improve retrieval efficiency; specific application strategies are proposed: select language materials in a standardized manner, increase language input in the teaching process, cultivate students' technical awareness, apply students’ typical mistakes or translated works to teaching feedback, and make good marks of corpus work with alignment, carry out corpus-based translation collocation teaching, and cultivate students' autonomous learning ability.
Published: 2021
Full Text: View/download PDF

4. NIL: large-scale detection of large-variance clones

Author: Tasuku Nakagawa, Yoshiki Higo, and Shinji Kusumoto
Subjects: Source code, Similarity (geometry), Computer science, media_common.quotation_subject, ComputingMilieux_LEGALASPECTSOFCOMPUTING, Inverted index, Longest common subsequence problem, Fragment (logic), Software_SOFTWAREENGINEERING, Clone (algebra), Scalability, Code (cryptography), Algorithm, media_common
Abstract: A code clone (in short, clone) is a code fragment that is identical or similar to other code fragments in source code. Clones generated by a large number of changes to copy-and-pasted code fragments are called large-variance (modifications are scattered) or large-gap (modifications are in one place) clones. It is difficult for general clone detection techniques to detect such clones and thus specialized techniques are necessary. In addition, with the rapid growth of software development, scalable clone detectors that can detect clones in large codebases are required. However, there are no existing techniques for quickly detecting large-variance or large-gap clones in large codebases. In this paper, we propose a scalable clone detection technique that can detect large-variance clones from large codebases and describe its implementation, called NIL. NIL is a token-based clone detector that efficiently identifies clone candidates using an N-gram representation of token sequences and an inverted index. Then, NIL verifies the clone candidates by measuring their similarity based on the longest common subsequence between their token sequences. We evaluate NIL in terms of large- variance clone detection accuracy, general Type-1, Type-2, and Type- 3 clone detection accuracy, and scalability. Our experimental results show that NIL has higher accuracy in terms of large-variance clone detection, equivalent accuracy in terms of general clone detection, and the shortest execution time for inputs of various sizes (1–250 MLOC) compared to existing state-of-the-art tools.
Published: 2021
Full Text: View/download PDF

5. Faster Index Reordering with Bipartite Graph Partitioning

Author: Matthias Petri, Alistair Moffat, and Joel Mackenzie
Subjects: Range (mathematics), Theoretical computer science, Computer science, Heuristic, Computation, Compression (functional analysis), Bipartite graph, Document clustering, Reference implementation, Inverted index
Abstract: We revisit the Bipartite Graph Partitioning approach to document reordering (Dhulipala et al., KDD 2016), and consider a range of algorithmic and heuristic refinements that lead to faster computation of index-minimizing document orderings. Our final implementation executes approximately four times faster than the reference implementation we commence with, and obtains the same, or slightly better, compression effectiveness on three large text collections.
Published: 2021
Full Text: View/download PDF

6. Exploiting Intel optane persistent memory for full text search

Author: Shoaib Akram
Subjects: Hardware_MEMORYSTRUCTURES, Computer science, Search engine indexing, Full text search, 02 engineering and technology, Parallel computing, Inverted index, Hash table, 020202 computer hardware & architecture, Tree (data structure), Search engine, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, sort, Overhead (computing)
Abstract: In our information-driven societies, full-text search is ubiquitous. Search is memory-intensive. Quickly searching massive corpora requires building indices, which consumes big volatile heaps. Search is storage I/O-intensive. Limited main memory necessitates writing large partial indices on non-volatile storage, where they finally live in merged form. These indices reside in memory, in full or in part, during query evaluation. Memory and I/O intensity make it hard to index and search content rapidly and efficiently. On the hardware side, the recently introduced Intel Optane DC persistent memory (PM) offers byte-addressability, high capacity, and non-volatility. This paper evaluates and exploits Optane PM for text indexing and search on multicore platforms. We identify essential structures in inverted indices (hash table, merge tree, and key-value store), where they reside (memory or storage), and key operations over them (sort, flush, and merge). We allocate index structures in DRAM, Optane PM, and block storage by modifying an existing search engine. We then evaluate a myriad of hybrid memory and storage configurations. Our findings include: (1) careful placement of index structures across DRAM, Optane PM, and SSD, speeds up indexing with a single core compared to a high-performance baseline, but does not scale to many cores, (2) crash-consistent indexing with Optane PM is feasible without incurring a high overhead, and (3) the tail latency of the longest multi-term conjunctive queries is lower with a PM-backed index than an SSD-backed one. This paper opens up persistent memory to a practical role in full-text search.
Published: 2021
Full Text: View/download PDF

7. Match Plan Generation in Web Search with Parameterized Action Reinforcement Learning

Author: Hui Xue, Wei Cheng, Sihao Chen, Haidong Wang, Ziyan Luo, Linfeng Zhao, Qi Chen, Lintao Zhang, Chuanjie Liu, and Mao Yang
Subjects: Sequence, Computer science, business.industry, Partially observable Markov decision process, Parameterized complexity, Machine learning, computer.software_genre, Inverted index, Task (project management), Web page, Reinforcement learning, Artificial intelligence, business, computer, Block (data storage)
Abstract: To achieve good result quality and short query response time, search engines use specific match plans on Inverted Index to help retrieve a small set of relevant documents from billions of web pages. A match plan is composed of a sequence of match rules, which contain discrete match rule types and continuous stopping quotas. Currently, match plans are manually designed by experts according to their several years’ experience, which encounters difficulty in dealing with heterogeneous queries and varying data distribution. In this work, we formulate the match plan generation as a Partially Observable Markov Decision Process (POMDP) with a parameterized action space, and propose a novel reinforcement learning algorithm Parameterized Action Soft Actor-Critic (PASAC) to effectively enhance the exploration in both spaces. In our scene, we also discover a skew prioritizing issue of the original Prioritized Experience Replay (PER) and introduce Stratified Prioritized Experience Replay (SPER) to address it. We are the first group to generalize this task for all queries as a learning problem with zero prior knowledge and successfully apply deep reinforcement learning in the real web search environment. Our approach greatly outperforms the well-designed production match plans by over 70% reduction of index block accesses with the quality of documents almost unchanged, and 9% reduction of query response time even with model inference cost. Our method also beats the baselines on some open-source benchmarks1.
Published: 2021
Full Text: View/download PDF

8. Fast Disjunctive Candidate Generation Using Live Block Filtering

Author: Michał Siedlaczek, Torsten Suel, and Antonio Mallia
Subjects: Theoretical computer science, Exploit, Computer science, 05 social sciences, 020207 software engineering, 02 engineering and technology, Inverted index, Ranking (information retrieval), Set (abstract data type), Simple (abstract algebra), 0202 electrical engineering, electronic engineering, information engineering, SIMD, 0509 other social sciences, 050904 information & library sciences, Focus (optics), Block (data storage)
Abstract: A lot of research has focused on the efficiency of search engine query processing, and in particular on disjunctive top-k queries that return the highest scoring k results that contain at least one of the query terms. Disjunctive top-k queries over simple ranking functions are commonly used to retrieve an initial set of candidate results that are then reranked by more complex, often machine-learned rankers. Many optimized top-k algorithms have been proposed, including MaxScore, WAND, BMW, and JASS. While the fastest methods achieve impressive results on top-10 and top-100 queries, they tend to become much slower for the larger k commonly used for candidate generation. In this paper, we focus on disjunctive top-k queries for larger k. We propose new algorithms that achieve much faster query processing for values of k up to thousands or tens of thousands. Our algorithms build on top of the live-block filtering approach of Dimopoulos et al, and exploit the SIMD capabilities of modern CPUs. We also perform a detailed experimental comparison of our methods with the fastest known approaches, and release a full model implementation of our methods and of the underlying live-block mechanism, which will allows others to design and experiment with additional methods under the live-block approach.
Published: 2021
Full Text: View/download PDF

9. Optimizing Continuous kNN Queries over Large-Scale Spatial-Textual Data Streams

Author: Rong Yang and Baoning Niu
Subjects: Exploit, Matching (graph theory), Computer science, Data stream mining, 02 engineering and technology, Construct (python library), computer.software_genre, Inverted index, k-nearest neighbors algorithm, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, Scale (map), computer, Block (data storage)
Abstract: The continuous k-Nearest Neighbor queries over spatial-textual data streams (abbr. CkQST) retrieve and continuously monitor at most k nearest neighbor (abbr. kNN) objects to the user-specified location containing all the user-specified keywords, which is the core operation of numerous location-based publish/subscribe systems. Such a system is usually subscribed with a massive number of CkQST and evaluated simultaneously whenever new objects are incoming and old objects are expiring. The approach to evaluating CkQST is to construct a spatial-textual hybrid index for subscribed queries and matching the incoming objects utilizing the filtering capabilities of the index. For CkQST, the minimal spatial search range covering kNN objects changes frequently with the arrival and expiration of qualified objects, and the cost of updating the index is prohibitively high. To efficiently evaluate CkQST, we extend Quad-tree with an inverted index, and exploit it with three techniques, i.e. a memory-based cost model, a block-based ordered inverted index and an adaptive insertion strategy. The experiments on comprehensive datasets demonstrate the effectiveness and efficiency of our proposed techniques.
Published: 2020
Full Text: View/download PDF

10. Enhancing FSCS-ART through Test Input Quantization and Inverted Lists

Author: Rubing Huang, Muhammad Ashfaq, and Michael Omari
Subjects: Quantization (physics), Software, business.industry, Computer science, k-means clustering, Random testing, Centroid, business, Cluster analysis, Inverted index, Algorithm, Domain (software engineering)
Abstract: Fixed-size-candidate-set adaptive random testing (FSCS-ART) is an ART technique well-known for its best failure-detection effectiveness and usages in testing many real-life applications. However, it faces substantial computational overhead in terms of O(n2) time cost for generating n test inputs, which becomes worse for high dimensional input domains (number of inputs a software takes). As real-life programs generally have low failure-rates and have high dimensional input domains, it is vital to reduce the computational overhead while preserving the failure-detection effectiveness for efficient software testing. In this work, we adopted Quantization and InVerted File structure approach to enhance the original FSCS-ART, called QIVFSCS-ART. The proposed method preprocesses the software input domain by partitioning it into discrete cells by using K-means clustering using a uniform random dataset. After this, the quantized form of each executed test input is stored in the inverted list of its cell’s center, called centroid. Results show that the proposed method significantly relieves the computational overhead of FSCS-ART while preserving its failure-detection effectiveness, especially for the high-dimensional software input domains.
Published: 2020
Full Text: View/download PDF

11. Index Obfuscation for Oblivious Document Retrieval in a Trusted Execution Environment

Author: Yifan Qiao, Shiyu Ji, Timothy Sherwood, Alvin Oliver Glova, Tao Yang, and Jinjin Shao
Subjects: Structure (mathematical logic), Information retrieval, Index (publishing), Trusted hardware, Computer science, Association (object-oriented programming), Obfuscation, Document retrieval, Inverted index, Masking (Electronic Health Record)
Abstract: This paper studies privacy-aware inverted index design and document retrieval for multi-keyword document search in a trusted hardware execution environment such as Intel SGX. The previous work uses time-consuming oblivious computing techniques to avoid the leakage of memory access patterns for privacy preservations in such an environment. This paper proposes an efficiency-enhanced design that obfuscates the inverted index structure with posting bucketing and document ID masking, which aims to hide document-term association and avoid the access pattern leakage. This paper describes privacy-aware oblivious document retrieval during online query processing based on such an index. Both privacy and efficiency analyses are provided, followed by evaluation results comparing proposed designs with multiple baselines.
Published: 2020
Full Text: View/download PDF

12. Embedding-based Retrieval in Facebook Search

Author: Shuying Sun, Giuseppe Ottaviano, Ashish Sharma, Jui-Ting Huang, Philip Pronin, Janani Padmanabhan, Li Xia, David Zhang, and Linjun Yang
Subjects: FOS: Computer and information sciences, Social graph, Information retrieval, business.industry, Computer science, Deep learning, Context (language use), 02 engineering and technology, Inverted index, Computer Science - Information Retrieval, Personalized search, Search engine, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Embedding, 020201 artificial intelligence & image processing, Artificial intelligence, business, Information Retrieval (cs.IR)
Abstract: Search in social networks such as Facebook poses different challenges than in classical web search: besides the query text, it is important to take into account the searcher's context to provide relevant results. Their social graph is an integral part of this context and is a unique aspect of Facebook search. While embedding-based retrieval (EBR) has been applied in eb search engines for years, Facebook search was still mainly based on a Boolean matching model. In this paper, we discuss the techniques for applying EBR to a Facebook Search system. We introduce the unified embedding framework developed to model semantic embeddings for personalized search, and the system to serve embedding-based retrieval in a typical search system based on an inverted index. We discuss various tricks and experiences on end-to-end optimization of the whole system, including ANN parameter tuning and full-stack optimization. Finally, we present our progress on two selected advanced topics about modeling. We evaluated EBR on verticals for Facebook Search with significant metrics gains observed in online A/B experiments. We believe this paper will provide useful insights and experiences to help people on developing embedding-based retrieval systems in search engines., 9 pages, 3 figures, 3 tables, to be published in KDD '20
Published: 2020
Full Text: View/download PDF

13. Efficiency Implications of Term Weighting for Passage Retrieval

Author: Luke Gallagher, Jamie Callan, Zhuyun Dai, and Joel Mackenzie
Subjects: Computer science, business.industry, Natural language understanding, 020207 software engineering, 02 engineering and technology, Machine learning, computer.software_genre, Inverted index, Weighting, Ranking, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Language model, Artificial intelligence, business, computer
Abstract: Language model pre-training has spurred a great deal of attention for tasks involving natural language understanding, and has been successfully applied to many downstream tasks with impressive results. Within information retrieval, many of these solutions are too costly to stand on their own, requiring multi-stage ranking architectures. Recent work has begun to consider how to "backport" salient aspects of these computationally expensive models to previous stages of the retrieval pipeline. One such instance is DeepCT, which uses BERT to re-weight term importance in a given context at the passage level. This process, which is computed offline, results in an augmented inverted index with re-weighted term frequency values. In this work, we conduct an investigation of query processing efficiency over DeepCT indexes. Using a number of candidate generation algorithms, we reveal how term re-weighting can impact query processing latency, and explore how DeepCT can be used as a static index pruning technique to accelerate query processing without harming search effectiveness.
Published: 2020
Full Text: View/download PDF

14. JASSjr: The Minimalistic BM25 Search Engine for Teaching and Learning Information Retrieval

Author: Andrew Trotman and Kat Lilly
Subjects: Search engine, Information retrieval, Index (publishing), Computer science, 020204 information systems, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), 020207 software engineering, 02 engineering and technology, Inverted index
Abstract: We present JASSjr, a minimalistic trec_eval compatible BM25-ranking search engine that can index small TREC data sets such as the Wall Street Journal collection. We do this for several reasons. First, to demonstrate how a term-at-a-time (TAAT) search engine works. Second, to demonstrate that a straightforward and competitive search engine with indexer can be written in under 600 lines of documented code. Third, as a way of providing a simple code-base for teaching Information Retrieval. We present two index-compatible versions (one in C/C++, the other in Java) that compile and run on MacOS, Linux, and Windows. Our code is released under the 2-clause BSD licence, and we provide several suggestions for extensions which might be used as exercises in an Information Retrieval course.
Published: 2020
Full Text: View/download PDF

15. Context-Aware Term Weighting For First Stage Passage Retrieval

Author: Jamie Callan and Zhuyun Dai
Subjects: 050101 languages & linguistics, business.industry, Computer science, 05 social sciences, Context (language use), 02 engineering and technology, Inverted index, computer.software_genre, Weighting, Term (time), 0202 electrical engineering, electronic engineering, information engineering, Key (cryptography), 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, Stage (hydrology), Artificial intelligence, business, computer, Natural language processing
Abstract: Term frequency is a common method for identifying the importance of a term in a document. But term frequency ignores how a term interacts with its text context, which is key to estimating document-specific term weights. This paper proposes a Deep Contextualized Term Weighting framework (DeepCT) that maps the contextualized term representations from BERT to into context-aware term weights for passage retrieval. The new, deep term weights can be stored in an ordinary inverted index for efficient retrieval. Experiments on two datasets demonstrate that DeepCT greatly improves the accuracy of first-stage passage retrieval algorithms.
Published: 2020
Full Text: View/download PDF

16. Workload-Aware Column Imprints

Author: Noah Slavitch
Subjects: Data access, Range query (data structures), Computer science, Data mining, Cache, computer.software_genre, Inverted index, Data structure, computer, Column (database)
Abstract: In-memory columnar databases use indexes to accelerate highly selective queries. The additional storage requirement of indexes becomes prohibitive when kept in memory. For example, an inverted index requires as much space as the column itself. Column Imprints (CI) have been proposed as a space-efficient structure that supports range queries. We examine the limitations of CI and we suggest three enhancements for in-memory databases. We propose a workload-aware approach which considers recent data access patterns when constructing CI. We optimize the histogram towards reducing false positives and cache misses for highly selective queries. We propose efficient algorithms to construct our data structures. Preliminary experiments confirm that: 1) our workload-aware imprints reduce the cache lines scanned anywhere from 30% to 50% when compared to the original CI, and 2) have significantly smaller storage requirements.
Published: 2020
Full Text: View/download PDF

17. Context-Aware Document Term Weighting for Ad-Hoc Search

Author: Jamie Callan and Zhuyun Dai
Subjects: Search engine, Information retrieval, Index (publishing), Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Search engine indexing, Embedding, Context (language use), Relevance (information retrieval), Inverted index, Representation (mathematics), Term (time), Weighting
Abstract: Bag-of-words document representations play a fundamental role in modern search engines, but their power is limited by the shallow frequency-based term weighting scheme. This paper proposes HDCT, a context-aware document term weighting framework for document indexing and retrieval. It first estimates the semantic importance of a term in the context of each passage. These fine-grained term weights are then aggregated into a document-level bag-of-words representation, which can be stored into a standard inverted index for efficient retrieval. This paper also proposes two approaches that enable training HDCT without relevance labels. Experiments show that an index using HDCT weights significantly improved the retrieval accuracy compared to typical term-frequency and state-of-the-art embedding-based indexes.
Published: 2020
Full Text: View/download PDF

18. IIU

Author: Tae Jun Ham, Jae-Yeon Won, Shivam Bharuka, Jun Heo, Yejin Lee, Jaeyoung Jang, and Jae W. Lee
Subjects: Computer science, Search engine indexing, Full text search, 020207 software engineering, Memory bandwidth, 02 engineering and technology, Parallel computing, Inverted index, Data structure, 020202 computer hardware & architecture, Search engine, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Throughput (business)
Abstract: Inverted index serves as a fundamental data structure for efficient search across various applications such as full-text search engine, document analytics and other information retrieval systems. The storage requirement and query load for these structures have been growing at a rapid rate. Thus, an ideal indexing system should maintain a small index size with a low query processing time. Previous works have mainly focused on using CPUs and GPUs to exploit query parallelism while utilizing state-of-the-art compression schemes to fit the index in memory. However, scaling parallelism to maximally utilize memory bandwidth on these architectures is still challenging. In this work, we present IIU, a novel inverted index processing unit, to optimize the query performance while maintaining a low memory overhead for index storage. To this end, we co-design the indexing scheme and hardware accelerator so that the accelerator can process highly compressed inverted index at a high throughput. In addition, IIU provides flexible interconnects between modules to take advantage of both intra- and inter-query parallelism. Our evaluation using a cycle-level simulator demonstrates that IIU provides an average of 13.8\times× query latency reduction and 5.4\times× throughput improvement across different query types, while reducing the average energy consumption by 18.6\times×, compared to Apache Lucene, a production-grade full-text search framework.
Published: 2020
Full Text: View/download PDF

19. Influence constraint based Top-k spatial keyword preference query

Author: Xin Li, Xiangfu Meng, Cai Pan, and Zhiguang Chu
Subjects: 0209 industrial biotechnology, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, 02 engineering and technology, Filter (signal processing), Inverted index, Object (computer science), computer.software_genre, k-nearest neighbors algorithm, 020901 industrial engineering & automation, Feature (computer vision), R-tree, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Relevance (information retrieval), Pruning (decision trees), Data mining, computer
Abstract: The traditional Top-k spatial keyword preference query processing mode usually selects the range and nearest neighbor as the spatial constraints. It focuses on the influence of the distance between a spatial object and a feature object on the query result. However, the distance between feature objects and the query results and the influence of textual relevance between the feature objects and the query keywords is crucial to query results. Therefore, we propose a Threshold Inverted File Algorithm (TAIFA) to improve the query, which based on influence constraints. In order to filter out irrelevant feature objects of query keywords, TAIFA uses inverted files to store feature objects. For another, the cost of query processing is reduced by setting the upper limit score for each node of R*-tree. In return, response speed is improved greatly. And the query results are more relevant to the consumer needs. Further, the performance of TAIFA is evaluated through analysis and experimental verification. The experimental results demonstrate that the response time of the query is an order of magnitude faster than the related algorithms by pruning the irrelevant spatial objects.
Published: 2019
Full Text: View/download PDF

20. A Data-Driven Approach for Multi-level Packing Problems in Manufacturing Industry

Author: Lei Chen, Xialiang Tong, Mingxuan Yuan, and Jia Zeng
Subjects: Mathematical optimization, Optimization problem, Heuristic (computer science), Bin packing problem, Computer science, Heuristic, 02 engineering and technology, Approximate string matching, Inverted index, Dynamic programming, Packing problems, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Heuristics
Abstract: The bin packing problem is one of the most fundamental optimization problems. Owing to its hardness as a combinatorial optimization problem class and its wide range of applications in different domains, different variations of the problem are emerged and many heuristics have been proposed for obtaining approximate solutions. In this paper, we solve a Multi-Level Bin Packing (MLBP) problem in the real make-to-order industry scenario. Existing solutions are not applicable to the problem due to: 1. the final packing may consist multiple levels of sub-packings; 2. the geometry shapes of objects as well as the packing constraints may be unknown. We design an automatic packing framework which extracts the packing knowledge from historical records to support packing without geometry shape and constraint information. Furthermore, we propose a dynamic programming approach to find the optimal solution for normal size problems; and a heuristic multi-level fuzzy-matching algorithm for large size problems. An inverted index is used to accelerate strategy search. The proposed auto packing framework has been deployed in Huawei Process & Engineering System to assist the packing engineers. It achieves a performance of accelerating the execution time of processing 5,000 packing orders to about $8$ minutes with an average successful packing rate as $80.54%$, which releases at least $30%$ workloads of packing workers.
Published: 2019
Full Text: View/download PDF

21. Semantic Product Search

Author: Weitian (Allen) Ding, Vijai Mohan, Vihan Lakshman, Choon Hui Teo, Ankit Shingavi, Priyanka Nigam, Yiwei Song, Bing Yin, and Hao Gu
Subjects: Matching (statistics), business.industry, Computer science, Deep learning, Hash function, Semantic search, Inverted index, computer.software_genre, Ranking (information retrieval), Tokenization (data security), Ranking, Artificial intelligence, business, computer, Natural language processing, Semantic matching
Abstract: We study the problem of semantic matching in product search, that is, given a customer query, retrieve all semantically related products from the catalog. Pure lexical matching via an inverted index falls short in this respect due to several factors: a) lack of understanding of hypernyms, synonyms, and antonyms, b) fragility to morphological variants (e.g. "woman" vs. "women"), and c) sensitivity to spelling errors. To address these issues, we train a deep learning model for semantic matching using customer behavior data. Much of the recent work on large-scale semantic search using deep learning focuses on ranking for web search. In contrast, semantic matching for product search presents several novel challenges, which we elucidate in this paper. We address these challenges by a) developing a new loss function that has an inbuilt threshold to differentiate between random negative examples, impressed but not purchased examples, and positive examples (purchased items), b) using average pooling in conjunction with n-grams to capture short-range linguistic patterns, c) using hashing to handle out of vocabulary tokens, and d) using a model parallel training architecture to scale across 8 GPUs. We present compelling offline results that demonstrate at least 4.7% improvement in Recall@100 and 14.5% improvement in mean average precision (MAP) over baseline state-of-the-art semantic search methods using the same tokenization method. Moreover, we present results and discuss learnings from online A/B tests which demonstrate the efficacy of our method.
Published: 2019
Full Text: View/download PDF

22. Accelerated Query Processing Via Similarity Score Prediction

Author: Alistair Moffat, J. Shane Culpepper, Matthias Petri, Joel Mackenzie, and Daniel Beck
Subjects: Computer science, 05 social sciences, Process (computing), Value (computer science), 02 engineering and technology, Information retrieval applications, Inverted index, computer.software_genre, Term (time), Similarity (network science), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Information system, Data mining, Pruning (decision trees), 0509 other social sciences, 050904 information & library sciences, computer
Abstract: Processing top-k bag-of-words queries is critical to many information retrieval applications, including web-scale search. In this work, we consider algorithmic properties associated with dynamic pruning mechanisms. Such algorithms maintain a score threshold (the k th highest similarity score identified so far) so that low-scoring documents can be bypassed, allowing fast top-k retrieval with no loss in effectiveness. In standard pruning algorithms the score threshold is initialized to the lowest possible value. To accelerate processing, we make use of term- and query-dependent features to predict the final value of that threshold, and then employ the predicted value right from the commencement of processing. Because of the asymmetry associated with prediction errors (if the estimated threshold is too high the query will need to be re-executed in order to assure the correct answer), the prediction process must be risk-sensitive. We explore techniques for balancing those factors, and provide detailed experimental results that show the practical usefulness of the new approach.
Published: 2019
Full Text: View/download PDF

23. Query-Task Mapping

Author: Benno Stein, Matthias Hagen, Michael Völske, and Ehsan Fatehifar
Subjects: Computer science, Information needs, 02 engineering and technology, computer.software_genre, Inverted index, Task (project management), 020204 information systems, Trie, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), Key (cryptography), 020201 artificial intelligence & image processing, Word2vec, Data mining, computer, Word (computer architecture)
Abstract: Several recent task-based search studies aim at splitting query logs into sets of queries for the same task or information need. We address the natural next step: mapping a currently submitted query to an appropriate task in an already task-split log. This query-task mapping can, for instance, enhance query suggestions---rendering efficiency of the mapping, besides accuracy, a key objective. Our main contributions are three large benchmark datasets and preliminary experiments with four query-task mapping approaches: (1) a Trie-based approach, (2) MinHash~LSH, (3) word movers distance in a Word2Vec setup, and (4) an inverted index-based approach. The experiments show that the fast and accurate inverted index-based method forms a strong baseline.
Published: 2019
Full Text: View/download PDF

24. JOSIE

Author: Dong Deng, Erkang Zhu, Renée J. Miller, and Fatemeh Nargesian
Subjects: Matching (graph theory), Computer science, Intersection (set theory), Nearest neighbor search, 02 engineering and technology, String searching algorithm, Inverted index, Column (database), Set (abstract data type), Search algorithm, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Algorithm
Abstract: We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set similarity search problem by considering columns as sets and matching values as intersection between sets. Although set similarity search is well-studied in the field of approximate string search (e.g., fuzzy keyword search), the solutions are designed for and evaluated over sets of relatively small size (average set size rarely much over 100 and maximum set size in the low thousands) with modest dictionary sizes (the total number of distinct values in all sets is only a few million). We observe that modern data lakes typically have massive set sizes (with maximum set sizes that may be tens of millions) and dictionaries that include hundreds of millions of distinct values. Our new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets. We show that JOSIE completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. More surprising, we also consider state-of-the-art approximate algorithm and show that our new exact search algorithm performs almost as well, and even in some cases better, on real data lakes.
Published: 2019
Full Text: View/download PDF

25. RobustiQ

Author: Wei Zhao, Fuhao Zou, Jincai Chen, Yuan-Fang Li, Ping Lu, and Wei Chen
Subjects: Speedup, Robustness (computer science), Computer science, Nearest neighbor search, Lookup table, Search engine indexing, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Codebook, 020201 artificial intelligence & image processing, 02 engineering and technology, Inverted index, Algorithm
Abstract: GPU-based methods represent state-of-the-art in approximate nearest neighbor (ANN) search, as they are scalable (billion-scale), accurate (high recall) as well as efficient (sub-millisecond query speed). Faiss, the representative GPU-based ANN system, achieves considerably faster query speed than the representative CPU-based systems. The query accuracy of Faiss critically depends on the number of indexing regions, which in turn is dependent on the amount of available memory. At the same time, query speed deteriorates dramatically with the increase in the number of partition regions. Thus, it can be observed that Faiss suffers from a lack of robustness, that the fine-grained partitioning of datasets is achieved at the expense of search speed, and vice versa. In this paper, we introduce a new GPU-based ANN search method, Robust Quantization (RobustiQ), that addresses the robustness limitations of existing GPU-based methods in a holistic way. We design a novel hierarchical indexing structure using vector and bilayer line quantization. This indexing structure, together with our indexing and encoding methods, allows RobustiQ to avoid the need for maintaining a large lookup table, hence reduces not only memory consumption but also query complexity. Our extensive evaluation on two public billion-scale benchmark datasets, SIFT1B and DEEP1B, shows that RobustiQ consistently obtains 2-3 × speedup over Faiss while achieving better query accuracy for different codebook sizes. Compared to the best CPU-based ANN systems, RobustiQ achieves even more pronounced average speedups of 51.8 × and 11 × respectively.
Published: 2019
Full Text: View/download PDF

26. A Hybrid BitFunnel and Partitioned Elias-Fano Inverted Index

Author: Yusen Li, Gang Wang, Xiaoguang Liu, Xinyu Liu, Rebecca J. Stones, and Zhaohua Zhang
Subjects: Index (economics), business.industry, Intersection (set theory), Computer science, 020207 software engineering, 02 engineering and technology, Inverted index, Search engine, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Partition (number theory), Local search (optimization), business, Algorithm
Abstract: Search engines encounter a time vs. space trade-off: search responsiveness (i.e., a short query response time) comes at the cost of increased index storage. We propose a hybrid method which uses both (a) the recently published mapping-matrix-style index BitFunnel (BF) for search efficiency, and (b) the state-of-the-art Partitioned Elias-Fano (PEF) inverted-index compression method. We use this proposed hybrid method to minimize time while satisfying a fixed space constraint, and to minimize space while satisfying a fixed time constraint. Each document is stored using either BF or PEF, and we use a local search strategy to find an approximately optimal BF-PEF partition. Since performing full experiments on each candidate BF-PEF partition is impractically slow, we use a regression model to predict the time and space costs resulting from candidate partitions (space accuracy 97.6%; time accuracy 95.2%). Compared with a hybrid mathematical index (Ottaviano et al., 2015), the time cost is reduced by up to 47% without significantly exceeding its size. Compared with three mathematical encoding methods, the hybrid BF-PEF index allows performing list intersection between around 16% to 76% faster (without significantly increasing the index size). Compared with BF, the index size is reduced by 45% while maintaining an intersection time comparable to that of BF.
Published: 2019
Full Text: View/download PDF

27. Fast Dictionary-Based Compression for Inverted Indexes

Author: Giulio Ermanno Pibiri, Matthias Petri, and Alistair Moffat
Subjects: decoding, Settore INF/01 - Informatica, Degree (graph theory), Computer science, Decoding, Compression, Inverted index, efficiency, compression, Integer sequence, Efficiency, Data_CODINGANDINFORMATIONTHEORY, Compression (functional analysis), Index compression, Range (statistics), Algorithm, Decoding methods
Abstract: Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.
Published: 2019
Full Text: View/download PDF

28. The Potential of Learned Index Structures for Index Compression

Author: Oosterhuis, H., Culpepper, J.S., de Rijke, M., Koopman, B., Trotman, A., Thomas, P., Information and Language Processing Syst (IVI, FNWI), and Communication
Subjects: FOS: Computer and information sciences, Information retrieval, Computer science, Search engine indexing, 02 engineering and technology, Data structure, Inverted index, Computer Science - Information Retrieval, Term (time), Identifier, Index (publishing), Position (vector), 020204 information systems, Index compression, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Information Retrieval (cs.IR)
Abstract: Inverted indexes are vital in providing fast key-word-based search. For every term in the document collection, a list of identifiers of documents in which the term appears is stored, along with auxiliary information such as term frequency, and position offsets. While very effective, inverted indexes have large memory requirements for web-sized collections. Recently, the concept of learned index structures was introduced, where machine learned models replace common index structures such as B-tree-indexes, hash-indexes, and bloom-filters. These learned index structures require less memory, and can be computationally much faster than their traditional counterparts. In this paper, we consider whether such models may be applied to conjunctive Boolean querying. First, we investigate how a learned model can replace document postings of an inverted index, and then evaluate the compromises such an approach might have. Second, we evaluate the potential gains that can be achieved in terms of memory requirements. Our work shows that learned models have great potential in inverted indexing, and this direction seems to be a promising area for future research., Comment: Will appear in the proceedings of ADCS'18
Published: 2018
Full Text: View/download PDF

29. Large-Scale Image Retrieval with Elasticsearch

Author: Fabio Carrara, Fabrizio Falchi, Paolo Bolettieri, Giuseppe Amato, and Claudio Gennaro
Subjects: business.industry, Computer science, similarity search, 02 engineering and technology, Machine learning, computer.software_genre, Inverted index, Convolutional neural network, Regional maximum activations of convolutions, image retrieval, content-based image retrieval, Product quantization: mean average precision, Elasticsearch, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, elasticsearch, business, retrieval, computer, Image retrieval
Abstract: Content-Based Image Retrieval in large archives through the use of visual features has become a very attractive research topic in recent years. The cause of this strong impulse in this area of research is certainly to be attributed to the use of Convolutional Neural Network (CNN) activations as features and their outstanding performance. However, practically all the available image retrieval systems are implemented in main memory, limiting their applicability and preventing their usage in big-data applications. In this paper, we propose to transform CNN features into textual representations and index them with the well-known full-text retrieval engine Elasticsearch. We validate our approach on a novel CNN feature, namely Regional Maximum Activations of Convolutions. A preliminary experimental evaluation, conducted on the standard benchmark INRIA Holidays, shows the effectiveness and efficiency of the proposed approach and how it compares to state-of-the-art main-memory indexes.
Published: 2018
Full Text: View/download PDF

30. Torch

Author: Xiaolin Qin, Qizhi Liu, Sheng Wang, Zizhe Xie, Zhifeng Bao, and J. Shane Culpepper
Subjects: Computer science, Nearest neighbor search, Search engine indexing, 02 engineering and technology, Similarity measure, Inverted index, Vertex (geometry), Set (abstract data type), Search engine, Similarity (network science), Ranking, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Trajectory, 020201 artificial intelligence & image processing, Algorithm
Abstract: This paper presents a new trajectory search engine called Torch for querying road network trajectory data. Torch is able to efficiently process two types of typical queries (similarity search and Boolean search), and support a wide variety of trajectory similarity functions. Additionally, we propose a new similarity function LORS in Torch to measure the similarity in a more effective and efficient manner. Indexing and search in Torch works as follows. First, each raw vehicle trajectory is transformed to a set of road segments (edges) and a set of crossings (vertices) on the road network. Then a lightweight edge and vertex index called LEVI is built. Given a query, a filtering framework over LEVI is used to dynamically prune the trajectory search space based on the similarity measure imposed. Finally, the result set (ranked or Boolean) is returned. Extensive experiments on real trajectory datasets verify the effectiveness and efficiency of Torch.
Published: 2018
Full Text: View/download PDF

31. Processing Class-Constraint K-NN Queries with MISP

Author: Evica Milchevski, Fabian Neffgen, and Sebastian Michel
Subjects: Structure (mathematical logic), Class (computer programming), Theoretical computer science, Computer science, Nearest neighbor search, 020207 software engineering, 02 engineering and technology, Type (model theory), Inverted index, Constraint (information theory), Index (publishing), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Pruning (decision trees)
Abstract: In this work, we consider processing k-nearest-neighbor (k-NN) queries, with the additional requirement that the result objects are of a specific type. To solve this problem, we propose an approach based on a combination of an inverted index and state-of-the-art similarity search index structure for efficiently pruning the search space early-on. Furthermore, we provide a cost model, and an extensive experimental study, that analyzes the performance of the proposed index structure under different configurations, with the aim of finding the most efficient one for the dataset being searched.
Published: 2018
Full Text: View/download PDF

32. ZigZag

Author: Yang Li, Lingfeng Deng, Chen Li, and Wenhai Li
Subjects: Theoretical computer science, Correctness, Computer science, 02 engineering and technology, Filter (higher-order function), Inverted index, Data set, External storage, Similarity (network science), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Vector space model, 020201 artificial intelligence & image processing, Pruning (decision trees)
Abstract: In this paper we study the problem of supporting similarity queries on a large number of records using a vector space model, where each record is a bag of tokens. We consider similarity functions that incorporate non-negative global token weights as well as record-specific token degrees. We develop a family of algorithms based on an inverted index for large data sets, especially for the case of using external storage such as hard disks or flash drives, and present pruning techniques based on various bounds to improve their performance. We formally prove the correctness of these techniques, and show how to achieve better pruning power by iteratively tightening these bounds to exactly filter dissimilar records. We conduct an extensive experimental study using real, large-scale data sets based on different storage platforms, including memory, hard disks, and flash drives. The results show that these algorithms and techniques can efficiently support similarity queries on large data sets.
Published: 2018
Full Text: View/download PDF

33. Research on Ship Image Retrieval Based on BoVW Model under Hadoop Platform

Author: Rong Hu, Zhiqiang Guo, Bangpei Zhu, and Jie Yang
Subjects: Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Visual dictionary, Inverted index, Image (mathematics), Term (time), Word lists by frequency, Bag-of-words model in computer vision, Key (cryptography), Computer vision, Artificial intelligence, business, Image retrieval
Abstract: Image data is one of the key data in the ship's navigation record. Ship scene reappearance depends on its efficient retrieval. Recently, the exponential growth of the number of images makes the traditional single-machine image retrieval method gradually show the problem of inefficiency. In this paper, the image retrieval method based on the Bag of Visual Words (BoVW) model under the Hadoop platform is proposed and the distributed image retrieval is realized. Firstly, this paper takes BoVW model as the research object. Based on the Hadoop platform, the construction method of traditional visual dictionary is improved and the word frequency vectors are weighted by Term Frequency-Inverse Document Frequency (TF-IDF). Then the inverted index is generated in parallel for image retrieval. Experimental results show this method doubled the efficiency of visual dictionary construction while maintaining the original retrieval results and effectively improved the efficiency of image retrieval.
Published: 2018
Full Text: View/download PDF

34. Class-constraint similarity queries

Author: Sebastian Michel, Agma J. M. Traina, and Jessica A. de Souza
Subjects: Class (computer programming), Theoretical computer science, Similarity (network science), Computer science, 020204 information systems, Bounded function, Metric (mathematics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 02 engineering and technology, Inverted index, k-nearest neighbors algorithm
Abstract: Similarity searching is a widely applied concept on multimedia or complex data, such as images, videos, time-series, among others. Therefore, it is important to look at the execution of specific query types, e.g., constrained k-nearest neighbor that is directly based on bounded regions. In this paper, we present the Class-Constraint k-Nearest Neighbor (CCkNN) query, which goes beyond the traditional constrained k-nearest neighbor, because our CCkNN works for any specific categories of data points. The proposed CCkNN aims at accelerating the process of class-constraint similarity query execution by taking advantage of performing queries on multiple metric access methods regarding the class dimensions of the objects of each index. Additionally, this strategy identifies which index is more appropriate to run class-constraint on the k-nearest neighbor queries. Experimental results based on several datasets, including synthetic and real ones, show that our strategy can reduce the number of distance calculations in up to two orders of magnitude while keeping a high-quality retrieval, according to the classes of the objects queried.
Published: 2018
Full Text: View/download PDF

35. A Unified Processing Paradigm for Interactive Location-based Web Search

Author: Shixun Huang, Zhifeng Bao, Rui Zhang, and Sheng Wang
Subjects: Information retrieval, Interactive search, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Search engine indexing, 02 engineering and technology, Inverted index, Point data, Monotone polygon, Robustness (computer science), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Information system, 020201 artificial intelligence & image processing
Abstract: This paper studies the location-based web search and aims to build a unified processing paradigm for two purposes: (1) efficiently support each of the various types of location-based queries (kNN query, top-k spatial-textual query, etc.) on two major forms of geo-tagged data, i.e., spatial point data such as geo-tagged web documents, and spatial trajectory data such as a sequence of geo-tagged travel blogs by a user; (2) support interactive search to provide quick response for a query session, within which a user usually keeps refining her query by either issuing different query types or specifying different constraints (e.g., adding a keyword and/or location, changing the choice of k, etc.) until she finds the desired results. To achieve this goal, we first propose a general Top-k query called Monotone Aggregate Spatial Keyword query-MASK, which is able to cover most types of location-based web search. Next, we develop a unified indexing (called Textual-Grid-Point Inverted Index) and query processing paradigm (called ETAIL Algorithm) to answer a single MASK query efficiently. Furthermore, we extend ETAIL to provide interactive search for multiple queries within one query session, by exploiting the commonality of textual and/or spatial dimension among queries. Last, extensive experiments on four real datasets verify the robustness and efficiency of our approach.
Published: 2018
Full Text: View/download PDF

36. Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts

Author: Alistair Moffat and Matthias Petri
Subjects: Computer science, Code word, Byte, 02 engineering and technology, Inverted index, Numeral system, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Information system, Probability distribution, 020201 artificial intelligence & image processing, Entropy encoding, Algorithm, Coding (social sciences)
Abstract: We examine approaches used for block-based inverted index compression, such as the OptPFOR mechanism, in which fixed-length blocks of postings data are compressed independently of each other. Building on previous work in which asymmetric numeral systems (ANS) entropy coding is used to represent each block, we explore a number of enhancements: (i) the use of two-dimensional conditioning contexts, with two aggregate parameters used in each block to categorize the distribution of symbol values that underlies the ANS approach, rather than just one; (ii) the use of a byte-friendly strategic mapping from symbols to ANS codeword buckets; and (iii) the use of a context merging process to combine similar probability distributions. Collectively, these improvements yield superior compression for index data, outperforming the reference point set by the Interp mechanism, and hence representing a significant step forward. We describe experiments using the 426 GiB gov2 collection and a new large collection of publicly-available news articles to demonstrate that claim, and provide query evaluation throughput rates compared to other block-based mechanisms.
Published: 2018
Full Text: View/download PDF

37. A Closed Frag-Shells Cubing Algorithm on High Dimensional and Non-Hierarchical Data Sets

Author: Dingsheng Wan, Yuelong Zhu, Shanshan Tang, and Qun Zhao
Subjects: Multidimensional analysis, Computer science, Online analytical processing, InformationSystems_DATABASEMANAGEMENT, 020206 networking & telecommunications, 02 engineering and technology, computer.file_format, Inverted index, Data segment, Data cube, 0202 electrical engineering, electronic engineering, information engineering, Bitmap index, Bitmap, 020201 artificial intelligence & image processing, Cube, computer, Algorithm, Computer Science::Databases
Abstract: In view of high-dimensional and non-hierarchical large data sets, an improved CFSC (Closed Frag-Shells Cube) method is proposed based on the Frag-Shells method in this paper. When the Data Cube is generated, the high-dimensional data is divided into several low-dimensional data fragments by using the idea of partitioning cubes into dimension attributes. For each dimension data segment, the closed cubes of each dimension data segment are calculated using the closed cube calculation. A query bitmap is added to each fragment, and a query index table of closed segments is constructed by using bit map index technology to reduce the storage space occupied by the result set and to increase the query efficiency. Based on the application of multidimensional analysis of water conservancy census data, it is proved that this method can effectively reduce the storage space of cube data of water conservancy census data and improve the efficiency of OLAP (online analytical processing) query.
Published: 2018
Full Text: View/download PDF

38. Efficient In-Memory, List-Based Text Inversion

Author: David Hawking and Bodo Billerbeck
Subjects: Hardware_MEMORYSTRUCTURES, Speedup, Computer science, Search engine indexing, 020207 software engineering, 02 engineering and technology, Linked list, Parallel computing, Inverted index, 020204 information systems, Virtual memory, Data_FILES, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Indexer, Page table
Abstract: When building a large inverted file index on a system with effectively unlimited memory, performance may be constrained by RAM latency. To optimise speed requires an understanding of the non-uniform memory access characteristics of modern systems. We address three main techniques for improving the performance of an in-memory, list-based inverted file indexer: List chunking, in-chunk postings compression, and use of virtual memory "Large Pages". We compare performance of dynamic chunking schemes capable of adapting to the Zipf-like distribution of term frequencies. Using a data set with 8.5 billion word occurrences, we find that the techniques are cumulative. Chunking almost halves the memory required for linked lists, while dramatically reducing the number of cache-line reads required to traverse the lists; In-chunk compression further halves the memory footprint, though it does not make much difference to speed; Large pages reduce the inefficiency of page table walks and speed up both phases of index building.
Published: 2017
Full Text: View/download PDF

39. Where's Waldo?

Author: Barak Pat and Yaron Kanza
Subjects: Information retrieval, Computer science, Microblogging, 02 engineering and technology, Inverted index, Ranking (information retrieval), Index (publishing), 020204 information systems, Online search, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Social media, Relevance (information retrieval), Cluster analysis
Abstract: The myriad geotagged posts in the social media constitute a vibrant information source that can be used to support geosocial search, that is, a search for geographic locations based on user activities in online social networks and microblogging platforms. Unlike a traditional geographic search, the results of a geosocial search are not restricted to predefined entities, and may reflect events, sentiments, and other matters that are expressed in the social media. A search for "jogging", for instance, will indicate popular jogging places. A search for "4-th of July Fireworks" would point out places where people watch the spectacle and tweet about it. Yet, geosocial search is different from ordinary Web search because there is no natural partition of the space into documents. There is a need to find new ways to effectively rank, filter, and present results. In this paper, we introduce a novel two-step search process of first, quickly finding relevant areas by using an arbitrarily indexed partition of the space, and second, applying clustering to the geotagged posts in the discovered areas, to present more accurate results. We propose and compare four different ranking measures for evaluating the relevance of an area to a given query. Our experiments, over a dataset of more than 40 million geotagged posts, illustrate the effectiveness of geosocial search, e.g., for finding events, or in a search based on a sentiment, in comparison to ordinary geographic search. Online search is supported by a partition-aware inverted index. Using the index, results are retrieved in a fraction of a second over millions of posts, even on a single standard machine.
Published: 2017
Full Text: View/download PDF

40. An Empirical Analysis of Pruning Techniques

Author: Leif Azzopardi, Ruey-Cheng Chen, and Falk Scholer
Subjects: QA75, Point (typography), Relation (database), Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Search engine indexing, 02 engineering and technology, computer.software_genre, Inverted index, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, Pruning (decision trees), computer, Retrievability
Abstract: Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relation between retrieval performance and retrieval bias. While various factors influencing retrievability have been examined, showing how the retrieval model may influence bias, no prior work has examined the impact of the index (and how it is optimized) on retrieval bias. Intuitively, how the documents are represented, and what terms they contain, will influence whether they are retrievable or not. In this paper, we investigate how the retrieval bias of a system changes as the inverted index is optimized for efficiency through static index pruning. In our analysis, we consider four pruning methods and examine how they affect performance and bias on the TREC GOV2 Collection. Our results show that the relationship between these factors is varied and complex - and very much dependent on the pruning algorithm. We find that more pruning results in relatively little change or a slight decrease in bias up to a point, and then a dramatic increase. The increase in bias corresponds to a sharp decrease in early precision such as NDCG@10 and is also indicative of a large decrease in MAP. The findings suggest that the impact of pruning algorithms can be quite varied - but retrieval bias could be used to guide the pruning process. Further work is required to determine precisely which documents are most affected and how this impacts upon performance.
Published: 2017
Full Text: View/download PDF

41. ANS-Based Index Compression

Author: Matthias Petri and Alistair Moffat
Subjects: business.industry, Computer science, Pattern recognition, Data_CODINGANDINFORMATIONTHEORY, 02 engineering and technology, Inverted index, Blocking (statistics), Set (abstract data type), 020204 information systems, Compression (functional analysis), 0202 electrical engineering, electronic engineering, information engineering, Range (statistics), 020201 artificial intelligence & image processing, Artificial intelligence, Entropy encoding, business, Decoding methods
Abstract: Techniques for effectively representing the postings lists associated with inverted indexes have been studied for many years. Here we combine the recently developed "asymmetric numeral systems" (ANS) approach to entropy coding and a range of previous index compression methods, including VByte, Simple, and Packed. The ANS mechanism allows each of them to provide markedly improved compression effectiveness, at the cost of slower decoding rates. Using the 426 GB GOV2 collection, we show that the combination of blocking and ANS-based entropy-coding against a set of 16 magnitude-based probability models yields compression effectiveness superior to most previous mechanisms, while still providing reasonable decoding speed.
Published: 2017
Full Text: View/download PDF

42. Indexable Bayesian Personalized Ranking for Efficient Top-k Recommendation

Author: Hady W. Lauw and Dung D. Le
Subjects: Information retrieval, Computer science, Hash function, Search engine indexing, Bayesian probability, 02 engineering and technology, 010501 environmental sciences, Inverted index, 01 natural sciences, k-nearest neighbors algorithm, Ranking (information retrieval), Ranking, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Representation (mathematics), 0105 earth and related environmental sciences
Abstract: Top-k recommendation seeks to deliver a personalized recommendation list of k items to a user. The dual objectives are (1) accuracy in identifying the items a user is likely to prefer, and (2) efficiency in constructing the recommendation list in real time. One direction towards retrieval efficiency is to formulate retrieval as approximate k nearest neighbor (kNN) search aided by indexing schemes, such as locality-sensitive hashing, spatial trees, and inverted index. These schemes, applied on the output representations of recommendation algorithms, speed up the retrieval process by automatically discarding a large number of potentially irrelevant items when given a user query vector. However, many previous recommendation algorithms produce representations that may not necessarily align well with the structural properties of these indexing schemes, eventually resulting in a significant loss of accuracy post-indexing. In this paper, we introduce Indexable Bayesian Personalized Ranking (IBPR) that learns from ordinal preference to produce representation that is inherently compatible with the aforesaid indices. Experiments on publicly available datasets show superior performance of the proposed model compared to state-of-the-art methods on top-k recommendation retrieval task, achieving significant speedup while maintaining high accuracy.
Published: 2017
Full Text: View/download PDF

43. Quantization in Append-Only Collections

Author: Jimmy Lin, Matt Crane, and Salman Mohammed
Subjects: Search engine, Computer science, 020204 information systems, Quantization (signal processing), 0202 electrical engineering, electronic engineering, information engineering, Append, 020201 artificial intelligence & image processing, 02 engineering and technology, Data mining, computer.software_genre, Inverted index, computer
Abstract: Quantization, the pre-calculation and conversion to integers of term/document weights in an inverted index, is a well studied aspect of search engines that substantially improves retrieval efficiency. Previous work has considered the impact of quantization on effectiveness-efficiency tradeoffs in retrieval, for example, exploring the relationship between collection size and quantization range in static web collections. We extend previous work to append-only collections and examine whether quantization settings derived from prior time periods can be applied to future time periods. Experiments confirm that previous results generalize to a collection with different characteristics and with a different ranking function, and that in an append-only collection, we can use previous quantization settings in future time periods without substantial losses in either effectiveness or efficiency.
Published: 2017
Full Text: View/download PDF

44. Upper Bound Approximation for BlockMaxWand

Author: Nicola Tonellotto and Craig Macdonald
Subjects: Rank (linear algebra), Computer science, 02 engineering and technology, Inverted index, Upper and lower bounds, Term (time), Weighting, BlockMaxWand, Search engine, Upper-bounds approximations, Wand, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Pruning (decision trees), Algorithm, Block (data storage)
Abstract: BlockMaxWand is a recent advance on the Wand dynamic pruning\ud technique, which allows efficient retrieval without any effectiveness\ud degradation to rank K. However, while BMW uses docid-sorted indices,\ud it relies on recording the upper bound of the term weighting\ud model scores for each block of postings in the inverted index. Such\ud a requirement can be disadvantageous in situations such as when\ud an index must be updated. In this work, we examine the appropriateness\ud of upper-bound approximation – which have previously\ud been shown suitable for Wand– in providing efficient retrieval for\ud BMW. Experiments on the ClueWeb12 category B13 corpus using\ud 5000 queries from a real search engine’s query log demonstrate that\ud BMW still provides benefits w.r.t. Wand when approximate upper\ud bounds are used, and that, if approximations on upper bounds are\ud tight, BMW with approximate upper bounds can provide efficiency\ud gains w.r.t.Wand with exact upper bounds, in particular for queries\ud of short to medium length.
Published: 2017
Full Text: View/download PDF

45. BitFunnel

Author: Sameh Elnikety, Mihaela Curmei, Yuxiong He, Alex Clemmer, Michael Hopcroft, Bob Goodwin, and Dan Luu
Subjects: Information retrieval, Intersection (set theory), Computer science, business.industry, Search engine indexing, 020207 software engineering, Cloud computing, 02 engineering and technology, Bloom filter, computer.software_genre, Inverted index, Signature (logic), Search engine, Index (publishing), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Data mining, business, computer, Block (data storage)
Abstract: Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing. In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures. This index, known as BitFunnel, replaced an existing production system based on an inverted index. The driving factor behind the shift away from the inverted index was operational cost savings. This paper describes algorithmic innovations and changes in the cloud computing landscape that led us to reconsider and eventually field a technology that was once considered unusable. The BitFunnel algorithm directly addresses four fundamental limitations in bit-sliced block signatures. At the same time, our mapping of the algorithm onto a cluster offers opportunities to avoid other costs associated with signatures. We show these innovations yield a significant efficiency gain versus classic bit-sliced signatures and then compare BitFunnel with Partitioned Elias-Fano Indexes, MG4J, and Lucene.
Published: 2017
Full Text: View/download PDF

46. An Experimental Study of Bitmap Compression vs. Inverted List Compression

Author: Yannis Papakonstantinou, Jianguo Wang, Steven Swanson, and Chunbin Lin
Subjects: Lossless compression, Computer science, Intersection (set theory), Data_CODINGANDINFORMATIONTHEORY, 02 engineering and technology, computer.file_format, Inverted index, computer.software_genre, 020204 information systems, Compression (functional analysis), 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Bitmap, 020201 artificial intelligence & image processing, Data mining, computer, Data compression
Abstract: Bitmap compression has been studied extensively in the database area and many efficient compression schemes were proposed, e.g., BBC, WAH, EWAH, and Roaring. Inverted list compression is also a well-studied topic in the information retrieval community and many inverted list compression algorithms were developed as well, e.g., VB, PforDelta, GroupVB, Simple8b, and SIMDPforDelta. We observe that they essentially solve the same problem, i.e., how to store a collection of sorted integers with as few as possible bits and support query processing as fast as possible. Due to historical reasons, bitmap compression and inverted list compression were developed as two separated lines of research in the database area and information retrieval area. Thus, a natural question is: Which one is better between bitmap compression and inverted list compression? To answer the question, we present the first comprehensive experimental study to compare a series of 9 bitmap compression methods and 12 inverted list compression methods. We compare these 21 algorithms on synthetic datasets with different distributions (uniform, zipf, and markov) as well as 8 real-life datasets in terms of the space overhead, decompression time, intersection time, and union time. Based on the results, we provide many lessons and guidelines that can be used for practitioners to decide which technique to adopt in future systems and also for researchers to develop new algorithms.
Published: 2017
Full Text: View/download PDF

47. Frag-shells cube based on hierarchical dimension encoding tree

Author: Yuelong Zhu, Jianguo Yao, Dingsheng Wan, and Shanshan Tang
Subjects: Theoretical computer science, Computer science, Online analytical processing, 020206 networking & telecommunications, Cube (algebra), 02 engineering and technology, Inverted index, Data cube, Tree (data structure), Dimension (vector space), Encoding (memory), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Tuple
Abstract: Pre-computation of data cube can greatly improve the performance of OLAP (online analytical processing). There are a lot of effective pre-computation methods of data cube. But in practice, appropriate pre-computation method for the characteristics of data set plays a crucial role in improving the efficiency of data cube pre-computation. In view of the high-dimensional and hierarchical dimension of water census data characteristic, Frag-Shells cube based on hierarchical dimension encoding tree has been proposed in this paper. In order to improve the retrieval efficiency and reduce redundant information in hierarchical dimensions, we have proposed hierarchical dimension encoding tree (HDE-Tree) to index the hierarchical dimension in Frag-Shells cube. In order to increase the efficiency of cube construction, improved Frag-Shells cube calculation method has been used to compute the tuples of non-hierarchical dimension fragments. In order to compress the size of data cube, the TID-List compression method has been used to decrease the storage cost of inverted index in each tuple. Experiments show that the Frag-Shells cube based on hierarchical dimension encoding tree can reduce the construction time and storage cost of data cube which has high-dimensional and dimensions hierarchical figures.
Published: 2017
Full Text: View/download PDF

48. Efficient & Effective Selective Query Rewriting with Efficiency Predictions

Author: Nicola Tonellotto, Craig Macdonald, and Iadh Ounis
Subjects: Information retrieval, Web search query, Computer science, 02 engineering and technology, computer.software_genre, Inverted index, Query optimization, Ranking (information retrieval), Query expansion, Search engine, Web query classification, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Sargable, Query by Example, Data mining, Query Rewriting, computer, Boolean conjunctive query, computer.programming_language
Abstract: To enhance effectiveness, a user's query can be rewritten internally by the search engine in many ways, for example by applying proximity, or by expanding the query with related terms. However, approaches that benefit effectiveness often have a negative impact on efficiency, which has impacts upon the user satisfaction, if the query is excessively slow. In this paper, we propose a novel framework for using the predicted execution time of various query rewritings to select between alternatives on a per-query basis, in a manner that ensures both effectiveness and efficiency. In particular, we propose the prediction of the execution time of ephemeral (e.g., proximity) posting lists generated from uni-gram inverted index posting lists, which are used in establishing the permissible query rewriting alternatives that may execute in the allowed time. Experiments examining both the effectiveness and efficiency of the proposed approach demonstrate that a 49% decrease in mean response time (and 62% decrease in 95th-percentile response time) can be attained without significantly hindering the effectiveness of the search engine.
Published: 2017

49. A semantic-based question answering system for indonesian translation of Quran

Author: Ria Hari Gusmita, Husni Teja Sukmana, Syopiansyah Jaya Putra, and Khodijah Hulliyah
Subjects: Information retrieval, business.industry, Computer science, Process (engineering), Factoid, Ontology (information science), computer.software_genre, Inverted index, Named-entity recognition, Semantic similarity, Question answering, Artificial intelligence, business, computer, Natural language processing, Interpreter
Abstract: This paper presents a work in developing a semantic-based question answering system (QAS) for Indonesian Translation of Quran (ITQ). This research is motivated by the lacks of previous built QAS that caused by a keyword-based retrieval. Instead of keeping the retrieval method, we shifted to a semantic approach where the retrieval process is done by using a semantic similarity measurement. In doing so, we built an ontology of ITQ to get the concepts as well as verses where they appear in. We applied three factoid question types on the QAS that including Who, Where, and When. Furthermore, a weighted vector for each concept that belongs to respective expected answering type (also called as named entity group) i.e. Person, Location, and Time is generated in order to feed semantic interpreter on user question. From 222 concepts defined from the ontology, we clustered them into 77, 24, and 6 concepts for Person, Location, and Time respectively. Since we found there are some characteristics of texts in ITQ, we developed our own modules to deal with including generate the inverted index and named entity recognition. Answer extraction is conducted by applying some features extraction in order to score the answer candidates. Evaluation of the system is designed by providing two data set of question and answer where the first one is purposed to measure the effectiveness of semantic approach comparing with keyword-based retrieval and the last one aims to know system performance in regard the appearance of concepts in ITQ.
Published: 2016
Full Text: View/download PDF

50. Capturing complex behaviour for predicting distant future trajectories

Author: Vaibhav Kulkarni, Benoît Garbinato, Bertil Chapuis, and Arielle Moro
Subjects: Mobility model, business.industry, Computer science, Search engine indexing, 020206 networking & telecommunications, Time horizon, 02 engineering and technology, computer.software_genre, Machine learning, Object (computer science), Inverted index, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Trajectory, Global Positioning System, Artificial intelligence, Data mining, business, Mobile device, computer
Abstract: We put forth a system, to predict distant-future positions of multiple moving entities and index the forecasted trajectories, in order to answer predictive queries involving long time horizons. Today, the proliferation of mobile devices with GPS functionality and internet connectivity has led to a rapid development of location-based services, accounting for user mobility prediction as a key paradigm. Mobility prediction is already playing a major role in traffic management, urban planning and location-based advertising, which demand accurate and long time horizon forecasting of user movements. Existing prediction methodologies either use motion patterns or techniques based on frequently visited places for predicting the next move. However, when it comes to distant-future, human mobility is too complex to be represented by such statistical functions. Therefore, the existing techniques are not well suited to answer distant-future queries with a satisfactory level of accuracy. To tackle this problem, we introduce a novel spatial object, 'Representative Trajectory', which embodies the movements of users amongst their zones of interest. We propose means to empirically evaluate the quality of this object and dynamically adapt its extraction method based on user mobility behaviour. We rely on an inverted index to store the predicted trajectories that scales well with the number of moving entities. Our evaluation results show that the technique achieves more than 70% accurate predictions with the best extraction technique. This shows that longer query time horizons do not necessarily demand complex spatial indexing schemes, which have to be rebalanced as they grow and which is a constantly experienced problem while answering predictive queries.
Published: 2016
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

209 results on '"Inverted index"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources