Back to Search Start Over

Probabilistic static pruning of inverted files

Authors :
Álvaro Barreiro
Roi Blanco
Source :
ACM Transactions on Information Systems. 28:1-33
Publication Year :
2010
Publisher :
Association for Computing Machinery (ACM), 2010.

Abstract

Information retrieval (IR) systems typically compress their indexes in order to increase their efficiency. Static pruning is a form of lossy data compression: it removes from the index, data that is estimated to be the least important to retrieval performance, according to some criterion. Generally, pruning criteria are derived from term weighting functions, which assign weights to terms according to their contribution to a document's contents. Usually, document-term occurrences that are assigned a low weight are ruled out from the index. The main assumption is that those entries contribute little to the document content. We present a novel pruning technique that is based on a probabilistic model of IR. We employ the Probability Ranking Principle as a decision criterion over which posting list entries are to be pruned. The proposed approach requires the estimation of three probabilities, combining them in such a way that we gather all the necessary information to apply the aforementioned criterion. We evaluate our proposed pruning technique on five TREC collections and various retrieval tasks, and show that in almost every situation it outperforms the state of the art in index pruning. The main contribution of this work is proposing a pruning technique that stems directly from the same source as probabilistic retrieval models, and hence is independent of the final model used for retrieval.

Details

ISSN :
15582868 and 10468188
Volume :
28
Database :
OpenAIRE
Journal :
ACM Transactions on Information Systems
Accession number :
edsair.doi...........503df6193a45b54d9b2a319cd8627ba1
Full Text :
https://doi.org/10.1145/1658377.1658378