Back to Search
Start Over
Probabilistic static pruning of inverted files
- Source :
- ACM Transactions on Information Systems. 28:1-33
- Publication Year :
- 2010
- Publisher :
- Association for Computing Machinery (ACM), 2010.
-
Abstract
- Information retrieval (IR) systems typically compress their indexes in order to increase their efficiency. Static pruning is a form of lossy data compression: it removes from the index, data that is estimated to be the least important to retrieval performance, according to some criterion. Generally, pruning criteria are derived from term weighting functions, which assign weights to terms according to their contribution to a document's contents. Usually, document-term occurrences that are assigned a low weight are ruled out from the index. The main assumption is that those entries contribute little to the document content. We present a novel pruning technique that is based on a probabilistic model of IR. We employ the Probability Ranking Principle as a decision criterion over which posting list entries are to be pruned. The proposed approach requires the estimation of three probabilities, combining them in such a way that we gather all the necessary information to apply the aforementioned criterion. We evaluate our proposed pruning technique on five TREC collections and various retrieval tasks, and show that in almost every situation it outperforms the state of the art in index pruning. The main contribution of this work is proposing a pruning technique that stems directly from the same source as probabilistic retrieval models, and hence is independent of the final model used for retrieval.
- Subjects :
- Computer science
Divergence-from-randomness model
Probabilistic logic
Statistical model
Lossy compression
computer.software_genre
General Business, Management and Accounting
Computer Science Applications
Ranking (information retrieval)
Weighting
Principal variation search
Data mining
Pruning (decision trees)
computer
Information Systems
Subjects
Details
- ISSN :
- 15582868 and 10468188
- Volume :
- 28
- Database :
- OpenAIRE
- Journal :
- ACM Transactions on Information Systems
- Accession number :
- edsair.doi...........503df6193a45b54d9b2a319cd8627ba1
- Full Text :
- https://doi.org/10.1145/1658377.1658378