1. IR From Bag-of-words to BERT and Beyond through Practical Experiments
- Author
-
Sean MacAvaney, Nicola Tonellotto, and Craig Macdonald
- Subjects
Information retrieval ,Bag-of-words model ,Scripting language ,Computer science ,Search engine indexing ,Learning to rank ,Language model ,Inverted index ,computer.software_genre ,computer ,Session (web analytics) ,Ranking (information retrieval) - Abstract
The task of adhoc search is undergoing a renaissance, sparked by advances in natural language processing. In particular, pre-trained contextualized language models (such as BERT and T5) have consistently shown to be a highly-effective foundation upon which to build ranking models. These models are equipped with a far deeper understanding of language than the capabilities of bag-of-words (BoW) models. Applying these techniques to new tasks can be tricky, however, as they require knowledge of deep learning frameworks, and significant scripting and data munging. In this full-day tutorial, we build up from foundational retrieval principles to the latest neural ranking techniques. We first provide foundational background on classical bag-of-words methods. We then show how feature-based Learning to Rank methods can be used to re-rank these results. Finally, we cover contemporary approaches, such as BERT, doc2query, and dense retrieval. Throughout the process, we demonstrate how these can be easily experimentally applied to new search tasks in a declarative style of conducting experiments exemplified by the PyTerrier and OpenNIR search toolkits. This tutorial is interactive in nature for participants. It is broken into sessions, each of which mixes explanatory presentation with hands-on activities using prepared Jupyter notebooks running on the Google Colab platform. These activities give participants experience applying the techniques covered in the tutorial on the TREC COVID benchmark test collection. The tutorial is broken into four sessions. In the first session, we cover foundational retrieval concepts, including inverted indexing, retrieval, and scoring. We also demonstrate how evaluation can be conducted in a declarative fashion within PyTerrier, encapsulating ideas such as significance testing, and multiple correction, as promoted as IR best practices. In the second session, we build upon the core retrieval concepts to demonstrate how to re-write queries (e.g., using RM3) and re-rank documents (e.g., using learning-to-rank). In the third session, we introduce contextualized language models, such as BERT and show how they can be utilized for document re-ranking (e.g, using Vanilla/monoBERT and EPIC). Finally, in session four, we move beyond re-ranking and cover how approaches that modify documents (e.g., DeepCT) as well as efforts to replace the traditional inverted index with an embedding-based index (e.g., ANCE, ColBERT, and ColBERT-PRF). By the end of the tutorial, participants will have experience conducting IR experiments from classical bag-of-words models to contemporary BERT models and beyond.
- Published
- 2021
- Full Text
- View/download PDF