1. Semantically enhanced term frequency based on word embeddings for Arabic information retrieval
- Author
-
Abdelkader El Mahdaouy, Said Ouatik El Alaoui, Eric Gaussier, Analyse de données, Modélisation et Apprentissage automatique [Grenoble] (AMA ), Laboratoire d'Informatique de Grenoble (LIG ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), laboratoire informatique et modélisation (LIM), Faculté des sciences Dhar El Mahras, Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), and Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])
- Subjects
Arabic Information Retrieval ,Word embedding ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Term Mismatch ,Context (language use) ,Recherche d'Information en langue arabe ,Appariement Sémantique ,computer.software_genre ,Semantics ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Word Embedding ,Relevance (information retrieval) ,Distributed Representation of word Vectors ,Semantic matching ,Context model ,Information retrieval ,Semantically Enhanced Term Frequency ,business.industry ,Représentations Distribuées des Vecteurs des Mots ,05 social sciences ,Disparité des Mots ,Term (time) ,Semantic Matching ,IR models ,[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] ,Artificial intelligence ,0509 other social sciences ,050904 information & library sciences ,0305 other medical science ,business ,computer ,Modèles de RI ,Natural language processing ,Word (computer architecture) - Abstract
International audience; Traditional Information Retrieval (IR) models are based on bag-of-words paradigm, where relevance scores are computed based on exact matching of keywords. Although these models have already achieved good performance, it has been shown that most of dissatisfaction cases in relevance are due to term mismatch between queries and documents. In this paper, we introduce novel method to compute term frequency based on semantic similarities using distributed representations of words in a vector space (Word Embeddings). Our main goal is to allow distinct but semantically related terms to match each other and contribute to the relevance scores. Hence, Arabic documents are retrieved beyond the bag-of-words paradigm based on semantic similarities between word vectors. The results on Arabic standard TREC data sets show significant improvement over the baseline bag-of-words models.
- Published
- 2016
- Full Text
- View/download PDF