Back to Search Start Over

Cleaning statistical language models

Authors :
Jourani, Reda
Langlois, David
Smaïli, Kamel
Daoudi, Khalid
Aboutajdine, Driss
Geometry and Statistics in acquisition data (GeoStat)
Inria Bordeaux - Sud-Ouest
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
Analysis, perception and recognition of speech (PAROLE)
INRIA Lorraine
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA)
Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)
Laboratoire de Recherche en Informatique et Télécommunications [Rabat] (GSCM-LRIT)
Université Mohammed V de Rabat [Agdal] (UM5)
University of Mohammed V
Source :
3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia
Publication Year :
2010
Publisher :
HAL CCSD, 2010.

Abstract

International audience; In this paper, we describe how to decide a n-gram is actually impossible in a language. We use decision rules on a corpus tagged with POS. These rules are based on statistics and phonological criteria. In terms of statistical language modeling, deciding that a n-gram is impossible leads to assign to it a null probability.We defer on the possible n-grams the released mass of probabilities. To do this, we define a new formulation of P(w|h). We apply the principle of impossible events to bigrams. Then we use the list of impossible bigrams to build a list of impossible trigrams. The new trigram model exceeds the baseline model by 5.53% in terms of perplexity.

Details

Language :
English
Database :
OpenAIRE
Journal :
3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia
Accession number :
edsair.dedup.wf.001..9eb693abde929d5c489583358cf2ae5b