Back to Search
Start Over
Cleaning statistical language models
- Source :
- 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia
- Publication Year :
- 2010
- Publisher :
- HAL CCSD, 2010.
-
Abstract
- International audience; In this paper, we describe how to decide a n-gram is actually impossible in a language. We use decision rules on a corpus tagged with POS. These rules are based on statistics and phonological criteria. In terms of statistical language modeling, deciding that a n-gram is impossible leads to assign to it a null probability.We defer on the possible n-grams the released mass of probabilities. To do this, we define a new formulation of P(w|h). We apply the principle of impossible events to bigrams. Then we use the list of impossible bigrams to build a list of impossible trigrams. The new trigram model exceeds the baseline model by 5.53% in terms of perplexity.
- Subjects :
- [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Subjects
Details
- Language :
- English
- Database :
- OpenAIRE
- Journal :
- 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia
- Accession number :
- edsair.dedup.wf.001..9eb693abde929d5c489583358cf2ae5b