Back to Search Start Over

Contemporaneous text as side-information in statistical language modeling

Authors :
Woosung Kim
Sanjeev Khudanpur
Source :
Computer Speech & Language. 18:143-162
Publication Year :
2004
Publisher :
Elsevier BV, 2004.

Abstract

We propose new methods to exploit contemporaneous text, such as on-line news articles, to improve language models for automatic speech recognition and other natural language processing applications. In particular, we investigate the use of text from a resource-rich language to sharpen language models for processing a news story or article in a language with scarce linguistic resources. We demonstrate that even with fairly crude cross-language information retrieval and simple machine translation, one can construct story-specific Chinese language models which exploit cues from a side-corpus of English newswire to significantly improve the performance of language models estimated from a static Chinese corpus. Our investigations cover cases when the amount of available Chinese text is small, and a case when a large Chinese text corpus is available. We examine the effectiveness of our techniques both when the side-corpus contains English documents that are near-translations of the Chinese documents being processed, and when the English side-corpus is merely from contemporaneous and independent news sources. We present experimental results for automatic transcription of speech from the Mandarin Broadcast News corpus.

Details

ISSN :
08852308
Volume :
18
Database :
OpenAIRE
Journal :
Computer Speech & Language
Accession number :
edsair.doi...........651901b88b0956be9ea1c4415995786a
Full Text :
https://doi.org/10.1016/j.csl.2003.09.001