Start Over

Using ASR-Generated Text for Spoken Language Modeling

Authors :: Nicolas Hervé
Valentin Pelloin
Benoit Favre
Franck Dary
Antoine Laurent
Sylvain Meignier
Laurent Besacier
Institut National de l'Audiovisuel (INA)
Laboratoire d'Informatique de l'Université du Mans (LIUM)
Le Mans Université (UM)
Laboratoire d'Informatique et Systèmes (LIS)
Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS)
Naver Labs Europe [Meylan]
ANR-19-CE23-0004,AISSPER,Intelligence artificielle pour la compréhension du langage parlé contrôlée sémantiquement(2019)
Source :: Proceedings of BigScience Episode #5--Workshop on Challenges & Perspectives in Creating Large Language Models, Proceedings of BigScience Episode #5--Workshop on Challenges & Perspectives in Creating Large Language Models, May 2022, virtual+Dublin, France. pp.17-25, ⟨10.18653/v1/2022.bigscience-1.2⟩
Publication Year :: 2022
Publisher :: HAL CCSD, 2022.
Abstract: International audience; This papers aims at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute 1) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT 2) or through training a LM from scratch. The new models (FlauBERT-Oral) are shared with the community 3 and are evaluated not only in terms of word prediction accuracy but also for two downstream tasks: classification of TV shows and syntactic parsing of speech. Experimental results show that FlauBERT-Oral is better than its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-Generated text can be useful to improve spoken language modeling.

Subjects :: [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing
[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]
[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing
[INFO]Computer Science [cs]
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]

Details

Language :: English
Database :: OpenAIRE
Journal :: Proceedings of BigScience Episode #5--Workshop on Challenges & Perspectives in Creating Large Language Models, Proceedings of BigScience Episode #5--Workshop on Challenges & Perspectives in Creating Large Language Models, May 2022, virtual+Dublin, France. pp.17-25, ⟨10.18653/v1/2022.bigscience-1.2⟩
Accession number :: edsair.doi.dedup.....1af6293cdfa7c325127c8c0e96bbc6e3
Full Text :: https://doi.org/10.18653/v1/2022.bigscience-1.2⟩