Start Over

Incorporation of Iterative Self-supervised Pre-training in the Creation of the ASR System for the Tatar Language

Authors :: Ilnur Muhametzyanov
Aidar Khusainov
Dzhavdet Suleymanov
Source :: Text, Speech, and Dialogue ISBN: 9783030835262, TDS
Publication Year :: 2021
Publisher :: Springer International Publishing, 2021.
Abstract: In this paper, we study the iterative self-supervised pretraining procedure for the Tatar language speech recognition system. The complete recipe includes the use of base pre-trained model (the multilingual XLSR model or the Librispeech (English) Wav2Vec 2.0 Base model), the next step was a “source” self-supervised pre-training on collected Tatar unlabeled data (mostly broadcast audio), then the resulting model was used for additional “target” self-supervised pretraining on the annotated corpus (target domain, without using labels), and the final step was to fine-tune the model on the annotated corpus with labels. To conduct the experiments we prepared a 328-h unlabeled and a 129-h annotated audio corpora. Experiments on three datasets (two proprietary and publicly available Common Voice as the third one) showed that the first “source” pretraining step allows ASR models to show on average 24.3% lower WER, and both source and target pretraining - 33.3% lower WER than a simple finetunes base model. The resulting accuracy for the Common Voice (read speech) test dataset is WER 5.37%, on the private TatarCorpus (read clean speech) is 4.65%, and for the spontaneous speech dataset collected from the TV shows is 22.6%, all of the results are the best-published results on these datasets. Additionally, we show that using a multilingual base model can be beneficial for the case of fine-tuning (10.5% less WER for this case), but applying self-supervised pretraining steps eliminates this difference.

Subjects :: Tatar
Self supervised learning
Computer science
Speech recognition
Language speech
language
Recognition system
Base (topology)
language.human_language
Spontaneous speech

Details

ISBN :: 978-3-030-83526-2
ISBNs :: 9783030835262
Database :: OpenAIRE
Journal :: Text, Speech, and Dialogue ISBN: 9783030835262, TDS
Accession number :: edsair.doi...........774eb995c5dee86f5ca78f9a0bc74c72
Full Text :: https://doi.org/10.1007/978-3-030-83527-9_41

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Incorporation of Iterative Self-supervised Pre-training in the Creation of the ASR System for the Tatar Language

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Incorporation of Iterative Self-supervised Pre-training in the Creation of the ASR System for the Tatar Language

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources