Back to Search Start Over

Incorporation of Iterative Self-supervised Pre-training in the Creation of the ASR System for the Tatar Language

Authors :
Ilnur Muhametzyanov
Aidar Khusainov
Dzhavdet Suleymanov
Source :
Text, Speech, and Dialogue ISBN: 9783030835262, TDS
Publication Year :
2021
Publisher :
Springer International Publishing, 2021.

Abstract

In this paper, we study the iterative self-supervised pretraining procedure for the Tatar language speech recognition system. The complete recipe includes the use of base pre-trained model (the multilingual XLSR model or the Librispeech (English) Wav2Vec 2.0 Base model), the next step was a “source” self-supervised pre-training on collected Tatar unlabeled data (mostly broadcast audio), then the resulting model was used for additional “target” self-supervised pretraining on the annotated corpus (target domain, without using labels), and the final step was to fine-tune the model on the annotated corpus with labels. To conduct the experiments we prepared a 328-h unlabeled and a 129-h annotated audio corpora. Experiments on three datasets (two proprietary and publicly available Common Voice as the third one) showed that the first “source” pretraining step allows ASR models to show on average 24.3% lower WER, and both source and target pretraining - 33.3% lower WER than a simple finetunes base model. The resulting accuracy for the Common Voice (read speech) test dataset is WER 5.37%, on the private TatarCorpus (read clean speech) is 4.65%, and for the spontaneous speech dataset collected from the TV shows is 22.6%, all of the results are the best-published results on these datasets. Additionally, we show that using a multilingual base model can be beneficial for the case of fine-tuning (10.5% less WER for this case), but applying self-supervised pretraining steps eliminates this difference.

Details

ISBN :
978-3-030-83526-2
ISBNs :
9783030835262
Database :
OpenAIRE
Journal :
Text, Speech, and Dialogue ISBN: 9783030835262, TDS
Accession number :
edsair.doi...........774eb995c5dee86f5ca78f9a0bc74c72
Full Text :
https://doi.org/10.1007/978-3-030-83527-9_41