Back to Search
Start Over
Incorporation of Iterative Self-supervised Pre-training in the Creation of the ASR System for the Tatar Language
- Source :
- Text, Speech, and Dialogue ISBN: 9783030835262, TDS
- Publication Year :
- 2021
- Publisher :
- Springer International Publishing, 2021.
-
Abstract
- In this paper, we study the iterative self-supervised pretraining procedure for the Tatar language speech recognition system. The complete recipe includes the use of base pre-trained model (the multilingual XLSR model or the Librispeech (English) Wav2Vec 2.0 Base model), the next step was a “source” self-supervised pre-training on collected Tatar unlabeled data (mostly broadcast audio), then the resulting model was used for additional “target” self-supervised pretraining on the annotated corpus (target domain, without using labels), and the final step was to fine-tune the model on the annotated corpus with labels. To conduct the experiments we prepared a 328-h unlabeled and a 129-h annotated audio corpora. Experiments on three datasets (two proprietary and publicly available Common Voice as the third one) showed that the first “source” pretraining step allows ASR models to show on average 24.3% lower WER, and both source and target pretraining - 33.3% lower WER than a simple finetunes base model. The resulting accuracy for the Common Voice (read speech) test dataset is WER 5.37%, on the private TatarCorpus (read clean speech) is 4.65%, and for the spontaneous speech dataset collected from the TV shows is 22.6%, all of the results are the best-published results on these datasets. Additionally, we show that using a multilingual base model can be beneficial for the case of fine-tuning (10.5% less WER for this case), but applying self-supervised pretraining steps eliminates this difference.
Details
- ISBN :
- 978-3-030-83526-2
- ISBNs :
- 9783030835262
- Database :
- OpenAIRE
- Journal :
- Text, Speech, and Dialogue ISBN: 9783030835262, TDS
- Accession number :
- edsair.doi...........774eb995c5dee86f5ca78f9a0bc74c72
- Full Text :
- https://doi.org/10.1007/978-3-030-83527-9_41