Back to Search Start Over

Pronunciation Dictionary-Free Multilingual Speech Synthesis Using Learned Phonetic Representations

Authors :
Liu, Chang
Ling, Zhen-Hua
Chen, Ling-Hui
Source :
IEEE-ACM Transactions on Audio, Speech, and Language Processing; 2023, Vol. 31 Issue: 1 p3706-3716, 11p
Publication Year :
2023

Abstract

This article presents a multilingual speech synthesis approach that leverages learned phonetic representations to eliminate the need for pronunciation dictionaries in target languages. The learned phonetic representations consist of unsupervised phonetic representations (UPR) and supervised phonetic representations (SPR). To extract UPRs, a pre-trained wav2vec 2.0 model is utilized, while a language-independent automatic speech recognition (LI-ASR) model with a connectionist temporal classification (CTC) loss is employed to derive segment-level SPRs from the speech data of target languages. An acoustic model using UPRs and SPRs as intermediate representations is then designed, comprising a UPR predictor, an SPR predictor, and a representation-to-mel-spectrogram (RTM) converter. The two predictors generate UPRs and SPRs from texts, respectively. The RTM converter first combines UPRs with SPRs using a Transformer-based encoder, and then feeds the merged representations into a decoder to produce mel-spectrograms. Considering the difficulty of collecting large training corpora for all languages in multilingual speech synthesis, the parameters of both the two predictors and the RTM converter can be pre-trained on non-target languages to further improve model performance. Experimental results on six target languages demonstrate that our method outperformed the approaches directly predicting mel-spectrograms from character or phoneme sequences, and pre-training the acoustic model using a multilingual corpus further improved the performance of synthetic speech.

Details

Language :
English
ISSN :
23299290
Volume :
31
Issue :
1
Database :
Supplemental Index
Journal :
IEEE-ACM Transactions on Audio, Speech, and Language Processing
Publication Type :
Periodical
Accession number :
ejs64350264
Full Text :
https://doi.org/10.1109/TASLP.2023.3313424