Back to Search Start Over

Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion.

Authors :
Kang, Xiao
Huang, Hao
Hu, Ying
Huang, Zhihua
Source :
Digital Signal Processing. Sep2021, Vol. 116, pN.PAG-N.PAG. 1p.
Publication Year :
2021

Abstract

• CTC loss is used to guide the VQ-VAE to learn pure content representations. • Experiments show generated speech with better naturalness and similarity. • Thorough analysis provides useful insight into representation disentangling. Vector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker representations from the speech by using a content encoder and a speaker encoder, which is suitable for the VC task that makes the speech of a source speaker sound like the speech of the target speaker without changing the linguistic content. However, the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features due to the lack of linguistic supervision for the content encoder. To address this issue, under the framework of VQ-VAE, connectionist temporal classification (CTC) loss is proposed to guide the content encoder to learn pure content representations by using an auxiliary network. Based on the fact that the CTC loss is not affected by the sequence length of the output of the content encoder, adding the linguistic supervision to the content encoder can be much easier. This non-parallel many-to-many voice conversion model is named as CTC-VQ-VAE. VC experiments on the CMU ARCTIC and VCTK corpus are carried out to evaluate the proposed method. Both the objective and the subjective results show that the proposed approach significantly improves the speech quality and speaker similarity of the converted speech, compared with the traditional VQ-VAE method. [ABSTRACT FROM AUTHOR]

Subjects

Subjects :
*LINGUISTIC change
*CLASSIFICATION

Details

Language :
English
ISSN :
10512004
Volume :
116
Database :
Academic Search Index
Journal :
Digital Signal Processing
Publication Type :
Periodical
Accession number :
151560583
Full Text :
https://doi.org/10.1016/j.dsp.2021.103110