Back to Search
Start Over
A Voice Cloning Method Based on the Improved HiFi-GAN Model.
- Source :
-
Computational Intelligence & Neuroscience . 10/11/2022, p1-12. 12p. - Publication Year :
- 2022
-
Abstract
- With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model parameters, inference speed, and voice quality, a voice cloning method based on improved HiFi-GAN has been proposed in this paper. (1) To improve the feature representation ability of the speaker encoder, the x-vector is used as the embedding vector that can characterize the target speaker. (2) To improve the performance of the HiFi-GAN vocoder, the input Mel spectrum is processed by a competitive multiscale convolution strategy. (3) The one-dimensional depth-wise separable convolution is used to replace all standard one-dimensional convolutions, significantly reducing the model parameters and increasing the inference speed. The improved HiFi-GAN model remarkably reduces the number of vocoder model parameters by about 68.58% and boosts the model's inference speed. The inference speed on the GPU and CPU has increased by 11.84% and 30.99%, respectively. Voice quality has also been marginally improved as MOS increased by 0.13 and PESQ increased by 0.11. The improved HiFi-GAN model exhibits outstanding performance and remarkable compatibility in the voice cloning task. Combined with the x-vector embedding, the proposed model achieves the highest score of all the models and test sets. [ABSTRACT FROM AUTHOR]
- Subjects :
- *SPEECH
*SPEECH synthesis
*VOCODER
*HUMAN voice
*SPEED
Subjects
Details
- Language :
- English
- ISSN :
- 16875265
- Database :
- Academic Search Index
- Journal :
- Computational Intelligence & Neuroscience
- Publication Type :
- Academic Journal
- Accession number :
- 159594582
- Full Text :
- https://doi.org/10.1155/2022/6707304