Start Over

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Authors :: Ellinas, Nikolaos
Christidou, Myrsini
Vioni, Alexandra
Sung, June Sig
Chalamandaris, Aimilios
Tsiakoulis, Pirros
Mastorocostas, Paris
Source :: Speech Communication. 146:22-31
Publication Year :: 2023
Publisher :: Elsevier BV, 2023.
Abstract: In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.<br />Comment: Final published version available at: Speech Communication. arXiv admin note: substantial text overlap with arXiv:2111.10168

Subjects :: FOS: Computer and information sciences
Computer Science - Machine Learning
Sound (cs.SD)
Linguistics and Language
Computer Science - Computation and Language
Communication
Computer Science - Sound
Language and Linguistics
Machine Learning (cs.LG)
Computer Science Applications
Audio and Speech Processing (eess.AS)
Modeling and Simulation
FOS: Electrical engineering, electronic engineering, information engineering
Computer Vision and Pattern Recognition
Computation and Language (cs.CL)
Software
Electrical Engineering and Systems Science - Audio and Speech Processing

Details

ISSN :: 01676393
Volume :: 146
Database :: OpenAIRE
Journal :: Speech Communication
Accession number :: edsair.doi.dedup.....ff5e7f452b9d8d1e543e3a8f34e2e1fb

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources