Back to Search Start Over

Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language

Authors :
Vasile Păiș
Verginica Barbu Mititelu
Elena Irimia
Radu Ion
Dan Tufiș
Source :
Applied Sciences, Vol 14, Iss 19, p 9043 (2024)
Publication Year :
2024
Publisher :
MDPI AG, 2024.

Abstract

This paper introduces the USPDATRO dataset. This is a speech dataset, in the Romanian language, constructed from open data, focusing on under-represented voice types (children, young and old people, and female voices). The paper covers the methodology behind the dataset construction, specific details regarding the dataset, and evaluation of existing Romanian Automatic Speech Recognition (ASR) systems, with different architectures. Results indicate that more under-represented speech content is needed in the training of ASR systems. Our approach can be extended to other low-resourced languages, as long as open data are available.

Details

Language :
English
ISSN :
20763417
Volume :
14
Issue :
19
Database :
Directory of Open Access Journals
Journal :
Applied Sciences
Publication Type :
Academic Journal
Accession number :
edsdoj.539571b54c6a47e392c2dac310b500ad
Document Type :
article
Full Text :
https://doi.org/10.3390/app14199043