1. KMSAV: Korean multi-speaker spontaneous audiovisual dataset
- Author
-
Kiyoung Park, Changhan Oh, and Sunghee Dong
- Subjects
audiovisual data ,dataset ,multimodal data ,multi-speaker spontaneous data ,speech recognition ,Telecommunication ,TK5101-6720 ,Electronics ,TK7800-8360 - Abstract
Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open-source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state-of-theart ASR and AVSR techniques, capitalizing on both pretrained models and fine-tuning processes. After fine-tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.
- Published
- 2024
- Full Text
- View/download PDF