1. Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning
- Author
-
R. Vinotha, D. Hepsiba, L. D. Vijay Anand, J. Andrew, and R. Jennifer Eunice
- Subjects
SepFormer-SEGAN ,HAN ,Transformer ,Conformer ,DSR ,Dysarthric speech enhancement ,Medicine ,Science - Abstract
Abstract Dysarthria, a motor speech disorder that impacts articulation and speech clarity, presents significant challenges for Automatic Speech Recognition (ASR) systems. This study proposes a groundbreaking approach to enhance the accuracy of Dysarthric Speech Recognition (DSR). A primary innovation lies in the integration of the SepFormer-Speech Enhancement Generative Adversarial Network (S-SEGAN), an advanced generative adversarial network tailored for Dysarthric Speech Enhancement (DSE), as a front-end processing stage for DSR systems. The S-SEGAN integrates SEGAN’s adversarial learning with SepFormer speech separation capabilities, demonstrating significant improvements in performance. Furthermore, a multistage transfer learning approach is employed to assess the DSR models for both word-level and sentence-level DSR. These DSR models are first trained on a large speech dataset (LibriSpeech) and then fine-tuned on dysarthric speech data (both isolated and augmented). Evaluations demonstrate significant DSR accuracy improvements in DSE integration. The Dysarthric Speech (DS)-baseline models (without DSE), Transformer and Conformer achieved Word Recognition Accuracy (WRA) percentages of 68.60% and 69.87%, respectively. The introduction of Hierarchical Attention Network (HAN) with the Transformer and Conformer architectures resulted in improved performance, with T-HAN achieving a WRA of 71.07% and C-HAN reaching 73%. The Transformer model with DSE + DSR for isolated words achieves a WRA of 73.40%, while that of the Conformer model reaches 74.33%. Notably, the T-HAN and C-HAN models with DSE + DSR demonstrate even more substantial enhancements, with WRAs of 75.73% and 76.87%, respectively. Augmenting words further boosts model performance, with the Transformer and Conformer models achieving WRAs of 76.47% and 79.20%, respectively. Remarkably, the T-HAN and C-HAN models with DSE + DSR and augmented words exhibit WRAs of 82.13% and 84.07%, respectively, with C-HAN displaying the highest performance among all proposed models.
- Published
- 2024
- Full Text
- View/download PDF