1. CochleaSpecNet: An Attention-Based Dual Branch Hybrid CNN-GRU Network for Speech Emotion Recognition Using Cochleagram and Spectrogram
- Author
-
Atkia Anika Namey, Khadija Akter, Md. Azad Hossain, and M. Ali Akber Dewan
- Subjects
Speech emotion ,cochleagram ,spectrogram ,hybrid network ,multi-head attention ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Being one of the main communication medium, speech contains necessary information about the emotional state of a human. Accurate emotion recognition is crucial for enhancing human-machine interactions, highlighting the importance of a strong Speech Emotion Recognition (SER) system. SER system classifies the human emotional state based on speaker’s utterances in different catagories such as sad, happy, neutral, angry, surprise, calm and so on. This research introduces a novel SER approach that utilizes cochleagram and spectrogram features to capture relevant speech patterns for the classifier network. The network integrates a hybrid model that combines Convolutional Neural Networks (CNN) for feature extraction with Gated Recurrent Units (GRU) to handle temporal dependencies. Furthermore, to improve the performance of this network, a multi-head attention mechanism has been incorporated following the GRU layer. Despite increasing interest in SER, there is a notable lack of studies using Bangla language datasets, revealing a significant gap in current research. To address this gap, evaluation of the model has been conducted on the augmented BanglaSER (Bangla Speech Emotion Recognition) dataset in which the model has achieved a notable accuracy of 92.04% in categorizing five distinct emotions: angry, surprise, happy, neutral, and sad. Additionally, to further evaluate the performance of the SER model, English language based RAVDESS (Ryerson Audio-Visual Database of Emotional Speech) dataset has also been employed into the proposed model. This attempt has provided 82.40% accuracy in classifying eight diverse emotions that includes fear, disgust, calm along with the emotions of BanglaSER. Moreover, a comparative analysis of the proposed model with existing SER approaches is carried out to demonstrate it’s stability and robustness. The incorporation of two individual features as inputs into the attention guided hybrid neural network showcases the efficacy of the proposed SER system, offering a promising approach for precise and efficient emotion categorization from speech signals.
- Published
- 2024
- Full Text
- View/download PDF