1. Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better
- Author
-
Ge, Mengying, Li, Mingyang, Tang, Dongkai, Li, Pengbo, Liu, Kuo, Deng, Shuhao, Pu, Songbai, Liu, Long, Song, Yang, and Zhang, Tao
- Subjects
Computer Science - Multimedia ,Computer Science - Artificial Intelligence ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
- Published
- 2024