1. Conditional selection with CNN augmented transformer for multimodal affective analysis.
- Author
-
Wang, Jianwen, Wang, Shiping, Xiao, Shunxin, Lin, Renjie, Dong, Mianxiong, and Guo, Wenzhong
- Subjects
TRANSFORMER models ,AFFECT (Psychology) ,SIGNAL detection ,MULTISENSOR data fusion - Abstract
Attention mechanism has been a successful method for multimodal affective analysis in recent years. Despite the advances, several significant challenges remain in fusing language and its nonverbal context information. One is to generate sparse attention coefficients associated with acoustic and visual modalities, which helps locate critical emotional semantics. The other is fusing complementary cross‐modal representation to construct optimal salient feature combinations of multiple modalities. A Conditional Transformer Fusion Network is proposed to handle these problems. Firstly, the authors equip the transformer module with CNN layers to enhance the detection of subtle signal patterns in nonverbal sequences. Secondly, sentiment words are utilised as context conditions to guide the computation of cross‐modal attention. As a result, the located nonverbal features are not only salient but also complementary to sentiment words directly. Experimental results show that the authors' method achieves state‐of‐the‐art performance on several multimodal affective analysis datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF