Back to Search
Start Over
Multi-scale network with shared cross-attention for audio–visual correlation learning.
- Source :
-
Neural Computing & Applications . Sep2023, Vol. 35 Issue 27, p20173-20187. 15p. - Publication Year :
- 2023
-
Abstract
- Cross-modal audio–visual correlation learning has been an interesting research topic, which aims to capture and understand semantic correspondences between audio and video. We face two challenges during audio–visual correlation learning: (i) audio and visual feature sequences, respectively, belong to different feature spaces, and (ii) semantic mismatch between audio and visual sequences inevitably happens. To solve these challenges, existing works mainly focus on how to efficiently extract discriminative features, while ignoring the abundant granular features of audio and visual modalities. In this work, we introduce the multi-scale network with shared cross-attention (MSNSCA) module for audio–visual correlation learning, a supervised representation learning framework for capturing semantic audio–visual correspondences by integrating a multi-scale feature extraction module with strong cross-attention into an end-to-end trainable deep network. MSNSCA can extract more effective audio–visual particle features with excellent audio–visual semantic matching capability. Experiments on various audio–visual learning tasks, including audio–visual matching and retrieval on benchmark datasets, demonstrate the effectiveness of the proposed MSNSCA model. [ABSTRACT FROM AUTHOR]
- Subjects :
- *SUPERVISED learning
*FEATURE extraction
*AUDIOVISUAL materials
Subjects
Details
- Language :
- English
- ISSN :
- 09410643
- Volume :
- 35
- Issue :
- 27
- Database :
- Academic Search Index
- Journal :
- Neural Computing & Applications
- Publication Type :
- Academic Journal
- Accession number :
- 170899802
- Full Text :
- https://doi.org/10.1007/s00521-023-08817-1