Back to Search Start Over

Multi-scale network with shared cross-attention for audio–visual correlation learning.

Authors :
Zhang, Jiwei
Yu, Yi
Tang, Suhua
Li, Wei
Wu, Jianming
Source :
Neural Computing & Applications. Sep2023, Vol. 35 Issue 27, p20173-20187. 15p.
Publication Year :
2023

Abstract

Cross-modal audio–visual correlation learning has been an interesting research topic, which aims to capture and understand semantic correspondences between audio and video. We face two challenges during audio–visual correlation learning: (i) audio and visual feature sequences, respectively, belong to different feature spaces, and (ii) semantic mismatch between audio and visual sequences inevitably happens. To solve these challenges, existing works mainly focus on how to efficiently extract discriminative features, while ignoring the abundant granular features of audio and visual modalities. In this work, we introduce the multi-scale network with shared cross-attention (MSNSCA) module for audio–visual correlation learning, a supervised representation learning framework for capturing semantic audio–visual correspondences by integrating a multi-scale feature extraction module with strong cross-attention into an end-to-end trainable deep network. MSNSCA can extract more effective audio–visual particle features with excellent audio–visual semantic matching capability. Experiments on various audio–visual learning tasks, including audio–visual matching and retrieval on benchmark datasets, demonstrate the effectiveness of the proposed MSNSCA model. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09410643
Volume :
35
Issue :
27
Database :
Academic Search Index
Journal :
Neural Computing & Applications
Publication Type :
Academic Journal
Accession number :
170899802
Full Text :
https://doi.org/10.1007/s00521-023-08817-1