Back to Search Start Over

Hierarchical cross-modal contextual attention network for visual grounding.

Authors :
Xu, Xin
Lv, Gang
Sun, Yining
Hu, Yuxia
Nian, Fudong
Source :
Multimedia Systems. Aug2023, Vol. 29 Issue 4, p2073-2083. 11p.
Publication Year :
2023

Abstract

This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN. [ABSTRACT FROM AUTHOR]

Subjects

Subjects :
*DEEP learning

Details

Language :
English
ISSN :
09424962
Volume :
29
Issue :
4
Database :
Academic Search Index
Journal :
Multimedia Systems
Publication Type :
Academic Journal
Accession number :
164947959
Full Text :
https://doi.org/10.1007/s00530-023-01097-8