Back to Search Start Over

MAVA: Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism.

Authors :
Peng, Yuxin
Qi, Jinwei
Zhuo, Yunkan
Source :
IEEE Transactions on Image Processing. 2020, Vol. 29, p2728-2741. 14p.
Publication Year :
2020

Abstract

The rapidly developing information technology leads to a fast growth of visual and textual contents, and it comes with huge challenges to make correlation and perform cross-media retrieval between images and sentences. Existing methods mainly explore cross-media correlation from either global-level instances as the whole images and sentences, or local-level fine-grained patches as the discriminative image regions and key words, which ignore the complementary information from the relation between local-level fine-grained patches. Naturally, relation understanding is highly important for learning cross-media correlation. People focus on not only the alignment between discriminative image regions and key words, but also their relations lying in the visual and textual context. Therefore, in this paper, we propose Multi-level Adaptive Visual-textual Alignment (MAVA) approach with the following contributions. First, we propose cross-media multi-pathway fine-grained network to extract not only the local fine-grained patches as discriminative image regions and key words, but also visual relations between image regions as well as textual relations from the context of sentences, which contain complementary information to exploit fine-grained characteristics within different media types. Second, we propose visual-textual bi-attention mechanism to distinguish the fine-grained information with different saliency from both local and relation levels, which can provide more discriminative hints for correlation learning. Third, we propose cross-media multi-level adaptive alignment to explore global, local and relation alignments. An adaptive alignment strategy is further proposed to enhance the matched pairs of different media types, and discard those misalignments adaptively to learn more precise cross-media correlation. Extensive experiments are conducted to perform image-sentence matching on 2 widely-used cross-media datasets, namely Flickr-30K and MS-COCO, comparing with 10 state-of-the-art methods, which can fully verify the effectiveness of our proposed MAVA approach. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10577149
Volume :
29
Database :
Academic Search Index
Journal :
IEEE Transactions on Image Processing
Publication Type :
Academic Journal
Accession number :
170078149
Full Text :
https://doi.org/10.1109/TIP.2019.2952085