1. Heterogeneous Graph Fusion Network for cross-modal image-text retrieval.
- Author
-
Qin, Xueyang, Li, Lishuang, Pang, Guangyao, and Hao, Fei
- Subjects
- *
PERFORMANCE standards , *VISUAL cryptography - Abstract
Exploring the semantic correspondence of image-text pairs is significant as it bridges vision and language. Most prior works focus on global semantic alignment or local semantic alignment, by developing a fine neural network that facilitates the corresponding alignment but neglects the semantic information and relative position information between image regions, or text words, which will lead to a non-meaningful alignment. To this end, a Heterogeneous Graph Fusion Network (HGFN) is proposed to explore the correlation score of vision-language for improving the accuracy of cross-modal image-text retrieval in this paper. Specifically, we first construct an undirected fully-connected graph based on the semantic or relative position information for each image, as well as a textual graph with neighborhood information of the text. Then, we present a graph fusion module to integrate the features of heterogeneous graphs into a unified hybrid representation, in which the graph convolutional network is utilized to gather neighborhood information to alleviate potentially non-meaningful alignment. In addition, we also propose a novel "Dynamic top- K negative" strategy for the selection of negative examples in the training process. Experimental results demonstrate that HGFN achieves comparable performance with state-of-the-art approaches on the Flickr30K and MSCOCO datasets. • Cross-modal image-text retrieval problems can be treated as graph-graph matching. • A graph fusion module is designed to fuse the visual graph and textual graph. • A novel strategy for selecting negative examples is proposed. • Competitive with state-of-the-art performance on two standard benchmarks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF