1. Semi-supervised cross-modal retrieval with graph-based semantic alignment network.
- Author
-
Zhang, Lei, Chen, Leiting, Ou, Weihua, and Zhou, Chuan
- Subjects
- *
MODAL logic , *ARTIFICIAL neural networks , *SUPERVISED learning , *GENERATIVE adversarial networks , *MAP design - Abstract
Semi-supervised cross-modal retrieval is an eclectic paradigm which learns common representations via exploiting underlying semantic information from both labeled and unlabeled data. Most existing methods ignore the rich semantic information of text data and are unable to fully utilize the text data in common representation learning. Moreover, they only considered the correlation of the data with the same semantic label, but ignored the correlation between the data with different semantic label. In this paper, we propose a novel semi-supervised cross-modal retrieval method, called Graph-based Semantic Alignment Network (GSAN), which learns common representation by aligning the features of different modalities with semantic embeddings of text data. Firstly, we design a Deep Supervised Semantic Encoding (DSSE) module to train the semantic projector and label predictor which can exploit the semantic embeddings and the predicted labels from unlabeled data of text modality. Then, GAN-based Bidirectional Fusion (GBF) module is designed to learn the mapping networks of two modalities (image and text). In order to make the mapping networks generate semantically discriminative and modality-invariant representations, we utilize the underlying semantic information exploited by DSSE to construct Graph-based Triplet Constraint (GTC) which can enforce feature embeddings from the semantically-matched (image and text) pairs to be more similar and push those mismatched ones away. By the benefit of fully using of semantic information, our approach can only use fewer label data and achieves the performance of state-of-the-art methods. In addition, since we only utilize the mapping networks trained in GBF module to generate common representations in referring stage, our approach is efficient and time saving in real world application. Extensive experiments on four widely-used datasets show the effectiveness of GSAN. • We specialize in exploring semantic information covered by corresponding text data. • The semantics of text data plays supervision signal in unsupervised learning. • The graph-based triplet constraint enforces the common representations well cluster. • The triplet loss is utilized to constrain the adversarial learning. • Experiments in both supervised and semi-supervised setting show the effectiveness. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF