Back to Search Start Over

Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval.

Authors :
Gong, Yan
Cosma, Georgina
Source :
Pattern Recognition. May2023, Vol. 137, pN.PAG-N.PAG. 1p.
Publication Year :
2023

Abstract

• Existing Visual Semantic Embedding (VSE) networks are trained by a hard negatives loss function that learns an objective margin between the similarity of relevant and irrelevant image–description embedding pairs and ignores the semantic differences between the irrelevant pairs. • We propose a novel Semantically Enhanced Hard negatives Loss function (LSEH) for Cross-modal Information Retrieval that considers the semantic differences between irrelevant training pairs. • The proposed LSEH function dynamically adjusts the learning objectives of VSE networks to make their learning flexible and efficient. • Experiments with various benchmark datasets and VSE networks revealed that the proposed LSEH function reduces their training epochs by approximately 50% and also improves their retrieval performance. Visual Semantic Embedding (VSE) networks aim to extract the semantics of images and their descriptions and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image–description embedding pairs. However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that ignores the semantic differences of the irrelevant image–description pairs. To address the challenge of measuring the optimal similarities between image–description pairs before obtaining the trained VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically-enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between irrelevant image–description pairs. Extensive experiments were carried out by integrating the proposed methods into five state-of-the-art VSE networks that were applied to three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the proposed methods achieved the best performance and can also be adopted by existing and future VSE networks. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00313203
Volume :
137
Database :
Academic Search Index
Journal :
Pattern Recognition
Publication Type :
Academic Journal
Accession number :
161766928
Full Text :
https://doi.org/10.1016/j.patcog.2022.109272