Back to Search Start Over

An Attentive Fourier-Augmented Image-Captioning Transformer

Authors :
Raymond Ian Osolo
Zhan Yang
Jun Long
Source :
Applied Sciences, Vol 11, Iss 18, p 8354 (2021)
Publication Year :
2021
Publisher :
MDPI AG, 2021.

Abstract

Many vision–language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting, a phenomenon that has persisted even with the emergence of transformer-based architectures as the preferred base architecture of recent state-of-the-art vision–language models. In this paper, we make the images matter more by using fast Fourier transforms to further breakdown the input features and extract more of their intrinsic salient information, resulting in more detailed yet concise captions. This is achieved by performing a 1D Fourier transformation on the image features first in the hidden dimension and then in the sequence dimension. These extracted features alongside the region proposal image features result in a richer image representation that can then be queried to produce the associated captions, which showcase a deeper understanding of image–object–location relationships than similar models. Extensive experiments performed on the MSCOCO dataset demonstrate a CIDER-D, BLEU-1, and BLEU-4 score of 130, 80.5, and 39, respectively, on the MSCOCO benchmark dataset.

Details

Language :
English
ISSN :
20763417
Volume :
11
Issue :
18
Database :
Directory of Open Access Journals
Journal :
Applied Sciences
Publication Type :
Academic Journal
Accession number :
edsdoj.b503f698a4a419ba043567d3f6f3faa
Document Type :
article
Full Text :
https://doi.org/10.3390/app11188354