Back to Search Start Over

Enhancing image captioning performance based on efficientnet B0 model and transformer encoder-decoder.

Authors :
Joshi, Abhisht
Alkhayyat, Ahmed
Gunwant, Harsh
Tripathi, Abhay
Sharma, Moolchand
Source :
AIP Conference Proceedings. 2024, Vol. 3919 Issue 1, p1-14. 14p.
Publication Year :
2024

Abstract

In recent years, improvements in natural language processing and computer vision have come together to provide automatic image caption generation. Image captioning is the process of creating a description for an image. Captioning an image needs the recognition of significant items, their properties, and their connections within the image. Additionally, it must create phrases that are syntactically and semantically accurate. Deep learning approaches can address the complexities and difficulties associated with image captions. This paper describes a joint model which is capable of automatically captioning images using EfficientNet-B0 and a transformer with multi-head attention. The model is an aggregation of an EfficientNet & Transformer single encoder and decoder. The encoder utilizes EfficientNet-B0, a convolutional neural network-based algorithm that generates a detailed input image, represented by embedding them into a fixed-length vector. The decoder employs a transformer, and a multi-head attention mechanism to selectively concentrate attention on certain regions of images to predict the sentence. The proposed model was trained using a large dataset Flickr8k to optimize BLEU N-Gram (N=1,2,3,4), METEOR score, and CIDEr, to assess the probability of the target description phrase given in the training images. Our studies show that the proposed model can produce captions for images automatically. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
0094243X
Volume :
3919
Issue :
1
Database :
Academic Search Index
Journal :
AIP Conference Proceedings
Publication Type :
Conference
Accession number :
176251261
Full Text :
https://doi.org/10.1063/5.0184395