Back to Search
Start Over
Enhancing image captioning performance based on efficientnet B0 model and transformer encoder-decoder.
- Source :
-
AIP Conference Proceedings . 2024, Vol. 3919 Issue 1, p1-14. 14p. - Publication Year :
- 2024
-
Abstract
- In recent years, improvements in natural language processing and computer vision have come together to provide automatic image caption generation. Image captioning is the process of creating a description for an image. Captioning an image needs the recognition of significant items, their properties, and their connections within the image. Additionally, it must create phrases that are syntactically and semantically accurate. Deep learning approaches can address the complexities and difficulties associated with image captions. This paper describes a joint model which is capable of automatically captioning images using EfficientNet-B0 and a transformer with multi-head attention. The model is an aggregation of an EfficientNet & Transformer single encoder and decoder. The encoder utilizes EfficientNet-B0, a convolutional neural network-based algorithm that generates a detailed input image, represented by embedding them into a fixed-length vector. The decoder employs a transformer, and a multi-head attention mechanism to selectively concentrate attention on certain regions of images to predict the sentence. The proposed model was trained using a large dataset Flickr8k to optimize BLEU N-Gram (N=1,2,3,4), METEOR score, and CIDEr, to assess the probability of the target description phrase given in the training images. Our studies show that the proposed model can produce captions for images automatically. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 0094243X
- Volume :
- 3919
- Issue :
- 1
- Database :
- Academic Search Index
- Journal :
- AIP Conference Proceedings
- Publication Type :
- Conference
- Accession number :
- 176251261
- Full Text :
- https://doi.org/10.1063/5.0184395