Back to Search
Start Over
Optimal transformers based image captioning using beam search.
- Source :
- Multimedia Tools & Applications; May2024, Vol. 83 Issue 16, p47963-47977, 15p
- Publication Year :
- 2024
-
Abstract
- Image Captioning is the process of generating textual descriptions of given images. It encompasses two major fields of deep learning, computer vision, and natural language processing. This paper presents an Image Captioning model which uses the Convolution Neural Network (CNN) model for feature extraction and a transformer architecture for the generation of sequences from these feature vectors. For feature extraction, this paper uses different CNN architectures like Xception, InceptionV3, ResNet50V2, VGG19, DenseNet201, ResNet152V2, EfficientNetV2B3, EfficientNetV2B0. The proposed method takes advantage of the transformer model for faster processing, and Beam search is implemented to get the top N most probable sequences for each image. The architecture is trained on Flickr8k dataset and the model outperforms the existing methods. The proposed model achieves a BLEU_4 score of 0.2184 on the Flickr8k dataset. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 13807501
- Volume :
- 83
- Issue :
- 16
- Database :
- Complementary Index
- Journal :
- Multimedia Tools & Applications
- Publication Type :
- Academic Journal
- Accession number :
- 177079320
- Full Text :
- https://doi.org/10.1007/s11042-023-17359-6