Optimal transformers based image captioning using beam search.

Authors :: Shetty, Ashish
Kale, Yatharth
Patil, Yogeshwar
Patil, Rajeshwar
Sharma, Sanjeev
Source :: Multimedia Tools & Applications; May2024, Vol. 83 Issue 16, p47963-47977, 15p
Publication Year :: 2024
Abstract: Image Captioning is the process of generating textual descriptions of given images. It encompasses two major fields of deep learning, computer vision, and natural language processing. This paper presents an Image Captioning model which uses the Convolution Neural Network (CNN) model for feature extraction and a transformer architecture for the generation of sequences from these feature vectors. For feature extraction, this paper uses different CNN architectures like Xception, InceptionV3, ResNet50V2, VGG19, DenseNet201, ResNet152V2, EfficientNetV2B3, EfficientNetV2B0. The proposed method takes advantage of the transformer model for faster processing, and Beam search is implemented to get the top N most probable sequences for each image. The architecture is trained on Flickr8k dataset and the model outperforms the existing methods. The proposed model achieves a BLEU_4 score of 0.2184 on the Flickr8k dataset. [ABSTRACT FROM AUTHOR]