Image Captioning using Pretrained Language Models and Image Segmentation

Authors :: Bianco, S
Ferrario, G
Napoletano, P
Bianco S.
Ferrario G.
Napoletano P.
Bianco, S
Ferrario, G
Napoletano, P
Bianco S.
Ferrario G.
Napoletano P.
Publication Year :: 2022
Abstract: Large-scale pre-trained language models, which have learned cross-modal representations on image-text pairs, are becoming popular for vision-language tasks because the fine-tuning to a specific task enables state-of-the-art results. Existing methods require features of image regions as input, but these regions are extracted with an object detection model that does not handle overlapping, noisy and ambiguous regions; this inevitably results in less meaningful features. In this paper we propose a new way to extract region features based on image segmentation, with the goal of reducing overlapping and noise. Our method is motivated by the observation that image segmentation can remove useless pixels using the binary mask to extract only the object of interest.

Tools