1. Auxiliary feature extractor and dual attention-based image captioning.
- Author
-
Zhao, Qian and Wu, Guichang
- Abstract
Currently, most image caption generation models use object features output from pre-trained detectors as input for multimodal learning of images and text. However, such downscaled object features are unable to capture the global contextual relationships of an image, leading to information loss and limitations in semantic understanding. In addition, relying on a single object feature extraction framework limits the accuracy of feature extraction, resulting in the inaccurate descriptive sentences. In order to solve the above-mentioned problems, we propose the feature-augmented (FA) module, which forms the front-end part of the encoder architecture. This module uses a multimodal pre-trained model CLIP (contrastive language-image pre-training) as an auxiliary feature extractor, which enhances the image descriptive capability by utilizing the extracted rich semantic features as a complement to the missing information such as scene and object relations, so as to better encode the information necessary for the captioning task. In addition, we add channel attention to the self-attention in the encoder structure, compressing the spatial dimensions of the feature map to allow the model to concentrate on more valuable information. We validate the effectiveness of our proposed method on datasets such as MS-COCO, analyze the components and importance of our model, and demonstrate a marked increase in performance compared to other state-of-the-art models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF