Back to Search Start Over

COME: Clip-OCR and Master ObjEct for text image captioning.

Authors :
Lv, Gang
Sun, Yining
Nian, Fudong
Zhu, Maofei
Tang, Wenliang
Hu, Zhenzhen
Source :
Image & Vision Computing. Aug2023, Vol. 136, pN.PAG-N.PAG. 1p.
Publication Year :
2023

Abstract

Text image captioning aims to understand the scene text in images for generating image captions. The key challenge of this task is to accurately and comprehensively understand the OCR tokens of scene text. Due to the dual modal of visual and textual features of scene text, expressing the multimodal semantic features of OCR tokens accurately is a challenging task. Additionally, since scene text cannot exist independently of specific objects and is always associated with its surroundings, establishing a scene graph centered around OCR tokens is also an important approach to understand its relationship with other objects in the image. In this paper, we propose a novel model named C lip- O CR and M aster Obj E ct (dubbed as COME) for text image captioning. First, we introduce a CLIP-OCR module to enhance the multimodal representation of OCR tokens. We separate the OCR representation into visual and textual items and narrow the similarity by contrastive learning. With the assistance of the CLIP-OCR module, we realize correlation alignment between different modes. Next, we propose the concept of master object for each OCR text and purify the OCR-oriented scene graph with it. The master object is defined as the object to which the OCR is attached, which bridges the semantic relationship between the OCR tokens and the image. We consider the master object as a proxy that connects OCR tokens and other regions in the image. By exploring the master object for each OCR token, we build a purified scene graph based on the master object and then enrich the visual embedding by the Graph Convolution Network (GCN). Furthermore, we cluster the OCR tokens and append the hierarchical information on the input embedding to provide a complete representation. Experiments on the TextCaps validation set and test set demonstrate the effectiveness of the proposed framework. • A new framework COME for text image captioning that understands OCR tokens of scene text. • A CLIP-OCR module to enhance the multimodal representation of OCR tokens. • The concept of master object to purified the relationships of OCR token. [ABSTRACT FROM AUTHOR]

Subjects

Subjects :
*CONVOLUTIONAL neural networks

Details

Language :
English
ISSN :
02628856
Volume :
136
Database :
Academic Search Index
Journal :
Image & Vision Computing
Publication Type :
Academic Journal
Accession number :
169333352
Full Text :
https://doi.org/10.1016/j.imavis.2023.104751