Start Over

Searching for memory-lighter architectures for OCR-augmented image captioning.

Authors :: Gallardo-García, Rafael
Beltrán-Martínez, Beatriz
Hernández-Gracidas, Carlos
Vilariño-Ayala, Darnes
Pinto, David
Beltrán, Beatriz
Singh, Vivek
Source :: Journal of Intelligent & Fuzzy Systems; 2022, Vol. 42 Issue 5, p4399-4410, 12p
Publication Year :: 2022
Abstract: Current State-of-the-Art image captioning systems that can read and integrate read text into the generated descriptions need high processing power and memory usage, which limits the sustainability and usability of the models (as they require expensive and very specialized hardware). The present work introduces two alternative versions (L-M4C and L-CNMT) of top architectures (on the TextCaps challenge), which were mainly adapted to achieve near-State-of-The-Art performance while being memory-lighter when compared to the original architectures, this is mainly achieved by using distilled or smaller pre-trained models on the text-and-OCR embedding modules. On the one hand, a distilled version of BERT was used in order to reduce the size of the text-embedding module (the distilled model has 59% fewer parameters), on the other hand, the OCR context processor on both architectures was replaced by Global Vectors (GloVe), instead of using FastText pre-trained vectors, this can reduce the memory used by the OCR-embedding module up to a 94%. Two of the three models presented in this work surpassed the baseline (M4C-Captioner) of the challenge on the evaluation and test sets, also, our best lighter architecture reached a CIDEr score of 88.24 on the test set, which is 7.25 points above the baseline model. [ABSTRACT FROM AUTHOR]