Back to Search Start Over

Multimodal Food Image Classification with Large Language Models.

Authors :
Kim, Jun-Hwa
Kim, Nam-Ho
Jo, Donghyeok
Won, Chee Sun
Source :
Electronics (2079-9292); Nov2024, Vol. 13 Issue 22, p4552, 10p
Publication Year :
2024

Abstract

In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LLM are encoded and combined with image features obtained from a transformer-based architecture to improve food image classification. Our approach employs a cross-attention mechanism to effectively fuse visual and textual modalities, enhancing the model's ability to extract discriminative features beyond what can be achieved with visual features alone. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
20799292
Volume :
13
Issue :
22
Database :
Complementary Index
Journal :
Electronics (2079-9292)
Publication Type :
Academic Journal
Accession number :
181168373
Full Text :
https://doi.org/10.3390/electronics13224552