Back to Search Start Over

Align vision-language semantics by multi-task learning for multi-modal summarization.

Authors :
Cui, Chenhao
Liang, Xinnian
Wu, Shuangzhi
Li, Zhoujun
Source :
Neural Computing & Applications. May2024, p1-14.
Publication Year :
2024

Abstract

Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features. After that, these visual features are fused with language representations for the decoder to generate the text summary. However, the cascaded way employs separate encoders for different modalities, which makes it hard to learn the joint vision and language representation. In addition, they also ignore the semantics alignment between paragraphs and images for multi-modal summarization tasks, which are crucial to a precise summary. To tackle these issues, in this paper, we propose ViL-Sum to jointly model paragraph-level <italic>Vi</italic>sion-<italic>L</italic>anguage Semantic Alignment and Multi-Modal <italic>Sum</italic>marization. Our ViL-Sum contains two components for better learning multi-modal semantics and aims to align them. The first one is a joint multi-modal encoder. The other one is two well-designed tasks for multi-task learning, including image reordering and image selection. Specifically, the joint multi-modal encoder converts images into visual embeddings and attaches them with text embedding as the input of the encoder. The reordering task guides the model to learn paragraph-level semantic alignment, and the selection task guides the model to select summary-related images in the final summary. Experimental results show that our proposed ViL-Sum outperforms current state-of-the-art methods on most automatic and manual evaluation metrics. In further analysis, we find that two well-designed tasks and a joint multi-modal encoder can effectively guide the model to learn reasonable paragraph-image and summary-image relations. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09410643
Database :
Academic Search Index
Journal :
Neural Computing & Applications
Publication Type :
Academic Journal
Accession number :
177270549
Full Text :
https://doi.org/10.1007/s00521-024-09908-3