Start Over

Complex visual question answering based on uniform form and content.

Authors :: Chen, Deguang
Chen, Jianrui
Fang, Chaowei
Zhang, Zhichao
Source :: Applied Intelligence; Mar2024, Vol. 54 Issue 6, p4602-4620, 19p
Publication Year :: 2024
Abstract: Complex visual question answering holds the potential to enhance artificial intelligence proficiency in understanding natural language, stimulate advances in computer vision technologies, and expand the range of practical applications. However, achieving desirable answers is often hindered by factors such as inconsistent form and content of pre-training and fine-tuning tasks, and the involvement of external knowledge. In this paper, we propose a complex visual question answering model based on uniform form and content, which aims to achieve better feature consistency and enhance model performance. To guide the question answering task and compensate for inconsistencies in the form of pre-training and downstream tasks, an encoding and decoding model is employed to generate auto-prompt tuning templates with masks. Moreover, the intermediate process between pre-training and the downstream task, which is similar to the downstream task, helps to further bridge the content gap between the two modalities. Based on this foundation, we propose a novel APT-CVQA model that incorporates a hybrid architecture and a joint loss function for cross-entropy and SIMCLR. On the complex scenario KR-VQA dataset, the accuracy of our model surpasses the optimal performance by 2.45 % . On the universal dataset GQA, our model performs 6.87 % better than the optimal performance of the compared models. The whole process is divided into three phases. Phase-1 generates auto-prompt tuning templates, Phase-2 facilitates the creation of intermediate pre-trained checkpoints, and Phase-3 is used for fine-tuning. [ABSTRACT FROM AUTHOR]