Back to Search Start Over

Counting-based visual question answering with serial cascaded attention deep learning.

Authors :
MeshuWelde, Tesfayee
Liao, Lejian
Source :
Pattern Recognition. Dec2023, Vol. 144, pN.PAG-N.PAG. 1p.
Publication Year :
2023

Abstract

• To design and develop a model of counting-based VQA using heuristic-assisted deep learning models for finding the counting answers based on the given simple and complex type questions, which provides the helpful information on the web source or social media application. • To propose the optimal weighted fused feature process, where the image features extracted by visual geometry Group16 (VGG16) and text features retrieved by text convolutional neural network (TCNN) are combined together by determining the optimal weight through novel improved tuna swarm optimization (ITSO) approach. • To frame the novel heuristic algorithm as ITSO, with the new formulation derived for random value that helps in position updating process. It is used to optimize the parameters like epoch in LSTM, hidden neuron of recurrent neural network (RNN) and weights for minimizing the root mean square error (RMSE) value. • To illustrate the novel serial cascaded recurrent neural network with attention mechanism based long short-term memory (SCRAM-LSTM) model, in which the RNN and attention-based LSTM is serially cascaded to estimate the final counting answers. Further, to attain the optimal answers, the hyper-parameters are tuned optimally by improved tuna swarm optimization (ITSO) approach. The counting-based questions play a major part in Visual Question Answering (VQA), the most challenging factor is counting the different objects present in the images. Recently more attention is paid to design a model of count-aided VQA. Based on the questions, the VQA system responds with appropriate answers. Yet, the complex questions are necessitating in the system with answers. The earlier models are still facing the challenging problems of counting the various objects within the images as the models become futile to select the features and lack fine-grained representation. In order to sustain the image representation, this paper proposes a new model for VQA using the heuristic approach of serial cascaded deep learning methods. Initially, the standard data regarding images and text data are gathered and fed to the pre-processing process. Consequently, the feature extraction is done on both the image and the text data. Here, the deep features from images are taken using Visual Geometry Group 16 (VGG16) and the text features are extracted using Text Convolutional Neural Network (TCNN). Then, the optimal weighted fused features are obtained, where the weights used for getting the necessary features are tuned via the Improved Tuna Swarm Optimization (ITSO) algorithm. Finally, the counting answers are retrieved based on the given queries, which is carried out via Serial Cascaded Recurrent Neural Network with Attention Mechanism-based Long Short-Term Memory (SCRAM-LSTM). The performance is examined with divergent metrics compared with conventional models. Hence, the findings reveal that it offers superior performance in estimating the appropriate answers. Therefore, the proposed work is widely used for such potential applications as helping blind or visually impaired people to get information, integrating with image retrieval systems, and also for search engines. Especially, it is utilized for the vision and language systems. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00313203
Volume :
144
Database :
Academic Search Index
Journal :
Pattern Recognition
Publication Type :
Academic Journal
Accession number :
171367572
Full Text :
https://doi.org/10.1016/j.patcog.2023.109850