Descriptor: "Visual Question Answering" / Language: english - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Visual Question Answering"' showing total 455 results

Start Over Descriptor "Visual Question Answering" Language english

455 results on '"Visual Question Answering"'

1. Multi-stage reasoning on introspecting and revising bias for visual question answering.

Author: An-An, L., Zimu, Lu, Ning, Xu, Min, Liu, Chenggang, Yan, Bolun, Zheng, Bo, Lv, Yulong, Duan, Zhuang, Shao, and Xuanya, Li
Subjects: ARTIFICIAL intelligence, ATTENTIONAL bias, VOCABULARY, FORECASTING, LANGUAGE & languages
Abstract: Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer prediction, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model's reliance on image content during answer reasoning and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental resu- lts show that our network achieves significant performance against the previous state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

2. Learning to enhance areal video captioning with visual question answering.

Author: Al Mehmadi, Shima M., Bazi, Yakoub, Al Rahhal, Mohamad M., and Zuair, Mansour
Subjects: *DRONE aircraft, *REMOTE sensing, *VIDEOS
Abstract: The utilization of Unmanned Aerial Vehicles (UAV) in remote sensing (RS) has witnessed a significant surge, offering valuable insights into Earth dynamics and human activities. However, this has led to a substantial increase in the volume of video data, rendering manual screening and analysis impractical. Consequently, there is a pressing need for the development of automated interpretation models for these aerial videos. In this paper, we propose a novel approach that leverages visual dialogue to enhance aerial video captioning. Our model adopts an encoder-decoder architecture, integrating a Visual Question Answering (VQA) task before the captioning task. The VQA task aims to enrich the captioning process by soliciting additional information about the image content. Specifically, our video encoder utilizes ViT-L/16, while the decoder employs Generative Pre-trained Transformer-2 (Distill-GPT-2). To validate our model, we introduce a novel benchmark dataset named CapERA-VQA, comprising videos accompanied by sets of questions, answers, and captions. Through experimental validation, we demonstrate the effectiveness of our proposed approach in enhancing the automated captioning of aerial videos. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. DCF–VQA: Counterfactual Structure Based on Multi–Feature Enhancement

Author: Yang Guan, Ji Cheng, Liu Xiaoming, Zhang Ziming, and Wang Chen
Subjects: visual question answering, multi-feature enhancement, counterfactual, discrete cosine transform, Mathematics, QA1-939, Electronic computers. Computer science, QA75.5-76.95
Abstract: Visual question answering (VQA) is a pivotal topic at the intersection of computer vision and natural language processing. This paper addresses the challenges of linguistic bias and bias fusion within invalid regions encountered in existing VQA models due to insufficient representation of multi-modal features. To overcome those issues, we propose a multi-feature enhancement scheme. This scheme involves the fusion of one or more features with the original ones, incorporating discrete cosine transform (DCT) features into the counterfactual reasoning framework. This approach harnesses finegrained information and spatial relationships within images and questions, enabling a more refined understanding of the indirect relationship between images and questions. Consequently, it effectively mitigates linguistic bias and bias fusion within invalid regions in the model. Extensive experiments are conducted on multiple datasets, including VQA2 and VQA-CP2, employing various baseline models and fusion techniques, resulting in promising and robust performance.
Published: 2024
Full Text: View/download PDF

4. Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual Question Answering

Author: Zhongjian Hu, Peng Yang, Fengyuan Liu, Yuan Meng, and Xingyu Liu
Subjects: visual question answering, knowledge-based visual question answering, large language model, knowledge injection, Electronic computers. Computer science, QA75.5-76.95
Abstract: Previous works employ the Large Language Model (LLM) like GPT-3 for knowledge-based Visual Question Answering (VQA). We argue that the inferential capacity of LLM can be enhanced through knowledge injection. Although methods that utilize knowledge graphs to enhance LLM have been explored in various tasks, they may have some limitations, such as the possibility of not being able to retrieve the required knowledge. In this paper, we introduce a novel framework for knowledge-based VQA titled “Prompting Large Language Models with Knowledge-Injection” (PLLMKI). We use vanilla VQA model to inspire the LLM and further enhance the LLM with knowledge injection. Unlike earlier approaches, we adopt the LLM for knowledge enhancement instead of relying on knowledge graphs. Furthermore, we leverage open LLMs, incurring no additional costs. In comparison to existing baselines, our approach exhibits the accuracy improvement of over 1.3 and 1.7 on two knowledge-based VQA datasets, namely OK-VQA and A-OKVQA, respectively.
Published: 2024
Full Text: View/download PDF

5. Integrating IoT and visual question answering in smart cities: Enhancing educational outcomes

Author: Tian Gao and Guanqi Wang
Subjects: Smart cities, IoT framework, Visual question answering, Large language models, Smart education technology, Engineering (General). Civil engineering (General), TA1-2040
Abstract: Emerging as a paradigmatic shift in urban development, smart cities harness the potential of advanced information and communication technologies to seamlessly integrate urban functions, optimize resource allocation, and improve the effectiveness of city management. Within the domain of smart education, the imperative application of Visual Question Answering (VQA) technology encounters significant limitations at the prevailing stage, particularly the absence of a robust Internet of Things (IoT) framework and the inadequate incorporation of large pre-trained language models (LLMs) within contemporary smart education paradigms, especially in addressing zero-shot VQA scenarios, which pose considerable challenges. In response to these constraints, this paper introduces an IoT-based smart city framework that is designed to refine the functionality and efficacy of educational systems. This framework is delineated into four cardinal layers: the data collection layer, data transmission layer, data management layer, and application layer. Furthermore, we introduce the innovative TeachVQA methodology at the application layer, synergizing VQA technology with extensive pre-trained language models, thereby considerably enhancing the dissemination and assimilation of educational content. Evaluative metrics in the VQAv2 and OKVQA datasets substantiate that the TeachVQA methodology not only outperforms existing VQA approaches, but also underscores its profound potential and practical relevance in the educational sector.
Published: 2024
Full Text: View/download PDF

6. Vision transformer-based visual language understanding of the construction process

Author: Bin Yang, Binghan Zhang, Yilong Han, Boda Liu, Jiniming Hu, and Yiming Jin
Subjects: Intelligent construction, Computer vision, Vision transformer, Natural language processing, Visual question answering, Engineering (General). Civil engineering (General), TA1-2040
Abstract: The widespread implementation of surveillance systems on construction sites has led to the accumulation of vast amounts of visual data, highlighting the need for an effective semantic analysis methodology. Natural language, as the most intuitive mode of expression, can significantly enhance the interpretability of such data. The adoption of multi-modality models promotes the interaction between surveillance video and textual data, thereby enabling managers to swiftly comprehend on-site dynamics. This study introduces a Visual Question Answering (VQA) approach for the construction industry and presents a specialized dataset to address the unique requirements of on-site management. Utilizing a Vision Transformer (ViT) architecture, the proposed model conducts feature extraction, fusion and interaction between visual and textual features. An additional projection layer is added to establish a transfer learning strategy that is optimized for construction site data. This novel approach facilitates rapid alignment of visual and language features in the model and is validated through ablation studies. The proposed approach achieves a testing accuracy of 83.8%, effectively converting image data from construction sites into natural language descriptions that enhance the analysis of construction processes. Compared to existing methods, this approach does not rely on object detection and allows for the direct extraction of deep-level semantic information from the on-site images. This study further discusses the feasibility of applying VQA within the architecture, engineering and construction (AEC) industry, examines its limitations, and offers suggestions for viable future directions of development.
Published: 2024
Full Text: View/download PDF

7. Multimodal attention-driven visual question answering for Malayalam.

Author: Kovath, Abhishek Gopinath, Nayyar, Anand, and Sikha, O. K.
Subjects: *DEEP learning, *CONVOLUTIONAL neural networks, *NATURAL languages, *QUESTION answering systems, *GENOMES, *TOURISM
Abstract: Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English questions. However, applications such as visual assistance and tourism necessitate the incorporation of multilingual VQA systems. This paper presents an effective deep learning framework for Malayalam visual question answering (MVQA), which can answer a specific natural language question about an image in Malayalam. As there is no available dataset in English–Malayalam VQA, a MVQA dataset was created by translating English question–answer pairs from the visual genome dataset. The paper proposes an attention-driven MVQA model on the developed dataset. The proposed MVQA model uses a deep learning-based co-attention mechanism to jointly learn the attention for images and Malayalam questions. A second-order multimodal factorized high-order pooling is used for multi modal feature fusion. Different VQA models using combinations of classical CNNs and RNNs were experimented on the developed MVQA dataset, and the performance was compared against the proposed attention-driven model. Experimental results show that the proposed attention-driven MVQA model achieves state-of-the-art results as compared to other models for MVQA on the custom Malayalam VQA dataset. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. HRVQA: A Visual Question Answering benchmark for high-resolution aerial images.

Author: Li, Kun, Vosselman, George, and Yang, Michael Ying
Subjects: *COMPUTER vision, *URBAN planning, *SOURCE code, *QUESTION answering systems, *SCARCITY, *PIXELS
Abstract: Visual question answering (VQA) is an important and challenging multimodal task in computer vision and photogrammetry. Recently, efforts have been made to bring the VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, the development of VQA in this domain is restricted by the huge variation in the appearance, scale, and orientation of the concepts in aerial images, along with the scarcity of well-annotated datasets. In this paper, we introduce a new dataset, HRVQA, which provides a collection of 53,512 aerial images of 1024 × 1024 pixels and semi-automatically generated 1,070,240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the recent methods on the HRVQA dataset. Moreover, we propose a novel model, GFTransformer, with gated attention modules and a mutual fusion module. The experiments show that the proposed dataset is quite challenging, especially the specific attribute-related questions. Our method achieves superior performance in comparison to the previous state-of-the-art approaches. The dataset and the source code are released at https://hrvqa.nl/. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese.

Author: Tran, Khiem Vinh, Phan, Hao Phu, Van Nguyen, Kiet, and Nguyen, Ngan Luu Thuy
Abstract: In recent years, visual question answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English. To address this, we introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese while mitigating biases. The dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), each question annotated to specify the type of reasoning involved. Leveraging this dataset, we conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions. The architecture effectively employs transformers to enable simultaneous reasoning over textual and visual data, merging both modalities at an early model stage. The experimental findings demonstrate that our proposed model achieves state-of-the-art performance across four evaluation metrics. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. Sign-based image criteria for social interaction visual question answering.

Author: Chuganskaya, Anfisa A, Kovalev, Alexey K, and Panov, Aleksandr I
Subjects: QUESTION answering systems, SOCIAL interaction, ARTIFICIAL intelligence, PSYCHOLOGICAL research, MACHINE learning
Abstract: The multi-modal tasks have started to play a significant role in the research on artificial intelligence. A particular example of that domain is visual–linguistic tasks, such as visual question answering. The progress of modern machine learning systems is determined, among other things, by the data on which these systems are trained. Most modern visual question answering data sets contain limited type questions that can be answered either by directly accessing the image itself or by using external data. At the same time, insufficient attention is paid to the issues of social interactions between people, which limits the scope of visual question answering systems. In this paper, we propose criteria by which images suitable for social interaction visual question answering can be selected for composing such questions, based on psychological research. We believe this should serve the progress of visual question answering systems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. Vision transformer-based visual language understanding of the construction process.

Author: Yang, Bin, Zhang, Binghan, Han, Yilong, Liu, Boda, Hu, Jiniming, and Jin, Yiming
Subjects: TRANSFORMER models, NATURAL language processing, FEATURE extraction, COMPUTER vision, VIDEO surveillance, BUILDING sites
Abstract: The widespread implementation of surveillance systems on construction sites has led to the accumulation of vast amounts of visual data, highlighting the need for an effective semantic analysis methodology. Natural language, as the most intuitive mode of expression, can significantly enhance the interpretability of such data. The adoption of multi-modality models promotes the interaction between surveillance video and textual data, thereby enabling managers to swiftly comprehend on-site dynamics. This study introduces a Visual Question Answering (VQA) approach for the construction industry and presents a specialized dataset to address the unique requirements of on-site management. Utilizing a Vision Transformer (ViT) architecture, the proposed model conducts feature extraction, fusion and interaction between visual and textual features. An additional projection layer is added to establish a transfer learning strategy that is optimized for construction site data. This novel approach facilitates rapid alignment of visual and language features in the model and is validated through ablation studies. The proposed approach achieves a testing accuracy of 83.8%, effectively converting image data from construction sites into natural language descriptions that enhance the analysis of construction processes. Compared to existing methods, this approach does not rely on object detection and allows for the direct extraction of deep-level semantic information from the on-site images. This study further discusses the feasibility of applying VQA within the architecture, engineering and construction (AEC) industry, examines its limitations, and offers suggestions for viable future directions of development. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. Advancing surgical VQA with scene graph knowledge.

Author: Yuan, Kun, Kattel, Manasi, Lavanchy, Joël L., Navab, Nassir, Srivastav, Vinkle, and Padoy, Nicolas
Abstract: Purpose: The modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with natural language capabilities is emerging as a necessity. Our work aims to advance visual question answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question–condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. Methods: First, we propose a surgical scene graph-based dataset, SSG-VQA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. We then propose SSG-VQA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module, which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Results: Our comprehensive analysis shows that our SSG-VQA dataset provides a more complex, diverse, geometrically grounded, unbiased and surgical action-oriented dataset compared to existing surgical VQA datasets and SSG-VQA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. Conclusion: We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. We point out that the bottleneck of the current surgical visual question–answer model lies in learning the encoded representation rather than decoding the sequence. Our SSG-VQA dataset provides a diagnostic benchmark to test the scene understanding and reasoning capabilities of the model. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-VQA. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. DCF-VQA: COUNTERFACTUAL STRUCTURE BASED ON MULTI--FEATURE ENHANCEMENT.

Author: GUAN YANG, CHENG JI, XIAOMING LIU, ZIMING ZHANG, and CHEN WANG
Subjects: DISCRETE cosine transforms, NATURAL language processing, COMPUTER vision, COUNTERFACTUALS (Logic)
Abstract: Visual question answering (VQA) is a pivotal topic at the intersection of computer vision and natural language processing. This paper addresses the challenges of linguistic bias and bias fusion within invalid regions encountered in existing VQA models due to insufficient representation of multi-modal features. To overcome those issues, we propose a multi-feature enhancement scheme. This scheme involves the fusion of one or more features with the original ones, incorporating discrete cosine transform (DCT) features into the counterfactual reasoning framework. This approach harnesses finegrained information and spatial relationships within images and questions, enabling a more refined understanding of the indirect relationship between images and questions. Consequently, it effectively mitigates linguistic bias and bias fusion within invalid regions in the model. Extensive experiments are conducted on multiple datasets, including VQA2 and VQA-CP2, employing various baseline models and fusion techniques, resulting in promising and robust performance. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. Learning a Mixture of Conditional Gating Blocks for Visual Question Answering.

Author: Sun, Qiang, Fu, Yan-Wei, and Xue, Xiang-Yang
Subjects: CONVOLUTIONAL neural networks, TRANSFORMER models, ARTIFICIAL neural networks, TURING test, RESEARCH personnel
Abstract: As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the "dynamic" property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our questionguided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. Enhancing machine vision: the impact of a novel innovative technology on video question-answering.

Author: Dan, Songjian and Feng, Wei
Subjects: *QUESTION answering systems, *COMPUTER vision, *LANGUAGE models, *TECHNOLOGICAL innovations, *NATURAL language processing, *ARTIFICIAL intelligence
Abstract: The robot video question-answering system is an artificial intelligence application that integrates computer vision and natural language processing technologies. Recently, it has received widespread attention, especially with the rapid development of large language models (LLMs). The core technical challenge lies in the application of visual question answering (VQA). However, visual question answering currently faces several challenges. Firstly, the acquisition of human annotations is costly, and secondly, existing models require expensive retraining when replacing a particular module. We propose the VLM2LLM model, which significantly improves the performance of multimodal question-answering tasks by integrating visual-language matching and large-scale language models. Specifically, it overcomes the limitations of requiring massive computational resources for training and inference in previous models. Furthermore, it allows for the upgrading of our LLM version according to the latest research advancements and needs. The results demonstrate that the VLM2LLM model achieves the highest accuracy compared to other state-of-the-art models on three datasets: QAv2, A-OKVQA, and OK-VQA. We hope that the VLM2LLM model can drive advancements in the field of robot video question-answering and provide innovative solutions for a wider range of application domains. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention.

Author: Koshti, Dipali, Gupta, Ashutosh, Kalla, Mukesh, and Sharma, Arvind
Subjects: *TRANSFORMER models, *POWER transformers, *FEATURE extraction, *NATURAL languages, *IMAGE representation
Abstract: Understanding multiple modalities and relating them is an easy task for humans. But for machines, this is a stimulating task. One such multi-modal reasoning task is Visual question answering which demands the machine to produce an answer for the natural language query asked based on the given image. Although plenty of work is done in this field, there is still a challenge of improving the answer prediction ability of the model and breaching human accuracy. A novel model for answering image-based questions based on a transformer has been proposed. The proposed model is a fully Transformer-based architecture that utilizes the power of a transformer for extracting language features as well as for performing joint understanding of question and image features. The proposed VQA model utilizes F-RCNN for image feature extraction. The retrieved language features and object-level image features are fed to a decoder inspired by the Bi-Directional Encoder Representation Transformer - BERT architecture that learns jointly the image characteristics directed by the question characteristics and rich representations of the image features are obtained. Extensive experimentation has been carried out to observe the effect of various hyperparameters on the performance of the model. The experimental results demonstrate that the model's ability to predict the answer increases with the increase in the number of layers in the transformer's encoder and decoder. The proposed model improves upon the previous models and is highly scalable due to the introduction of the BERT. Our best model reports 72.31% accuracy on the test-standard split of the VQAv2 dataset. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

17. EarthVQANet: Multi-task visual question answering for remote sensing image understanding.

Author: Wang, Junjue, Ma, Ailong, Chen, Zihang, Zheng, Zhuo, Wan, Yuting, Zhang, Liangpei, and Zhong, Yanfei
Subjects: *SURFACE of the earth, *SEMANTICS, *URBAN planning, *HURRICANE Harvey, 2017, *HUMAN settlements
Abstract: Monitoring and managing Earth's surface resources is critical to human settlements, encompassing essential tasks such as city planning, disaster assessment, etc. To accurately recognize the categories and locations of geographical objects and reason about their spatial or semantic relations , we propose a multi-task framework named EarthVQANet, which jointly addresses segmentation and visual question answering (VQA) tasks. EarthVQANet contains a hierarchical pyramid network for segmentation and semantic-guided attention for VQA, in which the segmentation network aims to generate pixel-level visual features and high-level object semantics, and semantic-guided attention performs effective interactions between visual features and language features for relational modeling. For accurate relational reasoning, we design an adaptive numerical loss that incorporates distance sensitivity for counting questions and mines hard-easy samples for classification questions, balancing the optimization. Experimental results on the EarthVQA dataset (city planning for Wuhan, Changzhou, and Nanjing in China), RSVQA dataset (basic statistics for general objects), and FloodNet dataset (disaster assessment for Texas in America attacked by Hurricane Harvey) show that EarthVQANet surpasses 11 general and remote sensing VQA methods. EarthVQANet simultaneously achieves segmentation and reasoning, providing a solid benchmark for various remote sensing applications. Data is available at http://rsidea.whu.edu.cn/EarthVQA.htm [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

18. Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering.

Author: Yan, Feng, Li, Zhe, Silamu, Wushour, and Li, Yanbing
Subjects: QUESTION answering systems, DESIGN
Abstract: Existing visual question answering (VQA) methods tend to focus excessively on visual objects in images, neglecting the understanding of implicit knowledge within the images, thus limiting the comprehension of image content. Furthermore, current mainstream VQA methods employ a bottom-up attention mechanism, which was initially proposed in 2017 and has become a bottleneck in visual question answering. In order to address the aforementioned issues and improve the ability to understand images, we have made the following improvements and innovations: (1) We utilize an OCR model to detect and extract scene text in the images, further enriching the understanding of image content. And we introduce the descriptive information from the images to enhance the model's comprehension of the images. (2) We have made improvements to the bottom-up attention model by obtaining two region features from the images, we concatenate the two region features to form the final visual feature, which better represents the image. (3) We design an extensible deep co-attention model, which includes self-attention units and co-attention units. This model can incorporate both image description information and scene text into the model, and it can be extended with other knowledge to further enhance the model's reasoning ability. (4) Experimental results demonstrate that our best single model achieves an overall accuracy of 74.38% on the VQA 2.0 test set. To the best of our knowledge, without using external datasets for pretraining, our model has reached a state-of-the-art level. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

19. A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering.

Author: Ma, Mingyang, Tohti, Turdi, Liang, Yi, Zuo, Zicheng, and Hamdulla, Askar
Abstract: Visual question answering tasks based on the knowledge graph are dedicated to integrating rich information in the knowledge graph to deal with complex questions that cannot be solved by image features alone while focusing on improving the performance of fundamental visual question answering tasks. The core of this task is to achieve effective cross-modal information fusion and resolve the semantic gap between images and text, thereby predicting answers more accurately. However, current visual question answering methods face challenges such as sparse information, single fusion features, and excessive computational burden. Given the sparsity of image regions related to questions in visual question answering tasks, traditional fusion methods such as linear pooling and cross-attention, while capable of effectively handling interactions between different modalities, engage the question with the entire image globally. It introduces unnecessary noise and increases computational complexity. To solve these problems, we propose a focus fusion attention mechanism (FFAM) integrated with image captions, effectively reducing noise and computational burden by focusing on the topk high-relevance areas. In addition, we adopt the advanced BLIP-2 model to generate image captions and introduce it as a new modality into the fusion process, breaking through the limitation of relying solely on features generated by the image encoder. Although introducing the knowledge graph increases the possibility of model processing complexity and noise, our method still shows powerful effects. On the F-VQA dataset, our model improved by 2.57% compared to the baseline model without the knowledge graph and achieved an accuracy of 86.35% with the knowledge graph. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

20. Dual modality prompt learning for visual question-grounded answering in robotic surgery

Author: Yue Zhang, Wanshu Fan, Peixi Peng, Xin Yang, Dongsheng Zhou, and Xiaopeng Wei
Subjects: Prompt learning, Visual prompt, Textual prompt, Grounding-answering, Visual question answering, Drawing. Design. Illustration, NC1-1940, Computer applications to medicine. Medical informatics, R858-859.7, Computer software, QA76.75-76.765
Abstract: Abstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.
Published: 2024
Full Text: View/download PDF

21. Graph neural networks for visual question answering: a systematic review.

Author: Yusuf, Abdulganiyu Abdu, Feng, Chong, Mao, Xianling, Ally Duma, Ramadhani, Abood, Mohammed Salah, and Chukkol, Abdulrahman Hamman Adama
Subjects: GRAPH neural networks, COMPUTER vision, NATURAL language processing, IMAGE representation, ISOMORPHISM (Mathematics), NEUROLINGUISTICS
Abstract: Recently, visual question answering (VQA) has gained considerable interest within the computer vision and natural language processing (NLP) research areas. The VQA task involves answering a question about an image, which requires both language and vision understanding. Effectively extracting visual representations from images, textual embedding from questions, and bridging the semantic disparity between image and question representations pose fundamental challenges in VQA. Lately, an increasing number of studies are focusing on utilizing graph neural networks (GNNs) to enhance the performance of VQA tasks. The ability to handle graph-structured data is a major advantage of GNNs for VQA tasks, which allows better representation of relationships between objects and regions in an image. These relationships include both spatial and semantic relationships. This paper systematically reviews various graph neural networks based studies for image-based VQA. Fifty-four related publications written between 2018—Jan. 2023 were carefully synthesized for this review. The review is structured into three perspectives: the various graph neural network techniques and models that have been applied for VQA, a comparison of the model's performance and existing challenges. After analyzing these papers, 45 different models were identified, grouped into four different GNN techniques. These are Graph Convolution Network (GCN), Graph Attention Network (GAT), Graph Isomorphism Network (GIN) and Graph Neural Network (GNN). Also, the performance of these models is compared based on accuracy, datasets, subtasks, feature representation and fusion techniques. Lastly, the study provided some possible suggestions to mitigate still existing challenges for future research in visual question answering. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. Learning the Meanings of Function Words From Grounded Language Using a Visual Question Answering Model.

Author: Portelance, Eva, Frank, Michael C., and Jurafsky, Dan
Subjects: *MACHINE learning, *SEMANTICS, *STATISTICAL learning, *ARTIFICIAL neural networks
Abstract: Interpreting a seemingly simple function word like "or," "behind," or "more" can require logical, numerical, and relational reasoning. How are such words learned by children? Prior acquisition theories have often relied on positing a foundation of innate knowledge. Yet recent neural‐network‐based visual question answering models apparently can learn to use function words as part of answering questions about complex visual scenes. In this paper, we study what these models learn about function words, in the hope of better understanding how the meanings of these words can be learned by both models and children. We show that recurrent models trained on visually grounded language learn gradient semantics for function words requiring spatial and numerical reasoning. Furthermore, we find that these models can learn the meanings of logical connectives and and or without any prior knowledge of logical reasoning as well as early evidence that they are sensitive to alternative expressions when interpreting language. Finally, we show that word learning difficulty is dependent on the frequency of models' input. Our findings offer proof‐of‐concept evidence that it is possible to learn the nuanced interpretations of function words in a visually grounded context by using non‐symbolic general statistical learning algorithms, without any prior knowledge of linguistic meaning. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

23. Diagram Perception Networks for Textbook Question Answering via Joint Optimization.

Author: Ma, Jie, Liu, Jun, Chai, Qi, Wang, Pinghui, and Tao, Jing
Subjects: *QUESTION answering systems, *CONVOLUTIONAL neural networks, *TEXTBOOKS
Abstract: Textbook question answering requires a system to answer questions with or without diagrams accurately, given multimodal contexts that include rich paragraphs and diagrams. Existing methods usually utilize a pipelined way to extract the most relevant paragraph from multimodal contexts and only employ convolutional neural networks to comprehend diagram semantics under the supervision of answer labels. The former will result in error accumulation, while the latter will lead to poor diagram understanding. To provide a remedy for the above issues, we propose an end-to-end DIagraM Perception network for textbook question answering (DIMP), which is jointly optimized by the supervision of relation predicting, diagram classification, and question answering. Specifically, knowledge extracting is regarded as a sequence classification task and optimized through the supervision of answer labels to alleviate error accumulation. To capture diagram semantics effectively, DIMP uses an explicit relation-aware method that first parses a diagram into several graphs under specific relations and then grasps the information propagation within them. Evaluation on two benchmark datasets shows that our method achieves competitive or better results without large data pre-training and constructing auxiliary tasks compared with current state-of-the-art methods. We provide comprehensive ablation studies and thorough analyses to determine what factors contribute to this success. We also make in-depth analyses for relational graph learning and joint optimization. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

24. Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering.

Author: YING-JIA LIN, CHING-SHAN TSENG, and HUNG-Yu KAO
Subjects: EXPLANATION
Abstract: Recent studies leveraging object detection as the preliminary step for Visual Question Answering (VQA) ignore the relationships between different objects inside an image based on the textual question. In addition. the previous VQA models work like black-box functions, which means it is difficult to explain why a model provides such answers to the corresponding inputs. To address the issues above. we propose a new model structure to strengthen the representations for different objects and provide explainability for the VQA task. We construct a relation graph to capture the relative positions between region pairs and then create relation-aware visual features with a relation encoder based on graph attention networks. To make the final VQA predictions explainable. we introduce a multi-task learning framework with an additional explanation generator to help our model produce reasonable explanations. Simultaneously. the generated explanations are incorporated with the visual features using a novel Hybrid-Attention mechanism to enhance cross-modal understanding. Experiments show that the proposed method performs better on the VQA task than the several baselines. In addition, incorporation with the explanation generator can provide reasonable explanations along with the predicted answers. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. Dual modality prompt learning for visual question-grounded answering in robotic surgery.

Author: Zhang, Yue, Fan, Wanshu, Peng, Peixi, Yang, Xin, Zhou, Dongsheng, and Wei, Xiaopeng
Subjects: SURGICAL robots, VISUAL learning, LANGUAGE models, MODAL logic
Abstract: With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. IMCN: Improved modular co-attention networks for visual question answering.

Author: Liu, Cheng, Wang, Chao, and Peng, Yan
Subjects: PROBLEM solving
Abstract: Many existing Visual Question Answering (VQA) methods use traditional attention mechanisms to focus on each region of the input image and each word of the input question and achieve well performance. However, the most obvious limitation of traditional attention mechanisms is that the module always generates a weighted average based on a specific query. When all regions and words are unsatisfied with the query, the generated vectors, which are noisy information, may lead to incorrect predictions. In this paper, we propose an Improved Modular Co-attention Network (IMCN) by incorporating the Attention on Attention (AoA) module into the self-attention module and the co-attention module to solve this problem. AoA adds another attention process by using element-wise multiplication on the information vector and the attention gate, which are both generated from the attention result and the current context. With AoA, the attended information obtained by the model is more useful. We also introduce an Improved Multimodal Fusion Network (IMFN), which leverages various branches to achieve hierarchical fusion, to fuse visual features and textual features for further improvements. We conduct extensive experiments on the VQA-v2 dataset to verify the effectiveness of the proposed modules and experimental results demonstrate our model outperforms the existing methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. Relational reasoning and adaptive fusion for visual question answering.

Author: Shen, Xiang, Han, Dezhi, Zong, Liang, Guo, Zihan, and Hua, Jie
Subjects: PSEUDOPOTENTIAL method
Abstract: Visual relationship modeling plays an indispensable role in visual question answering (VQA). VQA models need to fully understand the visual scene and positional relationships within the image to answer complex reasoning questions involving visual object relationships. Accurate reasoning and an understanding of the relationships between different visual objects are particularly crucial. However, most reasoning models used in current VQA tasks only use simple attention mechanisms to model visual object relationships and ignore the potential for effective modeling using rich visual object features during the learning process. This work proposes an effective visual object Relationship Reasoning and Adaptive Fusion (RRAF) model to address the shortcomings of existing VQA model research. RRAF can simultaneously model visual objects' position, appearance, and semantic features and uses an adaptive fusion mechanism to achieve fine-grained multimodal reasoning and fusion. Specifically, we designed an effective image encoder to model and learn the relationship between the position and appearance features of visual objects. In addition, in the co-attention module, we employ semantic information from the question to focus on critical visual objects. Finally, we use an adaptive fusion mechanism to reassign weights and fuse different modalities of features to effectively predict the answer. Experimental results show that the RRAF model outperforms current state-of-the-art methods on the VQA 2.0 and GQA datasets, especially in visual object counting problems. We also conducted extensive ablation experiments to demonstrate the effectiveness of the RRAF model, achieving an overall accuracy of 71.33% and 57.83% on the VQA 2.0 and GQA datasets, respectively. Code is available at https://github.com/shenxiang-vqa/RRAF. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. Survey of Multimodal Medical Question Answering.

Author: Demirhan, Hilmi and Zadrozny, Wlodek
Subjects: *ARTIFICIAL intelligence, *BIBLIOMETRICS, *COMPUTER vision, *MEDICAL sciences
Abstract: Multimodal medical question answering (MMQA) is a vital area bridging healthcare and Artificial Intelligence (AI). This survey methodically examines the MMQA research published in recent years. We collect academic literature through Google Scholar, applying bibliometric analysis to the publications and datasets used in these studies. Our analysis uncovers the increasing interest in MMQA over time, with diverse domains such as natural language processing, computer vision, and large language models contributing to the research. The AI methods used in multimodal question answering in the medical domain are a prominent focus, accompanied by applicability of MMQA to the medical field. MMQA in the medical field has its unique challenges due to the sensitive nature of medicine as a science dealing with human health. The survey reveals MMQA research to be in an exploratory stage, discussing different methods, datasets, and potential business models. Future research is expected to focus on application development by big tech companies, such as MedPalm. The survey aims to provide insights into the current state of multimodal medical question answering, highlighting the growing interest from academia and industry. The identified research gaps and trends will guide future investigations and encourage collaborative efforts to advance this transformative field. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

29. Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering.

Author: Lu, Qiwen, Chen, Shengbo, and Zhu, Xiaoke
Subjects: LANGUAGE & languages
Abstract: Language bias stands as a noteworthy concern in visual question answering (VQA), wherein models tend to rely on spurious correlations between questions and answers for prediction. This prevents the models from effectively generalizing, leading to a decrease in performance. In order to address this bias, we propose a novel modality fusion collaborative de-biasing algorithm (CoD). In our approach, bias is considered as the model's neglect of information from a particular modality during prediction. We employ a collaborative training approach to facilitate mutual modeling between different modalities, achieving efficient feature fusion and enabling the model to fully leverage multimodal knowledge for prediction. Our experiments on various datasets, including VQA-CP v2, VQA v2, and VQA-VS, using different validation strategies, demonstrate the effectiveness of our approach. Notably, employing a basic baseline model resulted in an accuracy of 60.14% on VQA-CP v2. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering.

Author: YAN WANG, PEIZE LI, QINGYI SI, HANWEN ZHANG, WENYU ZANG, ZHENG LIN, and PENG FU
Abstract: Knowledge-based visual question answering not only needs to answer the questions based on images but also incorporates external knowledge to study reasoning in the joint space of vision and language. To bridge the gap between visual content and semantic cues, it is important to capture the question-related and semantics-rich vision-language connections. Most existing solutions model simple intra-modality relation or represent cross-modality relation using a single vector, which makes it difficult to effectively model complex connections between visual features and question features. Thus, we propose a cross-modality multiple relations learning model, aiming to better enrich cross-modality representations and construct advanced multi-modality knowledge triplets. First, we design a simple yet effective method to generate multiple relations that represent the rich cross-modality relations. The various cross-modality relations link the textual question to the related visual objects. These multi-modality triplets efficiently align the visual objects and corresponding textual answers. Second, to encourage multiple relations to better align with different semantic relations, we further formulate a novel global-local loss. The global loss enables the visual objects and corresponding textual answers close to each other through cross-modality relations in the vision-language space, and the local loss better preserves semantic diversity among multiple relations. Experimental results on the Outside Knowledge VQA and Knowledge-Routed Visual Question Reasoning datasets demonstrate that our model outperforms the state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

31. Knowledge enhancement and scene understanding for knowledge-based visual question answering.

Author: Su, Zhenqiang and Gou, Gang
Subjects: IMAGE retrieval, SEMANTICS
Abstract: Knowledge-based visual question answering calls for not only paying attention to the visual content of images but also the support of relevant outside knowledge for improved question and answer thinking. The semantics of the questions should not be overlooked since knowledge retrieval relies on more than just visual information. This paper first proposed a question-based semantic retrieval strategy to compensate for the absence of image retrieval knowledge in order to better combine visual and knowledge information. Secondly, image caption is added to help the model better achieve scene understanding. Finally, modal knowledge is represented and accumulated through the triplets. Experimental results on the OK-VQA dataset show that the proposed method achieves an improvement of 4.24% and 1.90% over the two baseline methods, respectively, which proves the effectiveness of this method. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. Improving VQA via Dual-Level Feature Embedding Network.

Author: Song, Yaru, Xu, Huahu, and Fang, Dikai
Subjects: INTERACTIVE learning
Abstract: Visual Question Answering (VQA) has sparked widespread interest as a crucial task in integrating vision and language. VQA primarily uses attention mechanisms to effectively answer questions to associate relevant visual regions with input questions. The detection-based features extracted by the object detection network aim to acquire the visual attention distribution on a predetermined detection frame and provide object-level insights to answer questions about foreground objects more effectively. However, it cannot answer the question about the background forms without detection boxes due to the lack of fine-grained details, which is the advantage of grid-based features. In this paper, we propose a Dual-Level Feature Embedding (DLFE) network, which effectively integrates grid-based and detection-based image features in a unified architecture to realize the complementary advantages of both features. Specifically, in DLFE, In DLFE, firstly, a novel Dual-Level Self-Attention (DLSA) modular is proposed to mine the intrinsic properties of the two features, where Positional Relation Attention (PRA) is designed to model the position information. Then, we propose a Feature Fusion Attention (FFA) to address the semantic noise caused by the fusion of two features and construct an alignment graph to enhance and align the grid and detection features. Finally, we use co-attention to learn the interactive features of the image and question and answer questions more accurately. Our method has significantly improved compared to the baseline, increasing accuracy from 66.01% to 70.63% on the test-std dataset of VQA 1.0 and from 66.24% to 70.91% for the test-std dataset of VQA 2.0. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

33. Toward Unsupervised Visual Reasoning: Do Off-the-Shelf Features Know How to Reason?

Author: Monika Wysoczanska, Tom Monnier, Tomasz Trzcinski, and David Picard
Subjects: Visual reasoning, visual question answering, representation learning, object-centric representations, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Recent advances in visual representation learning allowed for the construction of a plethora of powerful features that are ready to use for numerous downstream tasks. Contrary to existing representation evaluations typically based on image or pixel-wised classification tasks, the goal of this work is to assess how well these features preserve meaningful information about the objects contained in a given image, such as their spatial locations, their visual properties, or their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. Our underlying assumption is that reasoning performances are strongly correlated with the quality of visual representations. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module of limited capacity and trained on the frozen visual representations to be evaluated in a spirit similar to standard feature evaluations relying on shallow networks. This involves constraining the complexity of the reasoning module as well as the size of its input. Using the proposed evaluation framework, we compare two types of visual representations, namely dense local features, and object-centric ones, against the performances of a perfect image representation using the ground truth. We make three key findings: 1) all considered, visual representations are far from extracting perfect visual information from a reasoning standpoint, 2) object-centric features better preserve the critical information necessary to perform basic reasoning, and 3) none of the two types of visual representation prevents from learning spurious correlations when confronted to a smaller training set. These findings stand in opposition to the excellent performances obtained by such off-the-shelf representations in typical evaluation protocols.
Published: 2024
Full Text: View/download PDF

34. Survey of Multimodal Medical Question Answering

Author: Hilmi Demirhan and Wlodek Zadrozny
Subjects: multimodal medical question answering, natural language processing, visual question answering, medical question answering, survey, artificial intelligence, Neurosciences. Biological psychiatry. Neuropsychiatry, RC321-571, Computer applications to medicine. Medical informatics, R858-859.7
Abstract: Multimodal medical question answering (MMQA) is a vital area bridging healthcare and Artificial Intelligence (AI). This survey methodically examines the MMQA research published in recent years. We collect academic literature through Google Scholar, applying bibliometric analysis to the publications and datasets used in these studies. Our analysis uncovers the increasing interest in MMQA over time, with diverse domains such as natural language processing, computer vision, and large language models contributing to the research. The AI methods used in multimodal question answering in the medical domain are a prominent focus, accompanied by applicability of MMQA to the medical field. MMQA in the medical field has its unique challenges due to the sensitive nature of medicine as a science dealing with human health. The survey reveals MMQA research to be in an exploratory stage, discussing different methods, datasets, and potential business models. Future research is expected to focus on application development by big tech companies, such as MedPalm. The survey aims to provide insights into the current state of multimodal medical question answering, highlighting the growing interest from academia and industry. The identified research gaps and trends will guide future investigations and encourage collaborative efforts to advance this transformative field.
Published: 2023
Full Text: View/download PDF

35. Beyond chat-GPT: a BERT-AO approach to custom question answering system

Author: Sophia, J. Jinu and Jacob, T. Prem
Published: 2024
Full Text: View/download PDF

36. Debiased Visual Question Answering via the perspective of question types.

Author: Huai, Tianyu, Yang, Shuwen, Zhang, Junhang, Zhao, Jiabao, and He, Liang
Subjects: *COUNTERFACTUALS (Logic), *ANTILOCK brake systems in automobiles, *ANNOTATIONS
Abstract: Visual Question Answering (VQA) aims to answer questions according to the given image. However, current VQA models tend to rely solely on textual information from the questions and ignore the visual information in the images to get answers, which is caused by bias that is generated during the training phase. Previous studies have shown that bias in VQA is mainly caused by the text modality, and our analysis suggests that question type is a crucial factor in bias formation. To address this bias, we proposed a self-supervised method including the Against Biased Samples(ABS) module that performs targeted debiasing by selecting samples that are prone to bias, and the Shuffle Question types(SQT) module that constructs negative samples by randomly replacing the question types of the samples selected by the ABS, to interrupting the shortcuts from question type to answer. Our approach mitigates the question-to-answer bias without using external annotations, overcoming the prior language problem. Additionally, we designed a new objective function for negative samples. Experimental results indicate that our method outperforms both self-supervised-based and supervised-based state-of-the-art approaches, achieving 70.36% accuracy on the VQA-CP v2 dataset. • We propose a framework use image–question pairs to construct counterfactual samples. • SQT and ABS can target data that is prone to bias in the training set to de-bias. • We introduce a novel loss function base on the constructed negative samples. • Our method can achieve state-of-the-art performance on benchmark VQA-CP v2 dataset. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

37. VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning.

Author: Ma, Han, Fan, Baoyu, Ng, Benjamin K., and Lam, Chan-Tong
Subjects: VISION, STIMULUS generalization, NEUROLINGUISTICS
Abstract: Complex tasks in the real world involve different modal models, such as visual question answering (VQA). However, traditional multimodal learning requires a large amount of aligned data, such as image text pairs, and constructing a large amount of training data is a challenge for multimodal learning. Therefore, we propose VL-Few, which is a simple and effective method to solve the multimodal few-shot problem. VL-Few (1) proposes the modal alignment, which aligns visual features into language space through a lightweight model network and improves the multimodal understanding ability of the model; (2) adopts few-shot meta learning in the multimodal problem, which constructs a few-shot meta task pool to improve the generalization ability of the model; (3) proposes semantic alignment to enhance the semantic understanding ability of the model for the task, context, and demonstration; (4) proposes task alignment that constructs training data into the target task form and improves the task understanding ability of the model; (5) proposes generation alignment, which adopts the token-level training and multitask fusion loss to improve the generation ability of the model. Our experimental results show the effectiveness of VL-Few for multimodal few-shot problems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

38. OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement.

Author: Yan, Feng, Silamu, Wushouer, Chai, Yachuang, and Li, Yanbing
Abstract: Most VQA(visual question answering) models can not understand the scene text in the image. Poor text reading ability is a significant reason for the current VQA model's poor performance. To solve the problems, we designed a co-attention model that incorporates the scene text features in images. We detect and obtain the OCR token in the image through the OCR model, which is conducive to further understanding the image. We design a model based on a co-attention mechanism, including a question self-attention unit, question-guided image visual attention unit and question-guided image OCR token attention unit. The redundant question information is filtered under the question self-attention module. The question-guided attention module is used to obtain the final visual features and OCR token features in the image. The information of question text features, visual image features and OCR token features in the image is fused. We design a classifier which can get an answer from the fixed answer set or directly copy the text detected from the OCR model as the final answer so that the model can answer the questions about the text in the image. The experimental results show that our model is improved. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

39. VL-Meta: Vision-Language Models for Multimodal Meta-Learning.

Author: Ma, Han, Fan, Baoyu, Ng, Benjamin K., and Lam, Chan-Tong
Subjects: *MACHINE learning, *LANGUAGE models, *MULTIMODAL user interfaces, *ARTIFICIAL intelligence, *TASK performance, *LEARNING ability, *STIMULUS generalization
Abstract: Multimodal learning is a promising area in artificial intelligence (AI) that can make the model understand different kinds of data. Existing works are trying to re-train a new model based on pre-trained models that requires much data, computation power, and time. However, it is difficult to achieve in low-resource or small-sample situations. Therefore, we propose VL-Meta, Vision Language Models for Multimodal Meta Learning. It (1) presents the vision-language mapper and multimodal fusion mapper, which are light model structures, to use the existing pre-trained models to make models understand images to language feature space and save training data, computation power, and time; (2) constructs the meta-task pool that can only use a small amount of data to construct enough training data and improve the generalization of the model to learn the data knowledge and task knowledge; (3) proposes the token-level training that can align inputs with the outputs during training to improve the model performance; and (4) adopts the multi-task fusion loss to learn the different abilities for the models. It achieves a good performance on the Visual Question Answering (VQA) task, which shows the feasibility and effectiveness of the model. This solution can help blind or visually impaired individuals obtain visual information. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

40. Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering.

Author: Jiang, Jingjing, Liu, Ziyi, and Zheng, Nanning
Subjects: *LANGUAGE models
Abstract: Benefiting from large-scale pretrained vision language models (VLMs), the performance of visual question answering (VQA) has approached human oracles. However, finetuning such models on limited data often suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve input robustness from an information bottleneck perspective when adapting pretrained VLMs to the downstream VQA task. Input robustness refers to the ability of models to defend against visual and linguistic input variations, as well as shortcut learning involved in inputs. Generally, the representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in multimodal learning, we propose Correlation Information Bottleneck (CIB), which seeks a tradeoff between compression and redundancy in representations by minimizing the mutual information (MI) between inputs and representations while maximizing the MI between outputs and representations. Moreover, we derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations, incorporating different internal correlations that guide models to learn more robust representations and facilitate modality alignment. Extensive experiments consistently demonstrate the effectiveness and superiority of the proposed CIB in terms of input robustness and accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

41. Self-supervised knowledge distillation in counterfactual learning for VQA.

Author: Bi, Yandong, Jiang, Huajie, Zhang, Hanfu, Hu, Yongli, and Yin, Baocai
Subjects: *COUNTERFACTUALS (Logic), *LEARNING modules
Abstract: As a popular cross-modal reasoning task, Visual Question Answering (VQA) has achieved great progress in recent years. However, the issue of language bias has always affected the reliability of VQA models. To address this problem, counterfactual learning methods are proposed to learn more robust features to mitigate the bias problem. However, current counterfactual learning approaches mainly focus on generating synthesized samples and assigning answers to them, neglecting the relationship between factual and original data, which hinders robust feature learning for effective reasoning. To overcome this limitation, we propose a Self-supervised Knowledge Distillation approach in Counterfactual Learning for VQA, dubbed as VQA-SkdCL, which utilizes a self-supervised constraint to make good use of the hidden knowledge in the factual samples, enhancing the robustness of VQA models. We demonstrate the effectiveness of the proposed approach on VQA v2, VQA-CP v1, and VQA-CP v2 datasets and our approach achieves excellent performance. • We study the relations of factual and original samples in counterfactual learning. • We design a self-supervised knowledge distillation module to learn robust features. • Our method is model-agnostic, end-to-end trainable, and easy to implement. • Extensive experiments on three datasets show the effectiveness of our approach. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

42. MSGeN: Multimodal Selective Generation Network for Grounded Explanations.

Author: Li, Dingbang, Chen, Wenzhou, and Lin, Xin
Subjects: NATURAL languages, EXPLANATION
Abstract: Modern models have shown impressive capabilities in visual reasoning tasks. However, the interpretability of their decision-making processes remains a challenge, causing uncertainty in their reliability. In response, we present the Multimodal Selective Generation Network (MSGeN), a novel approach to enhancing interpretability and transparency in visual reasoning. MSGeN can generate explanations that seamlessly integrate diverse modal information, providing a comprehensive and intuitive understanding of its decisions. The model consists of five collaborative components: (1) the Multimodal Encoder, which encodes and fuses input data; (2) the Reasoner, which is responsible for generating stepwise inference states; (3) the Selector, which is utilized for selecting the modality for each step's explanation; (4) the Speaker, which generates natural language descriptions; and (5) the Pointer, which produces visual cues. These components work harmoniously to generate explanations enriched with natural language context and visual cues. Our extensive experimentation demonstrates that MSGeN surpasses existing multimodal explanation generation models across various metrics, including BLEU, METEOR, ROUGE, CIDEr, SPICE, and Grounding. We also show detailed visual examples highlighting MSGeN's ability to generate comprehensive and coherent explanations, showcasing its effectiveness through practical case studies. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

43. Semi-Supervised Implicit Augmentation for Data-Scarce VQA †.

Author: Dodla, Bhargav, Hegde, Kartik, and Rajagopalan, A. N.
Subjects: OPEN-ended questions, DATA augmentation, LANGUAGE models, QUESTION answering systems, PROBLEM solving
Abstract: Vision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in answering open-ended questions. The vast contextual information learned during the pretraining stage in VLMs can be utilised effectively to finetune the VQA model for specific datasets. In particular, special types of VQA datasets, such as OK-VQA, A-OKVQA (outside knowledge-based), and ArtVQA (domain-specific), have a relatively smaller number of images and corresponding question-answer annotations in the training set. Such datasets can be categorised as data-scarce. This hinders the effective learning of VLMs due to the low information availability. We introduce SemIAug (Semi-Supervised Implicit Augmentation), a model and dataset agnostic strategy specially designed to address the challenges faced by limited data availability in the domain-specific VQA datasets. SemIAug uses the annotated image-question data present within the chosen dataset and augments it with meaningful new image-question associations. We show that SemIAug improves the VQA performance on data-scarce datasets without the need for additional data or labels. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

44. Multimodal Bi-direction Guided Attention Networks for Visual Question Answering.

Author: Cai, Linqin, Xu, Nuoying, Tian, Hang, Chen, Kejia, and Fan, Haodu
Subjects: QUESTION answering systems, NATURAL language processing, COMPUTER vision, VISUAL learning
Abstract: Current visual question answering (VQA) has become a research hotspot in the computer vision and natural language processing field. A core solution of VQA is how to fuse multi-modal features from images and questions. This paper proposes a Multimodal Bi-direction Guided Attention Network (MBGAN) for VQA by combining visual relationships and attention to achieve more refined feature fusion. Specifically, the self-attention is used to extract image features and text features, the guided-attention is applied to obtain the correlation between each image area and the related question. To obtain the relative position relationship of different objects, position attention is further introduced to realize relationship correlation modeling and enhance the matching ability of multi-modal features. Given an image and a natural language question, the proposed MBGAN learns visual relation inference and question attention networks in parallel to achieve the fine-grained fusion of the visual features and the textual features, then the final answers can be obtained accurately through model stacking. MBGAN achieves 69.41% overall accuracy on the VQA-v1 dataset, 70.79% overall accuracy on the VQA-v2 dataset, and 68.79% overall accuracy on the COCO-QA dataset, which shows that the proposed MBGAN outperforms most of the state-of-the-art models. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

45. A visual questioning answering approach to enhance robot localization in indoor environments.

Author: Diego Peña-Narvaez, Juan, Martín, Francisco, Miguel Guerrero, José, and Pérez-Rodríguez, Rodrigo
Subjects: LANGUAGE models, ROBOTS, VISUAL perception
Abstract: Navigating robots with precision in complex environments remains a significant challenge. In this article, we present an innovative approach to enhance robot localization in dynamic and intricate spaces like homes and offices. We leverage Visual Question Answering (VQA) techniques to integrate semantic insights into traditional mapping methods, formulating a novel position hypothesis generation to assist localization methods, while also addressing challenges related to mapping accuracy and localization reliability. Our methodology combines a probabilistic approach with the latest advances in Monte Carlo Localization methods and Visual Language models. The integration of our hypothesis generation mechanism results in more robust robot localization compared to existing approaches. Experimental validation demonstrates the effectiveness of our approach, surpassing state-of-the-art multi-hypothesis algorithms in both position estimation and particle quality. This highlights the potential for accurate self-localization, even in symmetric environments with large corridor spaces. Furthermore, our approach exhibits a high recovery rate from deliberate position alterations, showcasing its robustness. By merging visual sensing, semantic mapping, and advanced localization techniques, we open new horizons for robot navigation. Our work bridges the gap between visual perception, semantic understanding, and traditional mapping, enabling robots to interact with their environment through questions and enrich their map with valuable insights. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

46. Design of knowledge incorporated VQA based on spatial GCNN with structured sentence embedding and linking algorithm.

Author: Koshti, Dipali, Gupta, Ashutosh, and Kalla, Mukesh
Subjects: *CONVOLUTIONAL neural networks, *FEATURE extraction, *COMPUTER vision, *COMMON sense, *ALGORITHMS
Abstract: Visual question Answering (VQA) is a computer vision task that requires a system to infer an answer to a text-based question about an image. Prior approaches did not take into account an image's positional information or the questions' grammatical and semantic relationships during image and question processing. Featurization, which leads to the false answering of the question. Hence to overcome this issue CNN –Graph based LSTM with optimized BP Featurization technique is introduced for feature extraction of image as well as question. The position of the subjects in the image has been determined using CNN with a dropout layer and the optimized momentum backpropagation during the extraction of image features without losing any image data. Then, using a graph-based LSTM with loopy backpropagation, the questions' syntactic and semantic dependencies are retrieved. However, due to their lack of external knowledge about the input image, the existing approaches are unable to respond to common sense knowledge-based questions (open domain). As a result, the proposed Spatial GCNN knowledge retrieval with PDB Model and Spatial Graph Convolutional Neural Network, which recovers external data from Wikidata, have been used to address the open domain problems. Then the Probabilistic Discriminative Bayesian model, based Attention mechanism predicts the answer by referring to all concepts in question. Thus, the proposed method answers the open domain question with high accuracy of 88.30%. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

47. ConfigILM: A general purpose configurable library for combining image and language models for visual question answering

Author: Leonard Hackel, Kai Norman Clasen, and Begüm Demir
Subjects: Visual question answering, Natural language processing, Machine learning, Open source, Python, Image analysis, Computer software, QA76.75-76.765
Abstract: ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a convenient implementation for seamlessly combining image and language models from two popular PyTorch libraries that are timm and huggingface. These libraries allow a variety of configurations of models without additional implementation effort. The monolithic interface provided by ConfigILM simplifies the exchange of components of a considered model and offers possibilities for developing new image-language models based on recombining the selected encoders. Additionally, the library provides pre-built and throughput-optimized PyTorch dataloaders. We also provide a guideline document that contains installation instructions, tutorial examples, and a complete discussion of the monolithic interface to the library. ConfigILM is released under the MIT License, encouraging its use in academic and commercial environments. The source code and documentation of ConfigILM are available at https://github.com/lhackel-tub/ConfigILM.
Published: 2024
Full Text: View/download PDF

48. The Potential of a Visual Dialogue Agent In a Tandem Automated Audio Description System for Videos.

Author: Stangl, Abigale, Ihorn, Shasta, Siu, Yue-Ting, Bodi, Aditya, Castanon, Mar, Narins, Lothar D, and Yoon, Ilmi
Subjects: SOUND systems, LOW vision, VIDEO production & direction, ARTIFICIAL intelligence, MACHINE learning
Abstract: The relentless pace of video production exacerbates the digital accessibility gap that individuals who are blind or low vision (BLV) face on a daily basis, resulting in disproportionate exclusion from community opportunities and risk management. Whereas previous automated audio description (AD) systems provide single-tool approaches for delivering minimum viable description (MVD) or delivering on-demand visual question answering (VQA), we present a tandem AI-based AD tool that combines MVD and on-demand VQA. A user study with 26 BLV individuals explored how the tandem system may be used under the conditions of delivering MVD and/or on-demand VQA with AI-only or human-in-the-loop support. When each tool was used in isolation, AI-only conditions scored significantly lower in both user enjoyment and comprehension. When used in tandem, AI-only conditions matched outcomes delivered with human-in-the-loop, which suggests that AI-only AD tools may be most effective when both types of tools are used in tandem. A multimodal analysis of interactions with the tandem system revealed areas for system improvement in terms of the timing of AD delivery and accurate content delivery. We discuss how the use of both types of tools in a tandem system can mitigate some of the digital frictions that have plagued efforts in machine learning and automated tools for accessibility. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

49. Multi-modal co-attention relation networks for visual question answering.

Author: Guo, Zihan and Han, Dezhi
Subjects: *PROBLEM solving, *COMPUTER vision
Abstract: The current mainstream visual question answering (VQA) models only model the object-level visual representations but ignore the relationships between visual objects. To solve this problem, we propose a Multi-Modal Co-Attention Relation Network (MCARN) that combines co-attention and visual object relation reasoning. MCARN can model visual representations at both object-level and relation-level, and stacking its visual relation reasoning module can further improve the accuracy of the model on Number questions. Inspired by MCARN, we propose two models, RGF-CA and Cos-Sin+CA, which combine co-attention with the relative geometry features of visual objects, and achieve excellent comprehensive performance and higher accuracy on Other questions respectively. Extensive experiments and ablation studies based on the benchmark dataset VQA 2.0 prove the effectiveness of our models, and also verify the synergy of co-attention and visual object relation reasoning in VQA task. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

50. VQAPT: A New visual question answering model for personality traits in social media images.

Author: Biswas, Kunal, Shivakumara, Palaiahnakote, Pal, Umapada, Liu, Cheng-Lin, and Lu, Yue
Subjects: *PERSONALITY, *TEXT recognition, *HUMAN facial recognition software, *MULTIPLE personality, *TRANSFORMER models, *EMPLOYEE reviews, *NATURAL language processing, *SOCIAL media
Abstract: • A new VQA for Big-Five-Factors personality traits identification. • A deep model for integrating text and person/face recognition-based features. • A dynamic Text-Object graph and a clip Transformer encoder have been explored. • A New dataset has been created, which will be released to the public for research. Visual Question Answering (VQA) for personality trait images on social media is challenging because of multiple emotions and actions with complex backgrounds in social media images. This work aims at developing a new VQA model for different personality traits (VQAPT) identification in a single image. This work considers the Big Five Factors (BFF) for personality traits namely, Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism. VQA is proposed based on the observation that multiple personality traits can be seen in a single image. We propose a model integrating text recognition and person/face recognition to derive the unique relationship between the text and the person's action in the image. Furthermore, a dynamic text-object graph for personality traits identification is constructed according to the query. For understanding a query, we explore the Contrastive Language-Image Pre-trained (CLIP) transformer encoder in this work. Since it is the first work of its kind, we have created a new dataset under this work for evaluation and the dataset is available publicly as mentioned in Section 4. The effectiveness of the proposed method is also evaluated on two benchmark datasets, namely TextVQA for VQA and PTI for personality traits identification. [Display omitted] [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

455 results on '"Visual Question Answering"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources