677 results on '"Visual Question Answering"'
Search Results
2. WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
- Author
-
Chen, Pingyi, Zhu, Chenglu, Zheng, Sunyi, Li, Honglin, Yang, Lin, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
3. Overview of the Trauma THOMPSON Challenge at MICCAI 2023
- Author
-
Zhuo, Yupeng, Kirkpatrick, Andrew W., Couperus, Kyle, Tran, Oanh, Beck, Jonah, DeVane, DeAnna, Candelore, Ross, McKee, Jessica, Colombo, Christopher, Gorbatkin, Chad, Birch, Eleanor, Duerstock, Bradley, Wachs, Juan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bao, Rina, editor, Grant, Ellen, editor, Kirkpatrick, Andrew, editor, Wachs, Juan, editor, and Ou, Yangming, editor
- Published
- 2025
- Full Text
- View/download PDF
4. The Trauma THOMPSON Challenge Report MICCAI 2023
- Author
-
Zhuo, Yupeng, W. Kirkpatrick, Andrew, Couperus, Kyle, Tran, Oanh, Wachs, Juan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bao, Rina, editor, Grant, Ellen, editor, Kirkpatrick, Andrew, editor, Wachs, Juan, editor, and Ou, Yangming, editor
- Published
- 2025
- Full Text
- View/download PDF
5. Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
- Author
-
Wang, Haibo, Ge, Weifeng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
6. Multi-stage reasoning on introspecting and revising bias for visual question answering.
- Author
-
An-An, L., Zimu, Lu, Ning, Xu, Min, Liu, Chenggang, Yan, Bolun, Zheng, Bo, Lv, Yulong, Duan, Zhuang, Shao, and Xuanya, Li
- Subjects
ARTIFICIAL intelligence ,ATTENTIONAL bias ,VOCABULARY ,FORECASTING ,LANGUAGE & languages - Abstract
Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer prediction, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model's reliance on image content during answer reasoning and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental resu- lts show that our network achieves significant performance against the previous state-of-the-art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Learning to enhance areal video captioning with visual question answering.
- Author
-
Al Mehmadi, Shima M., Bazi, Yakoub, Al Rahhal, Mohamad M., and Zuair, Mansour
- Subjects
- *
DRONE aircraft , *REMOTE sensing , *VIDEOS - Abstract
The utilization of Unmanned Aerial Vehicles (UAV) in remote sensing (RS) has witnessed a significant surge, offering valuable insights into Earth dynamics and human activities. However, this has led to a substantial increase in the volume of video data, rendering manual screening and analysis impractical. Consequently, there is a pressing need for the development of automated interpretation models for these aerial videos. In this paper, we propose a novel approach that leverages visual dialogue to enhance aerial video captioning. Our model adopts an encoder-decoder architecture, integrating a Visual Question Answering (VQA) task before the captioning task. The VQA task aims to enrich the captioning process by soliciting additional information about the image content. Specifically, our video encoder utilizes ViT-L/16, while the decoder employs Generative Pre-trained Transformer-2 (Distill-GPT-2). To validate our model, we introduce a novel benchmark dataset named CapERA-VQA, comprising videos accompanied by sets of questions, answers, and captions. Through experimental validation, we demonstrate the effectiveness of our proposed approach in enhancing the automated captioning of aerial videos. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. DCF–VQA: Counterfactual Structure Based on Multi–Feature Enhancement
- Author
-
Yang Guan, Ji Cheng, Liu Xiaoming, Zhang Ziming, and Wang Chen
- Subjects
visual question answering ,multi-feature enhancement ,counterfactual ,discrete cosine transform ,Mathematics ,QA1-939 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Visual question answering (VQA) is a pivotal topic at the intersection of computer vision and natural language processing. This paper addresses the challenges of linguistic bias and bias fusion within invalid regions encountered in existing VQA models due to insufficient representation of multi-modal features. To overcome those issues, we propose a multi-feature enhancement scheme. This scheme involves the fusion of one or more features with the original ones, incorporating discrete cosine transform (DCT) features into the counterfactual reasoning framework. This approach harnesses finegrained information and spatial relationships within images and questions, enabling a more refined understanding of the indirect relationship between images and questions. Consequently, it effectively mitigates linguistic bias and bias fusion within invalid regions in the model. Extensive experiments are conducted on multiple datasets, including VQA2 and VQA-CP2, employing various baseline models and fusion techniques, resulting in promising and robust performance.
- Published
- 2024
- Full Text
- View/download PDF
9. Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual Question Answering
- Author
-
Zhongjian Hu, Peng Yang, Fengyuan Liu, Yuan Meng, and Xingyu Liu
- Subjects
visual question answering ,knowledge-based visual question answering ,large language model ,knowledge injection ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Previous works employ the Large Language Model (LLM) like GPT-3 for knowledge-based Visual Question Answering (VQA). We argue that the inferential capacity of LLM can be enhanced through knowledge injection. Although methods that utilize knowledge graphs to enhance LLM have been explored in various tasks, they may have some limitations, such as the possibility of not being able to retrieve the required knowledge. In this paper, we introduce a novel framework for knowledge-based VQA titled “Prompting Large Language Models with Knowledge-Injection” (PLLMKI). We use vanilla VQA model to inspire the LLM and further enhance the LLM with knowledge injection. Unlike earlier approaches, we adopt the LLM for knowledge enhancement instead of relying on knowledge graphs. Furthermore, we leverage open LLMs, incurring no additional costs. In comparison to existing baselines, our approach exhibits the accuracy improvement of over 1.3 and 1.7 on two knowledge-based VQA datasets, namely OK-VQA and A-OKVQA, respectively.
- Published
- 2024
- Full Text
- View/download PDF
10. Integrating IoT and visual question answering in smart cities: Enhancing educational outcomes
- Author
-
Tian Gao and Guanqi Wang
- Subjects
Smart cities ,IoT framework ,Visual question answering ,Large language models ,Smart education technology ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
Emerging as a paradigmatic shift in urban development, smart cities harness the potential of advanced information and communication technologies to seamlessly integrate urban functions, optimize resource allocation, and improve the effectiveness of city management. Within the domain of smart education, the imperative application of Visual Question Answering (VQA) technology encounters significant limitations at the prevailing stage, particularly the absence of a robust Internet of Things (IoT) framework and the inadequate incorporation of large pre-trained language models (LLMs) within contemporary smart education paradigms, especially in addressing zero-shot VQA scenarios, which pose considerable challenges. In response to these constraints, this paper introduces an IoT-based smart city framework that is designed to refine the functionality and efficacy of educational systems. This framework is delineated into four cardinal layers: the data collection layer, data transmission layer, data management layer, and application layer. Furthermore, we introduce the innovative TeachVQA methodology at the application layer, synergizing VQA technology with extensive pre-trained language models, thereby considerably enhancing the dissemination and assimilation of educational content. Evaluative metrics in the VQAv2 and OKVQA datasets substantiate that the TeachVQA methodology not only outperforms existing VQA approaches, but also underscores its profound potential and practical relevance in the educational sector.
- Published
- 2024
- Full Text
- View/download PDF
11. Vision transformer-based visual language understanding of the construction process
- Author
-
Bin Yang, Binghan Zhang, Yilong Han, Boda Liu, Jiniming Hu, and Yiming Jin
- Subjects
Intelligent construction ,Computer vision ,Vision transformer ,Natural language processing ,Visual question answering ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
The widespread implementation of surveillance systems on construction sites has led to the accumulation of vast amounts of visual data, highlighting the need for an effective semantic analysis methodology. Natural language, as the most intuitive mode of expression, can significantly enhance the interpretability of such data. The adoption of multi-modality models promotes the interaction between surveillance video and textual data, thereby enabling managers to swiftly comprehend on-site dynamics. This study introduces a Visual Question Answering (VQA) approach for the construction industry and presents a specialized dataset to address the unique requirements of on-site management. Utilizing a Vision Transformer (ViT) architecture, the proposed model conducts feature extraction, fusion and interaction between visual and textual features. An additional projection layer is added to establish a transfer learning strategy that is optimized for construction site data. This novel approach facilitates rapid alignment of visual and language features in the model and is validated through ablation studies. The proposed approach achieves a testing accuracy of 83.8%, effectively converting image data from construction sites into natural language descriptions that enhance the analysis of construction processes. Compared to existing methods, this approach does not rely on object detection and allows for the direct extraction of deep-level semantic information from the on-site images. This study further discusses the feasibility of applying VQA within the architecture, engineering and construction (AEC) industry, examines its limitations, and offers suggestions for viable future directions of development.
- Published
- 2024
- Full Text
- View/download PDF
12. Multimodal attention-driven visual question answering for Malayalam.
- Author
-
Kovath, Abhishek Gopinath, Nayyar, Anand, and Sikha, O. K.
- Subjects
- *
DEEP learning , *CONVOLUTIONAL neural networks , *NATURAL languages , *QUESTION answering systems , *GENOMES , *TOURISM - Abstract
Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English questions. However, applications such as visual assistance and tourism necessitate the incorporation of multilingual VQA systems. This paper presents an effective deep learning framework for Malayalam visual question answering (MVQA), which can answer a specific natural language question about an image in Malayalam. As there is no available dataset in English–Malayalam VQA, a MVQA dataset was created by translating English question–answer pairs from the visual genome dataset. The paper proposes an attention-driven MVQA model on the developed dataset. The proposed MVQA model uses a deep learning-based co-attention mechanism to jointly learn the attention for images and Malayalam questions. A second-order multimodal factorized high-order pooling is used for multi modal feature fusion. Different VQA models using combinations of classical CNNs and RNNs were experimented on the developed MVQA dataset, and the performance was compared against the proposed attention-driven model. Experimental results show that the proposed attention-driven MVQA model achieves state-of-the-art results as compared to other models for MVQA on the custom Malayalam VQA dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. HRVQA: A Visual Question Answering benchmark for high-resolution aerial images.
- Author
-
Li, Kun, Vosselman, George, and Yang, Michael Ying
- Subjects
- *
COMPUTER vision , *URBAN planning , *SOURCE code , *QUESTION answering systems , *SCARCITY , *PIXELS - Abstract
Visual question answering (VQA) is an important and challenging multimodal task in computer vision and photogrammetry. Recently, efforts have been made to bring the VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, the development of VQA in this domain is restricted by the huge variation in the appearance, scale, and orientation of the concepts in aerial images, along with the scarcity of well-annotated datasets. In this paper, we introduce a new dataset, HRVQA, which provides a collection of 53,512 aerial images of 1024 × 1024 pixels and semi-automatically generated 1,070,240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the recent methods on the HRVQA dataset. Moreover, we propose a novel model, GFTransformer, with gated attention modules and a mutual fusion module. The experiments show that the proposed dataset is quite challenging, especially the specific attribute-related questions. Our method achieves superior performance in comparison to the previous state-of-the-art approaches. The dataset and the source code are released at https://hrvqa.nl/. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese.
- Author
-
Tran, Khiem Vinh, Phan, Hao Phu, Van Nguyen, Kiet, and Nguyen, Ngan Luu Thuy
- Abstract
In recent years, visual question answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English. To address this, we introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese while mitigating biases. The dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), each question annotated to specify the type of reasoning involved. Leveraging this dataset, we conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions. The architecture effectively employs transformers to enable simultaneous reasoning over textual and visual data, merging both modalities at an early model stage. The experimental findings demonstrate that our proposed model achieves state-of-the-art performance across four evaluation metrics. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Sign-based image criteria for social interaction visual question answering.
- Author
-
Chuganskaya, Anfisa A, Kovalev, Alexey K, and Panov, Aleksandr I
- Subjects
QUESTION answering systems ,SOCIAL interaction ,ARTIFICIAL intelligence ,PSYCHOLOGICAL research ,MACHINE learning - Abstract
The multi-modal tasks have started to play a significant role in the research on artificial intelligence. A particular example of that domain is visual–linguistic tasks, such as visual question answering. The progress of modern machine learning systems is determined, among other things, by the data on which these systems are trained. Most modern visual question answering data sets contain limited type questions that can be answered either by directly accessing the image itself or by using external data. At the same time, insufficient attention is paid to the issues of social interactions between people, which limits the scope of visual question answering systems. In this paper, we propose criteria by which images suitable for social interaction visual question answering can be selected for composing such questions, based on psychological research. We believe this should serve the progress of visual question answering systems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Vision transformer-based visual language understanding of the construction process.
- Author
-
Yang, Bin, Zhang, Binghan, Han, Yilong, Liu, Boda, Hu, Jiniming, and Jin, Yiming
- Subjects
TRANSFORMER models ,NATURAL language processing ,FEATURE extraction ,COMPUTER vision ,VIDEO surveillance ,BUILDING sites - Abstract
The widespread implementation of surveillance systems on construction sites has led to the accumulation of vast amounts of visual data, highlighting the need for an effective semantic analysis methodology. Natural language, as the most intuitive mode of expression, can significantly enhance the interpretability of such data. The adoption of multi-modality models promotes the interaction between surveillance video and textual data, thereby enabling managers to swiftly comprehend on-site dynamics. This study introduces a Visual Question Answering (VQA) approach for the construction industry and presents a specialized dataset to address the unique requirements of on-site management. Utilizing a Vision Transformer (ViT) architecture, the proposed model conducts feature extraction, fusion and interaction between visual and textual features. An additional projection layer is added to establish a transfer learning strategy that is optimized for construction site data. This novel approach facilitates rapid alignment of visual and language features in the model and is validated through ablation studies. The proposed approach achieves a testing accuracy of 83.8%, effectively converting image data from construction sites into natural language descriptions that enhance the analysis of construction processes. Compared to existing methods, this approach does not rely on object detection and allows for the direct extraction of deep-level semantic information from the on-site images. This study further discusses the feasibility of applying VQA within the architecture, engineering and construction (AEC) industry, examines its limitations, and offers suggestions for viable future directions of development. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Advancing surgical VQA with scene graph knowledge.
- Author
-
Yuan, Kun, Kattel, Manasi, Lavanchy, Joël L., Navab, Nassir, Srivastav, Vinkle, and Padoy, Nicolas
- Abstract
Purpose: The modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with natural language capabilities is emerging as a necessity. Our work aims to advance visual question answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question–condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. Methods: First, we propose a surgical scene graph-based dataset, SSG-VQA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. We then propose SSG-VQA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module, which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Results: Our comprehensive analysis shows that our SSG-VQA dataset provides a more complex, diverse, geometrically grounded, unbiased and surgical action-oriented dataset compared to existing surgical VQA datasets and SSG-VQA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. Conclusion: We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. We point out that the bottleneck of the current surgical visual question–answer model lies in learning the encoded representation rather than decoding the sequence. Our SSG-VQA dataset provides a diagnostic benchmark to test the scene understanding and reasoning capabilities of the model. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-VQA. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. DCF-VQA: COUNTERFACTUAL STRUCTURE BASED ON MULTI--FEATURE ENHANCEMENT.
- Author
-
GUAN YANG, CHENG JI, XIAOMING LIU, ZIMING ZHANG, and CHEN WANG
- Subjects
DISCRETE cosine transforms ,NATURAL language processing ,COMPUTER vision ,COUNTERFACTUALS (Logic) - Abstract
Visual question answering (VQA) is a pivotal topic at the intersection of computer vision and natural language processing. This paper addresses the challenges of linguistic bias and bias fusion within invalid regions encountered in existing VQA models due to insufficient representation of multi-modal features. To overcome those issues, we propose a multi-feature enhancement scheme. This scheme involves the fusion of one or more features with the original ones, incorporating discrete cosine transform (DCT) features into the counterfactual reasoning framework. This approach harnesses finegrained information and spatial relationships within images and questions, enabling a more refined understanding of the indirect relationship between images and questions. Consequently, it effectively mitigates linguistic bias and bias fusion within invalid regions in the model. Extensive experiments are conducted on multiple datasets, including VQA2 and VQA-CP2, employing various baseline models and fusion techniques, resulting in promising and robust performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Learning a Mixture of Conditional Gating Blocks for Visual Question Answering.
- Author
-
Sun, Qiang, Fu, Yan-Wei, and Xue, Xiang-Yang
- Subjects
CONVOLUTIONAL neural networks ,TRANSFORMER models ,ARTIFICIAL neural networks ,TURING test ,RESEARCH personnel - Abstract
As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the "dynamic" property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our questionguided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Enhancing machine vision: the impact of a novel innovative technology on video question-answering.
- Author
-
Dan, Songjian and Feng, Wei
- Subjects
- *
QUESTION answering systems , *COMPUTER vision , *LANGUAGE models , *TECHNOLOGICAL innovations , *NATURAL language processing , *ARTIFICIAL intelligence - Abstract
The robot video question-answering system is an artificial intelligence application that integrates computer vision and natural language processing technologies. Recently, it has received widespread attention, especially with the rapid development of large language models (LLMs). The core technical challenge lies in the application of visual question answering (VQA). However, visual question answering currently faces several challenges. Firstly, the acquisition of human annotations is costly, and secondly, existing models require expensive retraining when replacing a particular module. We propose the VLM2LLM model, which significantly improves the performance of multimodal question-answering tasks by integrating visual-language matching and large-scale language models. Specifically, it overcomes the limitations of requiring massive computational resources for training and inference in previous models. Furthermore, it allows for the upgrading of our LLM version according to the latest research advancements and needs. The results demonstrate that the VLM2LLM model achieves the highest accuracy compared to other state-of-the-art models on three datasets: QAv2, A-OKVQA, and OK-VQA. We hope that the VLM2LLM model can drive advancements in the field of robot video question-answering and provide innovative solutions for a wider range of application domains. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention.
- Author
-
Koshti, Dipali, Gupta, Ashutosh, Kalla, Mukesh, and Sharma, Arvind
- Subjects
- *
TRANSFORMER models , *POWER transformers , *FEATURE extraction , *NATURAL languages , *IMAGE representation - Abstract
Understanding multiple modalities and relating them is an easy task for humans. But for machines, this is a stimulating task. One such multi-modal reasoning task is Visual question answering which demands the machine to produce an answer for the natural language query asked based on the given image. Although plenty of work is done in this field, there is still a challenge of improving the answer prediction ability of the model and breaching human accuracy. A novel model for answering image-based questions based on a transformer has been proposed. The proposed model is a fully Transformer-based architecture that utilizes the power of a transformer for extracting language features as well as for performing joint understanding of question and image features. The proposed VQA model utilizes F-RCNN for image feature extraction. The retrieved language features and object-level image features are fed to a decoder inspired by the Bi-Directional Encoder Representation Transformer - BERT architecture that learns jointly the image characteristics directed by the question characteristics and rich representations of the image features are obtained. Extensive experimentation has been carried out to observe the effect of various hyperparameters on the performance of the model. The experimental results demonstrate that the model's ability to predict the answer increases with the increase in the number of layers in the transformer's encoder and decoder. The proposed model improves upon the previous models and is highly scalable due to the introduction of the BERT. Our best model reports 72.31% accuracy on the test-standard split of the VQAv2 dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. EarthVQANet: Multi-task visual question answering for remote sensing image understanding.
- Author
-
Wang, Junjue, Ma, Ailong, Chen, Zihang, Zheng, Zhuo, Wan, Yuting, Zhang, Liangpei, and Zhong, Yanfei
- Subjects
- *
SURFACE of the earth , *SEMANTICS , *URBAN planning , *HURRICANE Harvey, 2017 , *HUMAN settlements - Abstract
Monitoring and managing Earth's surface resources is critical to human settlements, encompassing essential tasks such as city planning, disaster assessment, etc. To accurately recognize the categories and locations of geographical objects and reason about their spatial or semantic relations , we propose a multi-task framework named EarthVQANet, which jointly addresses segmentation and visual question answering (VQA) tasks. EarthVQANet contains a hierarchical pyramid network for segmentation and semantic-guided attention for VQA, in which the segmentation network aims to generate pixel-level visual features and high-level object semantics, and semantic-guided attention performs effective interactions between visual features and language features for relational modeling. For accurate relational reasoning, we design an adaptive numerical loss that incorporates distance sensitivity for counting questions and mines hard-easy samples for classification questions, balancing the optimization. Experimental results on the EarthVQA dataset (city planning for Wuhan, Changzhou, and Nanjing in China), RSVQA dataset (basic statistics for general objects), and FloodNet dataset (disaster assessment for Texas in America attacked by Hurricane Harvey) show that EarthVQANet surpasses 11 general and remote sensing VQA methods. EarthVQANet simultaneously achieves segmentation and reasoning, providing a solid benchmark for various remote sensing applications. Data is available at http://rsidea.whu.edu.cn/EarthVQA.htm [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering.
- Author
-
Yan, Feng, Li, Zhe, Silamu, Wushour, and Li, Yanbing
- Subjects
QUESTION answering systems ,DESIGN - Abstract
Existing visual question answering (VQA) methods tend to focus excessively on visual objects in images, neglecting the understanding of implicit knowledge within the images, thus limiting the comprehension of image content. Furthermore, current mainstream VQA methods employ a bottom-up attention mechanism, which was initially proposed in 2017 and has become a bottleneck in visual question answering. In order to address the aforementioned issues and improve the ability to understand images, we have made the following improvements and innovations: (1) We utilize an OCR model to detect and extract scene text in the images, further enriching the understanding of image content. And we introduce the descriptive information from the images to enhance the model's comprehension of the images. (2) We have made improvements to the bottom-up attention model by obtaining two region features from the images, we concatenate the two region features to form the final visual feature, which better represents the image. (3) We design an extensible deep co-attention model, which includes self-attention units and co-attention units. This model can incorporate both image description information and scene text into the model, and it can be extended with other knowledge to further enhance the model's reasoning ability. (4) Experimental results demonstrate that our best single model achieves an overall accuracy of 74.38% on the VQA 2.0 test set. To the best of our knowledge, without using external datasets for pretraining, our model has reached a state-of-the-art level. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering.
- Author
-
Ma, Mingyang, Tohti, Turdi, Liang, Yi, Zuo, Zicheng, and Hamdulla, Askar
- Abstract
Visual question answering tasks based on the knowledge graph are dedicated to integrating rich information in the knowledge graph to deal with complex questions that cannot be solved by image features alone while focusing on improving the performance of fundamental visual question answering tasks. The core of this task is to achieve effective cross-modal information fusion and resolve the semantic gap between images and text, thereby predicting answers more accurately. However, current visual question answering methods face challenges such as sparse information, single fusion features, and excessive computational burden. Given the sparsity of image regions related to questions in visual question answering tasks, traditional fusion methods such as linear pooling and cross-attention, while capable of effectively handling interactions between different modalities, engage the question with the entire image globally. It introduces unnecessary noise and increases computational complexity. To solve these problems, we propose a focus fusion attention mechanism (FFAM) integrated with image captions, effectively reducing noise and computational burden by focusing on the topk high-relevance areas. In addition, we adopt the advanced BLIP-2 model to generate image captions and introduce it as a new modality into the fusion process, breaking through the limitation of relying solely on features generated by the image encoder. Although introducing the knowledge graph increases the possibility of model processing complexity and noise, our method still shows powerful effects. On the F-VQA dataset, our model improved by 2.57% compared to the baseline model without the knowledge graph and achieved an accuracy of 86.35% with the knowledge graph. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training
- Author
-
Su, Tongkun, Li, Jun, Zhang, Xi, Jin, Haibo, Chen, Hao, Wang, Qiong, Lv, Faqin, Zhao, Baoliang, Hu, Ying, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
26. Region-Specific Retrieval Augmentation for Longitudinal Visual Question Answering: A Mix-and-Match Paradigm
- Author
-
Yung, Ka-Wai, Sivaraj, Jayaram, Stoyanov, Danail, Loukogeorgakis, Stavros, Mazomenos, Evangelos B., Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
27. Can LLMs’ Tuning Methods Work in Medical Multimodal Domain?
- Author
-
Chen, Jiawei, Jiang, Yue, Yang, Dingkang, Li, Mingcheng, Wei, Jinjie, Qian, Ziyun, Zhang, Lihua, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
28. Overview of the ImageCLEF 2024: Multimedia Retrieval in Medical Applications
- Author
-
Ionescu, Bogdan, Müller, Henning, Drăgulinescu, Ana-Maria, Rückert, Johannes, Ben Abacha, Asma, García Seco de Herrera, Alba, Bloch, Louise, Brüngel, Raphael, Idrissi-Yaghir, Ahmad, Schäfer, Henning, Schmidt, Cynthia Sabrina, Pakull, Tabea M. G., Damm, Hendrik, Bracke, Benjamin, Friedrich, Christoph M., Andrei, Alexandra-Georgiana, Prokopchuk, Yuri, Karpenka, Dzmitry, Radzhabov, Ahmedkhan, Kovalev, Vassili, Macaire, Cécile, Schwab, Didier, Lecouteux, Benjamin, Esperança-Rodier, Emmanuelle, Yim, Wen-Wai, Fu, Yujuan, Sun, Zhaoyi, Yetisgen, Meliha, Xia, Fei, Hicks, Steven A., Riegler, Michael A., Thambawita, Vajira, Storås, Andrea, Halvorsen, Pål, Heinrich, Maximilian, Kiesel, Johannes, Potthast, Martin, Stein, Benno, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Goeuriot, Lorraine, editor, Mulhem, Philippe, editor, Quénot, Georges, editor, Schwab, Didier, editor, Di Nunzio, Giorgio Maria, editor, Soulier, Laure, editor, Galuščáková, Petra, editor, García Seco de Herrera, Alba, editor, Faggioli, Guglielmo, editor, and Ferro, Nicola, editor
- Published
- 2024
- Full Text
- View/download PDF
29. CHIC: Corporate Document for Visual Question Answering
- Author
-
Mahamoud, Ibrahim Souleiman, Coustaty, Mickaël, Joseph, Aurélie, d’Andecy, Vincent Poulain, Ogier, Jean-Marc, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Barney Smith, Elisa H., editor, Liwicki, Marcus, editor, and Peng, Liangrui, editor
- Published
- 2024
- Full Text
- View/download PDF
30. CircuitVQA: A Visual Question Answering Dataset for Electrical Circuit Images
- Author
-
Mehta, Rahul, Singh, Bhavyajeet, Varma, Vasudeva, Gupta, Manish, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bifet, Albert, editor, Davis, Jesse, editor, Krilavičius, Tomas, editor, Kull, Meelis, editor, Ntoutsi, Eirini, editor, and Žliobaitė, Indrė, editor
- Published
- 2024
- Full Text
- View/download PDF
31. VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task
- Author
-
Bi, Yandong, Jiang, Huajie, Liu, Jing, Liu, Mengting, Hu, Yongli, Yin, Baocai, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Huang, De-Shuang, editor, Pan, Yijie, editor, and Guo, Jiayang, editor
- Published
- 2024
- Full Text
- View/download PDF
32. Image Understanding Through Visual Question Answering: A Review from Past Research
- Author
-
Yanda, Nagamani, Tagore Babu, J., Aswin Kumar, K., Taraka Rama Rao, M., Ranjith Varma, K. V., Rahul Babu, N., Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Abraham, Ajith, editor, Bajaj, Anu, editor, Hanne, Thomas, editor, Siarry, Patrick, editor, and Ma, Kun, editor
- Published
- 2024
- Full Text
- View/download PDF
33. IIU: Independent Inference Units for Knowledge-Based Visual Question Answering
- Author
-
Li, Yili, Yu, Jing, Gai, Keke, Xiong, Gang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Cao, Cungeng, editor, Chen, Huajun, editor, Zhao, Liang, editor, Arshad, Junaid, editor, Asyhari, Taufiq, editor, and Wang, Yonghao, editor
- Published
- 2024
- Full Text
- View/download PDF
34. Experiential Questioning for VQA
- Author
-
Gómez Blanco, Ruben, Pérez Peinador, Adrián, Sanjuan Espejo, Adrián, Sánchez-Ruiz, Antonio A., Díaz-Agudo, Belén, Hartmanis, Juris, Founding Editor, van Leeuwen, Jan, Series Editor, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Kobsa, Alfred, Series Editor, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Nierstrasz, Oscar, Series Editor, Pandu Rangan, C., Editorial Board Member, Sudan, Madhu, Series Editor, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Vardi, Moshe Y, Series Editor, Goos, Gerhard, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Recio-Garcia, Juan A., editor, Orozco-del-Castillo, Mauricio G., editor, and Bridge, Derek, editor
- Published
- 2024
- Full Text
- View/download PDF
35. Increasing Interpretability in Outside Knowledge Visual Question Answering
- Author
-
Upravitelev, Max, Krauss, Christopher, Kuhlmann, Isabelle, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Uden, Lorna, editor, and Ting, I-Hsien, editor
- Published
- 2024
- Full Text
- View/download PDF
36. Generating Type-Related Instances and Metric Learning to Overcoming Language Priors in VQA
- Author
-
Sun, Chongxiang, Yang, Ying, Yu, Zhengtao, Guo, Chenliang, Zhao, Jia, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Ren, Jinchang, editor, Hussain, Amir, editor, Liao, Iman Yi, editor, Chen, Rongjun, editor, Huang, Kaizhu, editor, Zhao, Huimin, editor, Liu, Xiaoyong, editor, Ma, Ping, editor, and Maul, Thomas, editor
- Published
- 2024
- Full Text
- View/download PDF
37. GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering
- Author
-
Li, Yi-Ting, Lin, Ying-Jia, Yeh, Chia-Jen, Lin, Chun-Yi, Kao, Hung-Yu, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Yang, De-Nian, editor, Xie, Xing, editor, Tseng, Vincent S., editor, Pei, Jian, editor, Huang, Jen-Wei, editor, and Lin, Jerry Chun-Wei, editor
- Published
- 2024
- Full Text
- View/download PDF
38. A Balanced Counting Visual Question Answering Dataset
- Author
-
Nuseir, Aya, Vannahme, Moritz, Ebner, Marc, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
- Published
- 2024
- Full Text
- View/download PDF
39. Evaluation of Systematic Errors in Visual Question Answering
- Author
-
Nuseir, Aya, Vannahme, Moritz, Ebner, Marc, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
- Published
- 2024
- Full Text
- View/download PDF
40. Advancing Multimedia Retrieval in Medical, Social Media and Content Recommendation Applications with ImageCLEF 2024
- Author
-
Ionescu, Bogdan, Müller, Henning, Drăgulinescu, Ana Maria, Idrissi-Yaghir, Ahmad, Radzhabov, Ahmedkhan, Herrera, Alba Garcia Seco de, Andrei, Alexandra, Stan, Alexandru, Storås, Andrea M., Abacha, Asma Ben, Lecouteux, Benjamin, Stein, Benno, Macaire, Cécile, Friedrich, Christoph M., Schmidt, Cynthia Sabrina, Schwab, Didier, Esperança-Rodier, Emmanuelle, Ioannidis, George, Adams, Griffin, Schäfer, Henning, Manguinhas, Hugo, Coman, Ioan, Schöler, Johanna, Kiesel, Johannes, Rückert, Johannes, Bloch, Louise, Potthast, Martin, Heinrich, Maximilian, Yetisgen, Meliha, Riegler, Michael A., Snider, Neal, Halvorsen, Pål, Brüngel, Raphael, Hicks, Steven A., Thambawita, Vajira, Kovalev, Vassili, Prokopchuk, Yuri, Yim, Wen-Wai, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Goharian, Nazli, editor, Tonellotto, Nicola, editor, He, Yulan, editor, Lipani, Aldo, editor, McDonald, Graham, editor, Macdonald, Craig, editor, and Ounis, Iadh, editor
- Published
- 2024
- Full Text
- View/download PDF
41. Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
- Author
-
Lerner, Paul, Ferret, Olivier, Guinaudeau, Camille, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Goharian, Nazli, editor, Tonellotto, Nicola, editor, He, Yulan, editor, Lipani, Aldo, editor, McDonald, Graham, editor, Macdonald, Craig, editor, and Ounis, Iadh, editor
- Published
- 2024
- Full Text
- View/download PDF
42. Can Machines and Humans Use Negation When Describing Images?
- Author
-
Sato, Yuri, Mineshima, Koji, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Baratgin, Jean, editor, Jacquet, Baptiste, editor, and Yama, Hiroshi, editor
- Published
- 2024
- Full Text
- View/download PDF
43. Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction
- Author
-
Liu, Yi, Pan, Junwen, Wang, Qilong, Chen, Guanlin, Nie, Weiguo, Zhang, Yudong, Gao, Qian, Hu, Qinghua, Zhu, Pengfei, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fang, Lu, editor, Pei, Jian, editor, Zhai, Guangtao, editor, and Wang, Ruiping, editor
- Published
- 2024
- Full Text
- View/download PDF
44. Visual Question Answering – VizWiz Challenge
- Author
-
Ranković, Tamara, Janković, Eva, Slivka, Jelena, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Trajanovic, Miroslav, editor, Filipovic, Nenad, editor, and Zdravkovic, Milan, editor
- Published
- 2024
- Full Text
- View/download PDF
45. VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
- Author
-
Liu, Yang, Tan, Ying, Luo, Jingzhou, Chen, Weixing, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Qingshan, editor, Wang, Hanzi, editor, Ma, Zhanyu, editor, Zheng, Weishi, editor, Zha, Hongbin, editor, Chen, Xilin, editor, Wang, Liang, editor, and Ji, Rongrong, editor
- Published
- 2024
- Full Text
- View/download PDF
46. Enhancing Image Comprehension for Computer Science Visual Question Answering
- Author
-
Wang, Hongyu, Qiang, Pengpeng, Tan, Hongye, Hu, Jingchang, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Qingshan, editor, Wang, Hanzi, editor, Ma, Zhanyu, editor, Zheng, Weishi, editor, Zha, Hongbin, editor, Chen, Xilin, editor, Wang, Liang, editor, and Ji, Rongrong, editor
- Published
- 2024
- Full Text
- View/download PDF
47. Syntax Tree Constrained Graph Network for Visual Question Answering
- Author
-
Su, Xiangrui, Zhang, Qi, Shi, Chongyang, Liu, Jiachang, Hu, Liang, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Luo, Biao, editor, Cheng, Long, editor, Wu, Zheng-Guang, editor, Li, Hongyi, editor, and Li, Chaojie, editor
- Published
- 2024
- Full Text
- View/download PDF
48. Dual modality prompt learning for visual question-grounded answering in robotic surgery
- Author
-
Yue Zhang, Wanshu Fan, Peixi Peng, Xin Yang, Dongsheng Zhou, and Xiaopeng Wei
- Subjects
Prompt learning ,Visual prompt ,Textual prompt ,Grounding-answering ,Visual question answering ,Drawing. Design. Illustration ,NC1-1940 ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Computer software ,QA76.75-76.765 - Abstract
Abstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.
- Published
- 2024
- Full Text
- View/download PDF
49. Graph neural networks for visual question answering: a systematic review.
- Author
-
Yusuf, Abdulganiyu Abdu, Feng, Chong, Mao, Xianling, Ally Duma, Ramadhani, Abood, Mohammed Salah, and Chukkol, Abdulrahman Hamman Adama
- Subjects
GRAPH neural networks ,COMPUTER vision ,NATURAL language processing ,IMAGE representation ,ISOMORPHISM (Mathematics) ,NEUROLINGUISTICS - Abstract
Recently, visual question answering (VQA) has gained considerable interest within the computer vision and natural language processing (NLP) research areas. The VQA task involves answering a question about an image, which requires both language and vision understanding. Effectively extracting visual representations from images, textual embedding from questions, and bridging the semantic disparity between image and question representations pose fundamental challenges in VQA. Lately, an increasing number of studies are focusing on utilizing graph neural networks (GNNs) to enhance the performance of VQA tasks. The ability to handle graph-structured data is a major advantage of GNNs for VQA tasks, which allows better representation of relationships between objects and regions in an image. These relationships include both spatial and semantic relationships. This paper systematically reviews various graph neural networks based studies for image-based VQA. Fifty-four related publications written between 2018—Jan. 2023 were carefully synthesized for this review. The review is structured into three perspectives: the various graph neural network techniques and models that have been applied for VQA, a comparison of the model's performance and existing challenges. After analyzing these papers, 45 different models were identified, grouped into four different GNN techniques. These are Graph Convolution Network (GCN), Graph Attention Network (GAT), Graph Isomorphism Network (GIN) and Graph Neural Network (GNN). Also, the performance of these models is compared based on accuracy, datasets, subtasks, feature representation and fusion techniques. Lastly, the study provided some possible suggestions to mitigate still existing challenges for future research in visual question answering. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. Learning the Meanings of Function Words From Grounded Language Using a Visual Question Answering Model.
- Author
-
Portelance, Eva, Frank, Michael C., and Jurafsky, Dan
- Subjects
- *
MACHINE learning , *SEMANTICS , *STATISTICAL learning , *ARTIFICIAL neural networks - Abstract
Interpreting a seemingly simple function word like "or," "behind," or "more" can require logical, numerical, and relational reasoning. How are such words learned by children? Prior acquisition theories have often relied on positing a foundation of innate knowledge. Yet recent neural‐network‐based visual question answering models apparently can learn to use function words as part of answering questions about complex visual scenes. In this paper, we study what these models learn about function words, in the hope of better understanding how the meanings of these words can be learned by both models and children. We show that recurrent models trained on visually grounded language learn gradient semantics for function words requiring spatial and numerical reasoning. Furthermore, we find that these models can learn the meanings of logical connectives and and or without any prior knowledge of logical reasoning as well as early evidence that they are sensitive to alternative expressions when interpreting language. Finally, we show that word learning difficulty is dependent on the frequency of models' input. Our findings offer proof‐of‐concept evidence that it is possible to learn the nuanced interpretations of function words in a visually grounded context by using non‐symbolic general statistical learning algorithms, without any prior knowledge of linguistic meaning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.