778 results on '"Visual Question Answering"'
Search Results
2. CooKie: commonsense knowledge-guided mixture-of-experts framework for fine-grained visual question answering
- Author
-
Wang, Chao, Yang, Jianming, Zhou, Yang, and Yue, Xiaodong
- Published
- 2025
- Full Text
- View/download PDF
3. ENVQA: Improving Visual Question Answering model by enriching the visual feature
- Author
-
Chowdhury, Souvik and Soni, Badal
- Published
- 2025
- Full Text
- View/download PDF
4. R-VQA: A robust visual question answering model
- Author
-
Chowdhury, Souvik and Soni, Badal
- Published
- 2025
- Full Text
- View/download PDF
5. Unbiased VQA via modal information interaction and question transformation
- Author
-
Peng, Dahe and Li, Zhixin
- Published
- 2025
- Full Text
- View/download PDF
6. Diff-ZsVQA: Zero-shot Visual Question Answering with Frozen Large Language Models Using Diffusion Model
- Author
-
Xu, Quanxing, Li, Jian, Tian, Yuhao, Zhou, Ling, Zhang, Feifei, and Huang, Rubing
- Published
- 2025
- Full Text
- View/download PDF
7. Vision-BioLLM: Large vision language model for visual dialogue in biomedical imagery
- Author
-
AlShibli, Ahmad, Bazi, Yakoub, Rahhal, Mohamad Mahmoud Al, and Zuair, Mansour
- Published
- 2025
- Full Text
- View/download PDF
8. Robust data augmentation and contrast learning for debiased visual question answering
- Author
-
Ning, Ke and Li, Zhixin
- Published
- 2025
- Full Text
- View/download PDF
9. VCF: An effective Vision-Centric Framework for Visual Question Answering
- Author
-
Wang, Fengjuan, Peng, Longkun, Cao, Shan, Yang, Zhaoqilin, Zhang, Ruonan, and An, Gaoyun
- Published
- 2025
- Full Text
- View/download PDF
10. Low-shot Visual Anomaly Detection with Multimodal Large Language Models
- Author
-
Schiele, Tobias, Kern, Daria, DeSilva, Anjali, and Klauck, Ulrich
- Published
- 2024
- Full Text
- View/download PDF
11. Automatic Construction Safety Report Using Visual Question Answering and Segmentation Model
- Author
-
Tran, Dai Quoc, Jeon, Yuntae, Son, Seongwoo, Kulinan, Almo Senja, Lee, Changjun, Park, Seunghee, di Prisco, Marco, Series Editor, Chen, Sheng-Hong, Series Editor, Vayas, Ioannis, Series Editor, Kumar Shukla, Sanjay, Series Editor, Sharma, Anuj, Series Editor, Kumar, Nagesh, Series Editor, Wang, Chien Ming, Series Editor, Cui, Zhen-Dong, Series Editor, Lu, Xinzheng, Series Editor, Francis, Adel, editor, Miresco, Edmond, editor, and Melhado, Silvio, editor
- Published
- 2025
- Full Text
- View/download PDF
12. Evaluating Large Language Models in Cybersecurity Knowledge with Cisco Certificates
- Author
-
Keppler, Gustav, Kunz, Jeremy, Hagenmeyer, Veit, Elbez, Ghada, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Horn Iwaya, Leonardo, editor, Kamm, Liina, editor, Martucci, Leonardo, editor, and Pulls, Tobias, editor
- Published
- 2025
- Full Text
- View/download PDF
13. Evaluating VQA Models’ Consistency in the Scientific Domain
- Author
-
Quan, Khanh-An C., Guinaudeau, Camille, Satoh, Shin’ichi, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Ide, Ichiro, editor, Kompatsiaris, Ioannis, editor, Xu, Changsheng, editor, Yanai, Keiji, editor, Chu, Wei-Ta, editor, Nitta, Naoko, editor, Riegler, Michael, editor, and Yamasaki, Toshihiko, editor
- Published
- 2025
- Full Text
- View/download PDF
14. VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving
- Author
-
Liu, Yibo, Yang, Zheyuan, Wu, Guile, Ren, Yuan, Lin, Kejian, Liu, Bingbing, Liu, Yang, Shan, Jinjun, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
15. Fully Authentic Visual Question Answering Dataset from Online Communities
- Author
-
Chen, Chongyan, Liu, Mengchen, Codella, Noel, Li, Yunsheng, Yuan, Lu, Gurari, Danna, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
16. Integrating Vision-Tool to Enhance Visual-Question-Answering in Special Domains
- Author
-
Le, Nguyen-Khang, Nguyen, Dieu-Hien, Nguyen, Le Minh, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Hadfi, Rafik, editor, Anthony, Patricia, editor, Sharma, Alok, editor, Ito, Takayuki, editor, and Bai, Quan, editor
- Published
- 2025
- Full Text
- View/download PDF
17. Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-consistency Training
- Author
-
Tan, Cheng, Wei, Jingxuan, Gao, Zhangyang, Sun, Linzhuang, Li, Siyuan, Guo, Ruifeng, Yu, Bihui, Li, Stan Z., Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
18. Towards Open-Ended Visual Quality Comparison
- Author
-
Wu, Haoning, Zhu, Hanwei, Zhang, Zicheng, Zhang, Erli, Chen, Chaofeng, Liao, Liang, Li, Chunyi, Wang, Annan, Sun, Wenxiu, Yan, Qiong, Liu, Xiaohong, Zhai, Guangtao, Wang, Shiqi, Lin, Weisi, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
19. WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
- Author
-
Chen, Pingyi, Zhu, Chenglu, Zheng, Sunyi, Li, Honglin, Yang, Lin, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
20. Overview of the Trauma THOMPSON Challenge at MICCAI 2023
- Author
-
Zhuo, Yupeng, Kirkpatrick, Andrew W., Couperus, Kyle, Tran, Oanh, Beck, Jonah, DeVane, DeAnna, Candelore, Ross, McKee, Jessica, Colombo, Christopher, Gorbatkin, Chad, Birch, Eleanor, Duerstock, Bradley, Wachs, Juan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bao, Rina, editor, Grant, Ellen, editor, Kirkpatrick, Andrew, editor, Wachs, Juan, editor, and Ou, Yangming, editor
- Published
- 2025
- Full Text
- View/download PDF
21. The Trauma THOMPSON Challenge Report MICCAI 2023
- Author
-
Zhuo, Yupeng, W. Kirkpatrick, Andrew, Couperus, Kyle, Tran, Oanh, Wachs, Juan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bao, Rina, editor, Grant, Ellen, editor, Kirkpatrick, Andrew, editor, Wachs, Juan, editor, and Ou, Yangming, editor
- Published
- 2025
- Full Text
- View/download PDF
22. Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
- Author
-
Wang, Haibo, Ge, Weifeng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
23. Chatting with interactive memory for text-based person retrieval: Chatting with interactive memory...: C. He et al.
- Author
-
He, Chen, Li, Shenshen, Wang, Zheng, Chen, Hua, Shen, Fumin, and Xu, Xing
- Abstract
Text-based person retrieval aims to match a specific pedestrian image with textual descriptions. Traditional approaches have largely focused on utilizing a “single-shot” query with text description. They may not align well with real-world scenarios and cannot fully encapsulate detailed cues since users may employ multiple and partial queries to describe a pedestrian. To overcome this discrepancy, we introduce a novel model termed Chatting with Interactive Memory (CIM) for the text-based person retrieval task. Our CIM model facilitates a more nuanced and interactive search process by allowing users to engage in multiple rounds of dialogue, providing a more comprehensive description of the person of interest. The proposed CIM model is structured around two pivotal components: (1) The Interactive Retrieval Module, leveraging interactive memory to dynamically process dialogue and enhance image retrieval, and (2) The Q&A Module, crafted to simulate real user interactions. Our extensive evaluations on three widely-used datasets CUHK-PEDES, ICFG-PEDES, and RSTPReid illustrate the superior performance of the proposed CIM framework, significantly improving the precision and user engagement in text-based person retrieval tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
24. Modular dual-stream visual fusion network for visual question answering.
- Author
-
Xue, Lixia, Wang, Wenhao, Wang, Ronggui, and Yang, Juan
- Subjects
- *
INFORMATION storage & retrieval systems , *SYSTEMS theory , *NOISE - Abstract
Object detection networks' extracted region features have been pivotal in visual question answering (VQA) advancements. However, lacking global context, these features may yield inaccurate answers for questions demanding such information. Conversely, grid features provide detailed global context but falter on questions requiring high-level semantic insights due to their lack of semantic richness. Therefore, this paper proposes an improved attention-based dual-stream visual fusion network (MDVFN), which fuses region features with grid features to obtain global context information, while grid features supplement high-level semantic information. Specifically, we design a visual crossed attention (VCA) module in the attention network, which can interactively fuse two visual features to enhance their performance before guiding attention with the question features. It is worth noting that in order to reduce the semantic noise generated by the interaction of two image features in the visual cross attention (VCA) module, the targeted optimization is carried out. Before fusion, the visual position information is embedded, respectively, and the visual fusion graph is used to constrain the fusion process. Additionally, to combine text information, grid features, and region features, we propose a modality-mixing network. To validate our model, we conducted extensive experiments on the VQA-v2 benchmark dataset and the GQA dataset. These experiments demonstrate that MDVFN outperforms the most advanced methods. For instance, our proposed model achieved accuracies of 72.16% and 72.03% on the VQA-v2 and GQA datasets, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
25. Visual question answering on blood smear images using convolutional block attention module powered object detection.
- Author
-
Lubna, A., Kalady, Saidalavi, and Lijiya, A.
- Subjects
- *
BLOOD cell count , *LEUKOCYTE count , *CONVOLUTIONAL neural networks , *LEUKOCYTES , *ERYTHROCYTES - Abstract
One of the vital characteristics that determine the health condition of a person is the shape and number of the red blood cells, white blood cells and platelets present in one's blood. Any abnormality in these characteristics is an indication of the person suffering from diseases like anaemia, leukaemia or thrombocytosis. The counting of the blood cell is conventionally made by means of microscopic studies with the application of suitable chemical substances in the blood. The conventional methods pose challenges in the analysis in terms of manual labour and are time-consuming and costly tasks requiring highly skilled medical professionals. This paper proposes a novel scheme to analyse the blood sample of an individual by employing a visual question answering (VQA) system, which accepts a blood smear image as input and answers questions pertaining to the sample, viz. amount of blood cells, nature of abnormalities, etc. very quickly without requiring the service of a skilled medical professional. In VQA, the computer generates textual answers to questions about an input image. Solving this difficult problem requires visual understanding, question comprehension and deductive reasoning. The proposed approach exploits a convolutional neural network for question categorisation and an object detector with an attention mechanism for visual comprehension. The experiment has been conducted with two types of attention: (1) convolutional block attention module and (2) squeeze-and-excitation network which facilitates very fast and reliable results. A VQA dataset has been created for this study due to the unavailability of a public dataset, and the proposed system exhibited an accuracy of 94% for numeric response questions/yes or no type questions and has a BLEU score of 0.91. It is also observed that the attention-based object recognition model of the proposed system for counting the blood characteristics has an accuracy of 97%, 100% and 98% for red blood cell count, white blood cell count and platelet count, respectively, which is an improvement of 1%, 0.06% and 1.61% as compared to the state-of-the-art model. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
26. Graph neural networks in vision-language image understanding: a survey: Graph neural networks in vision-language image understanding: a survey: H. Senior et al.
- Author
-
Senior, Henry, Slabaugh, Gregory, Yuan, Shanxin, and Rossi, Luca
- Subjects
- *
GRAPH neural networks , *COMPUTER vision , *ARTIFICIAL intelligence , *IMAGE retrieval , *INFORMATION retrieval - Abstract
2D image understanding is a complex problem within computer vision, but it holds the key to providing human-level scene comprehension. It goes further than identifying the objects in an image, and instead, it attempts to understand the scene. Solutions to this problem form the underpinning of a range of tasks, including image captioning, visual question answering (VQA), and image retrieval. Graphs provide a natural way to represent the relational arrangement between objects in an image, and thus, in recent years graph neural networks (GNNs) have become a standard component of many 2D image understanding pipelines, becoming a core architectural component, especially in the VQA group of tasks. In this survey, we review this rapidly evolving field and we provide a taxonomy of graph types used in 2D image understanding approaches, a comprehensive list of the GNN models used in this domain, and a roadmap of future potential developments. To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of their architecture. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
27. Application of Generative Artificial Intelligence Models for Accurate Prescription Label Identification and Information Retrieval for the Elderly in Northern East of Thailand.
- Author
-
Thetbanthad, Parinya, Sathanarugsawait, Benjaporn, and Praneetpolgrang, Prasong
- Subjects
GENERATIVE artificial intelligence ,LANGUAGE models ,OPTICAL character recognition ,HEALTH facilities ,MEDICATION therapy management - Abstract
This study introduces a novel AI-driven approach to support elderly patients in Thailand with medication management, focusing on accurate drug label interpretation. Two model architectures were explored: a Two-Stage Optical Character Recognition (OCR) and Large Language Model (LLM) pipeline combining EasyOCR with Qwen2-72b-instruct and a Uni-Stage Visual Question Answering (VQA) model using Qwen2-72b-VL. Both models operated in a zero-shot capacity, utilizing Retrieval-Augmented Generation (RAG) with DrugBank references to ensure contextual relevance and accuracy. Performance was evaluated on a dataset of 100 diverse prescription labels from Thai healthcare facilities, using RAG Assessment (RAGAs) metrics to assess Context Recall, Factual Correctness, Faithfulness, and Semantic Similarity. The Two-Stage model achieved high accuracy (94%) and strong RAGAs scores, particularly in Context Recall (0.88) and Semantic Similarity (0.91), making it well-suited for complex medication instructions. In contrast, the Uni-Stage model delivered faster response times, making it practical for high-volume environments such as pharmacies. This study demonstrates the potential of zero-shot AI models in addressing medication management challenges for the elderly by providing clear, accurate, and contextually relevant label interpretations. The findings underscore the adaptability of AI in healthcare, balancing accuracy and efficiency to meet various real-world needs. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
28. An Adaptive Multimodal Fusion Network Based on Multilinear Gradients for Visual Question Answering.
- Author
-
Zhao, Chengfang, Tang, Mingwei, Zheng, Yanxi, and Ran, Chaocong
- Subjects
NATURAL language processing ,COMPUTER vision ,ARTIFICIAL intelligence ,FEATURE extraction ,IMAGE analysis ,QUESTION answering systems ,MULTIMODAL user interfaces - Abstract
As an interdisciplinary field of natural language processing and computer vision, Visual Question Answering (VQA) has emerged as a prominent research focus in artificial intelligence. The core of the VQA task is to combine natural language understanding and image analysis to infer answers by extracting meaningful features from textual and visual inputs. However, most current models struggle to fully capture the deep semantic relationships between images and text owing to their limited capacity to comprehend feature interactions, which constrains their performance. To address these challenges, this paper proposes an innovative Trilinear Multigranularity and Multimodal Adaptive Fusion algorithm (TriMMF) that is designed to improve the efficiency of multimodal feature extraction and fusion in VQA tasks. Specifically, the TriMMF consists of three key modules: (1) an Answer Generation Module, which generates candidate answers by extracting fused features and leveraging question features to focus on critical regions within the image; (2) a Fine-grained and Coarse-grained Interaction Module, which achieves multimodal interaction between question and image features at different granularities and incorporates implicit answer information to capture complex multimodal correlations; and (3) an Adaptive Weight Fusion Module, which selectively integrates coarse-grained and fine-grained interaction features based on task requirements, thereby enhancing the model's robustness and generalization capability. Experimental results demonstrate that the proposed TriMMF significantly outperforms existing methods on the VQA v1.0 and VQA v2.0 datasets, achieving state-of-the-art performance in question–answer accuracy. These findings indicate that the TriMMF effectively captures the deep semantic associations between images and text. The proposed approach provides new insights into multimodal interaction and fusion research, combining domain adaptation techniques to address a broader range of cross-domain visual question answering tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
29. BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
- Author
-
Md. Shalha Mucha Bhuyan, Eftekhar Hossain, Khaleda Akhter Sathi, Md. Azad Hossain, and M. Ali Akber Dewan
- Subjects
Visual question answering ,multimodal deep learning ,large language model ,natural language processing ,multi-head attention mechanism ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of ≈3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (MCRAN), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories. The benchmark and source code is available at https://github.com/eftekhar-hossain/Bengali-VQA.
- Published
- 2025
- Full Text
- View/download PDF
30. ViOCRVQA: novel benchmark dataset and VisionReader for visual question answering by understanding Vietnamese text in images: ViOCRVQA: novel benchmark dataset and VisionReader for visual…: H. Q. Pham et al.
- Author
-
Pham, Huy Quang, Nguyen, Thang Kien-Bao, Van Nguyen, Quan, Tran, Dan Quang, Nguyen, Nghia Hieu, Van Nguyen, Kiet, and Nguyen, Ngan Luu-Thuy
- Abstract
Optical Character Recognition-Visual Question Answering (OCR-VQA) is the task of answering text information contained in images that have been significantly developed in the English language in recent years. However, there are limited studies of this task in low-resource languages such as Vietnamese. To this end, we introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition-Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs. In this dataset, all the images contain text and questions about the information relevant to the text in the images. We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset. Furthermore, we introduce a novel approach, called VisionReader, which achieved 41.16% in EM and 69.90% in the F1-score on test dataset. The results showed that the OCR system plays an important role in VQA models on the ViOCRVQA dataset. In addition, the objects in the image also play a role in improving model performance. We open access to our dataset at for further research in OCR-VQA task in Vietnamese. The code for the proposed method, along with the models utilized in the experimental evaluation, is available at the following . [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
31. Integrating IoT and visual question answering in smart cities: Enhancing educational outcomes
- Author
-
Tian Gao and Guanqi Wang
- Subjects
Smart cities ,IoT framework ,Visual question answering ,Large language models ,Smart education technology ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
Emerging as a paradigmatic shift in urban development, smart cities harness the potential of advanced information and communication technologies to seamlessly integrate urban functions, optimize resource allocation, and improve the effectiveness of city management. Within the domain of smart education, the imperative application of Visual Question Answering (VQA) technology encounters significant limitations at the prevailing stage, particularly the absence of a robust Internet of Things (IoT) framework and the inadequate incorporation of large pre-trained language models (LLMs) within contemporary smart education paradigms, especially in addressing zero-shot VQA scenarios, which pose considerable challenges. In response to these constraints, this paper introduces an IoT-based smart city framework that is designed to refine the functionality and efficacy of educational systems. This framework is delineated into four cardinal layers: the data collection layer, data transmission layer, data management layer, and application layer. Furthermore, we introduce the innovative TeachVQA methodology at the application layer, synergizing VQA technology with extensive pre-trained language models, thereby considerably enhancing the dissemination and assimilation of educational content. Evaluative metrics in the VQAv2 and OKVQA datasets substantiate that the TeachVQA methodology not only outperforms existing VQA approaches, but also underscores its profound potential and practical relevance in the educational sector.
- Published
- 2024
- Full Text
- View/download PDF
32. Integrating deep learning for visual question answering in Agricultural Disease Diagnostics: Case Study of Wheat Rust
- Author
-
Akash Nanavaty, Rishikesh Sharma, Bhuman Pandita, Ojasva Goyal, Srinivas Rallapalli, Murari Mandal, Vaibhav Kumar Singh, Pratik Narang, and Vinay Chamola
- Subjects
Deep learning ,Plant Disease ,Visual question answering ,Wheat rust ,Medicine ,Science - Abstract
Abstract This paper presents a novel approach to agricultural disease diagnostics through the integration of Deep Learning (DL) techniques with Visual Question Answering (VQA) systems, specifically targeting the detection of wheat rust. Wheat rust is a pervasive and destructive disease that significantly impacts wheat production worldwide. Traditional diagnostic methods often require expert knowledge and time-consuming processes, making rapid and accurate detection challenging. We drafted a new, WheatRustDL2024 dataset (7998 images of healthy and infected leaves) specifically designed for VQA in the context of wheat rust detection and utilized it to retrieve the initial weights on the federated learning server. This dataset comprises high-resolution images of wheat plants, annotated with detailed questions and answers pertaining to the presence, type, and severity of rust infections. Our dataset also contains images collected from various sources and successfully highlights a wide range of conditions (different lighting, obstructions in the image, etc.) in which a wheat image may be taken, therefore making a generalized universally applicable model. The trained model was federated using Flower. Following extensive analysis, the chosen central model was ResNet. Our fine-tuned ResNet achieved an accuracy of 97.69% on the existing data. We also implemented the BLIP (Bootstrapping Language-Image Pre-training) methods that enable the model to understand complex visual and textual inputs, thereby improving the accuracy and relevance of the generated answers. The dual attention mechanism, combined with BLIP techniques, allows the model to simultaneously focus on relevant image regions and pertinent parts of the questions. We also created a custom dataset (WheatRustVQA) with our augmented dataset containing 1800 augmented images and their associated question-answer pairs. The model fetches an answer with an average BLEU score of 0.6235 on our testing partition of the dataset. This federated model is lightweight and can be seamlessly integrated into mobile phones, drones, etc. without any hardware requirement. Our results indicate that integrating deep learning with VQA for agricultural disease diagnostics not only accelerates the detection process but also reduces dependency on human experts, making it a valuable tool for farmers and agricultural professionals. This approach holds promise for broader applications in plant pathology and precision agriculture and can consequently address food security issues.
- Published
- 2024
- Full Text
- View/download PDF
33. PTCR: Knowledge-Based Visual Question Answering Framework Based on Large Language Model
- Author
-
XUE Di, LI Xin, LIU Mingshuai
- Subjects
visual question answering ,prompt engineering ,large language model ,cross-modal ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Aiming at the problems of insufficient model input information and poor reasoning performance in knowledge-based visual question answering (VQA), this paper constructs a PTCR knowledge-based VQA framework based on large language model (LLM), which consists of four parts: answer candidate generation, targeted image descriptions, autonomous chain of thought (CoT) construction, and prompted LLM inference. The PTCR framework uses LLM to guide multimodal large language models to generate targeted image descriptions, which solves the problem of incomplete coverage of previous image captions. It improves the model??s reasoning ability by guiding LLM to autonomously generate CoT, which provides the thinking process of similar problems during the reasoning process; and it introduces selection rearrangement technology to eliminate LLM??s selection location discrimination during the reasoning process, and reduces the randomness error of the reasoning by means of majority voting. Experimental results show that the accuracy of the CogVLM model enhanced by the PTCR framework is improved by 16.7 percentage points and 13.3 percentage points on the OK-VQA and A-OKVQA datasets. Meanwhile, compared with Prophet, the accuracy of the PTCR framework is improved by 3.4 percentage points and 5.0 percentage points on the OK-VQA and A-OKVQA datasets. The results of ablation experiments demonstrate that the methods used in this paper, such as targeted image descriptions and autonomous chains of thought, are all effective in improving accuracy. It is evident that the PTCR framework has improved the performance of knowledge-based VQA.
- Published
- 2024
- Full Text
- View/download PDF
34. MKGFA: Multimodal Knowledge Graph Construction and Fact-Assisted Reasoning for VQA.
- Author
-
Wang, Longbao, Zhang, Jinhao, Zhang, Libing, Zhang, Shuai, Xu, Shufang, Yu, Lin, and Gao, Hongmin
- Subjects
- *
KNOWLEDGE graphs , *FIRST-order logic , *KNOWLEDGE representation (Information theory) , *DECISION making , *MULTIMODAL user interfaces - Abstract
Knowledge-based visual question answering relies on open-ended external knowledge and a fine-grained comprehension of both the visual content of images and semantic information. Existing methods for utilizing knowledge have the following limitations: (1) Language pre-training methods output answers in the form of plain text, which only understand shallow visual content; (2) The knowledge retrieved by image objects as labels is represented as first-order logic, making it difficult to infer complex questions. To address the above problems, this paper integrates visual-textual multimodal information, accumulates domain-specific and external multi-modal knowledge, introduces and supplements external objective facts, and proposes a multimodal knowledge graph construction and fact-assisted reasoning network (MKGFA). The network consists of three parts: the multimodal knowledge graph construction module (MKGC), the objective fact-assisted reasoning module (FAR), and the answer inference module. The MKGC engages in the coarse-to-fine-grained learning of triplet representations for multimodal knowledge units. The FAR establishes deep cross-modal relations between visual objects and factual words for correlating real answers. The answer inference module makes the final decision based on the results of both. Among them, the former two modules employ a pre-training and fine-tuning strategy, systematically accumulating foundational and domain-specific knowledge. Compared with the state-of-the-arts, MKGFA achieves 1.09% and 0.7% higher accuracy on the two challenging OKVQA and KRVQA datasets, respectively. The experimental results demonstrate the complementary advantages of the integration of the two modules. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. A visual question answering model based on image captioning.
- Author
-
Zhou, Kun, Liu, Qiongjie, and Zhao, Dexin
- Abstract
Image captioning and visual question answering are two important tasks in the field of artificial intelligence, which have been widely used in various aspects of life and greatly facilitate our daily life. Image captioning and visual question answering have many similarities and use basically the same related knowledge and techniques. They are both cross-modal tasks involving computer vision and natural language processing, and can be studied in the same model and use the image captioning results to enhance the visual question answering output. Current research on these two tasks has largely been conducted independently, and the accuracy of the visual question answering results needs to be improved. Therefore, this paper proposes a visual question answering model IC-VQA based on image captioning. This model first performs the image captioning part, i.e., obtaining rich visual information by constructing object geometric relations and utilizing mesh information, and then generates question-specific image captioning by means of Attention+ Transformer. Transformer to generate question-specific image captioning sentences. Then the visual question answering part is performed, i.e., the previously generated image captioning sentences are fused to answer the question through the Attention+ LSTM framework, which significantly improves the accuracy of the answer. Experiments on the datasets VQA1.0 and VQA2.0 resulted in an overall accuracy of 70.1 and 70.85, respectively, which significantly closes the gap with humans, which proves the effectiveness of the IC-VQA model, and the accuracy of the visual question answering output can be truly improved by fusing the captioning sentences about the question. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. KTMN: Knowledge-driven Two-stage Modulation Network for visual question answering.
- Author
-
Shi, Jingya, Han, Dezhi, Chen, Chongqing, and Shen, Xiang
- Abstract
Existing visual question answering (VQA) methods introduce the Transformer as the backbone architecture for intra- and inter-modal interactions, demonstrating its effectiveness in dependency relationship modeling and information alignment. However, the Transformer’s inherent attention mechanisms tend to be affected by irrelevant information and do not utilize the positional information of objects in the image during the modelling process, which hampers its ability to adequately focus on key question words and crucial image regions during answer inference. Considering this issue is particularly pronounced on the visual side, this paper designs a Knowledge-driven Two-stage Modulation self-attention mechanism to optimize the internal interaction modeling of image sequences. In the first stage, we integrate textual context knowledge and the geometric knowledge of visual objects to modulate and optimize the query and key matrices. This effectively guides the model to focus on visual information relevant to the context and geometric knowledge during the information selection process. In the second stage, we design an information comprehensive representation to apply a secondary modulation to the interaction results from the first modulation. This further guides the model to fully consider the overall context of the image during inference, enhancing its global understanding of the image content. On this basis, we propose a Knowledge-driven Two-stage Modulation Network (KTMN) for VQA, which enables fine-grained filtering of redundant image information while more precisely focusing on key regions. Finally, extensive experiments conducted on the datasets VQA v2 and CLEVR yielded Overall accuracies of 71.36% and 99.20%, respectively, providing ample validation of the proposed method’s effectiveness and rationality. Source code is available at . [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Exploring and exploiting model uncertainty for robust visual question answering.
- Author
-
Zhang, Xuesong, He, Jun, Zhao, Jia, Hu, Zhenzhen, Yang, Xun, Li, Jia, and Hong, Richang
- Abstract
Visual Question Answering (VQA) methods have been widely demonstrated to exhibit bias in answering questions due to the distribution differences of answer samples between training and testing, resulting in resultant performance degradation. While numerous efforts have demonstrated promising results in overcoming language bias, broader implications (e.g., the trustworthiness of current VQA model predictions) of the problem remain unexplored. In this paper, we aim to provide a different viewpoint on the problem from the perspective of model uncertainty. In a series of empirical studies on the VQA-CP v2 dataset, we find that current VQA models are often biased towards making obviously incorrect answers with high confidence, i.e., being overconfident, which indicates high uncertainty. In light of this observation, we: (1) design a novel metric for monitoring model overconfidence, and (2) propose a model calibration method to address the overconfidence issue, thereby making the model more reliable and better at generalization. The calibration method explicitly imposes constraints on model predictions to make the model less confident during training. It has the advantage of being model-agnostic and computationally efficient. Experiments demonstrate that VQA approaches exhibiting overconfidence are usually negatively impacted in terms of generalization, and fortunately their performance and trustworthiness can be boosted by the adoption of our calibration method. Code is available at [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Beyond Words: ESC‐Net Revolutionizes VQA by Elevating Visual Features and Defying Language Priors.
- Author
-
Chowdhury, Souvik and Soni, Badal
- Subjects
- *
NATURAL language processing , *COMPUTER vision - Abstract
Language prior is a pressing problem in the VQA domain where a model provides an answer favoring the most frequent related answer. There are some methods that are adopted to mitigate language prior issue, for example, ensemble approach, the balanced data approach, the modified evaluation strategy, and the modified training framework. In this article, we propose a VQA model, "Ensemble of Spatial and Channel Attention Network (ESC‐Net)," to overcome the language bias problem by improving the visual features. In this work, we have used regional and global image features along with an ensemble of combined channel and spatial attention mechanisms to improve visual features. The model is a simpler and effective solution than existing methods to solve language bias. Extensive experiment show a remarkable performance improvement of 18% on the VQACP v2 dataset with a comparison to current state‐of‐the‐art (SOTA) models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Integrating IoT and visual question answering in smart cities: Enhancing educational outcomes.
- Author
-
Gao, Tian and Wang, Guanqi
- Subjects
LANGUAGE models ,SMART cities ,INFORMATION & communication technologies ,URBAN growth ,INTERNET of things - Abstract
Emerging as a paradigmatic shift in urban development, smart cities harness the potential of advanced information and communication technologies to seamlessly integrate urban functions, optimize resource allocation, and improve the effectiveness of city management. Within the domain of smart education, the imperative application of Visual Question Answering (VQA) technology encounters significant limitations at the prevailing stage, particularly the absence of a robust Internet of Things (IoT) framework and the inadequate incorporation of large pre-trained language models (LLMs) within contemporary smart education paradigms, especially in addressing zero-shot VQA scenarios, which pose considerable challenges. In response to these constraints, this paper introduces an IoT-based smart city framework that is designed to refine the functionality and efficacy of educational systems. This framework is delineated into four cardinal layers: the data collection layer, data transmission layer, data management layer, and application layer. Furthermore, we introduce the innovative TeachVQA methodology at the application layer, synergizing VQA technology with extensive pre-trained language models, thereby considerably enhancing the dissemination and assimilation of educational content. Evaluative metrics in the VQAv2 and OKVQA datasets substantiate that the TeachVQA methodology not only outperforms existing VQA approaches, but also underscores its profound potential and practical relevance in the educational sector. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. 基于隐式知识增强的 KB-VQA 知识 检索策略研究.
- Author
-
郑洪岩, 王慧, 刘昊, 张志平, 杨晓娟, and 孙涛
- Abstract
Copyright of Journal of Graphics is the property of Journal of Graphics Editorial Office and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
41. Adversarial Sample Synthesis for Visual Question Answering.
- Author
-
Li, Chuanhao, Jing, Chenchen, Li, Zhen, Wu, Yuwei, and Jia, Yunde
- Subjects
LINGUISTIC change ,GENERALIZATION ,SAMPLING methods ,MEMORIZATION ,LANGUAGE & languages - Abstract
Language prior is a major block to improving the generalization of visual question answering (VQA) models. Recent work has revealed that synthesizing extra training samples to balance training sets is a promising way to alleviate language priors. However, most existing methods synthesize extra samples in a manner independent of training processes, which neglect the fact that the language priors memorized by VQA models are changing during training, resulting in insufficient synthesized samples. In this article, we propose an adversarial sample synthesis method, which synthesizes different adversarial samples by adversarial masking at different training epochs to cope with the changing memorized language priors. The basic idea behind our method is to use adversarial masking to synthesize adversarial samples that will cause the model to make wrong answers. To this end, we design a generative module to carry out adversarial masking by attacking the VQA model and introduce a bias-oriented objective to supervise the training of the generative module. We couple the sample synthesis with the training process of the VQA model, which ensures that the synthesized samples at different training epochs are beneficial to the VQA model. We incorporated the proposed method into three VQA models including UpDn, LMH, and LXMERT and conducted experiments on three datasets including VQA-CP v1, VQA-CP v2, and VQA v2. Experimental results demonstrate that a large improvement of our method, such as 16.22% gains on LXMERT in the overall accuracy of VQA-CP v2. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. Integrating deep learning for visual question answering in Agricultural Disease Diagnostics: Case Study of Wheat Rust.
- Author
-
Nanavaty, Akash, Sharma, Rishikesh, Pandita, Bhuman, Goyal, Ojasva, Rallapalli, Srinivas, Mandal, Murari, Singh, Vaibhav Kumar, Narang, Pratik, and Chamola, Vinay
- Subjects
FEDERATED learning ,WHEAT rusts ,DEEP learning ,AGRICULTURE ,PLANT diseases - Abstract
This paper presents a novel approach to agricultural disease diagnostics through the integration of Deep Learning (DL) techniques with Visual Question Answering (VQA) systems, specifically targeting the detection of wheat rust. Wheat rust is a pervasive and destructive disease that significantly impacts wheat production worldwide. Traditional diagnostic methods often require expert knowledge and time-consuming processes, making rapid and accurate detection challenging. We drafted a new, WheatRustDL2024 dataset (7998 images of healthy and infected leaves) specifically designed for VQA in the context of wheat rust detection and utilized it to retrieve the initial weights on the federated learning server. This dataset comprises high-resolution images of wheat plants, annotated with detailed questions and answers pertaining to the presence, type, and severity of rust infections. Our dataset also contains images collected from various sources and successfully highlights a wide range of conditions (different lighting, obstructions in the image, etc.) in which a wheat image may be taken, therefore making a generalized universally applicable model. The trained model was federated using Flower. Following extensive analysis, the chosen central model was ResNet. Our fine-tuned ResNet achieved an accuracy of 97.69% on the existing data. We also implemented the BLIP (Bootstrapping Language-Image Pre-training) methods that enable the model to understand complex visual and textual inputs, thereby improving the accuracy and relevance of the generated answers. The dual attention mechanism, combined with BLIP techniques, allows the model to simultaneously focus on relevant image regions and pertinent parts of the questions. We also created a custom dataset (WheatRustVQA) with our augmented dataset containing 1800 augmented images and their associated question-answer pairs. The model fetches an answer with an average BLEU score of 0.6235 on our testing partition of the dataset. This federated model is lightweight and can be seamlessly integrated into mobile phones, drones, etc. without any hardware requirement. Our results indicate that integrating deep learning with VQA for agricultural disease diagnostics not only accelerates the detection process but also reduces dependency on human experts, making it a valuable tool for farmers and agricultural professionals. This approach holds promise for broader applications in plant pathology and precision agriculture and can consequently address food security issues. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. 基于大语言模型的 PTCR 外部知识型视觉问答框架.
- Author
-
薛 迪, 李 欣, and 刘明帅
- Abstract
Copyright of Journal of Frontiers of Computer Science & Technology is the property of Beijing Journal of Computer Engineering & Applications Journal Co Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
44. A Picture May Be Worth a Hundred Words for Visual Question Answering †.
- Author
-
Hirota, Yusuke, Garcia, Noa, Otani, Mayu, Chu, Chenhui, and Nakashima, Yuta
- Subjects
RECOGNITION (Psychology) ,LANGUAGE models ,DATA augmentation ,DECISION making ,STATISTICAL bias - Abstract
How far can textual representations go in understanding images? In image understanding, effective representations are essential. Deep visual features from object recognition models currently dominate various tasks, especially Visual Question Answering (VQA). However, these conventional features often struggle to capture image details in ways that match human understanding, and their decision processes lack interpretability. Meanwhile, the recent progress in language models suggests that descriptive text could offer a viable alternative. This paper investigated the use of descriptive text as an alternative to deep visual features in VQA. We propose to process description–question pairs rather than visual features, utilizing a language-only Transformer model. We also explored data augmentation strategies to enhance training set diversity and mitigate statistical bias. Extensive evaluation shows that textual representations using approximately a hundred words can effectively compete with deep visual features on both the VQA 2.0 and VQA-CP v2 datasets. Our qualitative experiments further reveal that these textual representations enable clearer investigation of VQA model decision processes, thereby improving interpretability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. Multi-stage reasoning on introspecting and revising bias for visual question answering.
- Author
-
An-An, L., Zimu, Lu, Ning, Xu, Min, Liu, Chenggang, Yan, Bolun, Zheng, Bo, Lv, Yulong, Duan, Zhuang, Shao, and Xuanya, Li
- Subjects
ARTIFICIAL intelligence ,ATTENTIONAL bias ,VOCABULARY ,FORECASTING ,LANGUAGE & languages - Abstract
Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer prediction, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model's reliance on image content during answer reasoning and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental resu- lts show that our network achieves significant performance against the previous state-of-the-art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering.
- Author
-
Hao, Dongze, Wang, Qunbo, Zhu, Xinxin, and Liu, Jing
- Subjects
VISUAL learning ,COUNTERFACTUALS (Logic) ,DATA modeling ,ANNOTATIONS ,NOISE - Abstract
Despite most state-of-the-art models having achieved amazing performance in Visual Question Answering (VQA), they usually utilize biases to answer the question. Recently, some studies synthesize counterfactual training samples to help the model to mitigate the biases. However, these synthetic samples need extra annotations and often contain noises. Moreover, these methods simply add synthetic samples to the training data to train the model with the cross-entropy loss, which cannot make the best use of synthetic samples to mitigate the biases. In this article, to mitigate the biases in VQA more effectively, we propose a Hierarchical Counterfactual Contrastive Learning (HCCL) method. Firstly, to avoid introducing noises and extra annotations, our method automatically masks the unimportant features in original pairs to obtain positive samples and create mismatched question-image pairs as negative samples. Then our method uses feature-level and answer-level contrastive learning to make the original sample close to positive samples in the feature space, while away from negative samples in both feature and answer spaces. In this way, the VQA model can learn the robust multimodal features and focus on both visual and language information to produce the answer. Our HCCL method can be adopted in different baselines, and the experimental results on VQA v2, VQA-CP, and GQA-OOD datasets show that our method is effective in mitigating the biases in VQA, which improves the robustness of the VQA model. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation
- Author
-
Tomohisa Seki, Yoshimasa Kawazoe, Hiromasa Ito, Yu Akagi, Toru Takiguchi, and Kazuhiko Ohe
- Subjects
large language model ,electrocardiography ,visual question answering ,hallucination ,zero-shot learning ,Diseases of the circulatory (Cardiovascular) system ,RC666-701 - Abstract
Large Language Models (LLM) are increasingly multimodal, and Zero-Shot Visual Question Answering (VQA) shows promise for image interpretation. If zero-shot VQA can be applied to a 12-lead electrocardiogram (ECG), a prevalent diagnostic tool in the medical field, the potential benefits to the field would be substantial. This study evaluated the diagnostic performance of zero-shot VQA with multimodal LLMs on 12-lead ECG images. The results revealed that multimodal LLM tended to make more errors in extracting and verbalizing image features than in describing preconditions and making logical inferences. Even when the answers were correct, erroneous descriptions of image features were common. These findings suggest a need for improved control over image hallucination and indicate that performance evaluation using the percentage of correct answers to multiple-choice questions may not be sufficient for performance assessment in VQA tasks.
- Published
- 2025
- Full Text
- View/download PDF
48. Adaptive sparse triple convolutional attention for enhanced visual question answering: Adaptive sparse triple convolutional attention for enhanced visual...
- Author
-
Wang, Ronggui, Chen, Hong, Yang, Juan, and Xue, Lixia
- Published
- 2025
- Full Text
- View/download PDF
49. SAFFNet: self-attention based on Fourier frequency domain filter network for visual question answering: SAFFNet: self-attention based on Fourier frequency...
- Author
-
Shi, Jingya, Han, Dezhi, Chen, Chongqing, and Shen, Xiang
- Published
- 2025
- Full Text
- View/download PDF
50. Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI
- Author
-
Lee, Gyeonggeon and Zhai, Xiaoming
- Published
- 2025
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.