Back to Search Start Over

Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations

Authors :
Zhu, Kangyu
Qin, Ziyuan
Yi, Huahui
Jiang, Zekun
Lao, Qicheng
Zhang, Shaoting
Li, Kang
Publication Year :
2025

Abstract

While mainstream vision-language models (VLMs) have advanced rapidly in understanding image level information, they still lack the ability to focus on specific areas designated by humans. Rather, they typically rely on large volumes of high-quality image-text paired data to learn and generate posterior attention maps. To address this critical issue, we propose leveraging visual prompts:simple visual markers in various forms to guide and enhance the formation of region-specific attention. Thus, we introduce MedVP, a pioneering framework that integrates medical entity extraction, visual prompt generation, and dataset adaptation for visual prompt guided fine-tuning. We successfully outperform recent state-of-the-art large models across multiple medical VQA datasets. Extensive experiments and Human evaluation are conducted to analyze the impact of different visual prompt forms and how they contribute to performance improvement. The results demonstrate both the effectiveness and clinical significance of our approach.<br />Comment: Accepted to NAACL 2025 Main Conference

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2501.02385
Document Type :
Working Paper