Back to Search Start Over

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization.

Authors :
Zhu, Lin
Yin, Weihan
Yang, Yiyao
Wu, Fan
Zeng, Zhaoyu
Gu, Qinying
Wang, Xinbing
Zhou, Chenghu
Ye, Nanyang
Source :
International Journal of Computer Vision. Sep2024, Vol. 132 Issue 9, p3375-3407. 33p.
Publication Year :
2024

Abstract

Recent advances in fine-tuning large-scale vision-language pre-trained models (VL-PTMs) have shown promising results in quick adaption to downstream tasks. However, prior research often lacks comprehensive investigation into out-of-distribution (OOD) generalization. Fine-tuning has a potential risk of overfitting, especially on few-shot OOD datasets when significant distribution shifts occur between the few-shot training examples and test sets. Previous research on fine-tuning's robustness to distribution shifts does not consider different characteristics of distribution shifts and may not effectively handle noisy data with spurious correlations. To address these challenges, we propose the Vision-Language Alignment Learning under Affinity and Divergence Principles (VLAD) to adapt VL-PTMs to robust few-shot OOD generalization with theoretical guarantees. Built upon the large-scale pre-trained vision-language foundation model CLIP, we leverage frozen language embeddings as invariant anchors to protect against distribution shifts, while using adapter layers to fine-tune pre-trained visual features for improved vision-language alignment. Besides, we introduce affinity and divergence principles to further mitigate overfitting during the vision-language aligning process by increasing class discrimination and suppressing non-causal features. More importantly, we offer theoretical evidence highlighting the superiority of general language knowledge in achieving more robust OOD generalization performance. The tighter upper bound of the OOD generalization errors by the proposed regularization loss is also shown in theoretical analysis. Our approach is substantiated by extensive experiments and ablation studies on diverse datasets, validating our theoretical findings. The code is available at https://github.com/LinLLLL/VLAD. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09205691
Volume :
132
Issue :
9
Database :
Academic Search Index
Journal :
International Journal of Computer Vision
Publication Type :
Academic Journal
Accession number :
179277899
Full Text :
https://doi.org/10.1007/s11263-024-02036-4