Efficient image analysis with triple attention vision transformer.

Authors :: Li, Gehui
Zhao, Tongtong
Source :: Pattern Recognition. Jun2024, Vol. 150, pN.PAG-N.PAG. 1p.
Publication Year :: 2024
Abstract: This paper introduces TrpViT, a novel triple attention vision transformer that efficiently captures both local and global features. The proposed architecture tackles global information acquisition by employing three complementary attention mechanisms in a unique attention block: Window, Dilated, and Channel attention. This attention block extracts spatially local features while expanding the receptive field to capture richer global context. By integrating this attention block with convolution, a new C-C-T-T architecture is formed. We rigorously evaluate TrpViT, demonstrating state-of-the-art performance on various computer vision tasks, including image classification, 2D and 3D object detection, instance segmentation, and low-level image colorization. Notably, TrpViT achieves strong accuracy across all parameter scales, highlighting its computational efficiency and effectiveness. • A Triple Attention Vision Transformer captures both global and local features. • TrpViT integrates convolution into transformer to provide induction bias. • TrpViT compensates non-local information using multiple complementary attentions. • TrpViT achieves state-of-the-art results across both high-level and low-level tasks. [ABSTRACT FROM AUTHOR]