Back to Search
Start Over
COLAFormer: Communicating local–global features with linear computational complexity.
- Source :
-
Pattern Recognition . Jan2025, Vol. 157, pN.PAG-N.PAG. 1p. - Publication Year :
- 2025
-
Abstract
- Local and sparse attention effectively reduce the high computational cost of global self-attention. However, they suffer from non-global dependency and coarse feature capturing, respectively. While some subsequent models employ effective interaction techniques for better classification performance, we observe that the computation and memory of these models overgrow as the resolution increases. Consequently, applying them to downstream tasks with large resolutions takes time and effort. In response to this concern, we propose an effective backbone network based on a novel attention mechanism called Concatenating glObal tokens in Local Attention (COLA) with a linear computational complexity. The implementation of COLA is straightforward, as it incorporates global information into the local attention in a concatenating manner. We introduce a learnable condensing feature (LCF) module to capture high-quality global information. LCF possesses the following properties: (1) performing a function similar to clustering, aggregating image patches into a smaller number of tokens based on similarity. (2) a constant number of aggregated tokens regardless of the image size, ensuring that it is a linear complexity operator. Based on COLA, we build COLAFormer, which achieves global dependency and fine-grained feature capturing with linear computational complexity and demonstrates impressive performance across various vision tasks. Specifically, our COLAFormer-S achieves 84.5% classification accuracy, surpassing other advanced models by 0.4% with similar or less resource consumption. Furthermore, our COLAFormer-S can achieve a better object detection performance while consuming only 1/4 of the resources compared to other state-of-the-art models. The code and models will be made publicly available. • We introduce a novel backbone called COLAFormer, which is based on Concatenating global tokens in Local Attention (COLA). COLA communicates between local attention and sparse attention following a concatenating manner. Additionally, it achieves linear complexity and can significantly contribute to downstream tasks. • In order to provide high-quality global information for COLA, we introduce an effective downsampling module called learnable condensing feature (LCF) module. LCF can perform a function similar to clustering and downsample tokens. • Extensive experiments across various vision tasks demonstrate that our COLAFormer achieves impressive classification accuracy on ImageNet1K and strikes a balance between performance and resource consumption on downstream tasks at large input resolutions such as object detection. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 00313203
- Volume :
- 157
- Database :
- Academic Search Index
- Journal :
- Pattern Recognition
- Publication Type :
- Academic Journal
- Accession number :
- 179603374
- Full Text :
- https://doi.org/10.1016/j.patcog.2024.110870