Back to Search
Start Over
FlexFormer: Flexible Transformer for efficient visual recognition.
- Source :
-
Pattern Recognition Letters . May2023, Vol. 169, p95-101. 7p. - Publication Year :
- 2023
-
Abstract
- • Conv-MSA based on contextual querying dot-products to improve the fine-grained features representation for self-attentions. • Tanh-Softmax hybrid nonlinearization method in linear self-attention for fast convergence on visual recognition tasks. • FlexFormer model achieved the state-of-the-art recognition accuracy on several benchmarks. • FlexFormer model works with fewer parameters and higher computation efficiency. Vision Transformers have shown overwhelming superiority in computer vision communities compared with convolutional neural networks. Nevertheless, the understanding of multi-head self attentions, as the de facto ingredient of Transformers, is still limited, which leads to surging interest in explaining its core ideology. A notable theory interprets that, unlike high-frequency sensitive convolutions, self-attention behaves like a generalized spatial smoothing and blurs the high spatial-frequency signals with depth increasing. In this paper, we design a Conv-MSA structure to extract efficient local contextual information and remedy the inherent drawback of self-attention. Accordingly, a flexible transformer structure named FlexFormer , with linear computational complexity on input image size, is proposed. Experimental results on several visual recognition benchmarks show that our FlexFormer achieved the state-of-the-art results on visual recognition tasks with fewer parameters and higher computational efficiency. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 01678655
- Volume :
- 169
- Database :
- Academic Search Index
- Journal :
- Pattern Recognition Letters
- Publication Type :
- Academic Journal
- Accession number :
- 163308878
- Full Text :
- https://doi.org/10.1016/j.patrec.2023.03.028