Start Over

FlexFormer: Flexible Transformer for efficient visual recognition.

Authors :: Fan, Xinyi
Liu, Huajun
Source :: Pattern Recognition Letters. May2023, Vol. 169, p95-101. 7p.
Publication Year :: 2023
Abstract: • Conv-MSA based on contextual querying dot-products to improve the fine-grained features representation for self-attentions. • Tanh-Softmax hybrid nonlinearization method in linear self-attention for fast convergence on visual recognition tasks. • FlexFormer model achieved the state-of-the-art recognition accuracy on several benchmarks. • FlexFormer model works with fewer parameters and higher computation efficiency. Vision Transformers have shown overwhelming superiority in computer vision communities compared with convolutional neural networks. Nevertheless, the understanding of multi-head self attentions, as the de facto ingredient of Transformers, is still limited, which leads to surging interest in explaining its core ideology. A notable theory interprets that, unlike high-frequency sensitive convolutions, self-attention behaves like a generalized spatial smoothing and blurs the high spatial-frequency signals with depth increasing. In this paper, we design a Conv-MSA structure to extract efficient local contextual information and remedy the inherent drawback of self-attention. Accordingly, a flexible transformer structure named FlexFormer , with linear computational complexity on input image size, is proposed. Experimental results on several visual recognition benchmarks show that our FlexFormer achieved the state-of-the-art results on visual recognition tasks with fewer parameters and higher computational efficiency. [ABSTRACT FROM AUTHOR]