Start Over

DSCAFormer: Lightweight Vision Transformer With Dual-Branch Spatial Channel Aggregation

Authors :: Jinfeng Li
Peng Wu
Renjie Xu
Xiaoming Zhang
Zhi Han
Source :: IEEE Access, Vol 12, Pp 75272-75288 (2024)
Publication Year :: 2024
Publisher :: IEEE, 2024.
Abstract: Vision Transformer (ViT) models with strong performance are widely popular in the vision field. However, they are often limited by computation and model size, making the study of lightweight models critical. Existing lightweight ViT models often exhibit a weak dependence on high-frequency information and lack an effective multi-frequency fusion mechanism. This limitation typically necessitates the use of complex computational resources and the expansion of model scale to compensate for these shortcomings. In response to this situation, this paper proposes the DSCAFormer model, which develops a novel architecture that includes a local convolutional block with attention-like algorithms and a global spatial transformer block. These components use a channel splitting strategy with a gating mechanism to merge local and global information, forming a multi-frequency spatial feature extractor capable of effectively integrating both local and global information from images. Furthermore, a channel aggregation method is introduced to enhance the extraction of spatial information within the channel space, thereby enabling spatial feature perception within the context and the allocation of multi-feature computation. A series of comprehensive experiments were conducted in the domains of classification, detection, and instance segmentation, which demonstrated the DSCAFormer model’s remarkable scalability and competitiveness. For instance, on the ImageNet-1K dataset, the DSCAFormer model, which utilized 2.5M, 4.3M, and 7.4M parameters, achieved top-1 accuracies of 72.5%, 76.7%, and 79.5%, respectively, with 0.4 G, 0.7G, and 1.2G FLOPs.These results outperform the MobileViTv2-0.5/0.75/1.0 models, with respective accuracy improvements of 2.3%, 1.1%, and 1.4%, and reductions in FLOPs of 20%, 30%, and 33%. In addition, DSCAFormer has also shown competitive performance in downstream tasks.

Subjects :: Feature fusion
image classification
lightweight model
self-attention
transformers
Electrical engineering. Electronics. Nuclear engineering
TK1-9971

Details

Language :: English
ISSN :: 21693536
Volume :: 12
Database :: Directory of Open Access Journals
Journal :: IEEE Access
Publication Type :: Academic Journal
Accession number :: edsdoj.f6ed0f17632a4cc5b38affde0e31af27
Document Type :: article
Full Text :: https://doi.org/10.1109/ACCESS.2024.3406555

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

DSCAFormer: Lightweight Vision Transformer With Dual-Branch Spatial Channel Aggregation

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

DSCAFormer: Lightweight Vision Transformer With Dual-Branch Spatial Channel Aggregation

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources