A data efficient transformer based on Swin Transformer.

Authors :: Yao, Dazhi
Shao, Yunxue
Source :: Visual Computer. Apr2024, Vol. 40 Issue 4, p2589-2598. 10p.
Publication Year :: 2024
Abstract: Almost all Vision Transformer-based models need to pre-train on the massive datasets and costly computation. Suppose researchers do not have enough data to train a Vision Transformer-based model or do not have powerful GPUs to implement computation for millions of labeled data. In that case, Vision Transformer-based models have no advantages over CNNs. Swin Transformer is brought forward to figure out these problems by applying the shifted window-based self-attention, which has linear computational complexity. Although Swin Transformer significantly reduces computing costs and works well on mid-size datasets, it still performs not well when it trains on a small-size dataset. In this paper, we propose a hierarchical and data-efficient Transformer based on Swin Transformer, which we call ESwin Transformer. We mainly redesigned the patch embedding module and patch merging module of Swin Transformer. We merely applied some unsophisticated convolutional components to these modules, which significantly improved performance when we trained our model on a small dataset. Our empirical results show that ESwin Transformer trained on CIFAR10/CIFAR100 with no extra data for 300 epochs achieves 97.17 % / 83.78 % accuracy and performs better than Swin Transformer and DeiT in the same training time. [ABSTRACT FROM AUTHOR]