Representation Learning Based on Vision Transformer.

Authors :: Ran, Ruisheng
Gao, Tianyu
Hu, Qianwei
Zhang, Wenfeng
Peng, Shunshun
Fang, Bin
Source :: International Journal of Pattern Recognition & Artificial Intelligence; Jun2024, Vol. 38 Issue 7, p1-23, 23p
Publication Year :: 2024
Abstract: In recent years, with the rapid development of information technology, the volume of image data has grown exponentially. However, these datasets typically contain a large amount of redundant information. To extract effective features and reduce redundancy from images, a representation learning method based on the Vision Transformer (ViT) has been proposed, and to our best knowledge, Transformer was first applied to zero-shot learning (ZSL). The method adopts a symmetric encoder–decoder structure, where the encoder incorporates Multi-Head Self-Attention (MSA) mechanism of ViT to reduce the dimensionality of image features, eliminate redundant information, and decrease computational burden. Consequently, it effectively extracts features, and the decoder is utilized for reconstructing image data. We evaluated the representation learning capability of the proposed method in various tasks, including data visualization, image reconstruction, face recognition, and ZSL. By comparing with state-of-the-art representation learning methods, the outstanding results obtained validate the effectiveness of this method in the field of representation learning. [ABSTRACT FROM AUTHOR]

Subjects :: TRANSFORMER models
INFORMATION technology
IMAGE reconstruction
CHANNEL coding
DATA visualization

Language :: English
ISSN :: 02180014
Volume :: 38
Issue :: 7
Database :: Complementary Index
Journal :: International Journal of Pattern Recognition & Artificial Intelligence
Publication Type :: Academic Journal
Accession number :: 178279126
Full Text :: https://doi.org/10.1142/S0218001424590043