1. Representation Learning Based on Vision Transformer.
- Author
-
Ran, Ruisheng, Gao, Tianyu, Hu, Qianwei, Zhang, Wenfeng, Peng, Shunshun, and Fang, Bin
- Subjects
TRANSFORMER models ,INFORMATION technology ,IMAGE reconstruction ,CHANNEL coding ,DATA visualization - Abstract
In recent years, with the rapid development of information technology, the volume of image data has grown exponentially. However, these datasets typically contain a large amount of redundant information. To extract effective features and reduce redundancy from images, a representation learning method based on the Vision Transformer (ViT) has been proposed, and to our best knowledge, Transformer was first applied to zero-shot learning (ZSL). The method adopts a symmetric encoder–decoder structure, where the encoder incorporates Multi-Head Self-Attention (MSA) mechanism of ViT to reduce the dimensionality of image features, eliminate redundant information, and decrease computational burden. Consequently, it effectively extracts features, and the decoder is utilized for reconstructing image data. We evaluated the representation learning capability of the proposed method in various tasks, including data visualization, image reconstruction, face recognition, and ZSL. By comparing with state-of-the-art representation learning methods, the outstanding results obtained validate the effectiveness of this method in the field of representation learning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF