Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition.

Authors :: Zhang, Xiaoyan
Cui, Yujie
Huo, Yongkai
Source :: Visual Computer; Aug2023, Vol. 39 Issue 8, p3247-3257, 11p
Publication Year :: 2023
Abstract: 360 ∘ video action recognition is one of the most promising fields with the popularity of omnidirectional cameras. To obtain a more precise action understanding in panoramic scene, in this paper, we propose a deformable patch embedding-based temporal shift module-enhanced vision transformer model (DS-ViT), which aims to simultaneously eliminate the distortion effects caused by equirectangular projection (ERP) and construct temporal relationship among the video sequences. Panoramic action recognition is a practical but challenging domain for the lack of panoramic feature extraction methods. With deformable patch embedding, our scheme can adaptively learn the position offsets between different pixels, which effectively captures the distorted features. The temporal shift module facilitates temporal information exchanging by shifting part of the channels with zero parameters. Thanks to the powerful encoder, DS-ViT can efficiently learn the distorted features from the ERP inputs. Simulation results show that our proposed solution outperforms the state-of-the-art two-stream solution by an action accuracy of 9.29 % and an activity accuracy of 8.18 % , where the recent EgoK360 dataset is employed. [ABSTRACT FROM AUTHOR]

Full Text Access

Tools