Cascading spatio-temporal attention network for real-time action detection.

Authors :: Yang, Jianhua
Wang, Ke
Li, Ruifeng
Perner, Petra
Source :: Machine Vision & Applications. Nov2023, Vol. 34 Issue 6, p1-12. 12p.
Publication Year :: 2023
Abstract: Accurately detecting human actions in video has many applications, such as video surveillance and somatosensory games. In this paper, we propose a spatial-aware attention module (SAM) and a temporal-aware attention module (TAM) for spatio-temporal action detection in videos. SAM first concatenates the feature maps of consecutive frames on the channel and then uses dilated convolutional layer followed by a sigmoid function to generate a spatial attention map. The resulting attention map contains spatial information from consecutive frames, so it helps the detector focus on salient spatial features to achieve more accurate localization of action instances in consecutive frames. TAM deploys several fully connected layers to generate a temporal attention map. The temporal attention map focuses on the temporal association of each spatial feature; it can capture the temporal association of action instances, thereby improving the detector to track actions. To evaluate the effectiveness of SAM and TAM, we build an efficient and strong anchor-free action detector, cascading spatio-temporal attention network, equipped with a 2D backbone and SAM and TAM modules. Extensive experiments on two benchmarks, JHMDB and UCF101-24, demonstrate the preferable performance of SAM and TAM. [ABSTRACT FROM AUTHOR]