1. Diverse Features Fusion Network for video-based action recognition.
- Author
-
Deng, Haoyang, Kong, Jun, Jiang, Min, and Liu, Tianshan
- Subjects
- *
HUMAN activity recognition , *CONVOLUTIONAL neural networks , *FEATURE extraction , *MULTISENSOR data fusion , *STREAMING video & television - Abstract
The two-stream convolutional network has been proved to be one milestone in the study of video-based action recognition. Lots of recent works modify internal structure of two-stream convolutional network directly and put top-level features into a 2D/3D convolution fusion module or a simpler one. However, these fusion methods cannot fully utilize features and the way fusing only top-level features lacks rich vital details. To tackle these issues, a novel network called Diverse Features Fusion Network (DFFN) is proposed. The fusion stream of DFFN contains two types of uniquely designed modules, the diverse compact bilinear fusion (DCBF) module and the channel-spatial attention (CSA) module, to distill and refine diverse compact spatiotemporal features. The DCBF modules use the diverse compact bilinear algorithm to fuse features extracted from multiple layers of the base network that are called diverse features in this paper. Further, the CSA module leverages channel attention and multi-size spatial attention to boost key information as well as restraining the noise of fusion features. We evaluate our three-stream network DFFN on three public challenging video action benchmarks: UCF101, HMDB51 and Something-Something V1. Experiment results indicate that our method achieves state-of-the-art performance. • Diverse compact bilinear fusion modules fuse spatial features and temporal features. • Diverse compact bilinear fusion modules fuse diverse features from different layers. • The multi-size spatial-channel attention module can adjust the attention of network. • The proposed network achieves state-of-the-art performance. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF