Start Over

An interactive network based on transformer for multimodal crowd counting.

Authors :: Yu, Ying
Cai, Zhen
Miao, Duoqian
Qian, Jin
Tang, Hong
Source :: Applied Intelligence; Oct2023, Vol. 53 Issue 19, p22602-22614, 13p
Publication Year :: 2023
Abstract: Crowd counting is a task to estimate the total number of pedestrians in an image. In most of the existing research, good vision problems, such as in parks, squares, and bright shopping malls during the day, have been addressed. However, there is little research on complex scenes in darkness. To study this problem, we propose an interactive network based on Transformer for multi-modal crowd counting. First, sliding convolutional encoding is adopted for the image to obtain better encoding features. The features are extracted through the designed primary interaction network, and then channel token attention is used to modulate the features. Then, the FGAF-MLP is used for high and low semantic fusion to enhance the feature expression and fully fuse the data in different modes to improve the accuracy of the method. To verify the effectiveness of our method, we conducted extensive ablation experiments with the latest multimodal benchmark RGBT-CC, and we verified the complementarity between multiple modal data and the effectiveness of the model components. We also verified the effectiveness of our method with the ShanghaiTechRGBD benchmark. The experimental results showed that our proposed method exhibits good results and achieves an improvement of more than 10 % in terms of the mean average error and mean squared error for the RGBT-CC benchmark. [ABSTRACT FROM AUTHOR]