Start Over

Transformer for multiple object tracking: Exploring locality to vision.

Authors :: Wu, Shan
Hadachi, Amnir
Lu, Chaoru
Vivet, Damien
Source :: Pattern Recognition Letters. Jun2023, Vol. 170, p70-76. 7p.
Publication Year :: 2023
Abstract: • Substituting the FFNs of the transformer encoders with locality modules enriches the local contexts in multi-object tracking. • The quantitative and qualitative research shows the strength of locality in transformer-based MOT systems. • The performance of pedestrian tracking is boosted when training the model with a mixture of detection and tracking datasets. Multi-object tracking (MOT) is a critical task in various domains, such as traffic analysis, surveillance, and autonomous vehicles. The joint-detection-and-tracking paradigm has been extensively researched, which is faster and more convenient for training and deploying over the classic tracking-by-detection paradigm while achieving state-of-the-art performance. This paper explores the possibilities of enhancing the MOT system by leveraging the prevailing convolutional neural network (CNN) and a novel vision transformer technique Locality. There are several deficiencies in the transformer adopted for computer vision tasks. While the transformers are good at modeling global information for a long embedding, the locality mechanism, which learns the local features, is missing. This could lead to negligence of small objects, which may cause security issues. We combine the TransTrack MOT system with the locality mechanism inspired by LocalViT and find that the locality-enhanced system outperforms the baseline TransTrack by 5.3% MOTA on the MOT17 dataset. [ABSTRACT FROM AUTHOR]