Back to Search Start Over

Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Authors :
Qian, Yang
Sun, Yinan
Kargarandehkordi, Ali
Azizian, Parnian
Mutlu, Onur Cezmi
Surabhi, Saimourya
Chen, Pingyi
Jabbar, Zain
Wall, Dennis Paul
Washington, Peter
Publication Year :
2024

Abstract

The increasing variety and quantity of tagged multimedia content on a variety of online platforms offer a unique opportunity to advance the field of human action recognition. In this study, we utilize 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition. We employ VideoMAE V2, an advanced model integrating Masked Autoencoders (MAE) with Vision Transformers (ViT), pre-trained on this diverse collection of unstructured videos. Our model, fine-tuned on established action recognition benchmarks such as UCF101 and HMDB51, achieves state-of-the-art results: 99.05% on UCF101, 86.08% on HMDB51, 85.51% on Kinetics-400, and 74.27% on Something-Something V2 using the ViT-giant backbone. These results highlight the potential of using unstructured and unlabeled videos as a valuable source of diverse and dynamic content for training foundation models. Our investigation confirms that while initial increases in pre-training data volume significantly enhance model performance, the gains diminish as the dataset size continues to expand. Our findings emphasize two critical axioms in self-supervised learning for computer vision: (1) additional pre-training data can yield diminishing benefits for some datasets and (2) quality is more important than quantity in self-supervised learning, especially when building foundation models.<br />Comment: 10 pages

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2402.08875
Document Type :
Working Paper