Start Over

Scaling 4D Representations

Authors :: Carreira, João
Gokay, Dilara
King, Michael
Zhang, Chuhan
Rocco, Ignacio
Mahendran, Aravindh
Keck, Thomas Albert
Heyward, Joseph
Koppula, Skanda
Pot, Etienne
Erdogan, Goker
Hasson, Yana
Yang, Yi
Greff, Klaus
Moing, Guillaume Le
van Steenkiste, Sjoerd
Zoran, Daniel
Hudson, Drew A.
Vélez, Pedro
Polanía, Luisa
Friedman, Luke
Duvarney, Chris
Goroshin, Ross
Allen, Kelsey
Walker, Jacob
Kabra, Rishabh
Aboussouan, Eric
Sun, Jennifer
Kipf, Thomas
Doersch, Carl
Pătrăucean, Viorica
Damen, Dima
Luc, Pauline
Sajjadi, Mehdi S. M.
Zisserman, Andrew
Publication Year :: 2024
Abstract: Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.