Back to Search Start Over

UVIS: Unsupervised Video Instance Segmentation

Authors :
Huang, Shuaiyi
Suri, Saksham
Gupta, Kamal
Rambhatla, Sai Saketh
Lim, Ser-nam
Shrivastava, Abhinav
Publication Year :
2024

Abstract

Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the openset recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for generating accurate pseudo-labels and a tracking memory bank for maintaining temporal consistency in object tracks. We evaluate our approach on three standard VIS benchmarks, namely YoutubeVIS-2019, YoutubeVIS-2021, and Occluded VIS. Our UVIS achieves 21.1 AP on YoutubeVIS-2019 without any video annotations or dense pretraining, demonstrating the potential of our unsupervised VIS framework.<br />Comment: CVPR2024 Workshop

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2406.06908
Document Type :
Working Paper