1. Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
- Author
-
Yang, Antoine, Nagrani, Arsha, Seo, Paul Hongsuck, Miech, Antoine, Pont-Tuset, Jordi, Laptev, Ivan, Sivic, Josef, Schmid, Cordelia, Models of visual object recognition and scene understanding (WILLOW), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Google Research, DeepMind [London], DeepMind Technologies, Czech Institute of Informatics, Robotics and Cybernetics [Prague] (CIIRC), Czech Technical University in Prague (CTU), The work was partially funded by a Google gift, the French government under management of Agence Nationale de la Recherche as part of the 'Investissements d’avenir' program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), the Louis Vuitton ENS Chair on Artificial Intelligence, the European Regional Development Fund under project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468)., and ANR-19-P3IA-0001,PRAIRIE,PaRis Artificial Intelligence Research InstitutE(2019)
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Video Understanding ,Computer Science - Artificial Intelligence ,Computer Vision ,Computer Vision and Pattern Recognition (cs.CV) ,Pretraining ,Computer Science - Computer Vision and Pattern Recognition ,Video Captioning ,Language Model ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.10: Vision and Scene Understanding ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Machine Learning (cs.LG) ,Video Paragraph Captioning ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,[INFO]Computer Science [cs] ,ACM: I.: Computing Methodologies/I.5: PATTERN RECOGNITION ,Computer Science - Computation and Language ,[INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV] ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE ,Artificial Intelligence (cs.AI) ,Vision and Language ,Dense Video Captioning ,Few-Shot Learning ,Computation and Language (cs.CL) ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.6: Learning - Abstract
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the tasks of video paragraph captioning and video clip captioning, and to few-shot settings. Our code is publicly available at https://antoyang.github.io/vid2seq.html., Comment: CVPR 2023 Camera-Ready; Project Webpage: https://antoyang.github.io/vid2seq.html ; 18 pages; 6 figures
- Published
- 2023
- Full Text
- View/download PDF