Uncovering the Temporal Context for Video Question Answering.

Authors :: Zhu, Linchao
Xu, Zhongwen
Yang, Yi
Hauptmann, Alexander
Source :: International Journal of Computer Vision. Sep2017, Vol. 124 Issue 3, p409-421. 13p.
Publication Year :: 2017
Abstract: In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder-decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of 'fill-in-the-blank', and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines. [ABSTRACT FROM AUTHOR]