Language-Attention Modular-Network for Relational Referring Expression Comprehension in Videos

Authors :: Dhingra, Naina
Jain, Shipra
Dhingra, Naina
Jain, Shipra
Publication Year :: 2022
Abstract: Referring expression (RE) for video domain describes the video using a natural language expression. Relational RE comprehension in a video domain localizes an object in relation to a distinguishing context object. Unlike object grounding in videos using REs, not much work has been done in videos using relational REs. In this paper, we focus on (1) relational RE comprehension for videos, and (2) demonstrating the significance of attention for the task. We propose a novel modular network based approach for relational RE comprehension in highly ambiguous settings for videos. We show the significance of the language attention in modular approach by: (1) Using two different networks, i.e., modATN consisting of attention mechanism, visual modules, and a natural language expression input, and modSTR consisting of visual modules, and structured input (subject, subject adjective, object, object adjective, action, relation); (2) Introducing a new dataset having structured RE for relation RE comprehension task in modSTR. Finally, we propose an optimised modular network that outperforms and shows significant improvements over the baseline networks.<br />Part of Proceedings ISBN 978-1-6654-9062-7QC 20230216

Full Text Access

Tools