Start Over

Vision-language navigation: a survey and taxonomy.

Authors :: Wu, Wansen
Chang, Tao
Li, Xinmeng
Yin, Quanjun
Hu, Yue
Source :: Neural Computing & Applications. Mar2024, Vol. 36 Issue 7, p3291-3316. 26p.
Publication Year :: 2024
Abstract: Vision-language navigation (VLN) tasks require an agent to follow language instructions from a human guide to navigate in previously unseen environments using visual observations. This challenging field, involving problems in natural language processing (NLP), computer vision (CV), robotics, etc., has spawned many excellent works focusing on various VLN tasks. This paper provides a comprehensive survey and an insightful taxonomy of these tasks based on the different characteristics of language instructions. Depending on whether navigation instructions are given once or multiple times, we divide the tasks into two categories, i.e., single-turn and multiturn tasks. We subdivide single-turn tasks into goal-oriented and route-oriented tasks based on whether the instructions designate a single goal location or specify a sequence of multiple locations. We subdivide multiturn tasks into interactive and passive tasks based on whether the agent is allowed to ask questions. These tasks require different agent capabilities and entail various model designs. We identify the progress made on these tasks and examine the limitations of the existing VLN models and task settings. Hopefully, a well-designed taxonomy of the task family enables comparisons among different approaches across papers concerning the same tasks and clarifies the advances made in these tasks. Furthermore, we discuss several open issues in this field and some promising directions for future research, including the incorporation of knowledge into VLN models and transferring them to the real physical world. [ABSTRACT FROM AUTHOR]