Start Over

Temporal Moment Localization via Natural Language by Utilizing Video Question Answers as a Special Variant and Bypassing NLP for Corpora.

Authors :: Nawaz, Hafiza Sadia
Shi, Zhensheng
Gan, Yanhai
Hirpa, Amanuel
Dong, Junyu
Zheng, Haiyong
Source :: IEEE Transactions on Circuits & Systems for Video Technology. Sep2022, Vol. 32 Issue 9, p6174-6185. 12p.
Publication Year :: 2022
Abstract: Temporal moment localization using natural language (TMLNL) is an emerging issue in computer vision for localizing a specific moment inside a long, untrimmed video. The goal of TMLNL is to obtain the video’s output moment, which is related to the input query in a substantial way. Previous research focused on the visual portion of TMLNL, such as objects, backdrops, and other visual attributes, but natural language processing (NLP) techniques were largely used for the textual portion. A long query requires sufficient context to properly localize moments within a long untrimmed video. Thus, as a consequence of not completely understanding how to handle queries, performances deteriorated, especially when the query was longer. In this paper, we treat the TMLNL challenge as a unique variation of VQA, which equally considers the visual elements by using our proposed VQA joint visual-textual framework (JVTF). However, we also manage complex and long input queries without employing natural language processing (NLP) by improving poorly graded to finely graded distinct granularity representations. Our suggested BCPN searches for insufficient context for long input queries using an approach called query handler (QH) and helps the JVTF find the most relevant moment. Previously, a recurrence of words was caused by increasing the number of encoding layers in transformers, LSTMs, and other NLP techniques; however, our QH ensured that repetition of word locations was reduced. The output of BCPN is combined with JVTF’s guided attention to further improve the end outcome. Therefore, we propose a novel bidirectional context predictor network (BCPN), in addition to a VQA joint visual-textual framework (JVTF), to address the equal importance of videos and queries. Through extensive experiments on three benchmark datasets, we show that the proposed BCPN outperforms the state-of-the-art methods by $IoU = 0.3 (2.65 \%) $ , $IoU = 0.5 (2.49 \%)$ , and $IoU = 0.7 (2.06 \%) $. [ABSTRACT FROM AUTHOR]

Subjects :: *NATURAL languages
*NATURAL language processing
*QUESTION answering systems
*LOCALIZATION (Mathematics)
*COMPUTER vision
*CORPORA
*VIDEOS

Details

Language :: English
ISSN :: 10518215
Volume :: 32
Issue :: 9
Database :: Academic Search Index
Journal :: IEEE Transactions on Circuits & Systems for Video Technology
Publication Type :: Academic Journal
Accession number :: 158914522
Full Text :: https://doi.org/10.1109/TCSVT.2022.3162650

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Temporal Moment Localization via Natural Language by Utilizing Video Question Answers as a Special Variant and Bypassing NLP for Corpora.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Temporal Moment Localization via Natural Language by Utilizing Video Question Answers as a Special Variant and Bypassing NLP for Corpora.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources