1. Fusion of Multimodal Embeddings for Ad-Hoc Video Search
- Author
-
Benoit Huet, Chong-Wah Ngo, Danny Francis, and Phuong Anh Nguyen
- Subjects
Vocabulary ,Information retrieval ,business.industry ,Computer science ,Deep learning ,media_common.quotation_subject ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Feature extraction ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,TRECVID ,Visualization ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Natural language ,0105 earth and related environmental sciences ,Semantic gap ,media_common - Abstract
The challenge of Ad-Hoc Video Search (AVS) originates from free-form (i.e., no pre-defined vocabulary) and free-style (i.e., natural language) query description. Bridging the semantic gap between AVS queries and videos becomes highly difficult as evidenced from the low retrieval accuracy of AVS benchmarking in TRECVID. In this paper, we study a new method to fuse multimodal embeddings which have been derived based on completely disjoint datasets. This method is tested on two datasets for two distinct tasks: on MSR-VTT for unique video retrieval and on V3C1 for multiple videos retrieval.
- Published
- 2019
- Full Text
- View/download PDF