Start Over

基于多模态融合的城市道路场景视频描述模型研究.

Authors :: 李铭兴
 徐成
 李学伟
 刘宏哲
 闫晨阳
 廖文森
Source :: Application Research of Computers / Jisuanji Yingyong Yanjiu. Feb2023, Vol. 40 Issue 2, p607-640. 6p.
Publication Year :: 2023
Abstract: Multimodal fusion algorithm is one of the solutions to the problem of urban road video caption which only considers visual information and ignores the equally important audio information. Existing multimodal fusion algorithms based on Transformer all have the problem of low fusion performance between modes and high computational complexity. In order to improve the interaction between multimodal information, This paper recently proposed a new Transformer based model called Multimodal Attention Bottleneck for Video Captioning (MABVC). Firstly, this paper uses pre-trained I3 D and VGGish networks to extract visual and audio features of video and input the extracted features into Transformer model. Then, the decoder part will train the information of the two modes respectively and perform multimodal fusion. Finally, the model processes the results of the decoder and generates text captions that people can understand. This paper conducted a comparison experiments using data sets MSR-VTT, MSVD and self-built data sets BUUISE, and validated model results using evaluation metrics the model. The experimental results show that the video caption model based on multimodal attention fusion has obvious improvement in all indicators. The model can still achieve good results on traffic scene data sets, and has great application prospects/can be promoted and applied in intelligent driving industry. [ABSTRACT FROM AUTHOR]