Back to Search Start Over

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Authors :
Zhu, Qiushi
Zhou, Long
Zhang, Ziqiang
Liu, Shujie
Jiao, Binxing
Zhang, Jie
Dai, Lirong
Jiang, Daxin
Li, Jinyu
Wei, Furu
Publication Year :
2022

Abstract

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.<br />Comment: 11 pages, Accepted by IEEE Transactions on Multimedia

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2211.11275
Document Type :
Working Paper
Full Text :
https://doi.org/10.1109/TMM.2023.3275873