Back to Search Start Over

End-to-end audiovisual speech activity detection with bimodal recurrent neural models.

Authors :
Tao, Fei
Busso, Carlos
Source :
Speech Communication. Oct2019, Vol. 113, p25-35. 11p.
Publication Year :
2019

Abstract

• This study proposes an end-to-end framework based on the bimodal recurrent neural network (BRNN) for AV-SAD that explicitly captures the temporal dynamic between acoustic and visual features. • The acoustic and visual features are directly learned from the data during training, creating a powerful end-to-end system. • The experimental evaluations on the CRSS- 4English-14 corpus (over 60h) demonstrate the benefits of using the proposed approach, which leads to statistically significant performance improvements over state-of-the-art methods. • This is the first end-to-end system for audiovisual speech activity detection. Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea, proposing the bimodal recurrent neural network (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 h of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone). [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
01676393
Volume :
113
Database :
Academic Search Index
Journal :
Speech Communication
Publication Type :
Academic Journal
Accession number :
138590734
Full Text :
https://doi.org/10.1016/j.specom.2019.07.003