1. Multi-modal and deep learning for robust speech recognition
- Author
-
James R. Glass., Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science., Feng, Xue, Ph. D. Massachusetts Institute of Technology, James R. Glass., Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science., and Feng, Xue, Ph. D. Massachusetts Institute of Technology
- Abstract
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017., Cataloged from PDF version of thesis., Includes bibliographical references (pages 105-115)., Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environments, system performance can degrade dramatically when noise and reverberation are present. In this thesis, speech denoising and model adaptation for robust speech recognition were studied, and four novel methods were introduced to improve ASR robustness. First, we developed an ASR system using multi-channel information from microphone arrays via accurate speaker tracking with Kalman filtering and subsequent beamforming. The system was evaluated on the publicly available Reverb Challenge corpus, and placed second (out of 49 submitted systems) in the recognition task on real data. Second, we explored a speech feature denoising and dereverberation method via deep denoising autoencoders (DDA). The method was evaluated on the CHiME2-WSJ0 corpus and achieved a 16% to 25% absolute improvement in word error rate (WER) compared to the baseline. Third, we developed a method to incorporate heterogeneous multi-modal data with a deep neural network (DNN) based acoustic model. Our experiments on a noisy vehicle-based speech corpus demonstrated that WERs can be reduced by 6.3% relative to the baseline system. Finally, we explored the use of a low-dimensional environmentally-aware feature derived from the total acoustic variability space. Two extraction methods are presented: one via linear discriminant analysis (LDA) projection, and the other via a bottleneck deep neural network (BN-DNN). Our evaluations showed that by adapting ASR systems with the proposed feature, ASR performance was significantly improved. We also demonstrated that the proposed feature yielded promising results on environment identification tasks., by Xue Feng., Ph. D.
- Published
- 2018