Back to Search
Start Over
End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks
- Source :
- IEEE-ACM Transactions on Audio, Speech, and Language Processing; September 2018, Vol. 26 Issue: 9 p1570-1584, 15p
- Publication Year :
- 2018
-
Abstract
- Speech enhancement model is used to map a noisy speech to a clean speech. In the training stage, an objective function is often adopted to optimize the model parameters. However, in the existing literature, there is an inconsistency between the model optimization criterion and the evaluation criterion for the enhanced speech. For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based mean square error (MSE) between estimated and clean speech is widely used in optimizing the model. Due to the inconsistency, there is no guarantee that the trained model can provide optimal performance in applications. In this study, we propose an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) to reduce the gap between the model optimization and the evaluation criterion. Because of the utterance-based optimization, temporal correlation information of long speech segments, or even at the entire utterance level, can be considered to directly optimize perception-based objective functions. As an example, we implemented the proposed FCN enhancement framework to optimize the STOI measure. Experimental results show that the STOI of a test speech processed by the proposed approach is better than conventional MSE-optimized speech due to the consistency between the training and the evaluation targets. Moreover, by integrating the STOI into model optimization, the intelligibility of human subjects and automatic speech recognition system on the enhanced speech is also substantially improved compared to those generated based on the minimum MSE criterion.
Details
- Language :
- English
- ISSN :
- 23299290
- Volume :
- 26
- Issue :
- 9
- Database :
- Supplemental Index
- Journal :
- IEEE-ACM Transactions on Audio, Speech, and Language Processing
- Publication Type :
- Periodical
- Accession number :
- ejs45709190
- Full Text :
- https://doi.org/10.1109/TASLP.2018.2821903