1. Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition
- Author
-
Emmanuel Vincent, Sunit Sivasankaran, Dominique Fohr, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Grid'5000, ANR-16-CE33-0006,VOCADOM,Commande vocale robuste adaptée à la personne et au contexte pour l'autonomie à domicile(2016), This work was made with the support of the French National Research Agency, in the framework of the project VOCADOM 'Robust voice commandadapted to the user and to the context for AAL' (ANR-16-CE33-0006). Experiments presented in this paper were carried out using the Grid’5000testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several universities as well as other organizations (see https://www.grid5000.fr) and using the EXPLOR centre, hosted by the University of Lorraine., Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
- Subjects
Multichannel speech separation ,WSJ0-2mix reverberated ,Signal processing ,Noise measurement ,Artificial neural network ,Computer science ,Speech recognition ,Word error rate ,020206 networking & telecommunications ,02 engineering and technology ,Speech processing ,Signal-to-noise ratio ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Audio and Speech Processing (eess.AS) ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Adaptive beamformer ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We investigate the effect of speaker localization on the performance of speech recognition systems in a multispeaker, multichannel environment. Given the speaker location information, speech separation is performed in three stages. In the first stage, a simple delay-and-sum (DS) beamformer is used to enhance the signal impinging from the speaker location which is then used to estimate a time-frequency mask corresponding to the localized speaker using a neural network. This mask is used to compute the second order statistics and to derive an adaptive beamformer in the third stage. We generated a multichannel, multispeaker, reverberated, noisy dataset inspired from the well studied WSJ0-2mix and study the performance of the proposed pipeline in terms of the word error rate (WER). An average WER of $29.4$% was achieved using the ground truth localization information and $42.4$% using the localization information estimated via GCC-PHAT. The signal-to-interference ratio (SIR) between the speakers has a higher impact on the ASR performance, to the extent of reducing the WER by $59$% relative for a SIR increase of $15$ dB. By contrast, increasing the spatial distance to $50^\circ$ or more improves the WER by $23$% relative only, Comment: Submitted to ICASSP 2020
- Published
- 2019
- Full Text
- View/download PDF