Back to Search Start Over

Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition

Authors :
Emmanuel Vincent
Sunit Sivasankaran
Dominique Fohr
Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH)
Inria Nancy - Grand Est
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD)
Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA)
Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA)
Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Grid'5000
ANR-16-CE33-0006,VOCADOM,Commande vocale robuste adaptée à la personne et au contexte pour l'autonomie à domicile(2016)
This work was made with the support of the French National Research Agency, in the framework of the project VOCADOM 'Robust voice commandadapted to the user and to the context for AAL' (ANR-16-CE33-0006). Experiments presented in this paper were carried out using the Grid’5000testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several universities as well as other organizations (see https://www.grid5000.fr) and using the EXPLOR centre, hosted by the University of Lorraine.
Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA)
Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Source :
EUSIPCO, 28th European Signal Processing Conference, 28th European Signal Processing Conference, Jan 2021, Amsterdam, Netherlands, EUSIPCO 2020-28th European Signal Processing Conference, EUSIPCO 2020-28th European Signal Processing Conference, Jan 2021, Amsterdam / Virtual, Netherlands, EUSIPCO 2020-28th European Signal Processing Conference, Jan 2021, Amsterdam / Virtual, Netherlands. ⟨10.23919/Eusipco47968.2020.9287541⟩
Publication Year :
2019
Publisher :
arXiv, 2019.

Abstract

We investigate the effect of speaker localization on the performance of speech recognition systems in a multispeaker, multichannel environment. Given the speaker location information, speech separation is performed in three stages. In the first stage, a simple delay-and-sum (DS) beamformer is used to enhance the signal impinging from the speaker location which is then used to estimate a time-frequency mask corresponding to the localized speaker using a neural network. This mask is used to compute the second order statistics and to derive an adaptive beamformer in the third stage. We generated a multichannel, multispeaker, reverberated, noisy dataset inspired from the well studied WSJ0-2mix and study the performance of the proposed pipeline in terms of the word error rate (WER). An average WER of $29.4$% was achieved using the ground truth localization information and $42.4$% using the localization information estimated via GCC-PHAT. The signal-to-interference ratio (SIR) between the speakers has a higher impact on the ASR performance, to the extent of reducing the WER by $59$% relative for a SIR increase of $15$ dB. By contrast, increasing the spatial distance to $50^\circ$ or more improves the WER by $23$% relative only<br />Comment: Submitted to ICASSP 2020

Details

Database :
OpenAIRE
Journal :
EUSIPCO, 28th European Signal Processing Conference, 28th European Signal Processing Conference, Jan 2021, Amsterdam, Netherlands, EUSIPCO 2020-28th European Signal Processing Conference, EUSIPCO 2020-28th European Signal Processing Conference, Jan 2021, Amsterdam / Virtual, Netherlands, EUSIPCO 2020-28th European Signal Processing Conference, Jan 2021, Amsterdam / Virtual, Netherlands. ⟨10.23919/Eusipco47968.2020.9287541⟩
Accession number :
edsair.doi.dedup.....4bb4e8839d0cfbc2c42a417127427871
Full Text :
https://doi.org/10.48550/arxiv.1910.11114