Back to Search Start Over

Multichannel Speech Separation with Recurrent Neural Networks from High-Order Ambisonics Recordings

Authors :
Emmanuel Vincent
Alexandre Guerin
Romain Serizel
Laureline Perotin
Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH)
Inria Nancy - Grand Est
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD)
Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA)
Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA)
Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Orange Labs [Cesson-Sévigné]
Orange Labs
Perotin, Lauréline
Source :
ICASSP, 43rd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2018), 43rd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2018), Apr 2018, Calgary, Canada
Publication Year :
2018
Publisher :
IEEE, 2018.

Abstract

International audience; We present a source separation system for high-order ambisonics (HOA) contents. We derive a multichannel spatial filter from a mask estimated by a long short-term memory (LSTM) recurrent neural network. We combine one channel of the mixture with the outputs of basic HOA beamformers as inputs to the LSTM, assuming that we know the directions of arrival of the directional sources. In our experiments, the speech of interest can be corrupted either by diffuse noise or by an equally loud competing speaker. We show that adding as input the output of the beamformer steered toward the competing speech in addition to that of the beamformer steered toward the target speech brings significant improvements in terms of word error rate.

Details

Database :
OpenAIRE
Journal :
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Accession number :
edsair.doi.dedup.....a1981cb166dd3b63bab9b647618fb309
Full Text :
https://doi.org/10.1109/icassp.2018.8461370