Author: "Romain Serizel" / Publisher: hal ccsd - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Romain Serizel"' showing total 31 results

Start Over Author "Romain Serizel" Publisher hal ccsd

31 results on '"Romain Serizel"'

1. Barlow Twins self-supervised learning for robust speaker recognition

Author: Mohammad Mohammadamini, Driss Matrouf, Jean-Francois Bonastre, Sandipana Dowerah, Romain Serizel, Denis Jouvet, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and ANR-18-CE33-0014,ROBOVOX,ROBOVOX - Identification vocale robuste pour les robots de sécurité mobiles(2018)
Subjects: Speaker recognition, Barlow Twins, [INFO]Computer Science [cs], Robustness, ResNet
Abstract: International audience; Acoustic noise is a big challenge for speaker recognition systems. The state-of-the-art speaker recognition systems are based on deep neural network speaker embeddings called xvector extractor. A noise-robust x-vector extractor is highly demanded in speaker recognition systems. In this paper, we introduce Barlow Twins self-supervised loss function in the area of speaker recognition. Barlow Twins objective function tries to optimize two criteria: Firstly, it increases the similarity between two versions of the same signal (i.e. the clean and its augmented noisy version) to make the speaker embedding invariant to the acoustic noise. Secondly, it reduces the redundancy between dimensions of the x-vectors that improves the overall quality of speaker embeddings. In our research, Barlow Twins objective function is integrated with the ResNet-based speaker embedding system. In the proposed system, the Barlow Twins objective function is calculated in the embedding layer and it is optimized jointly with the speaker classifier loss function. The experimental results on Fabiole corpus show 22 % relative gain in terms of EER in the clean environments and 18% improvement in the presence of noise with low SNR and reverberation.
Published: 2022
Full Text: View/download PDF

2. A Comprehensive Exploration of Noise Robustness and Noise Compensation in ResNet and TDNN-based Speaker Recognition Systems

Author: Mohammad MohammadAmini, Driss Matrouf, Jean-Francois Bonatsre, Sandipana Dowerah, Romain Serizel, Denis Jouvet, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and ANR-18-CE33-0014,ROBOVOX,ROBOVOX - Identification vocale robuste pour les robots de sécurité mobiles(2018)
Subjects: Speaker recognition, Reverberation, Additive noise, Robustness, ResNet, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; In this paper, a comprehensive exploration of noise robustness and noise compensation of ResNet and TDNN speaker recognition systems is presented. Firstly the robustness of the TDNN and ResNet in the presence of noise, reverberation, and both distortions is explored. Our experimental results show that in all cases the ResNet system is more robust than TDNN. After that, a noise compensation task is done with denoising autoencoder (DAE) over the x-vectors extracted from both systems. We explored two scenarios: 1) compensation of artificial noise with artificial data, 2) compensation of real noise with artificial data. The second case is the most desired scenario, because it makes noise compensation affordable without having real data to train denoising techniques. The experimental results show that in the first scenario noise compensation gives significant improvement with TDNN while this improvement in Resnet is not significant. In the second scenario, we achieved 15% improvement of EER over VoiCes Eval challenge in both TDNN and ResNet systems. In most cases the performance of ResNet without compensation is superior to TDNN with noise compensation.
Published: 2022

3. Distributed speech separation in spatially unconstrained microphone arrays

Author: Romain Serizel, Nicolas Furnon, Slim Essid, Irina Illina, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Télécom ParisTech, Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, This work was made with the support of the French National Research Agency, in the framework of the project DiSCogs 'Distant speech communication with heterogeneous unconstrained microphone arrays' (ANR-17-CE23-0026-01)., Grid'5000, ANR-17-CE23-0026,DiSCogs,Antennes acoustiques hétérogènes et non contraintes pour la communication parlée(2017), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Institut Polytechnique de Paris (IP Paris), Département Images, Données, Signal (IDS), Signal, Statistique et Apprentissage (S2A), and Institut Mines-Télécom [Paris] (IMT)-Télécom Paris-Institut Mines-Télécom [Paris] (IMT)-Télécom Paris
Subjects: Signal Processing (eess.SP), Microphone array, Computer science, Microphone, business.industry, Process (computing), Pattern recognition, 02 engineering and technology, Speech processing, Signal, Microphone arrays, Distributed processing, Recurrent neural network, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Electrical Engineering and Systems Science - Signal Processing, business, Speech separation
Abstract: Submitted to ICASSP 2020; Speech separation with several speakers is a challenging task because of the non-stationarity of the speech and the strong signal similarity between interferent sources. Current state-of-the-art solutions can separate well the different sources using sophisticated deep neural networks which are very tedious to train. When several microphones are available, spatial information can be exploited to design much simpler algorithms to discriminate speakers. We propose a distributed algorithm that can process spatial information in a spatially unconstrained microphone array. The algorithm relies on a convolutional recurrent neural network that can exploit the signal diversity from the distributed nodes. In a typical case of a meeting room, this algorithm can capture an estimate of each source in a first step and propagate it over the microphone array in order to increase the separation performance in a second step. We show that this approach performs even better when the number of sources and nodes increases. We also study the influence of a mismatch in the number of sources between the training and testing conditions.
Published: 2021
Full Text: View/download PDF

4. DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays

Author: Romain Serizel, Irina Illina, Nicolas Furnon, Slim Essid, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, This work was made with the support of the French National Research Agency, in the framework of the project DiSCogs (ANR-17-CE23-0026-01). Experiments presented in this paper were partially out using the Grid5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000)., Grid'5000, ANR-17-CE23-0026,DiSCogs,Antennes acoustiques hétérogènes et non contraintes pour la communication parlée(2017), Institut Polytechnique de Paris (IP Paris), Département Images, Données, Signal (IDS), Télécom ParisTech, Signal, Statistique et Apprentissage (S2A), and Institut Mines-Télécom [Paris] (IMT)-Télécom Paris-Institut Mines-Télécom [Paris] (IMT)-Télécom Paris
Subjects: Signal Processing (eess.SP), Microphone array, Acoustics and Ultrasonics, Noise measurement, Artificial neural network, Computer science, Microphone, Noise reduction, Speech recognition, Context (language use), Speech processing, Speech enhancement, Computational Mathematics, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Computer Science::Sound, FOS: Electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Electrical and Electronic Engineering, Electrical Engineering and Systems Science - Signal Processing
Abstract: Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to extend a previously introduced distributed DNN-based time-frequency mask estimation scheme that can efficiently use spatial information in form of so-called compressed signals which are pre-filtered target estimations. We study the performance of this algorithm under realistic acoustic conditions and investigate practical aspects of its optimal application. We show that the nodes in the microphone array cooperate by taking profit of their spatial coverage in the room. We also propose to use the compressed signals not only to convey the target estimation but also the noise estimation in order to exploit the acoustic diversity recorded throughout the microphone array., Submitted to TASLP
Published: 2020
Full Text: View/download PDF

5. Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes

Author: Romain Serizel, Hakan Erdogan, Justin Salamon, Nicolas Turpault, John R. Hershey, Scott Wisdom, Eduardo Fonseca, Prem Seetharaman, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Google Inc, Research at Google, Universitat Pompeu Fabra [Barcelona] (UPF), Descript, Inc., Adobe Research, Part of this work was made with the support of the French National Research Agency, in the framework of the project LEAUDS 'Learning to understand audio scenes' (ANR-18-CE23-0020) and the French region Grand-Est. High Performance Computing resources were partially provided by the EXPLOR centre hosted by the University de Lorraine., Grid'5000, ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Sound localization, Sound (cs.SD), Reverberation, Soundscape, Computer science, Speech recognition, 02 engineering and technology, Computer Science - Sound, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), Robustness (computer science), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Sound (geography), synthetic soundscapes, geography, Signal processing, geography.geographical_feature_category, Event (computing), Sound event detection, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Benchmark (computing), sound separation, 020201 artificial intelligence & image processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: International audience; We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We show that the localization in time of sound events is still a problem for SED systems. We also show that reverberation and non-target sound events are severely degrading the performance of the SED systems. In the latter case, sound separation seems like a promising solution.
Published: 2020
Full Text: View/download PDF

6. Joint NN-Supported Multichannel Reduction of Acoustic Echo, Reverberation and Noise

Author: Emmanuel Vincent, Guillaume Carbajal, Romain Serizel, Eric Humbert, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Télécom ParisTech, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Reverberation, Sound (cs.SD), Acoustics and Ultrasonics, Computer science, Noise reduction, Acoustics, Expectation-maximization, Recurrent neural network, Machine Learning (stat.ML), Background noise, Computer Science - Sound, Machine Learning (cs.LG), Reduction (complexity), 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Statistics - Machine Learning, Audio and Speech Processing (eess.AS), Distortion, Computer Science (miscellaneous), FOS: Electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Acoustic echo, Echo (computing), Joint distortion reduction, Computational Mathematics, Noise, Nonlinear distortion, Computer Science::Sound, 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: International audience; We consider the problem of simultaneous reduction of acoustic echo, reverberation and noise. In real scenarios, these distortion sources may occur simultaneously and reducing them implies combining the corresponding distortion-specific filters. As these filters interact with each other, they must be jointly optimized. We propose to model the target and residual signals after linear echo cancellation and dereverberation using a multichannel Gaussian modeling framework and to jointly represent their spectra by means of a neural network. We develop an iterative block-coordinate ascent algorithm to update all the filters. We evaluate our system on real recordings of acoustic echo, reverberation and noise acquired with a smart speaker in various situations. The proposed approach outperforms in terms of overall distortion a cascade of the individual approaches and a joint reduction approach which does not rely on a spectral model of the target and residual signals.
Published: 2020
Full Text: View/download PDF

7. DNN-Based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays

Author: Irina Illina, Nicolas Furnon, Slim Essid, Romain Serizel, Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Grid'5000, Mésocentre EXPLOR, ANR-17-CE23-0026,DiSCogs,Antennes acoustiques hétérogènes et non contraintes pour la communication parlée(2017), Télécom ParisTech-Institut Mines-Télécom [Paris] (IMT), and ANR-17-CE23-0026,DiSCogs,Distant speech communication with heterogeneous unconstrained microphone arrays(2018)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), microphone arrays, Microphone, Computer science, Computer Science - Artificial Intelligence, Real-time computing, Speech enhancement, Context (language use), 02 engineering and technology, Signal, Computer Science - Sound, Distributed processing, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), Index Terms-Speech enhancement, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Artificial neural network, Node (networking), Wiener filter, 020206 networking & telecommunications, Filter (signal processing), Artificial Intelligence (cs.AI), dis- tributed processing, Computer Science::Sound, symbols, 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multichannel processing is widely used for speech enhancement but several limitations appear when trying to deploy these solutions to the real-world. Distributed sensor arrays that consider several devices with a few microphones is a viable alternative that allows for exploiting the multiple devices equipped with microphones that we are using in our everyday life. In this context, we propose to extend the distributed adaptive node-specific signal estimation approach to a neural networks framework. At each node, a local filtering is performed to send one signal to the other nodes where a mask is estimated by a neural network in order to compute a global multi-channel Wiener filter. In an array of two nodes, we show that this additional signal can be efficiently taken into account to predict the masks and leads to better speech enhancement performances than when the mask estimation relies only on the local signals., Submitted to ICASSP2020
Published: 2020

8. Sound event detection in synthetic domestic environments

Author: Romain Serizel, Nicolas Turpault, Justin Salamon, Ankit Shah, Serizel, Romain, APPEL À PROJETS GÉNÉRIQUE 2018 - Apprentissage statistique pour la compréhension de scènes audio - - LEAUDS2018 - ANR-18-CE23-0020 - AAPG2018 - VALID, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Language Technologies Institute [Pittsburgh] (LTI), Carnegie Mellon University [Pittsburgh] (CMU), Adobe Research, Grid'5000, ANR-18-CE23-0020,LEAUDS,LEARNING TO UNDERSTAND AUDIO SCENES(2018), ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Subjects: semi-supervised learning, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], Soundscape, Computer science, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, 02 engineering and technology, 030204 cardiovascular system & hematology, Machine learning, computer.software_genre, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 03 medical and health sciences, 0302 clinical medicine, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Impact model, 0202 electrical engineering, electronic engineering, information engineering, Index Terms-Sound event detection, business.industry, 020206 networking & telecommunications, [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], synthetic data, Sound event detection, weakly labeled data, Artificial intelligence, business, computer
Abstract: International audience; We present a comparative analysis of the performance of state-of-the-art sound event detection systems. In particular, we study the robustness of the systems to noise and signal degradation, which is known to impact model generalization. Our analysis is based on the results of task 4 of the DCASE 2019 challenge, where submitted systems were evaluated on, in addition to real-world recordings, a series of synthetic soundscapes that allow us to carefully control for different soundscape characteristics. Our results show that while overall systems exhibit significant improvements compared to previous work, they still suffer from biases that could prevent them from generalizing to real-world scenarios.
Published: 2020

9. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Author: Ankit Shah, Justin Salamon, Nicolas Turpault, Romain Serizel, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Language Technologies Institute [Pittsburgh] (LTI), Carnegie Mellon University [Pittsburgh] (CMU), Adobe Research, Grid'5000, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Soundscape, Synthetic data, Computer science, 02 engineering and technology, Semi-supervised learning, Machine learning, computer.software_genre, Task (project management), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 030507 speech-language pathology & audiology, 03 medical and health sciences, Domestic environment, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Weakly labeled data, business.industry, 020206 networking & telecommunications, Sound event detection, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Labeled data, Artificial intelligence, 0305 other medical science, business, computer
Abstract: International audience; This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Published: 2019

10. Semi-supervised triplet loss based learning of ambient audio embeddings

Author: Nicolas Turpault, Emmanuel Vincent, Romain Serizel, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Région Grand Est., Grid'5000, ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018), Turpault, Nicolas, and APPEL À PROJETS GÉNÉRIQUE 2018 - Apprentissage statistique pour la compréhension de scènes audio - - LEAUDS2018 - ANR-18-CE23-0020 - AAPG2018 - VALID
Subjects: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], Computer Science::Machine Learning, triplet loss, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Computer science, Sample (statistics), Index Terms-weak labels, 02 engineering and technology, Semi-supervised learning, 010501 environmental sciences, 01 natural sciences, Task (project management), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Statistics::Machine Learning, audio tagging, Triplet loss, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, 0105 earth and related environmental sciences, Training set, prototypical network, business.industry, Supervised learning, Pattern recognition, [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], Transformation (function), ComputingMethodologies_PATTERNRECOGNITION, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 020201 artificial intelligence & image processing, Artificial intelligence, audio embedding, business, Feature learning
Abstract: International audience; Deep neural networks are particularly useful to learn relevant repre-sentations from data. Recent studies have demonstrated the poten-tial of unsupervised representation learning for ambient sound anal-ysis using various flavors of the triplet loss. They have comparedthis approach to supervised learning. However, in real situations,it is common to have a small labeled dataset and a large unlabeledone. In this paper, we combine unsupervised and supervised tripletloss based learning into a semi-supervised representation learningapproach. We propose two flavors of this approach, whereby thepositive samples for those triplets whose anchors are unlabeled areobtained either by applying a transformation to the anchor, or byselecting the nearest sample in the training set. We compare ourapproach to supervised and unsupervised representation learning aswell as the ratio between the amount of labeled and unlabeled data.We evaluate all the above approaches on an audio tagging task usingthe DCASE 2018 Task 4 dataset, and we show the impact of thisratio on the tagging performance.
Published: 2019

11. CRNN-based multiple DoA estimation using acoustic intensity features for Ambisonics recordings

Author: Romain Serizel, Emmanuel Vincent, Alexandre Guerin, Laureline Perotin, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Orange Labs [Cesson-Sévigné], Orange Labs, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Subjects: Computer science, audio source localization, 02 engineering and technology, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Signal, direction of arrival, layerwise relevance propagation, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Artificial neural network, business.industry, Ambisonics, acoustic intensity, Direction of arrival, 020206 networking & telecommunications, Pattern recognition, Grid, Sound intensity, first-order Ambisonics, Visualization, Time–frequency analysis, Signal Processing, Artificial intelligence, business, convolutional recurrent neural network
Abstract: International audience; Localizing audio sources is challenging in real reverberant environments, especially when several sources are active. We propose to use a neural network built from stacked convolutional and recurrent layers in order to estimate the direction of arrival of multiple sources from a first-order Ambisonics recording. It returns the directions of arrival (over a discrete grid) of a known number of sources. We propose to use features derived from the acoustic intensity vector as inputs. We analyze the behavior of the neural network by means of a visualization technique called layerwise relevance propagation. This analysis highlights which parts of the input signal are relevant in a given situation. We also conduct experiments to evaluate the performance of our system in various environments, from simulated rooms to real recordings, with one or two speech sources. The results show that the proposed features significantly improve performances with respect to raw Ambisonics inputs.
Published: 2019
Full Text: View/download PDF

12. Regression versus classification for neural network based audio source localization

Author: Alexandre Défossez, Alexandre Guerin, Emmanuel Vincent, Romain Serizel, Laureline Perotin, Perotin, Lauréline, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'informatique de l'école normale supérieure (LIENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), Orange Labs [Cesson-Sévigné], Orange Labs, IEEE, Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris)
Subjects: Mean squared error, soft target, Computer science, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, [INFO.INFO-NE] Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], 02 engineering and technology, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], law.invention, 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, law, training criterion, 0202 electrical engineering, electronic engineering, information engineering, Cartesian coordinate system, angular loss, Cost-sensitive classification, Artificial neural network, Angular distance, Probabilistic logic, Spherical coordinate system, 020206 networking & telecommunications, Grid, Regression, Direction-of-arrival, 0305 other medical science, Algorithm
Abstract: International audience; We compare the performance of regression and classification neural networks for single-source direction-of-arrival estimation. Since the output space is continuous and structured, regression seems more appropriate. However, classification on a discrete spherical grid is widely believed to perform better and is predominantly used in the literature. For regression, we propose two ways to account for the spherical geometry of the output space based either on the angular distance between spherical coordinates or on the mean squared error between Cartesian coordinates. For classification, we propose two alternatives to the classical one-hot encoding framework: we derive a Gibbs distribution from the squared angular distance between grid points and use the corresponding probabilities either as soft targets or as cross-entropy weights that retain a clear probabilis-tic interpretation. We show that regression on Cartesian coordinates is generally more accurate, except when localized interference is present, in which case classification appears to be more robust.
Published: 2019

13. CRNN-based joint azimuth and elevation localization with the Ambisonics intensity vector

Author: Emmanuel Vincent, Romain Serizel, Laureline Perotin, Alexandre Guerin, Orange Labs [Cesson-Sévigné], Orange Labs, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Perotin, Lauréline, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Reverberation, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Ambisonics, Computer science, Microphone, acoustic intensity, Direction of arrival, 020206 networking & telecommunications, CRNN, 02 engineering and technology, Sound intensity, direction of arrival, Azimuth, first-order Ambisonics (FOA), 030507 speech-language pathology & audiology, 03 medical and health sciences, Noise, Recurrent neural network, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science, Algorithm
Abstract: International audience; We present a source localization system for first-order Ambisonics (FOA) contents based on a stacked convolutional and recurrent neural network (CRNN). We propose to use as input to the CRNN the FOA acoustic intensity vector, which is easy to compute and closely linked to the sound direction of arrival (DoA). The system estimates the DoA of a point source in both azimuth and elevation. We conduct an experimental evaluation in configurations including reverberation, noise, and various speaker w.r.t. microphone orientations. The results show that the proposed architecture and input allow the network to return accurate location estimates in realistic conditions compared to another recent CRNN-based system.
Published: 2018

14. Multichannel Audio Modeling with Elliptically Stable Tensor Decomposition

Author: Romain Serizel, Mathieu Fontaine, Roland Badeau, Umut Simsekli, Fabian-Robert Stöter, Antoine Liutkus, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Scientific Data Management (ZENITH), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Deville Y., Gannot S., Mason R., Plumbley M., Ward D., ANR-15-CE38-0003,KAMoulox,Démixage en ligne de larges archives sonores(2015), ANR-16-CE23-0014,FBIMATRIX,Méthodes distribuées et parallèles de Monte-Carlo par chaînes de Markov pour l'Inférence Bayésienne de modèles à factorisation de tenseurs(2016), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), and Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Inria Sophia Antipolis - Méditerranée (CRISAM)
Subjects: Computer science, Estimation theory, Noise reduction, Gaussian, Monte Carlo method, 020206 networking & telecommunications, 02 engineering and technology, Data_CODINGANDINFORMATIONTHEORY, Non-negative matrix factorization, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Noise, symbols.namesake, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Computer Science::Sound, 0202 electrical engineering, electronic engineering, information engineering, symbols, Spectrogram, 0305 other medical science, Algorithm
Abstract: International audience; This paper introduces a new method for multichannel speech enhancement based on a versatile modeling of the residual noise spec-trogram. Such a model has already been presented before in the single channel case where the noise component is assumed to follow an alpha-stable distribution for each time-frequency bin, whereas the speech spec-trogram, supposed to be more regular, is modeled as Gaussian. In this paper, we describe a multichannel extension of this model, as well as a Monte Carlo Expectation-Maximisation algorithm for parameter estimation. In particular, a multichannel extension of the Itakura-Saito nonnegative matrix factorization is exploited to estimate the spectral parameters for speech, and a Metropolis-Hastings algorithm is proposed to estimate the noise contribution. We evaluate the proposed method in a challenging multichannel denoising application and compare it to other state-of-the-art algorithms.
Published: 2018
Full Text: View/download PDF

15. Multiple-input neural network-based residual echo suppression

Author: Romain Serizel, Eric Humbert, Emmanuel Vincent, Guillaume Carbajal, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Invoxia [Issy-les-Moulineaux], Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and CARBAJAL, Guillaume
Subjects: Computer science, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, neural network, Speech recognition, [INFO.INFO-NE] Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], residual echo suppression, [INFO] Computer Science [cs], [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Residual, Reduction (complexity), 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, ComputerSystemsOrganization_SPECIAL-PURPOSEANDAPPLICATION-BASEDSYSTEMS, [INFO]Computer Science [cs], Hardware_ARITHMETICANDLOGICSTRUCTURES, Artificial neural network, business.industry, Deep learning, Echo (computing), deep learning, Filter design, ComputingMethodologies_PATTERNRECOGNITION, Nonlinear distortion, Artificial intelligence, Loudspeaker, 0305 other medical science, business
Abstract: International audience; A residual echo suppressor (RES) aims to suppress the residual echo in the output of an acoustic echo canceler (AEC). Spectral-based RES approaches typically estimate the magnitude spectra of the near-end speech and the residual echo from a single input, that is either the far-end speech or the echo computed by the AEC, and derive the RES filter coefficients accordingly. These single inputs do not always suffice to discriminate the near-end speech from the remaining echo. In this paper, we propose a neural network-based approach that directly estimates the RES filter coefficients from multiple inputs, including the AEC output, the far-end speech, and/or the echo computed by the AEC. We evaluate our system on real recordings of acoustic echo and near-end speech acquired in various situations with a smart speaker. We compare it to two single-input spectral-based approaches in terms of echo reduction and near-end speech distortion.
Published: 2018

16. Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification

Author: Romain Serizel, Gael Richard, Victor Bisot, Slim Essid, Laboratoire Traitement et Communication de l'Information (LTCI), Télécom ParisTech-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS), Département Traitement du Signal et des Images (TSI), Télécom ParisTech-Centre National de la Recherche Scientifique (CNRS), Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Subjects: Acoustics and Ultrasonics, Computer science, 02 engineering and technology, Machine learning, computer.software_genre, Non-negative matrix factorization, Matrix decomposition, Task (project management), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Electrical and Electronic Engineering, business.industry, Feature learning, Matrix Factorization, 020206 networking & telecommunications, Pattern recognition, Acoustic Scene Classification, Computational Mathematics, Principal component analysis, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Unsupervised learning, Spectrogram, 020201 artificial intelligence & image processing, Artificial intelligence, business, Focus (optics), computer
Abstract: International audience; In this paper, we study the usefulness of various matrix factorization methods for learning features to be used for the specific Acoustic Scene Classification problem. A common way of addressing ASC has been to engineer features capable of capturing the specificities of acoustic environments. Instead, we show that better representations of the scenes can be automatically learned from time-frequency representations using matrix factorization techniques. We mainly focus on extensions including sparse, kernel-based, convolutive and a novel supervised dictionary learning variant of Principal Component Analysis and Nonnegative Matrix Factorization. An experimental evaluation is performed on two of the largest ASC datasets available in order to compare and discuss the usefulness of these methods for the task. We show that the unsupervised learning methods provide better representations of acoustic scenes than the best conventional hand-crafted features on both datasets. Furthermore, the introduction of a novel nonnegative supervised matrix factorization model and Deep Neural networks trained on spectrograms, allow us to reach further improvements.
Published: 2017
Full Text: View/download PDF

17. Multiview approaches to event detection and scene analysis

Author: Fabio Antonacci, Slim Essid, Alexey Ozerov, Ngoc Q. K. Duong, Sanjeel Parekh, Romain Serizel, Augusto Sarti, Signal, Statistique et Apprentissage (S2A), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris-Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Département Images, Données, Signal (IDS), Télécom ParisTech, Technicolor R & I [Cesson Sévigné], Technicolor, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Dipartimento di Elettronica e Informazione, Politecnico di Milano [Milan] (POLIMI), Tuomas Virtanen, Mark D. Plumbley, and Dan Ellis
Subjects: Beamforming, Microphone array, Computer science, media_common.quotation_subject, Audio source separation, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Context (language use), Multichannel audio, Multimodal scene analysis, 02 engineering and technology, Representation learning, Presentation, Physics and Astronomy (all), Engineering (all), [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio source localization and tracking, Data fusion, Joint audiovisual scene analysis, Matrix factorization, Multichannel Wiener filtering, Multiview scene analysis, Tensor factorization, Computer Science (all), 0202 electrical engineering, electronic engineering, information engineering, Selection (linguistics), Computer vision, Wiener filtering, media_common, business.industry, Event (computing), [SCCO.NEUR]Cognitive science/Neuroscience, 020206 networking & telecommunications, Sensor fusion, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Multichannel, 020201 artificial intelligence & image processing, Artificial intelligence, business, Feature learning, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; This chapter addresses sound scene and event classification in multiview settings, that is, settings where the observations are obtained from multiple sensors, each sensor contributing a particular view of the data (e.g., audio microphones, video cameras, etc.). We briefly introduce some of the techniques that can be exploited to effectively combine the data conveyed by the different views under analysis for a better interpretation. We first provide a high-level presentation of generic methods that are particularly relevant in the context of multiview and multimodal sound scene analysis. Then, we more specifically present a selection of techniques used for audiovisual event detection and microphone array-based scene analysis.
Published: 2017
Full Text: View/download PDF

18. Machine listening techniques as a complement to video image analysis in forensics

Author: Victor Bisot, Romain Serizel, Slim Essid, Gael Richard, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Traitement et Communication de l'Information (LTCI), Télécom ParisTech-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS), Département Traitement du Signal et des Images (TSI), Télécom ParisTech-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), and Centre National de la Recherche Scientifique (CNRS)-Télécom ParisTech
Subjects: Machine listening, Multimedia, Computer science, media_common.quotation_subject, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020206 networking & telecommunications, ComputerApplications_COMPUTERSINOTHERSYSTEMS, 02 engineering and technology, computer.software_genre, Video image, Motion (physics), Complement (complexity), Sound recording and reproduction, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Quality (business), 0305 other medical science, Hidden Markov model, computer, Mobile device, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, media_common
Abstract: International audience; Video is now one of the major sources of information for forensics. However, video documents can be originating from various recording devices (CCTV, mobile devices. . . ) with inconsistent quality and can sometimes be recorded in challenging light or motion conditions. Therefore, the amount of information that can be extracted relying solely on video image can vary to a great extent. Most of the videos however generally include audio recording as well. Machine listening can then become a valuable complement to video image analysis in challenging scenarios. In this paper, the authors present a brief overview of some machine listening techniques and their application to the analysis of video documents for forensics. The applicability of these techniques to forensics problems is then discussed in the light of machine listening system performances.
Published: 2016
Full Text: View/download PDF

19. Group nonnegative matrix factorisation with speaker and session variability compensation for speaker identification

Author: Slim Essid, Romain Serizel, Gael Richard, Signal, Statistique et Apprentissage (S2A), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris-Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Département Traitement du Signal et des Images (TSI), Télécom ParisTech-Centre National de la Recherche Scientifique (CNRS), European Project: 607480,EC:FP7:SEC,FP7-SEC-2013-1,LASIE(2014), HAL, TelecomParis, and LArge Scale Information Exploitation of Forensic Data - LASIE - - EC:FP7:SEC2014-05-01 - 2017-10-31 - 607480 - VALID
Subjects: feature learning, speaker identification, Point (typography), Computer science, Group (mathematics), Speech recognition, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), 020206 networking & telecommunications, 02 engineering and technology, Speaker recognition, Identity (music), Compensation (engineering), Speaker diarisation, speaker variability, 030507 speech-language pathology & audiology, 03 medical and health sciences, Computer Science::Sound, spectrogram factorisation, 0202 electrical engineering, electronic engineering, information engineering, Session (computer science), 0305 other medical science, Feature learning, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, Nonnegative matrix factorisation
Abstract: International audience; This paper presents a feature learning approach for speaker identification that is based on nonnegative matrix factorisation. Recent studies have shown that with such models, the dictionary atoms can represent well the speaker identity. The approaches proposed so far focused only on speaker variability and not on session variability. However, this later point is a crucial aspect in the success of the I-vector approach that is now the state-of-the-art in speaker identification.This paper proposes a method that relies on group nonnegative matrix factorisation and that is inspired by the I-vector training procedure. By doing so the proposed approach intends to capture both the speaker variability and the session variability. Results on a small corpus prove that the proposed approach can be competitive with I-vectors.
Published: 2016

20. Acoustic scene classification with matrix factorization for unsupervised feature learning

Author: Romain Serizel, Gael Richard, Slim Essid, Victor Bisot, Signal, Statistique et Apprentissage (S2A), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris-Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Département Traitement du Signal et des Images (TSI), Télécom ParisTech-Centre National de la Recherche Scientifique (CNRS), and European Project: 607480,EC:FP7:SEC,FP7-SEC-2013-1,LASIE(2014)
Subjects: business.industry, Computer science, Speech recognition, Pooling, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, matrix factorization, unsupervised feature learning, Non-negative matrix factorization, Matrix decomposition, 030507 speech-language pathology & audiology, 03 medical and health sciences, Kernel (linear algebra), Kernel (image processing), [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], Principal component analysis, Acoustic scene classification, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, 0305 other medical science, business, Feature learning, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, ComputingMilieux_MISCELLANEOUS
Abstract: In this paper we study the use of unsupervised feature learning for acoustic scene classification (ASC). The acoustic environment recordings are represented by time-frequency images from which we learn features in an unsupervised manner. After a set of preprocessing and pooling steps, the images are decomposed using matrix factorization methods. By decomposing the data on a learned dictionary, we use the projection coefficients as features for classification. An experimental evaluation is done on a large ASC dataset to study popular matrix factorization methods such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) as well as some of their extensions including sparse, kernel based and convolutive variants. The results show the compared variants lead to significant improvement compared to the state-of-the-art results in ASC.
Published: 2016

21. Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition

Author: Romain Serizel, Diego Giuliani, Laboratoire Traitement et Communication de l'Information (LTCI), Télécom ParisTech-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS), Département Traitement du Signal et des Images (TSI), Télécom ParisTech-Centre National de la Recherche Scientifique (CNRS), and Fondazione Bruno Kessler [Trento, Italy] (FBK)
Subjects: Single pass, Computer science, Speech recognition, Word error rate, 02 engineering and technology, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Vocal tract length normalisation, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, children's speech recognition, Image warping, Hidden Markov model, Artificial neural network, business.industry, automatic speech recognition, 020206 networking & telecommunications, Pattern recognition, deep neural networks, Mel-frequency cepstrum, Artificial intelligence, 0305 other medical science, business, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Vocal tract, Decoding methods
Abstract: International audience; This paper introduces approaches based on vocal tract length normalisation (VTLN) techniques for hybrid deep neural network (DNN)-hidden Markov model (HMM) automatic speech recognition when targeting children's and adults' speech. VTLN is investigated by training a DNN-HMM system by using first mel frequency cepstral coefficients (MFCCs) normalised with standard VTLN. Then, MFCCs derived acoustic features are combined with the VTLN warping factors to obtain an augmented set of features as input to a DNN. In this later, novel, approach the warping factors are obtained with a separate DNN and the decoding can be operated in a single pass when standard VTLN approach requires two decoding passes. Both VTLN-based approaches are shown to improve phone error rate performance, up to 20% relative improvement, compared to a baseline trained on a mixture of children's and adults' speech.
Published: 2014
Full Text: View/download PDF

22. Low-rank Approximation Based Multichannel Wiener Filter Algorithms for Noise Reduction with Application in Cochlear Implants

Author: Jan Wouters, Romain Serizel, Marc Moonen, Bas van Dijk, Fondazione Bruno Kessler [Trento, Italy] (FBK), Department of Electrical Engineering - K.U.Leuven (ESAT/SCD-COSIC), Faculty of Engineering of the K.U. Leuven, Cochlear CTCE, and UZ Leuven - campus Gasthuisberg
Subjects: Acoustics and Ultrasonics, Noise measurement, Noise (signal processing), Speech recognition, Wiener filter, Salt-and-pepper noise, Speech processing, Gradient noise, Computational Mathematics, symbols.namesake, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Gaussian noise, Computer Science::Sound, Computer Science (miscellaneous), symbols, Median filter, Electrical and Electronic Engineering, Algorithm, Mathematics
Abstract: International audience; This paper presents low-rank approximation based multichannel Wiener filter algorithms for noise reduction in speech plus noise scenarios, with application in cochlear implants. In a single speech source scenario, the frequency-domain autocorrelation matrix of the speech signal is often assumed to be a rank-1 matrix, which then allows to derive different rank-1 approximation based noise reduction filters. In practice, however, the rank of the autocorrelation matrix of the speech signal is usually greater than one. Firstly, the link between the different rank-1 approximation based noise reduction filters and the original speech distortion weighted multichannel Wiener filter is investigated when the rank of the autocorrelation matrix of the speech signal is indeed greater than one. DRAFT 2 Secondly, in low input signal-to-noise-ratio scenarios, due to noise non-stationarity, the estimation of the autocorrelation matrix of the speech signal can be problematic and the noise reduction filters can deliver unpredictable noise reduction performance. An eigenvalue decomposition based filter and a generalized eigenvalue decomposition based filter are introduced that include a more robust rank-1, or more generally rank-R, approximation of the autocorrelation matrix of the speech signal. These noise reduction filters are demonstrated to deliver a better noise reduction performance especially in low input signal-to-noise-ratio scenarios. The filters are especially usefull in cochlear implants, where more speech distortion and hence a more agressive noise reduction can be tolerated.
Published: 2014
Full Text: View/download PDF

23. A Speech Distortion Weighting Based Approach to Integrated Active Noise Control and Noise Reduction in Hearing Aids

Author: Romain Serizel, Marc Moonen, Jan Wouters, Søren Holdt Jensen, and Serizel, Romain
Subjects: Engineering, Noise temperature, Noise measurement, business.industry, Speech recognition, Salt-and-pepper noise, Noise figure, symbols.namesake, Noise, Control and Systems Engineering, Control theory, Gaussian noise, Signal Processing, symbols, Effective input noise temperature, Computer Vision and Pattern Recognition, Electrical and Electronic Engineering, business, Software, Active noise control, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: This paper presents weighted approaches for integrated active noise control and noise reduction in hearing aids. The unweighted integrated active noise control and noise reduction scheme introduced in the previous work does not allow to trade-off between the active noise control and the noise reduction. In some circumstances it will, however, be useful to emphasize one of the functional blocks. Changing the original optimisation problem to a constrained optimisation problem leads to a scheme based on a weighted mean squared error criterion that allows to focus either on the active noise control or on the noise reduction. It is similarly possible to derive a scheme that allows to focus either on reducing the speech distortion or on reducing the residual noise at the eardrum. In a single speech source scenario and when the number of sound sources (speech plus noise sources) is less than or equal to the number of microphones, it is possible to derive a simple formula for the output signal-to-noise ratio of the latter scheme. It can then be shown that this scheme delivers a constant signal-to-noise ratio at the eardrum for any weighting factor.
Published: 2013
Full Text: View/download PDF

24. Rank-1 Approximation Based Multichannel Wiener Filtering Algorithms For Noise Reduction In Cochlear Implants

Author: Bas van Dijk, Marc Moonen, Romain Serizel, Jan Wouters, and Serizel, Romain
Subjects: Speech enhancement, symbols.namesake, Signal-to-noise ratio, Autocorrelation matrix, Noise reduction, Wiener filter, symbols, Speech processing, Algorithm, Eigendecomposition of a matrix, Matrix decomposition, Mathematics, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: This paper presents multichannel Wiener filtering-based algorithms for noise reduction in cochlear implants. In a single speech scenario, the autocorrelation matrix of the speech signal can be approximated by a rank-1 matrix. It is then possible to derive noise reduction filters that deliver improved signal-to-noise ratio performance. The link between these different filters is investigated here and an eigenvalue decomposition based algorithm is demonstrated to be more stable at low input signal-to-noise ratio compared to previous algorithms.
Published: 2013

25. Output SNR analysis of integrated active noise control and noise reduction in hearing aids under a single speech source scenario

Author: Søren Holdt Jensen, Romain Serizel, Jan Wouters, Marc Moonen, and Serizel, Romain
Subjects: Noise temperature, Engineering, Noise measurement, business.industry, Noise figure, Noise floor, symbols.namesake, Noise, Control and Systems Engineering, Gaussian noise, Signal Processing, Electronic engineering, symbols, Effective input noise temperature, Computer Vision and Pattern Recognition, Electrical and Electronic Engineering, business, Software, Active noise control, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: This paper analyses the output signal-to-noise ratio for a standard noise reduction scheme based on the multichannel Wiener filter and for an integrated active noise control and noise reduction scheme based on the filtered-X multichannel Wiener filter, both applied in a hearing aid framework that includes the effects of signal leakage through an open fitting and secondary path effects. In previous work, integrating noise reduction and active noise control has been shown to allow to compensate for effects of signal leakage and secondary path effects. These experimental results are now verified theoretically. The output signal-to-noise ratios are derived under a single speech source scenario. Theoretical results are then compared to simulations for a single noise source scenario and a multiple noise sources scenario.
Published: 2011
Full Text: View/download PDF

26. Accuracy Constraint Determination in Fixed-Point System Design

Author: Daniel Menard, Olivier Sentieys, Romain Serizel, Romuald Rocher, Energy Efficient Computing ArchItectures with Embedded Reconfigurable Resources (CAIRN), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-ARCHITECTURE (IRISA-D3), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), and Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
Subjects: General Computer Science, Noise (signal processing), Computer science, business.industry, 020208 electrical & electronic engineering, Real-time computing, Design flow, lcsh:Electronics, Process (computing), lcsh:TK7800-8360, 020206 networking & telecommunications, 02 engineering and technology, Fixed point, Constraint (information theory), [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Control and Systems Engineering, 0202 electrical engineering, electronic engineering, information engineering, Systems design, business, Algorithm, Encoder, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Digital signal processing, Computer Science(all)
Abstract: International audience; Most of digital signal processing applications are specified and designed with floatingpoint arithmetic but are finally implemented using fixed-point architectures. Thus, the design flow requires a floating-point to fixed-point conversion stage which optimizes the implementation cost under execution time and accuracy constraints. This accuracy constraint is linked to the application performances and the determination of this constraint is one of the key issues of the conversion process. In this paper, a method is proposed to determine the accuracy constraint from the application performance. The fixed-point system is modeled with an infinite precision version of the system and a single noise source located at the system output. Then, an iterative approach for optimizing the fixed-point specification under the application performance constraint is defined and detailed. Finally the efficiency of our approach is demonstrated by experiments on an MP3 encoder.
Published: 2008
Full Text: View/download PDF

27. Analysis of scientific challenges in ambient sound recognition in real environments

Author: Turpault, Nicolas, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Université de Lorraine, Emmanuel Vincent, Romain Serizel, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Sound event detection, Sound event tagging, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Détection d'événements sonores, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO]Computer Science [cs], Étiquettage d'événements sonores, Detection and classification of ambiant sound and events
Abstract: Notre vie est constamment bercée par les sons ambiants. Du bruit d’une voiture qui passe à un oiseau qui chante, de l’eau qui coule dans notre douche aux bruits de notre clavier, les sons ambiants sont partout. Les humains sans pertes auditives reconnaissent inconsciemment les sons qui les entourent et prennent de nombreuses décisions de la vie quotidienne en tenant compte des sons ambiants (réactions à des pleurs de bébé ou une alarme par exemple). Durant ces dernières années, la recherche autour de l’analyse automatique de ces sons ambiants s’est développée rapidement. L’analyse des sons ambiants est un problème difficile à résoudre en raison de la complexité des scènes sonores et de leur manque de structure apparente. Les événements sonores qui constituent les scènes sonores sont très variés et de nombreux événements peuvent être actifs simultanément. Afin de reconnaître les événements sonores de façon automatique, on a généralement recours à des méthodes d’apprentissage automatique. Les méthodes par apprentissage profond sont devenues très populaires ces dernières années grâce à leurs performances élevées pour des tâches diverses dont l’analyse de sons ambiants. Les méthodes d’apprentissage s’appuient sur l’utilisation de jeux de données contenant les événements que l’on souhaite reconnaître. Dans l’idéal, ces jeux de données contiennent des annotations concernant l’activité liée à chacune des classes d’événements sonores et éventuellement à leur temporalité (on parle alors d’annotations fortes). Ces dernières années, des jeux de données fortement annotés ont été collectés et publiés pour permettre l’analyse de sons ambiants, mais ils sont souvent composés d’une faible quantité de données qui ne sont pas toujours enregistrées en conditions réelles. Obtenir des annotations fortes coûte cher, et il est donc difficile d’obtenir un gros jeu de données fortement annotées. En revanche, la collecte de données non annotées ou annotées partiellement et sans indication de temporalité (annotations faibles) est plus facile. C’est dans ce cadre que s’inscrit cette thèse.Nous proposons de traiter le problème de la reconnaissance d’événements sonores en environnement domestique en utilisant des données non annotées et faiblement annotées. Le but est d’analyser les problèmes qui surviennent lors d’un scénario réel de reconnaissance d’événements sonores au sein d’une maison pour permettre l’assistance aux personnes en perte d’autonomie ou rendre la maison intelligente. Afin d’analyser ce problème, nous avons proposé une tâche de détection d’événements sonores dans un challenge international d’analyse de sons ambiants. Pour cette tâche nous avons défini un problème proche d’un scénario réel pour permettre l’analyse scientifique des différents problèmes qui apparaissent dans l’analyse de sons ambiants en environnement réel. Nous proposons un jeu de données pour permettre des analyses détaillées des problèmes scientifiques à résoudre pour permettre l’évolution continue de la tâche. Nous nous focalisons ensuite sur le problème de l’apprentissage semi-supervisé qui permet l’apprentissage de systèmes utilisant des données annotées et des données non annotées. Cette analyse se concentre sur l’apprentissage d’une représentation qui serait utile pour des applications finales d’étiquetage ou de détection d’événements sonores. Nous analysons enfin l’impact de l’annotation faible des données dans l’apprentissage d’un système de reconnaissance d’événements sonores afin de proposer des conseils pour l’annotation faible des jeux de données ou des pistes de solutions.; We’re constantly surrounded by ambient sounds. From a car passing by to a bird’s song or from the running water in the shower to the sound of a keyboard, ambient sounds are everywhere. Humans without hearing loss unconsciously recognize them and take multiple decisions using the information provided by ambient sounds in their everyday life (reaction to a baby crying or to an alarm for example). In the last years, the research interest in automatic ambient sound analysis has rapidly grown. Ambient sound analysis is a difficult problem because of the complexity of the sound scenes and their lack of apparent structure.Sound events constituting these sound scenes are various and multiple events can appear simultaneously. To recognize sound events automatically, machine learning methods are usually used, in particular deep learning methods due to their good performance on a variety of tasks including ambient sound analysis. These methods require a training dataset containing the sound events to be recognized. Ideally, the dataset contains labels indicating the type of events and their time positions in the audio clips (strong labels). In recent years, some strongly annotated datasets have appeared that are designed for ambient sound analysis, but they usually contain only a small amount of data and are rarely recorded in real conditions. Strong annotations are expensive to collect, making it difficult to acquire a large scale strongly labeled dataset. However, collecting data without labels or with partial labels indicating the presence of some events without their time information (weak labels) is easier. This thesis fits in this context. We propose to address the problem of sound event recognition in domestic environments using unlabeled and weakly labeled data. Our goal is to analyze the different problems that can appear in a real world scenario of sound event recognition in domestic environment with applications to assisted living and smart house. To analyse this problem we have organized a domestic sound event detection task in an international ambient sound analysis challenge. We have defined this task in such a way that it allows us to analyze the different problems appearing in a real world scenario. We have collected, annotated and shared a dataset designed for this analysis. From 2018 to 2020, we have organized three evaluation campaigns to allow for a detailed analysis of the systems submitted by participants and a continuous improvement the task definition. Then, we focus on the problem of learning systems using both labeled and unlabeled training data (semi-supervised learning). The analysis concentrates on learning a representation which could be useful for a variety of tasks in sound event detection or tagging. Finally, we analyze the impact of weak labels in the training dataset of a sound event recognition system to understand if this is the main problem of a sound event recognition system and provide advice for the labelling of real world data.
Published: 2021

28. Apprentissage profond bout-en-bout pour le rehaussement de la parole

Author: Carbajal, Guillaume, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Université de Lorraine, Emmanuel Vincent, Romain Serizel, and Éric Humbert [Co-encadrant]
Subjects: Reverberation, Speech enhancement, Rehaussement de la parole, Réseaux de neurones profonds, Background noise, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Réduction conjointe de distorsions, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Réverbération, Deep neural networks, Joint reduction of distortions, [INFO]Computer Science [cs], Bruit ambiant, Écho acoustique, Acoustic echo
Abstract: This PhD falls within the development of hands-free telecommunication systems, more specifically smart speakers in domestic environments. The user interacts with another speaker at a far-end point and can be typically a few meters away from this kind of system. The microphones are likely to capture sounds of the environment which are added to the user's voice, such background noise, acoustic echo and reverberation. These types of distortion degrade speech quality, intelligibility and listening comfort for the far-end speaker, and must be reduced. Filtering methods can reduce individually each of these types of distortion. Reducing all of them implies combining the corresponding filtering methods. As these methods interact with each other which can deteriorate the user's speech, they must be jointly optimized. First of all, we introduce an acoustic echo reduction approach which combines an echo cancellation filter with a residual echo postfilter designed to adapt to the echo cancellation filter. To do so, we propose to estimate the postfilter coefficients using the short term spectra of multiple known signals, including the output of the echo cancellation filter, as inputs to a neural network. We show that this approach improves the performance and the robustness of the postfilter in terms of echo reduction, while limiting speech degradation, on several scenarios in real conditions. Secondly, we describe a joint approach for multichannel reduction of echo, reverberation and noise. We propose to simultaneously model the target speech and undesired residual signals after echo cancellation and dereveberation in a probabilistic framework, and to jointly represent their short-term spectra by means of a recurrent neural network. We develop a block-coordinate ascent algorithm to update the echo cancellation and dereverberation filters, as well as the postfilter that reduces the undesired residual signals. We evaluate our approach on real recordings in different conditions. We show that it improves speech quality and reduction of echo, reverberation and noise compared to a cascade of individual filtering methods and another joint reduction approach. Finally, we present an online version of our approach which is suitable for time-varying acoustic conditions. We evaluate the perceptual quality achieved on real examples where the user moves during the conversation.; Cette thèse s'insère dans le développement des systèmes de télécommunication mains-libres, en particulier avec des enceintes intelligentes en environnement domestique. L'utilisateur interagit avec un correspondant distant en étant généralement situé à quelques mètres de ce type de système. Les microphones sont susceptibles de capter des sons de l'environnement qui se mêlent à la voix de l'utilisateur, comme le bruit ambiant, l'écho acoustique et la réverbération. Ces types de distorsions peuvent gêner fortement l'écoute et la compréhension de la conversation par le correspondant distant, et il est donc nécessaire de les réduire. Des méthodes de filtrage existent pour réduire individuellement chacun de ces types de distorsion sonore, et leur réduction simultanée implique de combiner ces méthodes. Toutefois, celles-ci interagissent entre elles, et leurs interactions peuvent dégrader de la voix de l'utilisateur. Il est donc nécessaire d'optimiser conjointement ces méthodes. En premier lieu, nous présentons une approche de réduction de l'écho acoustique combinant un filtre d'annulation d'écho avec un post-filtre de suppression d'écho résiduel conçu de manière à s'adapter à différents modes de fonctionnement du filtre d'annulation. Pour cela, nous proposons d'estimer les coefficients du post-filtre en utilisant les spectres à court terme de plusieurs signaux observés, dont le signal estimé par le filtre d'annulation, en entrée d'un réseau de neurones. Nous montrons que cette approche améliore la performance et la robustesse du post-filtre en matière de réduction d'écho, tout en limitant la dégradation de la parole de l'utilisateur, sur plusieurs scénarios dans des conditions réelles. En second lieu, nous décrivons une approche conjointe de réduction multicanale de l'écho, de la réverbération et du bruit. Nous proposons de modéliser simultanément la parole cible et les signaux résiduels après annulation d'écho et déréverbération dans un cadre probabiliste et de représenter conjointement leurs spectres à court terme à l'aide d'un réseau de neurones récurrent. Nous intégrons cette modélisation dans un algorithme de montée par blocs de coordonnées pour mettre à jour les filtres d'annulation d'écho et de déréverbération, ainsi que le post-filtre de suppression des signaux résiduels. Nous évaluons notre approche sur des enregistrements réels dans différentes conditions. Nous montrons qu'elle améliore la qualité de la parole ainsi que la réduction de l'écho, de la réverbération et du bruit, par rapport à une approche optimisant séparément les méthodes de filtrage et une autre approche de réduction conjointe. En dernier lieu, nous formulons une version en ligne de notre approche adaptée aux situations où les conditions acoustiques varient dans le temps. Nous évaluons la qualité perceptuelle sur des exemples réels où l'utilisateur se déplace durant la conversation.
Published: 2020

29. End-to-end deep learning for speech enhancement

Author: Carbajal, Guillaume, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Université de Lorraine, Emmanuel Vincent, Romain Serizel, and Éric Humbert [Co-encadrant]
Subjects: Reverberation, Speech enhancement, Rehaussement de la parole, Réseaux de neurones profonds, Background noise, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Réduction conjointe de distorsions, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Réverbération, Deep neural networks, Joint reduction of distortions, [INFO]Computer Science [cs], Bruit ambiant, Écho acoustique, Acoustic echo
Abstract: This PhD falls within the development of hands-free telecommunication systems, more specifically smart speakers in domestic environments. The user interacts with another speaker at a far-end point and can be typically a few meters away from this kind of system. The microphones are likely to capture sounds of the environment which are added to the user's voice, such background noise, acoustic echo and reverberation. These types of distortion degrade speech quality, intelligibility and listening comfort for the far-end speaker, and must be reduced. Filtering methods can reduce individually each of these types of distortion. Reducing all of them implies combining the corresponding filtering methods. As these methods interact with each other which can deteriorate the user's speech, they must be jointly optimized. First of all, we introduce an acoustic echo reduction approach which combines an echo cancellation filter with a residual echo postfilter designed to adapt to the echo cancellation filter. To do so, we propose to estimate the postfilter coefficients using the short term spectra of multiple known signals, including the output of the echo cancellation filter, as inputs to a neural network. We show that this approach improves the performance and the robustness of the postfilter in terms of echo reduction, while limiting speech degradation, on several scenarios in real conditions. Secondly, we describe a joint approach for multichannel reduction of echo, reverberation and noise. We propose to simultaneously model the target speech and undesired residual signals after echo cancellation and dereveberation in a probabilistic framework, and to jointly represent their short-term spectra by means of a recurrent neural network. We develop a block-coordinate ascent algorithm to update the echo cancellation and dereverberation filters, as well as the postfilter that reduces the undesired residual signals. We evaluate our approach on real recordings in different conditions. We show that it improves speech quality and reduction of echo, reverberation and noise compared to a cascade of individual filtering methods and another joint reduction approach. Finally, we present an online version of our approach which is suitable for time-varying acoustic conditions. We evaluate the perceptual quality achieved on real examples where the user moves during the conversation.; Cette thèse s'insère dans le développement des systèmes de télécommunication mains-libres, en particulier avec des enceintes intelligentes en environnement domestique. L'utilisateur interagit avec un correspondant distant en étant généralement situé à quelques mètres de ce type de système. Les microphones sont susceptibles de capter des sons de l'environnement qui se mêlent à la voix de l'utilisateur, comme le bruit ambiant, l'écho acoustique et la réverbération. Ces types de distorsions peuvent gêner fortement l'écoute et la compréhension de la conversation par le correspondant distant, et il est donc nécessaire de les réduire. Des méthodes de filtrage existent pour réduire individuellement chacun de ces types de distorsion sonore, et leur réduction simultanée implique de combiner ces méthodes. Toutefois, celles-ci interagissent entre elles, et leurs interactions peuvent dégrader de la voix de l'utilisateur. Il est donc nécessaire d'optimiser conjointement ces méthodes. En premier lieu, nous présentons une approche de réduction de l'écho acoustique combinant un filtre d'annulation d'écho avec un post-filtre de suppression d'écho résiduel conçu de manière à s'adapter à différents modes de fonctionnement du filtre d'annulation. Pour cela, nous proposons d'estimer les coefficients du post-filtre en utilisant les spectres à court terme de plusieurs signaux observés, dont le signal estimé par le filtre d'annulation, en entrée d'un réseau de neurones. Nous montrons que cette approche améliore la performance et la robustesse du post-filtre en matière de réduction d'écho, tout en limitant la dégradation de la parole de l'utilisateur, sur plusieurs scénarios dans des conditions réelles. En second lieu, nous décrivons une approche conjointe de réduction multicanale de l'écho, de la réverbération et du bruit. Nous proposons de modéliser simultanément la parole cible et les signaux résiduels après annulation d'écho et déréverbération dans un cadre probabiliste et de représenter conjointement leurs spectres à court terme à l'aide d'un réseau de neurones récurrent. Nous intégrons cette modélisation dans un algorithme de montée par blocs de coordonnées pour mettre à jour les filtres d'annulation d'écho et de déréverbération, ainsi que le post-filtre de suppression des signaux résiduels. Nous évaluons notre approche sur des enregistrements réels dans différentes conditions. Nous montrons qu'elle améliore la qualité de la parole ainsi que la réduction de l'écho, de la réverbération et du bruit, par rapport à une approche optimisant séparément les méthodes de filtrage et une autre approche de réduction conjointe. En dernier lieu, nous formulons une version en ligne de notre approche adaptée aux situations où les conditions acoustiques varient dans le temps. Nous évaluons la qualité perceptuelle sur des exemples réels où l'utilisateur se déplace durant la conversation.
Published: 2020

30. Localization and enhancement of speech from the Ambisonics format

Author: Perotin, Lauréline, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Orange Labs [Cesson-Sévigné], Orange Labs, Université de Lorraine, Emmanuel Vincent, Romain Serizel, and Alexandre Guérin (Orange Labs)
Subjects: Ambisonics, Séparation de sources, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Parole, Speech, Réseaux de neurones, Ambisonie, Source separation, Source localization, Neural networks, Localisation de sources
Abstract: This work was conducted in the fast-growing context of hands-free voice command. In domestic environments, smart devices are usually laid in a fixed position, while the human speaker gives orders from anywhere, not necessarily next to the device, or nor even facing it. This adds difficulties compared to the problem of near-field voice command (typically for mobile phones) : strong reverberation, early reflections on furniture around the device, and surrounding noises can degrade the signal. Moreover, other speakers may interfere, which make the understanding of the target speaker quite difficult. In order to facilitate speech recognition in such adverse conditions, several preprocessing methods are introduced here. We use a spatialized audio format suitable for audio scene analysis : the Ambisonic format. We first propose a sound source localization method that relies on a convolutional and recurrent neural network. We define an input feature vector inspired by the acoustic intensity vector which improves the localization performance, in particular in real conditions involving several speakers and a microphone array laid on a table. We exploit the visualization technique called layerwise relevance propagation (LRP) to highlight the time-frequency zones that are correlate positively with the network output. This analysis is of paramount importance to establish the validity of a neural network. In addition, it shows that the neural network essentially relies on time-frequency zones where direct sound dominates reverberation and background noise. We then present a method to enhance the voice of the main speaker and ease its recognition. We adopt a mask-based beamforming framework based on a time-frequency mask estimated by a neural network. To deal with the situation of multiple speakers with similar loudness, we first use a wideband beamformer to enhance the target speaker thanks to the associated localization information. We show that this additional information is not enough for the network when two speakers are close to each other. However, if we also give an enhanced version of the interfering speaker as input to the network, it returns much better masks. The filters generated from those masks greatly improve speech recognition performance. We evaluate this algorithm in various environments, including real ones, with a black-box automatic speech recognition system. Finally, we combine the proposed localization and enhancement systems and evaluate the robustness of the latter to localization errors in real environments.; Cette thèse s'inscrit dans le contexte de l'essor des assistants vocaux mains libres. Dans un environnement domestique, l'appareil est généralement posé à un endroit fixe, tandis que le locuteur s'adresse à lui depuis diverses positions, sans nécessairement s'appliquer à être proche du dispositif, ni même à lui faire face. Cela ajoute des difificultés majeures par rapport au cas, plus simple, de la commande vocale en champ proche (pour les téléphones portables par exemple) : ici, la réverbération est plus importante ; des réflexions précoces sur les meubles entourant l'appareil peuvent brouiller le signal ; les bruits environnants sont également sources d'interférences. À ceci s'ajoutent de potentiels locuteurs concurrents qui rendent la compréhension du locuteur principal particulièrement difficile. Afin de faciliter la reconnaissance vocale dans ces conditions adverses, plusieurs pré-traitements sont proposés ici. Nous utilisons un format audio spatialisé, le format Ambisonique, adapté à l'analyse de scènes sonores. Dans un premier temps, nous présentons une méthode de localisation des sources sonores basée sur un réseau de neurones convolutif et récurrent. Nous proposons des descripteurs inspirés du vecteur d'intensité acoustique qui améliorent la performance de localisation, notamment dans des situations réelles où plusieurs sources sont présentes et l'antenne de microphones est posée sur une table. La technique de visualisation appelée layerwise relevance propagation (LRP) met en valeur les zones temps-fréquence positivement corrélées avec la localisation prédite par le réseau dans un cas donné. En plus d'être méthodologiquement indispensable, cette analyse permet d'observer que le réseau de neurones exploite principalement les zones dans lesquelles le son direct domine la réverbération et le bruit ambiant. Dans un second temps, nous proposons une méthode pour rehausser la parole du locuteur principal et faciliter sa reconnaissance. Nous nous plaçons dans le cadre de la formation de voies basée sur des masques temps-fréquence estimés par un réseau de neurones. Afin de traiter le cas où plusieurs personnes parlent à un volume similaire, nous utilisons l'information de localisation pour faire un premier rehaussement à large bande dans la direction du locuteur cible. Nous montrons que donner cette information supplémentaire au réseau n'est pas suffisant dans le cas où deux locuteurs sont proches ; en revanche, donner en plus la version rehaussée du locuteur concurrent permet au réseau de renvoyer de meilleurs masques. Ces masques permettent d'en déduire un filtre multicanal qui améliore grandement la reconnaissance vocale. Nous évaluons cet algorithme dans différents environnements, y compris réels, grâce à un moteur de reconnaissance de la parole utilisé comme boîte noire.Dans un dernier temps, nous combinons les systèmes de localisation et de rehaussement et nous évaluons la robustesse du second aux imprécisions du premier sur des exemples réels.
Published: 2019

31. Localisation et rehaussement de sources de parole au format Ambisonique: analyse de scènes sonores pour faciliter la commande vocale

Author: Perotin, Lauréline, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Orange Labs [Cesson-Sévigné], Orange Labs, Université de Lorraine, Emmanuel Vincent, Romain Serizel, and Alexandre Guérin (Orange Labs)
Subjects: Ambisonics, Séparation de sources, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Parole, Speech, Réseaux de neurones, Ambisonie, Source separation, Source localization, Neural networks, Localisation de sources
Abstract: This work was conducted in the fast-growing context of hands-free voice command. In domestic environments, smart devices are usually laid in a fixed position, while the human speaker gives orders from anywhere, not necessarily next to the device, or nor even facing it. This adds difficulties compared to the problem of near-field voice command (typically for mobile phones) : strong reverberation, early reflections on furniture around the device, and surrounding noises can degrade the signal. Moreover, other speakers may interfere, which make the understanding of the target speaker quite difficult. In order to facilitate speech recognition in such adverse conditions, several preprocessing methods are introduced here. We use a spatialized audio format suitable for audio scene analysis : the Ambisonic format. We first propose a sound source localization method that relies on a convolutional and recurrent neural network. We define an input feature vector inspired by the acoustic intensity vector which improves the localization performance, in particular in real conditions involving several speakers and a microphone array laid on a table. We exploit the visualization technique called layerwise relevance propagation (LRP) to highlight the time-frequency zones that are correlate positively with the network output. This analysis is of paramount importance to establish the validity of a neural network. In addition, it shows that the neural network essentially relies on time-frequency zones where direct sound dominates reverberation and background noise. We then present a method to enhance the voice of the main speaker and ease its recognition. We adopt a mask-based beamforming framework based on a time-frequency mask estimated by a neural network. To deal with the situation of multiple speakers with similar loudness, we first use a wideband beamformer to enhance the target speaker thanks to the associated localization information. We show that this additional information is not enough for the network when two speakers are close to each other. However, if we also give an enhanced version of the interfering speaker as input to the network, it returns much better masks. The filters generated from those masks greatly improve speech recognition performance. We evaluate this algorithm in various environments, including real ones, with a black-box automatic speech recognition system. Finally, we combine the proposed localization and enhancement systems and evaluate the robustness of the latter to localization errors in real environments.; Cette thèse s'inscrit dans le contexte de l'essor des assistants vocaux mains libres. Dans un environnement domestique, l'appareil est généralement posé à un endroit fixe, tandis que le locuteur s'adresse à lui depuis diverses positions, sans nécessairement s'appliquer à être proche du dispositif, ni même à lui faire face. Cela ajoute des difificultés majeures par rapport au cas, plus simple, de la commande vocale en champ proche (pour les téléphones portables par exemple) : ici, la réverbération est plus importante ; des réflexions précoces sur les meubles entourant l'appareil peuvent brouiller le signal ; les bruits environnants sont également sources d'interférences. À ceci s'ajoutent de potentiels locuteurs concurrents qui rendent la compréhension du locuteur principal particulièrement difficile. Afin de faciliter la reconnaissance vocale dans ces conditions adverses, plusieurs pré-traitements sont proposés ici. Nous utilisons un format audio spatialisé, le format Ambisonique, adapté à l'analyse de scènes sonores. Dans un premier temps, nous présentons une méthode de localisation des sources sonores basée sur un réseau de neurones convolutif et récurrent. Nous proposons des descripteurs inspirés du vecteur d'intensité acoustique qui améliorent la performance de localisation, notamment dans des situations réelles où plusieurs sources sont présentes et l'antenne de microphones est posée sur une table. La technique de visualisation appelée layerwise relevance propagation (LRP) met en valeur les zones temps-fréquence positivement corrélées avec la localisation prédite par le réseau dans un cas donné. En plus d'être méthodologiquement indispensable, cette analyse permet d'observer que le réseau de neurones exploite principalement les zones dans lesquelles le son direct domine la réverbération et le bruit ambiant. Dans un second temps, nous proposons une méthode pour rehausser la parole du locuteur principal et faciliter sa reconnaissance. Nous nous plaçons dans le cadre de la formation de voies basée sur des masques temps-fréquence estimés par un réseau de neurones. Afin de traiter le cas où plusieurs personnes parlent à un volume similaire, nous utilisons l'information de localisation pour faire un premier rehaussement à large bande dans la direction du locuteur cible. Nous montrons que donner cette information supplémentaire au réseau n'est pas suffisant dans le cas où deux locuteurs sont proches ; en revanche, donner en plus la version rehaussée du locuteur concurrent permet au réseau de renvoyer de meilleurs masques. Ces masques permettent d'en déduire un filtre multicanal qui améliore grandement la reconnaissance vocale. Nous évaluons cet algorithme dans différents environnements, y compris réels, grâce à un moteur de reconnaissance de la parole utilisé comme boîte noire.Dans un dernier temps, nous combinons les systèmes de localisation et de rehaussement et nous évaluons la robustesse du second aux imprécisions du premier sur des exemples réels.
Published: 2019

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

31 results on '"Romain Serizel"'

1. Barlow Twins self-supervised learning for robust speaker recognition

2. A Comprehensive Exploration of Noise Robustness and Noise Compensation in ResNet and TDNN-based Speaker Recognition Systems

3. Distributed speech separation in spatially unconstrained microphone arrays

4. DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays

5. Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes

6. Joint NN-Supported Multichannel Reduction of Acoustic Echo, Reverberation and Noise

7. DNN-Based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays

8. Sound event detection in synthetic domestic environments

9. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

10. Semi-supervised triplet loss based learning of ambient audio embeddings

11. CRNN-based multiple DoA estimation using acoustic intensity features for Ambisonics recordings

12. Regression versus classification for neural network based audio source localization

13. CRNN-based joint azimuth and elevation localization with the Ambisonics intensity vector

14. Multichannel Audio Modeling with Elliptically Stable Tensor Decomposition

15. Multiple-input neural network-based residual echo suppression

16. Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification

17. Multiview approaches to event detection and scene analysis

18. Machine listening techniques as a complement to video image analysis in forensics

19. Group nonnegative matrix factorisation with speaker and session variability compensation for speaker identification

20. Acoustic scene classification with matrix factorization for unsupervised feature learning

21. Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition

22. Low-rank Approximation Based Multichannel Wiener Filter Algorithms for Noise Reduction with Application in Cochlear Implants

23. A Speech Distortion Weighting Based Approach to Integrated Active Noise Control and Noise Reduction in Hearing Aids

24. Rank-1 Approximation Based Multichannel Wiener Filtering Algorithms For Noise Reduction In Cochlear Implants

25. Output SNR analysis of integrated active noise control and noise reduction in hearing aids under a single speech source scenario

26. Accuracy Constraint Determination in Fixed-Point System Design

27. Analysis of scientific challenges in ambient sound recognition in real environments

28. Apprentissage profond bout-en-bout pour le rehaussement de la parole

29. End-to-end deep learning for speech enhancement

30. Localization and enhancement of speech from the Ambisonics format

31. Localisation et rehaussement de sources de parole au format Ambisonique: analyse de scènes sonores pour faciliter la commande vocale

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

31 results on '"Romain Serizel"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources