1. Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes
- Author
-
Romain Serizel, Hakan Erdogan, Justin Salamon, Nicolas Turpault, John R. Hershey, Scott Wisdom, Eduardo Fonseca, Prem Seetharaman, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Google Inc, Research at Google, Universitat Pompeu Fabra [Barcelona] (UPF), Descript, Inc., Adobe Research, Part of this work was made with the support of the French National Research Agency, in the framework of the project LEAUDS 'Learning to understand audio scenes' (ANR-18-CE23-0020) and the French region Grand-Est. High Performance Computing resources were partially provided by the EXPLOR centre hosted by the University de Lorraine., Grid'5000, ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
FOS: Computer and information sciences ,Sound localization ,Sound (cs.SD) ,Reverberation ,Soundscape ,Computer science ,Speech recognition ,02 engineering and technology ,Computer Science - Sound ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Audio and Speech Processing (eess.AS) ,Robustness (computer science) ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Sound (geography) ,synthetic soundscapes ,geography ,Signal processing ,geography.geographical_feature_category ,Event (computing) ,Sound event detection ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,Benchmark (computing) ,sound separation ,020201 artificial intelligence & image processing ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
International audience; We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We show that the localization in time of sound events is still a problem for SED systems. We also show that reverberation and non-target sound events are severely degrading the performance of the SED systems. In the latter case, sound separation seems like a promising solution.
- Published
- 2020
- Full Text
- View/download PDF