Back to Search Start Over

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Authors :
Wu, Shilong
Wang, Chenxi
Chen, Hang
Dai, Yusheng
Zhang, Chenyue
Wang, Ruoyu
Lan, Hongbo
Du, Jun
Lee, Chin-Hui
Chen, Jingdong
Watanabe, Shinji
Siniscalchi, Sabato Marco
Scharenborg, Odette
Wang, Zhong-Qiu
Pan, Jia
Gao, Jianqing
Publication Year :
2023

Abstract

Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.<br />Comment: 5 pages, 4 figures

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2309.08348
Document Type :
Working Paper