1. End-to-End Neural Diarization for Unknown Number of Speakers with Multi-Scale Decoder.
- Author
-
Myat Aye Aye Aung, Win Pa Pa, and Hay Mar Soe Naing
- Subjects
ORAL communication ,LINGUISTIC context ,SPEECH ,ERROR rates ,BROADCAST journalism - Abstract
Speaker diarization is crucial for enhancing speech communication across various domains, including broadcast news, meetings, conferences featuring multiple speakers. Nevertheless, real-time diarization applications face persistent challenges due to overlapping speech and varying acoustic conditions. To address these challenges, End-to-End Neural Diarization (EEND) has demonstrated superior performance compared to traditional clusteringbased methods. Conventional neural techniques often rely on fixed datasets, which can hinder their ability to generalize across different speech patterns and real-world environments. Therefore, this research proposes an EEND model utilizing a Multi-Scale approach to compute optimal weights, essential for generating speaker labels across multiple scales. The Multi-Scale Diarization Decoder (MSDD) approach accommodates a flexible number of speakers, overlapaware diarization, and integrates a pre-trained speaker embedding model. The investigation included different languages and datasets, such as the proposed Myanmar M-Diarization dataset and the English AMI meeting corpus. Notably, many benchmark multi-speaker datasets for speaker diarization include no more than 8 speakers per audio and have fixed-length speakers per audio. Hence, this study developed its own dataset featuring up to 15 speakers with flexible number of speakers. Furthermore, the study demonstrates language-independence, underscoring its efficacy across diverse linguistic contexts. Comparative analysis revealed that the proposed model outperformed clustering baseline methods (i-vectors and x-vectors) and single-scale EEND approaches in both languages regarding Diarization Error Rate (DER). Additionally, proposed M-Diarization dataset included audio of varying lengths and scenarios with an overlap ratio of 10%. The model was validated on the M-Diarization dataset, demonstrating its capability to handle flexible speaker counts and audio durations efficiently. This experiment marks the first implementation of an EEND with a Multi-Scale approach on a fixed-speaker English language corpus and the variable-speaker M-Diarization dataset. It achieved notable results: 44.63% for i-vectors, 47.38% for x-vectors, 19% for the EEND single-scale approach, and 4.37% for the EEND MSDD approach on overlap ratio 3.31% on the M-Diarization dataset. The experimental outcomes clearly indicate that the proposed method significantly enhances diarization performance, particularly in scenarios involving varying numbers of speakers and diverse audio conversation lengths. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF