16 results on '"Diarization error rate"'
Search Results
2. End-to-End Neural Diarization for Unknown Number of Speakers with Multi-Scale Decoder.
- Author
-
Myat Aye Aye Aung, Win Pa Pa, and Hay Mar Soe Naing
- Subjects
ORAL communication ,LINGUISTIC context ,SPEECH ,ERROR rates ,BROADCAST journalism - Abstract
Speaker diarization is crucial for enhancing speech communication across various domains, including broadcast news, meetings, conferences featuring multiple speakers. Nevertheless, real-time diarization applications face persistent challenges due to overlapping speech and varying acoustic conditions. To address these challenges, End-to-End Neural Diarization (EEND) has demonstrated superior performance compared to traditional clusteringbased methods. Conventional neural techniques often rely on fixed datasets, which can hinder their ability to generalize across different speech patterns and real-world environments. Therefore, this research proposes an EEND model utilizing a Multi-Scale approach to compute optimal weights, essential for generating speaker labels across multiple scales. The Multi-Scale Diarization Decoder (MSDD) approach accommodates a flexible number of speakers, overlapaware diarization, and integrates a pre-trained speaker embedding model. The investigation included different languages and datasets, such as the proposed Myanmar M-Diarization dataset and the English AMI meeting corpus. Notably, many benchmark multi-speaker datasets for speaker diarization include no more than 8 speakers per audio and have fixed-length speakers per audio. Hence, this study developed its own dataset featuring up to 15 speakers with flexible number of speakers. Furthermore, the study demonstrates language-independence, underscoring its efficacy across diverse linguistic contexts. Comparative analysis revealed that the proposed model outperformed clustering baseline methods (i-vectors and x-vectors) and single-scale EEND approaches in both languages regarding Diarization Error Rate (DER). Additionally, proposed M-Diarization dataset included audio of varying lengths and scenarios with an overlap ratio of 10%. The model was validated on the M-Diarization dataset, demonstrating its capability to handle flexible speaker counts and audio durations efficiently. This experiment marks the first implementation of an EEND with a Multi-Scale approach on a fixed-speaker English language corpus and the variable-speaker M-Diarization dataset. It achieved notable results: 44.63% for i-vectors, 47.38% for x-vectors, 19% for the EEND single-scale approach, and 4.37% for the EEND MSDD approach on overlap ratio 3.31% on the M-Diarization dataset. The experimental outcomes clearly indicate that the proposed method significantly enhances diarization performance, particularly in scenarios involving varying numbers of speakers and diverse audio conversation lengths. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Speech Enhancement for Multimodal Speaker Diarization System
- Author
-
Rehan Ahmad, Syed Zubair, and Hani Alquhayz
- Subjects
Multimodal speaker diarization ,LSTM ,audio-visual synchronization ,additive white Gaussian noise ,Gaussian mixture model ,diarization error rate ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Speaker diarization system identifies the speaker homogenous regions in those set of recordings where multiple speakers are present. It answers the question `who spoke when?'. The data set for speaker diarization usually consists of telephone, meetings, TV/ talk shows, broadcast news and other multi-speaker recordings. In this paper, we present the performance of our proposed multimodal speaker diarization system under noisy conditions. Two types of noises comprising additive white Gaussian noise (AWGN) and realistic environmental noise is used to evaluate the system. To mitigate the effect of noise, we propose to add an LSTM based speech enhancement block in our diarization pipeline. This block is trained on synthesized data set with more than 100 noise types to enhance the noisy speech. The enhanced speech is further used in multimodal speaker diarization system which utilizes a pre-trained audio-visual synchronization model to find the active speaker. High confidence active speaker segments are then used to train the speaker specific clusters on the enhanced speech. A subset of AMI corpus consisting of 5.4 h of recordings is used in this analysis. For AWGN, the LSTM model performance improvement is comparable with Wiener filter while in case of realistic environmental noise, the LSTM model improves significantly as compared to Wiener filter in terms of diarization error rate (DER).
- Published
- 2020
- Full Text
- View/download PDF
4. Investigating the Effect of Varying Window Sizes in Speaker Diarization for Meetings Domain
- Author
-
Naik, Nirali, Mankad, Sapan H., Thakkar, Priyank, Howlett, Robert James, Series editor, Jain, Lakhmi C., Series editor, Satapathy, Suresh Chandra, editor, and Joshi, Amit, editor
- Published
- 2018
- Full Text
- View/download PDF
5. Unsupervised deep feature embeddings for speaker diarization.
- Author
-
AHMAD, Rehan and ZUBAIR, Syed
- Subjects
- *
GAUSSIAN mixture models , *EMBEDDINGS (Mathematics) , *ERROR rates - Abstract
Speaker diarization aims to determine "who spoke when?" from multispeaker recording environments. In this paper, we propose to learn a set of high-level feature representations, referred to as feature embeddings, from an unsupervised deep architecture for speaker diarization. These sets of embeddings are learned through a deep autoencoder model when trained on mel-frequency cepstral coefficients (MFCCs) of input speech frames. Learned embeddings are then used in Gaussian mixture model based hierarchical clustering for diarization. The results show that these unsupervised embeddings are better compared to MFCCs in reducing the diarization error rate. Experiments conducted on the popular subset of the AMI meeting corpus consisting of 5.4 h of recordings show that the new embeddings decrease the average diarization error rate by 2.96%. However, for individual recordings, maximum improvement of 8.05% is acquired. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
6. Real-Time Implementation of Speaker Diarization System on Raspberry PI3 Using TLBO Clustering Algorithm
- Author
-
Karim Dabbabi, Adnen Cherif, and Salah Hajji
- Subjects
Diarization error rate ,0209 industrial biotechnology ,business.industry ,Computer science ,Applied Mathematics ,Pattern recognition ,02 engineering and technology ,Execution time ,Raspberry pi ,Speaker diarisation ,020901 industrial engineering & automation ,Signal Processing ,Artificial intelligence ,business ,Cluster analysis ,Classifier (UML) - Abstract
In the recent years, extensive researches have been performed on various possible implementations of speaker diarization systems. These systems require efficient clustering algorithms in order to improve their performances in real-time processing. Teaching–learning-based optimization (TLBO) is such clustering algorithm which can be used to resolve the problem to the optimum clustering in a reasonable time. In this paper, a real-time implementation of speaker diarization (SD) system on raspberry pi 3 (RPi 3) using TLBO technique as classifier has been performed. This system has been evaluated on broadcasting radio dataset (NDTV), and the experimental tests have shown that this technique has succeeded to achieve acceptable performances in terms of diarization error rate (DER = 21.90% and 35% in single- and cross-show diarization, respectively), accuracy (87.30%), and real-time factor (RTF = 2.40). Also, we have tested TLBO technique on a 2.4 GHz Intel Core i5 processor using REPERE corpus. Thus, ameliorated results have been obtained in terms of execution time (xRT) and DER in both tasks of single- and cross-show speaker diarization (0.08 and 0.095, and 18.50% and 26.30%, respectively).
- Published
- 2020
- Full Text
- View/download PDF
7. Speech Enhancement for Multimodal Speaker Diarization System
- Author
-
Syed M. Zubair, Hani Alquhayz, and Rehan Ahmad
- Subjects
General Computer Science ,Computer science ,Speech recognition ,Multimodal speaker diarization ,additive white Gaussian noise ,02 engineering and technology ,diarization error rate ,030507 speech-language pathology & audiology ,03 medical and health sciences ,symbols.namesake ,0202 electrical engineering, electronic engineering, information engineering ,General Materials Science ,Environmental noise ,Block (data storage) ,audio-visual synchronization ,Wiener filter ,General Engineering ,020207 software engineering ,Mixture model ,Speaker diarisation ,Speech enhancement ,Noise ,Additive white Gaussian noise ,Gaussian mixture model ,symbols ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,0305 other medical science ,LSTM ,lcsh:TK1-9971 - Abstract
Speaker diarization system identifies the speaker homogenous regions in those set of recordings where multiple speakers are present. It answers the question ‘who spoke when?’. The data set for speaker diarization usually consists of telephone, meetings, TV/ talk shows, broadcast news and other multi-speaker recordings. In this paper, we present the performance of our proposed multimodal speaker diarization system under noisy conditions. Two types of noises comprising additive white Gaussian noise (AWGN) and realistic environmental noise is used to evaluate the system. To mitigate the effect of noise, we propose to add an LSTM based speech enhancement block in our diarization pipeline. This block is trained on synthesized data set with more than 100 noise types to enhance the noisy speech. The enhanced speech is further used in multimodal speaker diarization system which utilizes a pre-trained audio-visual synchronization model to find the active speaker. High confidence active speaker segments are then used to train the speaker specific clusters on the enhanced speech. A subset of AMI corpus consisting of 5.4 h of recordings is used in this analysis. For AWGN, the LSTM model performance improvement is comparable with Wiener filter while in case of realistic environmental noise, the LSTM model improves significantly as compared to Wiener filter in terms of diarization error rate (DER).
- Published
- 2020
8. Active correction for speaker diarization with human in the loop
- Author
-
Anthony Larcher, Sylvain Meignier, Loïc Barrault, Yevhenii Prokopalo, Meysam Shamsi, Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), The University of Sheffield [Sheffield, U.K.], and ANR-17-CHR2-0004,ALLIES,Autonomous Lifelong learnIng intelLigent Systems(2017)
- Subjects
Diarization error rate ,Active learning ,Active learning (machine learning) ,Computer science ,business.industry ,Machine learning ,computer.software_genre ,Clustering ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Speaker diarisation ,Speaker diarization ,Human-in-the-loop ,Limit (mathematics) ,Artificial intelligence ,State (computer science) ,Cluster analysis ,business ,Baseline (configuration management) ,computer - Abstract
International audience; State of the art diarization systems now achieve decent performance but those performances are often not good enough to deploy them without any human supervision. In this paper we propose a framework that solicits a human in the loop to correct the clustering by answering simple questions. After defining the nature of the questions, we propose an algorithm to list those questions and two stopping criteria that are necessary to limit the work load on the human in the loop. Experiments performed on the ALLIES dataset show that a limited interaction with a human expert can lead to considerable improvement of up to 36.5% relative diarization error rate (DER) compared to a strong baseline.
- Published
- 2021
9. Age-Invariant Speaker Embedding for Diarization of Cognitive Assessments
- Author
-
Ka-Ho Wong, Man-Wai Mak, Timothy Kwok, Helen Meng, and Sean Shensheng Xu
- Subjects
Diarization error rate ,Speaker diarisation ,Training set ,Artificial neural network ,Computer science ,Speech recognition ,Embedding ,Cognition ,Invariant (computer science) ,Test data - Abstract
This paper investigates an age-invariant speaker embedding approach to speaker diarization, which is an essential step towards the automatic cognitive assessments from speech. Studies have shown that incorporating speaker traits (e.g., age, gender, etc.) can improve speaker diarization performance. However, we found that age information in the speaker embeddings is detrimental to speaker diarization if there is a severe mismatch between the age distributions in the training data and test data. To minimize the detrimental effect of age mismatch, an adversarial training strategy is introduced to remove age variability from the utterance-level speaker embeddings. Evaluations on an interactive dialog dataset for Montreal cognitive assessments (MoCA) show that the adversarial training strategy can produce age-invariant embeddings and reduce diarization error rate (DER) by 4.33%. The approach also outperforms the conventional method even with less training data.
- Published
- 2021
- Full Text
- View/download PDF
10. End-to-end speaker segmentation for overlap-aware resegmentation
- Author
-
Antoine Laurent, Hervé Bredin, Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées, Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), HPC resources of IDRIS under the allocation AD011012177 made by GENCI (Grand Équipement National de Calcul Intensif), ANR-16-CE92-0025,PLUMCOT,Identification non-supervisée des personnages de films et séries télévisées(2016), and ANR-19-CE38-0012,GEM,Mesure de l'égalité entre les sexes dans les médias(2019)
- Subjects
FOS: Computer and information sciences ,speaker segmentation ,Sound (cs.SD) ,Computer science ,Speech recognition ,overlapped speech detection ,02 engineering and technology ,[INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE] ,Computer Science - Sound ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,Task (project management) ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,030507 speech-language pathology & audiology ,03 medical and health sciences ,End-to-end principle ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,Segmentation ,resegmentation ,Diarization error rate ,Voice activity detection ,020206 networking & telecommunications ,Speaker diarisation ,voice activity detection ,Temporal resolution ,speaker diarization ,0305 other medical science ,Change detection ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization approach (EEND), the task is modeled as a multi-label classification problem using permutation-invariant training. The main difference is that our model operates on short audio chunks (5 seconds) but at a much higher temporal resolution (every 16ms). Experiments on multiple speaker diarization datasets conclude that our model can be used with great success on both voice activity detection and overlapped speech detection. Our proposed model can also be used as a post-processing step, to detect and correctly assign overlapped speech regions. Relative diarization error rate improvement over the best considered baseline (VBx) reaches 17% on AMI, 13% on DIHARD 3, and 13% on VoxConverse., Comment: Camera-ready version for Interspeech 2021 with significantly better voice activity detection, overlapped speech detection, and speaker diarization results. The code used for results reported in v1 contained a small bug that has now been fixed
- Published
- 2021
- Full Text
- View/download PDF
11. End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
- Author
-
Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Yusuke Fujita, and Shinji Watanabe
- Subjects
FOS: Computer and information sciences ,Diarization error rate ,Sound (cs.SD) ,0209 industrial biotechnology ,Sequence ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,02 engineering and technology ,Computer Science - Sound ,Speaker diarisation ,020901 industrial engineering & automation ,End-to-end principle ,Audio and Speech Processing (eess.AS) ,Attractor ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,Encoder decoder ,Cluster analysis ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors are multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69 % diarization error rate (DER) on simulated mixtures and a 8.07 % DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56 % and 9.54 %, respectively. In unknown numbers of speakers conditions, our method attained a 15.29 % DER on CALLHOME, while the x-vector-based clustering method achieved a 19.43 % DER., Accepted to INTERSPEECH 2020
- Published
- 2020
12. Speaker diarization for multi-speaker conversations via x-vectors
- Author
-
Xiao Song, Jian Zhang, and Yangfan Zhang
- Subjects
Speaker diarisation ,Diarization error rate ,Training set ,Artificial neural network ,Coincident ,Computer science ,Speech recognition ,media_common.quotation_subject ,Conversation ,Cluster analysis ,Linear discriminant analysis ,media_common - Abstract
This paper investigates a new way to build x-vectors based speaker diarization system for multi-speaker conversations, and explore how to improve system performance. There has been a lot of work to prove the superiority of x-vectors in speaker diarization, but it has not been applied in multi-speaker scenarios. We have studied several techniques in our system, such as dividing a long conversation into short overlapping segments to facilitate the extraction of x-vectors instead of ignoring overlapping regions, and re-classifying the labels of coincident segments after clustering to reduce errors. In addition, we enhance the training data to deal with the problem of insufficient discriminant analysis, and select the appropriate number of archives to control the iteration of training samples when training neural networks. Finally, the experimental results on the AMI croups demonstrate the effectiveness of our system. Compared with the initial system of 2018 DIHARD challenge track 2, our final result is relatively reduced by 13.21% on Diarization Error Rate (DER).
- Published
- 2019
- Full Text
- View/download PDF
13. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
- Author
-
Allah Ditta, Hani Alquhayz, Rehan Ahmad, and Syed M. Zubair
- Subjects
Computer science ,Speech recognition ,02 engineering and technology ,Biochemistry ,Article ,Analytical Chemistry ,diarization error rate ,Set (abstract data type) ,SyncNet ,Synchronization (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,speech activity detection ,Electrical and Electronic Engineering ,Face detection ,Instrumentation ,Voice activity detection ,020207 software engineering ,Mixture model ,Atomic and Molecular Physics, and Optics ,Term (time) ,Speaker diarisation ,Gaussian mixture model ,MFCC ,020201 artificial intelligence & image processing ,speaker diarization ,Mel-frequency cepstrum - Abstract
Speaker diarization systems aim to find &lsquo, who spoke when?&rsquo, in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.
- Published
- 2019
- Full Text
- View/download PDF
14. Enhanced low-latency speaker spotting using selective cluster enrichment
- Author
-
Nicholas Evans, Jose Patino, and Héctor Delgado
- Subjects
Diarization error rate ,Computer science ,Speech recognition ,Online processing ,Word error rate ,02 engineering and technology ,Spotting ,Rapid detection ,Data modeling ,Speaker diarisation ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Speaker detection ,0305 other medical science - Abstract
Low-latency speaker spotting (LLSS) calls for the rapid detection of known speakers within multi-speaker audio streams. While previous work showed the potential to develop efficient LLSS solutions by combining speaker diarization and speaker detection within an online processing framework, it failed to move significantly beyond the traditional definition of diarization. This paper shows that the latter needs rethinking and that a diarization sub-system tailored to the end application, rather than to the minimisation of the diarization error rate, can improve LLSS performance. The proposed selective cluster enrichment algorithm is used to guide the diarization system to better model segments within a multi-speaker audio stream and hence detect more reliably a given target speaker. The LLSS solution reported in this paper shows that target speakers can be detected with a 16% equal error rate after having been active in multi-speaker audio streams for only 15 seconds.
- Published
- 2018
- Full Text
- View/download PDF
15. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.
- Author
-
Ahmad, Rehan, Zubair, Syed, Alquhayz, Hani, and Ditta, Allah
- Subjects
AUDIOVISUAL materials ,GAUSSIAN mixture models ,SYNCHRONIZATION ,RADIO talk programs ,SOUND recordings - Abstract
Speaker diarization systems aim to find 'who spoke when?' in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
16. Ideas for Clustering of Similar Models of a Speaker in an Online Speaker Diarization System
- Author
-
Vlasta Radová and Marie Kunešová
- Subjects
Speaker diarisation ,Diarization error rate ,Computer science ,business.industry ,Speech recognition ,Baseline system ,Artificial intelligence ,Cluster analysis ,Speaker recognition ,computer.software_genre ,business ,computer ,Natural language processing - Abstract
During online speaker diarization, a situation may occur where a single speaker is being represented by several different models. Such situation leads to worsened diarization results, because the diarization system considers every change of a model to be a change of speakers. In the article we describe a method for detecting this situation and propose several ways of solving it. Experiments show that the most suitable option is treating multiple GMMs as belonging to a single speaker, i.e. updating all of them with the same data every time one of them is assigned a new segment. In that case, there was a relative improvement in Diarization Error Rate of 30.69% in comparison with the baseline system.
- Published
- 2015
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.