22 results on '"Diarization error rate"'
Search Results
2. End-to-End Neural Diarization for Unknown Number of Speakers with Multi-Scale Decoder.
- Author
-
Myat Aye Aye Aung, Win Pa Pa, and Hay Mar Soe Naing
- Subjects
ORAL communication ,LINGUISTIC context ,SPEECH ,ERROR rates ,BROADCAST journalism - Abstract
Speaker diarization is crucial for enhancing speech communication across various domains, including broadcast news, meetings, conferences featuring multiple speakers. Nevertheless, real-time diarization applications face persistent challenges due to overlapping speech and varying acoustic conditions. To address these challenges, End-to-End Neural Diarization (EEND) has demonstrated superior performance compared to traditional clusteringbased methods. Conventional neural techniques often rely on fixed datasets, which can hinder their ability to generalize across different speech patterns and real-world environments. Therefore, this research proposes an EEND model utilizing a Multi-Scale approach to compute optimal weights, essential for generating speaker labels across multiple scales. The Multi-Scale Diarization Decoder (MSDD) approach accommodates a flexible number of speakers, overlapaware diarization, and integrates a pre-trained speaker embedding model. The investigation included different languages and datasets, such as the proposed Myanmar M-Diarization dataset and the English AMI meeting corpus. Notably, many benchmark multi-speaker datasets for speaker diarization include no more than 8 speakers per audio and have fixed-length speakers per audio. Hence, this study developed its own dataset featuring up to 15 speakers with flexible number of speakers. Furthermore, the study demonstrates language-independence, underscoring its efficacy across diverse linguistic contexts. Comparative analysis revealed that the proposed model outperformed clustering baseline methods (i-vectors and x-vectors) and single-scale EEND approaches in both languages regarding Diarization Error Rate (DER). Additionally, proposed M-Diarization dataset included audio of varying lengths and scenarios with an overlap ratio of 10%. The model was validated on the M-Diarization dataset, demonstrating its capability to handle flexible speaker counts and audio durations efficiently. This experiment marks the first implementation of an EEND with a Multi-Scale approach on a fixed-speaker English language corpus and the variable-speaker M-Diarization dataset. It achieved notable results: 44.63% for i-vectors, 47.38% for x-vectors, 19% for the EEND single-scale approach, and 4.37% for the EEND MSDD approach on overlap ratio 3.31% on the M-Diarization dataset. The experimental outcomes clearly indicate that the proposed method significantly enhances diarization performance, particularly in scenarios involving varying numbers of speakers and diverse audio conversation lengths. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Speech Enhancement for Multimodal Speaker Diarization System
- Author
-
Rehan Ahmad, Syed Zubair, and Hani Alquhayz
- Subjects
Multimodal speaker diarization ,LSTM ,audio-visual synchronization ,additive white Gaussian noise ,Gaussian mixture model ,diarization error rate ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Speaker diarization system identifies the speaker homogenous regions in those set of recordings where multiple speakers are present. It answers the question `who spoke when?'. The data set for speaker diarization usually consists of telephone, meetings, TV/ talk shows, broadcast news and other multi-speaker recordings. In this paper, we present the performance of our proposed multimodal speaker diarization system under noisy conditions. Two types of noises comprising additive white Gaussian noise (AWGN) and realistic environmental noise is used to evaluate the system. To mitigate the effect of noise, we propose to add an LSTM based speech enhancement block in our diarization pipeline. This block is trained on synthesized data set with more than 100 noise types to enhance the noisy speech. The enhanced speech is further used in multimodal speaker diarization system which utilizes a pre-trained audio-visual synchronization model to find the active speaker. High confidence active speaker segments are then used to train the speaker specific clusters on the enhanced speech. A subset of AMI corpus consisting of 5.4 h of recordings is used in this analysis. For AWGN, the LSTM model performance improvement is comparable with Wiener filter while in case of realistic environmental noise, the LSTM model improves significantly as compared to Wiener filter in terms of diarization error rate (DER).
- Published
- 2020
- Full Text
- View/download PDF
4. Investigating the Effect of Varying Window Sizes in Speaker Diarization for Meetings Domain
- Author
-
Naik, Nirali, Mankad, Sapan H., Thakkar, Priyank, Howlett, Robert James, Series editor, Jain, Lakhmi C., Series editor, Satapathy, Suresh Chandra, editor, and Joshi, Amit, editor
- Published
- 2018
- Full Text
- View/download PDF
5. Unsupervised deep feature embeddings for speaker diarization.
- Author
-
AHMAD, Rehan and ZUBAIR, Syed
- Subjects
- *
GAUSSIAN mixture models , *EMBEDDINGS (Mathematics) , *ERROR rates - Abstract
Speaker diarization aims to determine "who spoke when?" from multispeaker recording environments. In this paper, we propose to learn a set of high-level feature representations, referred to as feature embeddings, from an unsupervised deep architecture for speaker diarization. These sets of embeddings are learned through a deep autoencoder model when trained on mel-frequency cepstral coefficients (MFCCs) of input speech frames. Learned embeddings are then used in Gaussian mixture model based hierarchical clustering for diarization. The results show that these unsupervised embeddings are better compared to MFCCs in reducing the diarization error rate. Experiments conducted on the popular subset of the AMI meeting corpus consisting of 5.4 h of recordings show that the new embeddings decrease the average diarization error rate by 2.96%. However, for individual recordings, maximum improvement of 8.05% is acquired. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
6. Real-Time Implementation of Speaker Diarization System on Raspberry PI3 Using TLBO Clustering Algorithm
- Author
-
Karim Dabbabi, Adnen Cherif, and Salah Hajji
- Subjects
Diarization error rate ,0209 industrial biotechnology ,business.industry ,Computer science ,Applied Mathematics ,Pattern recognition ,02 engineering and technology ,Execution time ,Raspberry pi ,Speaker diarisation ,020901 industrial engineering & automation ,Signal Processing ,Artificial intelligence ,business ,Cluster analysis ,Classifier (UML) - Abstract
In the recent years, extensive researches have been performed on various possible implementations of speaker diarization systems. These systems require efficient clustering algorithms in order to improve their performances in real-time processing. Teaching–learning-based optimization (TLBO) is such clustering algorithm which can be used to resolve the problem to the optimum clustering in a reasonable time. In this paper, a real-time implementation of speaker diarization (SD) system on raspberry pi 3 (RPi 3) using TLBO technique as classifier has been performed. This system has been evaluated on broadcasting radio dataset (NDTV), and the experimental tests have shown that this technique has succeeded to achieve acceptable performances in terms of diarization error rate (DER = 21.90% and 35% in single- and cross-show diarization, respectively), accuracy (87.30%), and real-time factor (RTF = 2.40). Also, we have tested TLBO technique on a 2.4 GHz Intel Core i5 processor using REPERE corpus. Thus, ameliorated results have been obtained in terms of execution time (xRT) and DER in both tasks of single- and cross-show speaker diarization (0.08 and 0.095, and 18.50% and 26.30%, respectively).
- Published
- 2020
- Full Text
- View/download PDF
7. Speech Enhancement for Multimodal Speaker Diarization System
- Author
-
Syed M. Zubair, Hani Alquhayz, and Rehan Ahmad
- Subjects
General Computer Science ,Computer science ,Speech recognition ,Multimodal speaker diarization ,additive white Gaussian noise ,02 engineering and technology ,diarization error rate ,030507 speech-language pathology & audiology ,03 medical and health sciences ,symbols.namesake ,0202 electrical engineering, electronic engineering, information engineering ,General Materials Science ,Environmental noise ,Block (data storage) ,audio-visual synchronization ,Wiener filter ,General Engineering ,020207 software engineering ,Mixture model ,Speaker diarisation ,Speech enhancement ,Noise ,Additive white Gaussian noise ,Gaussian mixture model ,symbols ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,0305 other medical science ,LSTM ,lcsh:TK1-9971 - Abstract
Speaker diarization system identifies the speaker homogenous regions in those set of recordings where multiple speakers are present. It answers the question ‘who spoke when?’. The data set for speaker diarization usually consists of telephone, meetings, TV/ talk shows, broadcast news and other multi-speaker recordings. In this paper, we present the performance of our proposed multimodal speaker diarization system under noisy conditions. Two types of noises comprising additive white Gaussian noise (AWGN) and realistic environmental noise is used to evaluate the system. To mitigate the effect of noise, we propose to add an LSTM based speech enhancement block in our diarization pipeline. This block is trained on synthesized data set with more than 100 noise types to enhance the noisy speech. The enhanced speech is further used in multimodal speaker diarization system which utilizes a pre-trained audio-visual synchronization model to find the active speaker. High confidence active speaker segments are then used to train the speaker specific clusters on the enhanced speech. A subset of AMI corpus consisting of 5.4 h of recordings is used in this analysis. For AWGN, the LSTM model performance improvement is comparable with Wiener filter while in case of realistic environmental noise, the LSTM model improves significantly as compared to Wiener filter in terms of diarization error rate (DER).
- Published
- 2020
8. Active correction for speaker diarization with human in the loop
- Author
-
Anthony Larcher, Sylvain Meignier, Loïc Barrault, Yevhenii Prokopalo, Meysam Shamsi, Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), The University of Sheffield [Sheffield, U.K.], and ANR-17-CHR2-0004,ALLIES,Autonomous Lifelong learnIng intelLigent Systems(2017)
- Subjects
Diarization error rate ,Active learning ,Active learning (machine learning) ,Computer science ,business.industry ,Machine learning ,computer.software_genre ,Clustering ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Speaker diarisation ,Speaker diarization ,Human-in-the-loop ,Limit (mathematics) ,Artificial intelligence ,State (computer science) ,Cluster analysis ,business ,Baseline (configuration management) ,computer - Abstract
International audience; State of the art diarization systems now achieve decent performance but those performances are often not good enough to deploy them without any human supervision. In this paper we propose a framework that solicits a human in the loop to correct the clustering by answering simple questions. After defining the nature of the questions, we propose an algorithm to list those questions and two stopping criteria that are necessary to limit the work load on the human in the loop. Experiments performed on the ALLIES dataset show that a limited interaction with a human expert can lead to considerable improvement of up to 36.5% relative diarization error rate (DER) compared to a strong baseline.
- Published
- 2021
9. Age-Invariant Speaker Embedding for Diarization of Cognitive Assessments
- Author
-
Ka-Ho Wong, Man-Wai Mak, Timothy Kwok, Helen Meng, and Sean Shensheng Xu
- Subjects
Diarization error rate ,Speaker diarisation ,Training set ,Artificial neural network ,Computer science ,Speech recognition ,Embedding ,Cognition ,Invariant (computer science) ,Test data - Abstract
This paper investigates an age-invariant speaker embedding approach to speaker diarization, which is an essential step towards the automatic cognitive assessments from speech. Studies have shown that incorporating speaker traits (e.g., age, gender, etc.) can improve speaker diarization performance. However, we found that age information in the speaker embeddings is detrimental to speaker diarization if there is a severe mismatch between the age distributions in the training data and test data. To minimize the detrimental effect of age mismatch, an adversarial training strategy is introduced to remove age variability from the utterance-level speaker embeddings. Evaluations on an interactive dialog dataset for Montreal cognitive assessments (MoCA) show that the adversarial training strategy can produce age-invariant embeddings and reduce diarization error rate (DER) by 4.33%. The approach also outperforms the conventional method even with less training data.
- Published
- 2021
- Full Text
- View/download PDF
10. End-to-end speaker segmentation for overlap-aware resegmentation
- Author
-
Antoine Laurent, Hervé Bredin, Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées, Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), HPC resources of IDRIS under the allocation AD011012177 made by GENCI (Grand Équipement National de Calcul Intensif), ANR-16-CE92-0025,PLUMCOT,Identification non-supervisée des personnages de films et séries télévisées(2016), and ANR-19-CE38-0012,GEM,Mesure de l'égalité entre les sexes dans les médias(2019)
- Subjects
FOS: Computer and information sciences ,speaker segmentation ,Sound (cs.SD) ,Computer science ,Speech recognition ,overlapped speech detection ,02 engineering and technology ,[INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE] ,Computer Science - Sound ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,Task (project management) ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,030507 speech-language pathology & audiology ,03 medical and health sciences ,End-to-end principle ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,Segmentation ,resegmentation ,Diarization error rate ,Voice activity detection ,020206 networking & telecommunications ,Speaker diarisation ,voice activity detection ,Temporal resolution ,speaker diarization ,0305 other medical science ,Change detection ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization approach (EEND), the task is modeled as a multi-label classification problem using permutation-invariant training. The main difference is that our model operates on short audio chunks (5 seconds) but at a much higher temporal resolution (every 16ms). Experiments on multiple speaker diarization datasets conclude that our model can be used with great success on both voice activity detection and overlapped speech detection. Our proposed model can also be used as a post-processing step, to detect and correctly assign overlapped speech regions. Relative diarization error rate improvement over the best considered baseline (VBx) reaches 17% on AMI, 13% on DIHARD 3, and 13% on VoxConverse., Comment: Camera-ready version for Interspeech 2021 with significantly better voice activity detection, overlapped speech detection, and speaker diarization results. The code used for results reported in v1 contained a small bug that has now been fixed
- Published
- 2021
- Full Text
- View/download PDF
11. End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
- Author
-
Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Yusuke Fujita, and Shinji Watanabe
- Subjects
FOS: Computer and information sciences ,Diarization error rate ,Sound (cs.SD) ,0209 industrial biotechnology ,Sequence ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,02 engineering and technology ,Computer Science - Sound ,Speaker diarisation ,020901 industrial engineering & automation ,End-to-end principle ,Audio and Speech Processing (eess.AS) ,Attractor ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,Encoder decoder ,Cluster analysis ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors are multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69 % diarization error rate (DER) on simulated mixtures and a 8.07 % DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56 % and 9.54 %, respectively. In unknown numbers of speakers conditions, our method attained a 15.29 % DER on CALLHOME, while the x-vector-based clustering method achieved a 19.43 % DER., Accepted to INTERSPEECH 2020
- Published
- 2020
12. Speaker diarization for multi-speaker conversations via x-vectors
- Author
-
Xiao Song, Jian Zhang, and Yangfan Zhang
- Subjects
Speaker diarisation ,Diarization error rate ,Training set ,Artificial neural network ,Coincident ,Computer science ,Speech recognition ,media_common.quotation_subject ,Conversation ,Cluster analysis ,Linear discriminant analysis ,media_common - Abstract
This paper investigates a new way to build x-vectors based speaker diarization system for multi-speaker conversations, and explore how to improve system performance. There has been a lot of work to prove the superiority of x-vectors in speaker diarization, but it has not been applied in multi-speaker scenarios. We have studied several techniques in our system, such as dividing a long conversation into short overlapping segments to facilitate the extraction of x-vectors instead of ignoring overlapping regions, and re-classifying the labels of coincident segments after clustering to reduce errors. In addition, we enhance the training data to deal with the problem of insufficient discriminant analysis, and select the appropriate number of archives to control the iteration of training samples when training neural networks. Finally, the experimental results on the AMI croups demonstrate the effectiveness of our system. Compared with the initial system of 2018 DIHARD challenge track 2, our final result is relatively reduced by 13.21% on Diarization Error Rate (DER).
- Published
- 2019
- Full Text
- View/download PDF
13. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
- Author
-
Allah Ditta, Hani Alquhayz, Rehan Ahmad, and Syed M. Zubair
- Subjects
Computer science ,Speech recognition ,02 engineering and technology ,Biochemistry ,Article ,Analytical Chemistry ,diarization error rate ,Set (abstract data type) ,SyncNet ,Synchronization (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,speech activity detection ,Electrical and Electronic Engineering ,Face detection ,Instrumentation ,Voice activity detection ,020207 software engineering ,Mixture model ,Atomic and Molecular Physics, and Optics ,Term (time) ,Speaker diarisation ,Gaussian mixture model ,MFCC ,020201 artificial intelligence & image processing ,speaker diarization ,Mel-frequency cepstrum - Abstract
Speaker diarization systems aim to find &lsquo, who spoke when?&rsquo, in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.
- Published
- 2019
- Full Text
- View/download PDF
14. Enhanced low-latency speaker spotting using selective cluster enrichment
- Author
-
Nicholas Evans, Jose Patino, and Héctor Delgado
- Subjects
Diarization error rate ,Computer science ,Speech recognition ,Online processing ,Word error rate ,02 engineering and technology ,Spotting ,Rapid detection ,Data modeling ,Speaker diarisation ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Speaker detection ,0305 other medical science - Abstract
Low-latency speaker spotting (LLSS) calls for the rapid detection of known speakers within multi-speaker audio streams. While previous work showed the potential to develop efficient LLSS solutions by combining speaker diarization and speaker detection within an online processing framework, it failed to move significantly beyond the traditional definition of diarization. This paper shows that the latter needs rethinking and that a diarization sub-system tailored to the end application, rather than to the minimisation of the diarization error rate, can improve LLSS performance. The proposed selective cluster enrichment algorithm is used to guide the diarization system to better model segments within a multi-speaker audio stream and hence detect more reliably a given target speaker. The LLSS solution reported in this paper shows that target speakers can be detected with a 16% equal error rate after having been active in multi-speaker audio streams for only 15 seconds.
- Published
- 2018
- Full Text
- View/download PDF
15. A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization
- Author
-
Mohammed Senoussaoui, Themos Stafylakis, Patrick Kenny, and Pierre Dumouchel
- Subjects
Diarization error rate ,Acoustics and Ultrasonics ,Iterative method ,Computer science ,business.industry ,Speech recognition ,Pattern recognition ,Speech processing ,Speaker diarisation ,Computational Mathematics ,Cosine Distance ,Computer Science (miscellaneous) ,Artificial intelligence ,Mean-shift ,Electrical and Electronic Engineering ,Cluster analysis ,business ,Prior information - Abstract
Speaker clustering is a crucial step for speaker diarization. The short duration of speech segments in telephone speech dialogue and the absence of prior information on the number of clusters dramatically increase the difficulty of this problem in diarizing spontaneous telephone speech conversations. We propose a simple iterative Mean Shift algorithm based on the cosine distance to perform speaker clustering under these conditions. Two variants of the cosine distance Mean Shift are compared in an exhaustive practical study. We report state of the art results as measured by the Diarization Error Rate and the Number of Detected Speakers on the LDC CallHome telephone corpus.
- Published
- 2014
- Full Text
- View/download PDF
16. Harmonic Structure Features for Robust Speaker Diarization
- Author
-
Junfeng Li, Yu Zhou, Yonghong Yan, and Hongbin Suo
- Subjects
Diarization error rate ,Engineering ,General Computer Science ,Harmonic structure ,Microphone ,business.industry ,Speech recognition ,Pattern recognition ,Vibration amplitude ,Electronic, Optical and Magnetic Materials ,Speaker diarisation ,Robustness (computer science) ,Cepstrum ,Artificial intelligence ,Electrical and Electronic Engineering ,business - Abstract
In this paper, we present a new approach for speaker diarization. First, we use the prosodic information calculated on the original speech to resynthesize the new speech data utilizing the spectrum modeling technique. The resynthesized data is modeled with sinusoids based on pitch, vibration amplitude, and phase bias. Then, we use the resynthesized speech data to extract cepstral features and integrate them with the cepstral features from original speech for speaker diarization. At last, we show how the two streams of cepstral features can be combined to improve the robustness of speaker diarization. Experiments carried out on the standardized datasets (the US National Institute of Standards and Technology Rich Transcription 04-S multiple distant microphone conditions) show a significant improvement in diarization error rate compared to the system based on only the feature stream from original speech.
- Published
- 2012
- Full Text
- View/download PDF
17. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model.
- Author
-
Ahmad, Rehan, Zubair, Syed, Alquhayz, Hani, and Ditta, Allah
- Subjects
AUDIOVISUAL materials ,GAUSSIAN mixture models ,SYNCHRONIZATION ,RADIO talk programs ,SOUND recordings - Abstract
Speaker diarization systems aim to find 'who spoke when?' in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
18. Ideas for Clustering of Similar Models of a Speaker in an Online Speaker Diarization System
- Author
-
Vlasta Radová and Marie Kunešová
- Subjects
Speaker diarisation ,Diarization error rate ,Computer science ,business.industry ,Speech recognition ,Baseline system ,Artificial intelligence ,Cluster analysis ,Speaker recognition ,computer.software_genre ,business ,computer ,Natural language processing - Abstract
During online speaker diarization, a situation may occur where a single speaker is being represented by several different models. Such situation leads to worsened diarization results, because the diarization system considers every change of a model to be a change of speakers. In the article we describe a method for detecting this situation and propose several ways of solving it. Experiments show that the most suitable option is treating multiple GMMs as belonging to a single speaker, i.e. updating all of them with the same data every time one of them is assigned a new segment. In that case, there was a relative improvement in Diarization Error Rate of 30.69% in comparison with the baseline system.
- Published
- 2015
- Full Text
- View/download PDF
19. A Cluster Purification Algorithm for Speaker Diarization System
- Author
-
Zhang Xiang
- Subjects
Speaker diarisation ,Diarization error rate ,Computer science ,business.industry ,Cluster (physics) ,Pattern recognition ,Point (geometry) ,Artificial intelligence ,Cluster analysis ,business ,Algorithm ,Selection (genetic algorithm) - Abstract
In speaker diarization system, it's common to use bottom-up clustering method where the input data is first split in small pieces and then merged the most similar segments until reaching a stopping point. However, it's not easy to ensure that every selection merges the right pair, and these errors tend to deteriorate the post-merging results. In this paper, a fast cluster purification algorithm is introduced after the size of clusters reach to a pre-estimated number K, which is no less than the real number of speakers involved in the conversation, and try to remedy the errors by removing the inappropriate segments into the right cluster. An effective way to estimate the number of speakers is also introduced before the clustering stage. The experiment results show improvement in both the purity of cluster and the diarization error rate (DER) after using the purification algorithm. The purity improves 0.8% and DER reduces 1.14% in average.
- Published
- 2014
- Full Text
- View/download PDF
20. Global Speaker Clustering towards Optimal Stopping Criterion in Binary Key Speaker Diarization
- Author
-
Héctor Delgado, Corinne Fredouille, Xavier Anguera, and Javier Serrano
- Subjects
Diarization error rate ,Engineering ,business.industry ,Binary number ,Agglomerative hierarchical clustering ,computer.software_genre ,Speaker diarisation ,Baseline system ,Optimal stopping ,Data mining ,Cluster analysis ,business ,computer ,Global optimization problem - Abstract
The recently proposed speaker diarization technique based on binary keys provides a very fast alternative to state-of-the-art systems with little increase of Diarization Error Rate DER. Although the approach shows great potential, it also presents issues, mainly in the stopping criterion. Therefore, exploring alternative clustering/stopping criterion approaches is needed. Recently some works have addressed the speaker clustering as a global optimization problem in order to tackle the intrinsic issues of the Agglomerative Hierarchical Clustering AHC mainly the local-maximum-based decision making. This paper aims at adapting and applying this new framework to the binary key diarization system. In addition, an analysis of cluster purity across the AHC iterations is done using reference speaker ground-truth labels to select the purer clustering as input for the global framework. Experiments on the REPERE phase 1 test database show improvements of around 6% absolute DER compared to the baseline system output.
- Published
- 2014
- Full Text
- View/download PDF
21. A Comparable Study on PNCC in Speaker Diarization for Meetings
- Author
-
Yunpeng Xiao, Weiping Ye, Qiao Li, and Qing Fan
- Subjects
Diarization error rate ,Computer science ,business.industry ,Speech recognition ,Pattern recognition ,Speaker recognition ,Frequency spectrum ,Speaker diarisation ,Feature (machine learning) ,Cepstrum coefficients ,Artificial intelligence ,Mel-frequency cepstrum ,business ,Hidden Markov model - Abstract
In speaker diarization, the most commonly used speaker feature is MFCC, which is also most commonly used speech feature in speech recognition. The newly proposed Power Normalized Cepstrum Coefficients (PNCC) achieves impressive improvement in noisy speech recognition compare to MFCC. It consequently expects a proof for speaker diarization use. In this paper, PNCC is evaluated against MFCC in a meeting domain speaker diarization system. The Diarization Error Rate (DER) shows no positive results with PNCC. This is possibly because of their inhibition in high frequency spectrum which is believed to represents the characteristics of human's voice. An initial model training material select strategy is proposed and used in the speaker diarization system in this work.
- Published
- 2010
- Full Text
- View/download PDF
22. Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign
- Author
-
Martin Zelenák, Javier Hernando, Henrik Schulz, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, and Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
- Subjects
Diarization error rate ,Acoustics and Ultrasonics ,Relation (database) ,Computer science ,Speech recognition ,Evaluation data ,Computational linguistics ,Context (language use) ,Broadcast news ,Task (project management) ,Speaker diarisation ,Speaker diarization ,Channel (programming) ,Processament de la parla ,Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC] ,Systems design ,Lingüística computacional ,Speech processing systems ,Electrical and Electronic Engineering ,Evaluation - Abstract
In this article, we present the evaluation results for the task of speaker diarization of broadcast news, which was part of the Albayzin 2010 evaluation campaign of language and speech technologies. The evaluation data consists of a subset of the Catalan broadcast news database recorded from the 3/24 TV channel. The description of five submitted systems from five different research labs is given, marking the common as well as the distinctive system features. The diarization performance is analyzed in the context of the diarization error rate, the number of detected speakers and also the acoustic background conditions. An effort is also made to put the achieved results in relation to the particular system design features.
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.