Back to Search Start Over

An investigation of domain adaptation in speaker embedding space for speaker recognition.

Authors :
Bahmaninezhad, Fahimeh
Zhang, Chunlei
Hansen, John H.L.
Source :
Speech Communication. May2021, Vol. 129, p7-16. 10p.
Publication Year :
2021

Abstract

Speaker recognition continues to grow as a research challenge in the field with expanded application in commercial, forensic, educational and general speech technology interfaces. However, challenges remain, especially for naturalistic audio streams including recordings with mismatch between train and test data (i.e., when train or system development data and enrollment/test data or application data are collected from different sources). Mismatch conditions (Hansen and Hasan, 2015) can be divided into two categories, extrinsic (channel, noise, etc.) and intrinsic (duration, language, and speaker traits including stress, emotion, Lombard effect, vocal effort, accent). Here, we investigate speaker recognition for the domain mismatch problem (intrinsic mismatch) especially for those challenges introduced by NIST (National Institute of Standards and Technology) SRE (speaker recognition evaluation) in 2016 and 2018. The challenges introduced in NIST SRE-16 and SRE-18 include language mismatch between train (used for the development of the system) and enrollment/test (used at the application phase). Here, we develop three alternative speaker embedding systems; i-vector, t-vector (an improved triplet loss solution), and x-vector. In addition, a number of unsupervised and supervised (using pseudo labels) methods are also studied for domain mismatch compensation, especially applied at the back-end level. These include adapted PLDA, adapted discriminant analysis, as well as score normalization and calibration methods using unlabeled in-domain data. We propose new variations to discriminant analysis with support vectors (SVDA) as well. These results confirm that SVDA can measurably improve speaker recognition performance for SRE-16 and SRE-18 tasks respectively by +15% and +8% in terms of min-Cprimary; and for EER the gains are +14% and +16% respectively, using i-vector speaker embeddings as the baseline. These advancements offer promising steps toward addressing speaker recognition in naturalistic audio streams. • Addresses the problem of speaker recognition in naturalistic environments due to domain mismatch. • Speaker recognition for domain (intrinsic) mismatch is addressed emphasizing challenges from NIST SRE (2016 & 2018). • Three alternative speaker embedding systems are developed: i-Vector, t-Vector (an improved triplet loss solution), and x-Vector. • Back-end advancements include: adapted PLDA, discriminant analysis with support vectors (SVDA), score normalization, and calibration methods. • Results show measurable improvement in speaker recognition performance for SRE-16 and SRE-18 tasks (+14% & +16% for EER). [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
01676393
Volume :
129
Database :
Academic Search Index
Journal :
Speech Communication
Publication Type :
Academic Journal
Accession number :
149760254
Full Text :
https://doi.org/10.1016/j.specom.2021.01.001