1. Self-supervised deep learning approaches to speaker recognition: A Ph.D. Thesis overview
- Author
-
Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Khan, Umair, Hernando Pericás, Francisco Javier, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla, Khan, Umair, and Hernando Pericás, Francisco Javier
- Abstract
Recent advances in Deep Learning (DL) for speaker recognition have improved the performance but are constrained to the need of labels for the background data, which is difficult in prac- tice. In i-vector based speaker recognition, cosine (unsuper- vised) and PLDA (supervised) are the basic scoring techniques, with a big performance gap between the two. In this thesis we tried to fill this gap without using speaker labels in several ways. We applied Restricted Boltzmann Machine (RBM) vectors for the tasks of speaker clustering and tracking in TV broadcast shows. The experiments on AGORA database show that us- ing this approach we gain a relative improvement of 12% and 11% for speaker clustering and tracking tasks, respectively. We also applied DL techniques in order to increase the discrimina- tive power of i-vectors in speaker verification task, for which we have proposed the use of autoencoder in several ways, i.e., (1) as a pre-training for a Deep Neural Network (DNN), (2) as a near- est neighbor autoencoder for i-vectors, (3) as an average pooled nearest neighbor autoencoder. The experiments on VoxCeleb database show that we gain a relative improvement of 21%, 42% and 53%, using the three systems respectively. Finally we also proposed a self-supervised end-to-end speaker verification system. The architecture is based on a Convolutional Neural Network (CNN), trained as a siamese network with multiple branches. From the results we can see that our system shows comparable performance to a supervised baseline, This work was supported by the project PID2019-107579RBI00 / AEI / 10.13039/501100011033, Peer Reviewed, Postprint (published version)
- Published
- 2021