Author: "Huh, Jaesung" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

1. Character-aware audio-visual subtitling in context

Author: Huh, Jaesung and Zisserman, Andrew
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holistic solution addresses what is said, when it's said, and who is speaking, providing a more comprehensive and accurate character-aware subtitling for TV shows. Our approach brings improvements on two fronts: first, we show that audio-visual synchronisation can be used to pick out the talking face amongst others present in a video clip, and assign an identity to the corresponding speech segment. This audio-visual approach improves recognition accuracy and yield over current methods. Second, we show that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene. We propose an approach using local voice embeddings of the audio, and large language model reasoning on the text transcription. This overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches. Project page : https://www.robots.ox.ac.uk/~vgg/research/llr-context/, Comment: ACCV 2024
Published: 2024

2. The VoxCeleb Speaker Recognition Challenge: A Retrospective

Author: Huh, Jaesung, Chung, Joon Son, Nagrani, Arsha, Brown, Andrew, Jung, Jee-weon, Garcia-Romero, Daniel, and Zisserman, Andrew
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provided publicly available training and evaluation datasets for each task and setting, with new test sets released each year. In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation. We chart the progress in performance over the five installments of the challenge on a common evaluation dataset and provide a detailed analysis of how each year's special focus affected participants' performance. This paper is aimed both at researchers who want an overview of the speaker recognition and diarisation field, and also at challenge organisers who want to benefit from the successes and avoid the mistakes of the VoxSRC challenges. We end with a discussion of the current strengths of the field and open challenges. Project page : https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/workshop.html, Comment: TASLP 2024
Published: 2024
Full Text: View/download PDF

3. TIM: A Time Interval Machine for Audio-Visual Action Recognition

Author: Chalk, Jacob, Huh, Jaesung, Kazakos, Evangelos, Zisserman, Andrew, and Damen, Dima
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM, Comment: Accepted to CVPR 2024. Project Webpage: https://jacobchalk.github.io/TIM-Project
Published: 2024

4. Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

Author: Korbar, Bruno, Huh, Jaesung, and Zisserman, Andrew
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and then use these exemplars to classify all speech segments by speaker identity. Notably, the method does not require face detection or tracking. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of the vast amount of videos available on modern streaming services. Project page : \url{https://www.robots.ox.ac.uk/~vgg/research/look-listen-recognise/}, Comment: Accepted for publication in ICASSP 2024
Published: 2024

5. OxfordVGG Submission to the EGO4D AV Transcription Challenge

Author: Huh, Jaesung, Bain, Max, and Zisserman, Andrew
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team. We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available. Our final submission obtained 56.0% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the leaderboard. All baseline codes and models are available on https://github.com/m-bain/whisperX., Comment: Technical Report
Published: 2023

6. WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Author: Bain, Max, Huh, Jaesung, Han, Tengda, and Zisserman, Andrew
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference., Comment: Accepted to INTERSPEECH 2023
Published: 2023

7. VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

Author: Huh, Jaesung, Brown, Andrew, Jung, Jee-weon, Chung, Joon Son, Nagrani, Arsha, Garcia-Romero, Daniel, and Zisserman, Andrew
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and hybrid workshop held at INTERSPEECH 2022. We describe the four tracks of our challenge along with the baselines, methods, and results. We conclude with a discussion on the new domain-transfer focus of VoxSRC-22, and on the progression of the challenge from the previous three editions.
Published: 2023

8. Epic-Sounds: A Large-scale Dataset of Actions That Sound

Author: Huh, Jaesung, Chalk, Jacob, Kazakos, Evangelos, Damen, Dima, and Zisserman, Andrew
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce Epic-Sounds, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from video, discarding ambiguities. Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate state-of-the-art audio recognition and detection models on our dataset, for both audio-only and audio-visual methods. We also conduct analysis on: the temporal overlap between audio events, the temporal and label correlations between audio and visual modalities, the ambiguities in annotating materials from audio-only input, the importance of audio-only labels and the limitations of current models to understand actions that sound. Project page : https://epic-kitchens.github.io/epic-sounds/, Comment: 12 pages, 12 figures
Published: 2023

9. Disentangled representation learning for multilingual speaker recognition

Author: Nam, Kihyun, Kim, Youkyum, Huh, Jaesung, Heo, Hee Soo, Jung, Jee-weon, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse the effect of bilingual speakers on speaker recognition performance. In this paper, we publish a large-scale evaluation set named VoxCeleb1-B derived from VoxCeleb that considers bilingual scenarios. We introduce an effective disentanglement learning strategy that combines adversarial and metric learning-based methods. This approach addresses the bilingual situation by disentangling language-related information from speaker representation while ensuring stable speaker representation learning. Our language-disentangled learning method only uses language pseudo-labels without manual information., Comment: Interspeech 2023
Published: 2022

10. In search of strong embedding extractors for speaker diarisation

Author: Jung, Jee-weon, Heo, Hee-Soo, Lee, Bong-Jin, Huh, Jaesung, Brown, Andrew, Kwon, Youngki, Watanabe, Shinji, and Chung, Joon Son
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. Second, embedding extractors have not seen utterances in which multiple speakers exist. These inputs are inevitably present in speaker diarisation because of overlapped speech and speaker changes; they degrade the performance. To mitigate the first problem, we generate speaker verification evaluation protocols that mimic the diarisation scenario better. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input. One technique generates overlapped speech segments, and the other generates segments where two speakers utter sequentially. Extensive experimental results using three state-of-the-art speaker embedding extractors demonstrate that both proposed approaches are effective., Comment: 5pages, 1 figure, 2 tables, submitted to ICASSP
Published: 2022

11. VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge

Author: Brown, Andrew, Huh, Jaesung, Chung, Joon Son, Nagrani, Arsha, Garcia-Romero, Daniel, and Zisserman, Andrew
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The third instalment of the VoxCeleb Speaker Recognition Challenge was held in conjunction with Interspeech 2021. The aim of this challenge was to assess how well current speaker recognition technology is able to diarise and recognise speakers in unconstrained or `in the wild' data. The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a virtual public challenge and workshop held at Interspeech 2021. This paper outlines the challenge, and describes the baselines, methods and results. We conclude with a discussion on the new multi-lingual focus of VoxSRC 2021, and on the progression of the challenge since the previous two editions., Comment: arXiv admin note: substantial text overlap with arXiv:2012.06867
Published: 2022

12. With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

Author: Kazakos, Evangelos, Huh, Jaesung, Nagrani, Arsha, Zisserman, Andrew, and Damen, Dima
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In egocentric videos, actions occur in quick succession. We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN., Comment: Accepted at BMVC 2021
Published: 2021

13. VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge

Author: Nagrani, Arsha, Chung, Joon Son, Huh, Jaesung, Brown, Andrew, Coto, Ernesto, Xie, Weidi, McLaren, Mitchell, Reynolds, Douglas A, and Zisserman, Andrew
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020. The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition and diarisation dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a virtual public challenge and workshop held at Interspeech 2020. This paper outlines the challenge, and describes the baselines, methods used, and results. We conclude with a discussion of the progress over the first installment of the challenge.
Published: 2020

14. Look who's not talking

Author: Kwon, Youngki, Heo, Hee Soo, Huh, Jaesung, Lee, Bong-Jin, and Chung, Joon Son
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The objective of this work is speaker diarisation of speech recordings 'in the wild'. The ability to determine speech segments is a crucial part of diarisation systems, accounting for a large proportion of errors. In this paper, we present a simple but effective solution for speech activity detection based on the speaker embeddings. In particular, we discover that the norm of the speaker embedding is an extremely effective indicator of speech activity. The method does not require an independent model for speech activity detection, therefore allows speaker diarisation to be performed using a unified representation for both speaker modelling and speech activity detection. We perform a number of experiments on in-house and public datasets, in which our method outperforms popular baselines., Comment: SLT 2021
Published: 2020

15. Playing a Part: Speaker Verification at the Movies

Author: Brown, Andrew, Huh, Jaesung, Nagrani, Arsha, Chung, Joon Son, and Zisserman, Andrew
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions: (i) We collect a novel, challenging speaker recognition dataset called VoxMovies, with speech for 856 identities from almost 4000 movie clips. VoxMovies contains utterances with varying emotion, accents and background noise, and therefore comprises an entirely different domain to the interview-style, emotionally calm utterances in current speaker recognition datasets such as VoxCeleb; (ii) We provide a number of domain adaptation evaluation sets, and benchmark the performance of state-of-the-art speaker recognition models on these evaluation pairs. We demonstrate that both speaker verification and identification performance drops steeply on this new data, showing the challenge in transferring models across domains; and finally (iii) We show that simple domain adaptation paradigms improve performance, but there is still large room for improvement., Comment: The first three authors contributed equally to this work
Published: 2020

16. Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020

Author: Heo, Hee Soo, Lee, Bong-Jin, Huh, Jaesung, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This report describes our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020. We perform a careful analysis of speaker recognition models based on the popular ResNet architecture, and train a number of variants using a range of loss functions. Our results show significant improvements over most existing works without the use of model ensemble or post-processing. We release the training code and pre-trained models as unofficial baselines for this year's challenge.
Published: 2020

17. Augmentation adversarial training for self-supervised speaker recognition

Author: Huh, Jaesung, Heo, Hee Soo, Kang, Jingu, Watanabe, Shinji, and Chung, Joon Son
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The goal of this work is to train robust speaker recognition models without speaker labels. Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceed that of humans., Comment: Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS
Published: 2020

18. Spot the conversation: speaker diarisation in the wild

Author: Chung, Joon Son, Huh, Jaesung, Nagrani, Arsha, Afouras, Triantafyllos, and Zisserman, Andrew
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions., Comment: The dataset will be available for download from http://www.robots.ox.ac.uk/~vgg/data/voxceleb/voxconverse.html . The development set will be released in July 2020, and the test set will be released in October 2020
Published: 2020
Full Text: View/download PDF

19. Metric Learning for Keyword Spotting

Author: Huh, Jaesung, Lee, Minjae, Heo, Heesoo, Mun, Seongkyu, and Chung, Joon Son
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The goal of this work is to train effective representations for keyword spotting via metric learning. Most existing works address keyword spotting as a closed-set classification problem, where both target and non-target keywords are predefined. Therefore, prevailing classifier-based keyword spotting systems perform poorly on non-target sounds which are unseen during the training stage, causing high false alarm rates in real-world scenarios. In reality, keyword spotting is a detection problem where predefined target keywords are detected from a variety of unknown sounds. This shares many similarities to metric learning problems in that the unseen and unknown non-target sounds must be clearly differentiated from the target keywords. However, a key difference is that the target keywords are known and predefined. To this end, we propose a new method based on metric learning that maximises the distance between target and non-target keywords, but also learns per-class weights for target keywords \`a la classification objectives. Experiments on the Google Speech Commands dataset show that our method significantly reduces false alarms to unseen non-target keywords, while maintaining the overall classification accuracy.
Published: 2020

20. In defence of metric learning for speaker recognition

Author: Chung, Joon Son, Huh, Jaesung, Mun, Seongkyu, Lee, Minjae, Heo, Hee Soo, Choe, Soyeon, Ham, Chiheon, Jung, Sunghwan, Lee, Bong-Jin, and Han, Icksang
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods., Comment: The code can be found at https://github.com/clovaai/voxceleb_trainer
Published: 2020
Full Text: View/download PDF

21. Modeling Musical Onset Probabilities via Neural Distribution Learning

Author: Huh, Jaesung, Martinsson, Egil, Kim, Adrian, and Ha, Jung-Woo
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Musical onset detection can be formulated as a time-to-event (TTE) or time-since-event (TSE) prediction task by defining music as a sequence of onset events. Here we propose a novel method to model the probability of onsets by introducing a sequential density prediction model. The proposed model estimates TTE & TSE distributions from mel-spectrograms using convolutional neural networks (CNNs) as a density predictor. We evaluate our model on the Bock dataset show-ing comparable results to previous deep-learning models., Comment: 2 pages, 2 figures, 2 tables
Published: 2020

22. The sound of my voice: speaker representation loss for target voice separation

Author: Mun, Seongkyu, Choe, Soyeon, Huh, Jaesung, and Chung, Joon Son
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Content and style representations have been widely studied in the field of style transfer. In this paper, we propose a new loss function using speaker content representation for audio source separation, and we call it speaker representation loss. The objective is to extract the target speaker voice from the noisy input and also remove it from the residual components. Compared to the conventional spectral reconstruction, our proposed framework maximizes the use of target speaker information by minimizing the distance between the speaker representations of reference and source separation output. We also propose triplet speaker representation loss as an additional criterion to remove the target speaker information from residual spectrogram output. VoiceFilter framework is adopted to evaluate source separation performance using the VCTK database, and we achieved improved performances compared to the baseline loss function without any additional network parameters., Comment: To appear in ICASSP 2020. The first two authors contributed equally to this work
Published: 2019

23. Delving into VoxCeleb: environment invariant speaker recognition

Author: Chung, Joon Son, Huh, Jaesung, and Mun, Seongkyu
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Research in speaker recognition has recently seen significant progress due to the application of neural network models and the availability of new large-scale datasets. There has been a plethora of work in search for more powerful architectures or loss functions suitable for the task, but these works do not consider what information is learnt by the models, apart from being able to predict the given labels. In this work, we introduce an environment adversarial training framework in which the network can effectively learn speaker-discriminative and environment-invariant embeddings without explicit domain shift during training. We achieve this by utilising the previously unused `video' information in the VoxCeleb dataset. The environment adversarial training allows the network to generalise better to unseen conditions. The method is evaluated on both speaker identification and verification tasks using the VoxCeleb dataset, on which we demonstrate significant performance improvements over baselines.
Published: 2019

24. Phase-aware Speech Enhancement with Deep Complex U-Net

Author: Choi, Hyeong-Seok, Kim, Jang-Hyun, Huh, Jaesung, Kim, Adrian, Ha, Jung-Woo, and Lee, Kyogu
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-Net structured model incorporating well-defined complex-valued building blocks to deal with complex-valued spectrograms. Second, we propose a polar coordinate-wise complex-valued masking method to reflect the distribution of complex ideal ratio masks. Third, we define a novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure. Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin., Comment: Significant error was found in data processing step, therefore will be retracted from International Conference on Learning Representations (ICLR) 2019. It is not recommended to read current version
Published: 2019

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

24 results on '"Huh, Jaesung"'

1. Character-aware audio-visual subtitling in context

2. The VoxCeleb Speaker Recognition Challenge: A Retrospective

3. TIM: A Time Interval Machine for Audio-Visual Action Recognition

4. Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

5. OxfordVGG Submission to the EGO4D AV Transcription Challenge

6. WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

7. VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

8. Epic-Sounds: A Large-scale Dataset of Actions That Sound

9. Disentangled representation learning for multilingual speaker recognition

10. In search of strong embedding extractors for speaker diarisation

11. VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge

12. With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

13. VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge

14. Look who's not talking

15. Playing a Part: Speaker Verification at the Movies

16. Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020

17. Augmentation adversarial training for self-supervised speaker recognition

18. Spot the conversation: speaker diarisation in the wild

19. Metric Learning for Keyword Spotting

20. In defence of metric learning for speaker recognition

21. Modeling Musical Onset Probabilities via Neural Distribution Learning

22. The sound of my voice: speaker representation loss for target voice separation

23. Delving into VoxCeleb: environment invariant speaker recognition

24. Phase-aware Speech Enhancement with Deep Complex U-Net

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

24 results on '"Huh, Jaesung"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources