A Purely End-to-End System for Multi-speaker Speech Recognition

Authors :: Jonathan Le Roux
John R. Hershey
Shinji Watanabe
Hiroshi Seki
Takaaki Hori
Source :: ACL (1)
Publication Year :: 2018
Publisher :: Association for Computational Linguistics, 2018.
Abstract: Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.

Subjects :: End to end system
Sequence
Training set
Computer science
Speech recognition
020208 electrical & electronic engineering
Contrast (statistics)
02 engineering and technology
Task (project management)
030507 speech-language pathology & audiology
03 medical and health sciences
0202 electrical engineering, electronic engineering, information engineering
Source separation
0305 other medical science

Database :: OpenAIRE
Journal :: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Accession number :: edsair.doi...........ebe839b0f708111d50160df740a834fc
Full Text :: https://doi.org/10.18653/v1/p18-1244