Back to Search
Start Over
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
- Source :
- IEEE Journal of Selected Topics in Signal Processing. 11:1240-1253
- Publication Year :
- 2017
- Publisher :
- Institute of Electrical and Electronics Engineers (IEEE), 2017.
-
Abstract
- Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder–decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
- Subjects :
- Network architecture
Artificial neural network
Computer science
Speech recognition
Markov process
02 engineering and technology
030507 speech-language pathology & audiology
03 medical and health sciences
symbols.namesake
Robustness (computer science)
Signal Processing
0202 electrical engineering, electronic engineering, information engineering
symbols
Beam search
020201 artificial intelligence & image processing
Language model
Electrical and Electronic Engineering
0305 other medical science
Hidden Markov model
Decoding methods
Subjects
Details
- ISSN :
- 19410484 and 19324553
- Volume :
- 11
- Database :
- OpenAIRE
- Journal :
- IEEE Journal of Selected Topics in Signal Processing
- Accession number :
- edsair.doi...........013ec9e5029c93cbe64572a9f4c0719b
- Full Text :
- https://doi.org/10.1109/jstsp.2017.2763455