Author: "John R. Hershey" / Topic: computer science - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"John R. Hershey"' showing total 83 results

Start Over Author "John R. Hershey" Topic computer science

83 results on '"John R. Hershey"'

1. Phasebook and Friends: Leveraging Discrete Representations for Source Separation

Author: Gordon Wichern, Shinji Watanabe, Andy M. Sarroff, John R. Hershey, and Jonathan Le Roux
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer science, Machine Learning (stat.ML), 02 engineering and technology, Computer Science - Sound, Oracle, Machine Learning (cs.LG), Statistics - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Electrical and Electronic Engineering, Computer Science - Computation and Language, Noise measurement, business.industry, Quantization (signal processing), Deep learning, 020206 networking & telecommunications, Time–frequency analysis, Speech enhancement, Signal Processing, Softmax function, Artificial intelligence, business, Computation and Language (cs.CL), Algorithm, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech enhancement and source separation systems based on deep learning have recently reached unprecedented levels of quality, to the point that performance is reaching a new ceiling. Most systems rely on estimating the magnitude of a target source by estimating a real-valued mask to be applied to a time-frequency representation of the mixture signal. A limiting factor in such approaches is a lack of phase estimation: the phase of the mixture is most often used when reconstructing the estimated time-domain signal. Here, we propose “magbook,” “phasebook,” and “combook,” three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks. Magbook layers extend classical sigmoidal units and a recently introduced convex softmax activation for mask-based magnitude estimation. Phasebook layers use a similar structure to give an estimate of the phase mask without suffering from phase wrapping issues. Combook layers are an alternative to the magbook–phasebook combination that directly estimate complex masks. We present various training and inference schemes involving these representations, and explain in particular how to include them in an end-to-end learning framework. We also present an oracle study to assess upper bounds on performance for various types of masks using discrete phase representations. We evaluate the proposed methods on the wsj0-2mix dataset, a well-studied corpus for single-channel speaker-independent speaker separation, matching the performance of state-of-the-art mask-based approaches without requiring additional phase reconstruction steps.
Published: 2019
Full Text: View/download PDF

2. Adversarial training and decoding strategies for end-to-end neural conversation models

Author: John R. Hershey, Bret Harsham, Wen Wang, Koji Yusuke, Takaaki Hori, and Chiori Hori
Subjects: Computer science, media_common.quotation_subject, 02 engineering and technology, Machine learning, computer.software_genre, 01 natural sciences, Theoretical Computer Science, Task (project management), Adversarial system, End-to-end principle, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Conversation, Dialog system, Dialog box, Set (psychology), 010301 acoustics, media_common, business.industry, 020206 networking & telecommunications, Human-Computer Interaction, Artificial intelligence, business, computer, Software, Decoding methods
Abstract: This paper presents adversarial training and decoding methods for neural conversation models that can generate natural responses given dialog contexts. In our prior work, we built several end-to-end conversation systems for the 6th Dialog System Technology Challenges (DSTC6) Twitter help-desk dialog task. These systems included novel extensions of sequence adversarial training, example-based response extraction, and Minimum Bayes-Risk based system combination. In DSTC6, our systems achieved the best performance in most objective measures such as BLEU and METEOR scores and decent performance in a subjective measure based on human rating. In this paper, we provide a complete set of our experiments for DSTC6 and further extend the training and decoding strategies more focusing on improving the subjective measure, where we combine responses of three adversarial models. Experimental results demonstrate that the extended methods improve the human rating score and outperform the best score in DSTC6.
Published: 2019
Full Text: View/download PDF

3. Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis

Author: Naoyuki Kanda, Maokui He, Shinji Watanabe, Jinyu Li, Zili Huang, Zhuo Chen, Jun Du, Scott Wisdom, John R. Hershey, Pavel Denisov, Desh Raj, Hakan Erdogan, Yi Luo, and Takuya Yoshioka
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Speech recognition, Modular system, 020206 networking & telecommunications, 02 engineering and technology, Computer Science - Sound, Speaker diarisation, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), Error analysis, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Task analysis, Subtitle, Transcription (software), 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR., Accepted to IEEE SLT 2021
Published: 2021
Full Text: View/download PDF

4. End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

Author: Shinji Watanabe, John R. Hershey, Soumi Maiti, Scott Wisdom, Kevin W. Wilson, and Hakan Erdogan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), business.industry, Computer science, Speech recognition, Deep learning, Computer Science - Sound, Data modeling, Machine Learning (cs.LG), Speaker diarisation, Discriminative model, Robustness (computer science), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Loudspeaker, Artificial intelligence, business, Cluster analysis, Electrical Engineering and Systems Science - Audio and Speech Processing, Network model
Abstract: We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multi-task transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data., Comment: 5 pages, 2 figures, ICASSP 2021
Published: 2021
Full Text: View/download PDF

5. Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes

Author: Romain Serizel, Hakan Erdogan, Justin Salamon, Nicolas Turpault, John R. Hershey, Scott Wisdom, Eduardo Fonseca, Prem Seetharaman, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Google Inc, Research at Google, Universitat Pompeu Fabra [Barcelona] (UPF), Descript, Inc., Adobe Research, Part of this work was made with the support of the French National Research Agency, in the framework of the project LEAUDS 'Learning to understand audio scenes' (ANR-18-CE23-0020) and the French region Grand-Est. High Performance Computing resources were partially provided by the EXPLOR centre hosted by the University de Lorraine., Grid'5000, ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Sound localization, Sound (cs.SD), Reverberation, Soundscape, Computer science, Speech recognition, 02 engineering and technology, Computer Science - Sound, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), Robustness (computer science), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Sound (geography), synthetic soundscapes, geography, Signal processing, geography.geographical_feature_category, Event (computing), Sound event detection, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Benchmark (computing), sound separation, 020201 artificial intelligence & image processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: International audience; We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We show that the localization in time of sound events is still a problem for SED systems. We also show that reverberation and non-target sound events are severely degrading the performance of the SED systems. In the latter case, sound separation seems like a promising solution.
Published: 2020
Full Text: View/download PDF

6. What's All the FUSS About Free Universal Sound Separation Data?

Author: Romain Serizel, Prem Seetharaman, Justin Salamon, Daniel P. W. Ellis, John R. Hershey, Scott Wisdom, Eduardo Fonseca, Nicolas Turpault, Hakan Erdogan, Google Inc, Research at Google, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Universitat Pompeu Fabra [Barcelona] (UPF), Adobe Research, Descript, Inc., and ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018)
Subjects: FOS: Computer and information sciences, Reverberation, Sound (cs.SD), open-source datasets, Computer science, Sound separation, Separation (aeronautics), 02 engineering and technology, Impulse (physics), Computer Science - Sound, Data modeling, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, Open domain, FOS: Electrical engineering, electronic engineering, information engineering, business.industry, Deep learning, deep learning, 020206 networking & telecommunications, Universal sound separation, variable source sep- aration, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 020201 artificial intelligence & image processing, Artificial intelligence, Variable number, business, Algorithm, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: International audience; We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box shaped rooms with frequency-dependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.5 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.
Published: 2020
Full Text: View/download PDF

7. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Author: Suyoun Kim, John R. Hershey, Shinji Watanabe, Tomoki Hayashi, and Takaaki Hori
Subjects: Network architecture, Artificial neural network, Computer science, Speech recognition, Markov process, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Robustness (computer science), Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, symbols, Beam search, 020201 artificial intelligence & image processing, Language model, Electrical and Electronic Engineering, 0305 other medical science, Hidden Markov model, Decoding methods
Abstract: Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder–decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
Published: 2017
Full Text: View/download PDF

8. Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming

Author: Tsubasa Ochiai, Takaaki Hori, John R. Hershey, Shinji Watanabe, and Xiong Xiao
Subjects: Beamforming, Artificial neural network, Noise measurement, Computer science, Time delay neural network, Speech recognition, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Word error rate, 020206 networking & telecommunications, 02 engineering and technology, ComputingMethodologies_ARTIFICIALINTELLIGENCE, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Language model, Electrical and Electronic Engineering, 0305 other medical science, Hidden Markov model
Abstract: This paper proposes a unified architecture for end-to-end automatic speech recognition (ASR) to encompass microphone-array signal processing such as a state-of-the-art neural beamformer within the end-to-end framework. Recently, the end-to-end ASR paradigm has attracted great research interest as an alternative to conventional hybrid paradigms with deep neural networks and hidden Markov models. Using this novel paradigm, we simplify ASR architecture by integrating such ASR components as acoustic, phonetic, and language models with a single neural network and optimize the overall components for the end-to-end ASR objective: generating a correct label sequence. Although most existing end-to-end frameworks have mainly focused on ASR in clean environments, our aim is to build more realistic end-to-end systems in noisy environments. To handle such challenging noisy ASR tasks, we study multichannel end-to-end ASR architecture, which directly converts multichannel speech signal to text through speech enhancement. This architecture allows speech enhancement and ASR components to be jointly optimized to improve the end-to-end ASR objective and leads to an end-to-end framework that works well in the presence of strong background noise. We elaborate the effectiveness of our proposed method on the multichannel ASR benchmarks in noisy environments (CHiME-4 and AMI). The experimental results show that our proposed multichannel end-to-end system obtained performance gains over the conventional end-to-end baseline with enhanced inputs from a delay-and-sum beamformer (i.e., BeamformIT) in terms of character error rate. In addition, further analysis shows that our neural beamformer, which is optimized only with the end-to-end ASR objective, successfully learned a noise suppression function.
Published: 2017
Full Text: View/download PDF

9. Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend

Author: Vikramjit Mitra, John R. Hershey, Takaaki Hori, Zhuo Chen, Hakan Erdogan, Shinji Watanabe, and Jonathan Le Roux
Subjects: Beamforming, Artificial neural network, business.industry, Computer science, Speech recognition, Feature extraction, Word error rate, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Theoretical Computer Science, Human-Computer Interaction, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Recurrent neural network, 0202 electrical engineering, electronic engineering, information engineering, Mel-frequency cepstrum, Artificial intelligence, Language model, 0305 other medical science, business, Software
Abstract: This paper gives an in-depth presentation of the multi-microphone speech recognition system we submitted to the 3rd CHiME speech separation and recognition challenge (CHiME-3) and its extension. The proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front-end speech enhancement to the language modeling. Three different types of beamforming are used to combine multi-microphone signals to obtain a single higher-quality signal. The beamformed signal is further processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, the beamformed signal is processed by two proposed noise-robust feature extraction methods. All features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes multi-channel noisy data training and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full set of techniques substantially reduced the word error rate (WER). Combining hypotheses from different beamforming and robust-feature systems ultimately achieved 5.05% WER for the real-test data, an 84.7% reduction relative to the baseline of 32.99% WER and a 44.5% reduction from our official CHiME-3 challenge result of 9.1% WER. Furthermore, this final result is better than the best result (5.8% WER) reported in the CHiME-3 challenge.
Published: 2017
Full Text: View/download PDF

10. Prior-based Binary Masking and Discriminative Methods for Reverberant and Noisy Speech Recognition Using Distant Stereo Microphones

Author: Yuuki Tachioka, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux
Subjects: Masking (art), General Computer Science, System combination, Computer science, business.industry, Speech recognition, Binary number, Feature transformation, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Discriminative model, 0202 electrical engineering, electronic engineering, information engineering, Deep neural networks, Artificial intelligence, 0305 other medical science, business
Published: 2017
Full Text: View/download PDF

11. Improving Universal Sound Separation Using Sound Classification

Author: Scott Wisdom, Daniel P. W. Ellis, Efthymios Tzinis, Aren Jansen, and John R. Hershey
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer science, Speech recognition, Machine Learning (stat.ML), 010501 environmental sciences, 01 natural sciences, Oracle, Computer Science - Sound, Domain (software engineering), Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, Classifier (linguistics), Source separation, FOS: Electrical engineering, electronic engineering, information engineering, Sound (geography), 0105 earth and related environmental sciences, Hyperparameter, geography, geography.geographical_feature_category, business.industry, Deep learning, Computer Science::Sound, Artificial intelligence, 0305 other medical science, business, Focus (optics), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic sources from an open domain, regardless of their class. In this paper, we utilize the semantic information learned by sound classifier networks trained on a vast amount of diverse sounds to improve universal sound separation. In particular, we show that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information. This approach is especially useful in an iterative setup, where source estimates from an initial separation stage and their corresponding classifier-derived embeddings are fed to a second separation network. By performing a thorough hyperparameter search consisting of over a thousand experiments, we find that classifier embeddings from oracle clean sources provide nearly one dB of SNR gain, and our best iterative models achieve a significant fraction of this oracle performance, establishing a new state-of-the-art for universal sound separation.
Published: 2019

12. Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

Author: Desh Raj, Shinji Watanabe, Kevin W. Wilson, Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, John R. Hershey, and Zhuo Chen
Subjects: FOS: Computer and information sciences, Beamforming, Sound (cs.SD), Computer Science - Machine Learning, Artificial neural network, Covariance function, business.industry, Computer science, Word error rate, Machine Learning (stat.ML), Context (language use), Pattern recognition, Computer Science - Sound, Machine Learning (cs.LG), Speech enhancement, Signal-to-noise ratio, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, Artificial intelligence, business, Block size, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, including factorizing the spatial covariance into a time-varying amplitude component and a time-invariant spatial component, as well as using block-based techniques. In addition, we introduce a multi-frame beamforming method which improves the results significantly by adding contextual frames to the beamforming formulations. We extensively evaluate and analyze the effects of window size, block size, and multi-frame context size for these methods. Our best method utilizes a sequence of three neural separation and multi-frame time-invariant spatial beamforming stages, and demonstrates an average improvement of 2.75 dB in scale-invariant signal-to-noise ratio and 14.2% absolute reduction in a comparative speech recognition metric across four challenging reverberant speech enhancement and separation tasks. We also use our three-speaker separation model to separate real recordings in the LibriCSS evaluation set into non-overlapping tracks, and achieve a better word error rate as compared to a baseline mask based beamformer., 7 pages, 7 figures, IEEE SLT 2021 (slt2020.org)
Published: 2019

13. Universal Sound Separation

Author: Ilya Kavalerov, Brian Patton, Kevin W. Wilson, John R. Hershey, Hakan Erdogan, Jonathan Le Roux, and Scott Wisdom
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer science, Speech recognition, Machine Learning (stat.ML), 02 engineering and technology, Computer Science - Sound, Machine Learning (cs.LG), Convolution, 030507 speech-language pathology & audiology, 03 medical and health sciences, Statistics - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Network architecture, Basis (linear algebra), business.industry, Deep learning, Short-time Fourier transform, 020206 networking & telecommunications, Speech enhancement, Task (computing), Artificial intelligence, 0305 other medical science, business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation., Comment: 5 pages, accepted to WASPAA 2019
Published: 2019
Full Text: View/download PDF

14. End-to-End Multilingual Multi-Speaker Speech Recognition

Author: John R. Hershey, Shinji Watanabe, Takaaki Hori, Jonathan Le Roux, and Hiroshi Seki
Subjects: End-to-end principle, Computer science, Speech recognition, Encoder decoder, Code-switching
Published: 2019
Full Text: View/download PDF

15. The Phasebook: Building Complex Masks via Discrete Representations for Source Separation

Author: Andy M. Sarroff, Gordon Wichern, Shinji Watanabe, John R. Hershey, and Jonathan Le Roux
Subjects: Noise measurement, business.industry, Computer science, Deep learning, Phase (waves), Signal, Time–frequency analysis, Speech enhancement, Source separation, Artificial intelligence, business, Representation (mathematics), Algorithm, Interpolation
Abstract: Deep learning based speech enhancement and source separation systems have recently reached unprecedented levels of quality, to the point that performance is reaching a new ceiling. Most systems rely on estimating the magnitude of a target source, either directly or by computing a real-valued mask to be applied to a time-frequency representation of the mixture signal. A limiting factor in such approaches is a lack of phase estimation: the phase of the mixture is most often used when reconstructing the estimated time-domain signal. We propose to estimate phase using "phasebook", a new type of layer based on a discrete representation of the phase difference between the mixture and the target. We also introduce "combook", a similar type of layer that directly estimates a complex mask. We present various training and inference schemes involving these representations, and explain in particular how to include them in an end-to-end learning framework. We also present an oracle study to assess upper bounds on performance for various types of masks using discrete phase representations. We evaluate the proposed methods on the wsj0-2mix dataset, a well-studied corpus for single-channel speaker-independent speaker separation, matching the performance of state-of-the-art mask-based approaches without requiring additional phase reconstruction steps.
Published: 2019
Full Text: View/download PDF

16. SDR - half-baked or well done?

Author: Jonathan Le Roux, John R. Hershey, Scott Wisdom, and Hakan Erdogan
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Noise reduction, media_common.quotation_subject, 02 engineering and technology, Data_CODINGANDINFORMATIONTHEORY, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, Signal-to-noise ratio, Interference (communication), Audio and Speech Processing (eess.AS), Distortion, 0202 electrical engineering, electronic engineering, information engineering, Source separation, FOS: Electrical engineering, electronic engineering, information engineering, Quality (business), media_common, Measure (data warehouse), 020206 networking & telecommunications, Speech enhancement, Computer engineering, 0305 other medical science, Communication channel, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality. A decade ago, the BSS_eval toolkit was developed to give researchers worldwide a way to evaluate the quality of their algorithms in a simple, fair, and hopefully insightful way: it attempted to account for channel variations, and to not only evaluate the total distortion in the estimated signal but also split it in terms of various factors such as remaining interference, newly added artifacts, and channel errors. In recent years, hundreds of papers have been relying on this toolkit to evaluate their proposed methods and compare them to previous works, often arguing that differences on the order of 0.1 dB proved the effectiveness of a method over others. We argue here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results. We propose to use a slightly modified definition, resulting in a simpler, more robust measure, called scale-invariant SDR (SI-SDR). We present various examples of critical failure of the original SDR that SI-SDR overcomes.
Published: 2018

17. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Author: Hannah Muckenhirn, Kevin W. Wilson, Quan Wang, Ye Jia, Prashant Sridhar, Ron Weiss, John R. Hershey, Ignacio Lopez Moreno, Rif A. Saurous, and Zelin Wu
Subjects: Masking (art), Signal Processing (eess.SP), FOS: Computer and information sciences, Computer Science - Machine Learning, Computer science, Speech recognition, Separation (aeronautics), Machine Learning (stat.ML), Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, Spectrogram, Electrical Engineering and Systems Science - Signal Processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals., To appear in Interspeech 2019
Published: 2018

18. An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech

Author: Shinji Watanabe, Takaaki Hori, Jonathan Le Roux, John R. Hershey, and Hiroshi Seki
Subjects: Flexibility (engineering), Artificial neural network, Language identification, Computer science, Speech recognition, 020206 networking & telecommunications, Speech corpus, 02 engineering and technology, Pronunciation, 01 natural sciences, Mixed language, Rule-based machine translation, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, 010301 acoustics, Utterance
Abstract: End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity to build a monolithic multilingual ASR system with a language-independent neural network architecture. In our previous work, we proposed a monolithic neural network architecture that can recognize multiple languages, and showed its effectiveness compared with conventional language-dependent models. However, the model is not guaranteed to properly handle switches in language within an utterance, thus lacking the flexibility to recognize mixed-language speech such as code-switching. In this paper, we extend our model to enable dynamic tracking of the language within an utterance, and propose a training procedure that takes advantage of a newly created mixed-language speech corpus. Experimental results show that the extended model outperforms both language-dependent models and our previous model without suffering from performance degradation that could be associated with language switching.
Published: 2018
Full Text: View/download PDF

19. Speaker Adaptation for Multichannel End-to-End Speech Recognition

Author: Shigeru Katagiri, John R. Hershey, Takaaki Hori, Shinji Watanabe, and Tsubasa Ochiai
Subjects: Noise measurement, Artificial neural network, Computer science, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Speech enhancement, End-to-end principle, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Adaptation (computer science), Hidden Markov model, Encoder, Decoding methods
Abstract: Recent work on multichannel end-to-end automatic speech recognition (ASR) has shown that multichannel speech enhancement and speech recognition functions can be integrated into a deep neural network (DNN)-based system, and promising experimental results have been shown using the CHiME-4 and AMI corpora. In other recent DNN-based hidden Markov model (DNN-HMM) hybrid architectures, the effectiveness of speaker adaptation has been well established. Motivated by these results, we propose a multi-path adaptation scheme for end-to-end multichannel ASR, which combines the unprocessed noisy speech features with a speech-enhanced pathway to improve upon previous end-to-end ASR approaches. Experimental results using CHiME-4 show that (1) our proposed multi-path adaptation scheme improves ASR performance and (2) adapting the encoder network is more effective than adapting the neural beamformer, attention mechanism, or decoder network.
Published: 2018
Full Text: View/download PDF

20. End-to-End Multi-Speaker Speech Recognition

Author: Jonathan Le Roux, Shinji Watanabe, John R. Hershey, Shane Settle, and Takaaki Hori
Subjects: Artificial neural network, business.industry, Computer science, Speech recognition, Deep learning, 010501 environmental sciences, 01 natural sciences, Cocktail party effect, Task (project management), 030507 speech-language pathology & audiology, 03 medical and health sciences, End-to-end principle, Transcription (linguistics), Task analysis, Artificial intelligence, Isolation (database systems), 0305 other medical science, Cluster analysis, business, 0105 earth and related environmental sciences
Abstract: Current advances in deep learning have resulted in a convergence of methods across a wide range of tasks, opening the door for tighter integration of modules that were previously developed and optimized in isolation. Recent ground-breaking works have produced end-to-end deep network methods for both speech separation and end-to-end automatic speech recognition (ASR). Speech separation methods such as deep clustering address the challenging cocktail-party problem of distinguishing multiple simultaneous speech signals. This is an enabling technology for real-world human machine interaction (HMI). However, speech separation requires ASR to interpret the speech for any HMI task. Likewise, ASR requires speech separation to work in an unconstrained environment. Although these two components can be trained in isolation and connected after the fact, this paradigm is likely to be sub-optimal, since it relies on artificially mixed data. In this paper, we develop the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals. The joint training framework synergistically adapts the separation and recognition to each other. As an additional benefit, it enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.
Published: 2018
Full Text: View/download PDF

21. Alternative Objective Functions for Deep Clustering

Author: Jonathan Le Roux, John R. Hershey, and Zhong-Qiu Wang
Subjects: Scheme (programming language), Network architecture, Linear programming, business.industry, Computer science, Inference, 020206 networking & telecommunications, 02 engineering and technology, Machine learning, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Symmetric matrix, Artificial intelligence, 0305 other medical science, Cluster analysis, business, computer, computer.programming_language
Abstract: The recently proposed deep clustering framework represents a significant step towards solving the cocktail party problem. This study proposes and compares a variety of alternative objective functions for training deep clustering networks. In addition, whereas the original deep clustering work relied on k-means clustering for test-time inference, here we investigate inference methods that are matched to the training objective. Furthermore, we explore the use of an improved chimera network architecture for speech separation, which combines deep clustering with mask-inference networks in a multiobjective training scheme. The deep clustering loss acts as a regularizer while training the end-to-end mask inference network for best separation. With further iterative phase reconstruction, our best proposed method achieves a state-of-the-art 11.5 dB signal-to-distortion ratio (SDR) result on the publicly available wsj0-2mix dataset, with a much simpler architecture than the previous best approach.
Published: 2018
Full Text: View/download PDF

22. Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

Author: John R. Hershey, Jonathan Le Roux, and Zhong-Qiu Wang
Subjects: Microphone array, Computer science, business.industry, Phase (waves), 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Discriminative model, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, 0305 other medical science, business, Cluster analysis, Spatial analysis
Abstract: The recently-proposed deep clustering algorithm represents a fundamental advance towards solving the cocktail party problem in the single-channel case. When multiple microphones are available, spatial information can be leveraged to differentiate signals from different directions. This study combines spectral and spatial features in a deep clustering framework so that the complementary spectral and spatial information can be simultaneously exploited to improve speech separation. We find that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry. Experiments on a spatial-ized version of the wsj0-2mix dataset show the strong potential of the proposed algorithm for speech separation in reverberant environments.
Published: 2018
Full Text: View/download PDF

23. Exploring Tradeoffs in Models for Low-latency Speech Enhancement

Author: Michael Chinen, Richard F. Lyon, John R. Hershey, Jeremy Thorpe, Kevin W. Wilson, Rif A. Saurous, Brian Patton, and Jan Skoglund
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Artificial neural network, Computer science, Computation, Speech recognition, Computer Science - Sound, Convolution, Task (project management), Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Spectrogram, Latency (engineering), 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing, Decibel
Abstract: We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and find that zero-look-ahead models can achieve, on average, within 0.03 dB SDR of our best bidirectional model. Further, we find that 200 milliseconds of look-ahead is sufficient to achieve equivalent performance to our best bidirectional model.
Published: 2018
Full Text: View/download PDF

24. Differentiable Consistency Constraints for Improved Deep Speech Enhancement

Author: Jeremy Thorpe, Kevin W. Wilson, Michael Chinen, Brian Patton, Scott Wisdom, John R. Hershey, and Rif A. Saurous
Subjects: Masking (art), FOS: Computer and information sciences, Sound (cs.SD), Artificial neural network, Noise measurement, Computer science, Short-time Fourier transform, 020206 networking & telecommunications, 02 engineering and technology, Computer Science - Sound, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Signal-to-noise ratio, Fourier transform, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, symbols, FOS: Electrical engineering, electronic engineering, information engineering, Spectrogram, Differentiable function, 0305 other medical science, Algorithm, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks. In this paper, we show that STFT consistency and mixture consistency can be jointly imposed by adding simple differentiable projection layers to the enhancement network. These layers are compatible with real or complex-valued masks. Using both of these constraints with complex-valued masks provides a 0.7 dB increase in scale-invariant signal-to-distortion ratio (SI-SDR) on a large dataset of speech corrupted by a wide variety of nonstationary noise across a range of input SNRs.
Published: 2018
Full Text: View/download PDF

25. End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

Author: Jonathan Le Roux, Zhong-Qiu Wang, John R. Hershey, and DeLiang Wang
Subjects: Masking (art), FOS: Computer and information sciences, Sound (cs.SD), Computer science, Phase (waves), Inverse, Machine Learning (stat.ML), 02 engineering and technology, Signal, Computer Science - Sound, Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Statistics - Machine Learning, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Computation and Language, Series (mathematics), Short-time Fourier transform, 020206 networking & telecommunications, Function (mathematics), Computer Science - Learning, Fourier transform, symbols, 0305 other medical science, Algorithm, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defined on the reconstructed signals, which are optimized for best separation. In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers. While mask values are typically limited to lie between zero and one for approaches using the mixture phase for reconstruction, this limitation is less relevant if the estimated magnitudes are to be used together with phase reconstruction. We thus propose several novel activation functions for the output layer of the T-F masking, to allow mask values beyond one. On the publicly-available wsj0-2mix dataset, our approach achieves state-of-the-art 12.6 dB scale-invariant signal-to-distortion ratio (SI-SDR) and 13.1 dB SDR, revealing new possibilities for deep learning based phase reconstruction and representing a fundamental progress towards solving the notoriously-hard cocktail party problem., Comment: Submitted to Interspeech 2018
Published: 2018
Full Text: View/download PDF

26. A Purely End-to-End System for Multi-speaker Speech Recognition

Author: Jonathan Le Roux, John R. Hershey, Shinji Watanabe, Hiroshi Seki, and Takaaki Hori
Subjects: End to end system, Sequence, Training set, Computer science, Speech recognition, 020208 electrical & electronic engineering, Contrast (statistics), 02 engineering and technology, Task (project management), 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Source separation, 0305 other medical science
Abstract: Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.
Published: 2018
Full Text: View/download PDF

27. Language independent end-to-end architecture for joint language identification and speech recognition

Author: John R. Hershey, Shinji Watanabe, and Takaaki Hori
Subjects: Kanji, Language identification, Rule-based machine translation, Computer science, Character (computing), Katakana, Speech recognition, Character encoding, VoxForge, Hiragana
Abstract: End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity, which we fully exploit in this paper, to build a monolithic multilingual ASR system with a language-independent neural network architecture. We present a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition. The model is based on our hybrid attention/connectionist temporal classification (CTC) architecture which has previously been shown to achieve the state-of-the-art performance in several ASR benchmarks. Here we augment its set of output symbols to include the union of character sets appearing in all the target languages. These include Roman and Cyrillic Alphabets, Arabic numbers, simplified Chinese, and Japanese Kanji/Hiragana/Katakana characters (5,500 characters in all). This allows training of a single multilingual model, whose parameters are shared across all the languages. The model can jointly identify the language and recognize the speech, automatically formatting the recognized text in the appropriate character set. The experiments, which used speech databases composed of Wall Street Journal (English), Corpus of Spontaneous Japanese, HKUST Mandarin CTS, and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian), demonstrate comparable/superior performance relative to language-dependent end-to-end ASR systems.
Published: 2017
Full Text: View/download PDF

28. Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition

Author: Shinji Watanabe, John R. Hershey, and Takaaki Hori
Subjects: Vocabulary, Computer science, Speech recognition, media_common.quotation_subject, Mandarin Chinese, language.human_language, Character (mathematics), Connectionism, Test set, language, Language model, Hidden Markov model, Word (computer architecture), media_common
Abstract: We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a character-based architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling Linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 % WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.
Published: 2017
Full Text: View/download PDF

29. Early and late integration of audio features for automatic video description

Author: Chiori Hori, John R. Hershey, Tim K. Marks, and Takaaki Hori
Subjects: Closed captioning, Audio signal, Artificial neural network, Computer science, business.industry, Deep learning, Speech recognition, Question answering, Artificial intelligence, Mel-frequency cepstrum, business, Fusion mechanism, Natural language
Abstract: This paper presents our approach to improve video captioning by integrating audio and video features. Video captioning is the task of generating a textual description to describe the content of a video. State-of-the-art approaches to video captioning are based on sequence-to-sequence models, in which a single neural network accepts sequential images and audio data, and outputs a sequence of words that best describe the input data in natural language. The network thus learns to encode the video input into an intermediate semantic representation, which can be useful in applications such as multimedia indexing, automatic narration, and audio-visual question answering. In our prior work, we proposed an attention-based multi-modal fusion mechanism to integrate image, motion, and audio features, where the multiple features are integrated in the network. Here, we apply hypothesis-level integration based on minimum Bayes-risk (MBR) decoding to further improve the caption quality, focusing on well-known evaluation metrics (BLEU and METEOR scores). Experiments with the YouTube2Text and MSR-VTT datasets demonstrate that combinations of early and late integration of multimodal features significantly improve the audio-visual semantic representation, as measured by the resulting caption quality. In addition, we compared the performance of our method using two different types of audio features: MFCC features, and the audio features extracted using SoundNet, which was trained to recognize objects and scenes from videos using only the audio signals.
Published: 2017
Full Text: View/download PDF

30. Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

Author: Zhong Meng, Hakan Erdogan, John R. Hershey, and Shinji Watanabe
Subjects: FOS: Computer and information sciences, Beamforming, Sound (cs.SD), Computer science, Speech recognition, 02 engineering and technology, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Environmental noise, Voice activity detection, Computer Science - Computation and Language, Noise measurement, business.industry, Deep learning, Acoustic model, 020206 networking & telecommunications, Speech processing, Filter design, Recurrent neural network, Artificial intelligence, 0305 other medical science, business, Computation and Language (cs.CL), Adaptive beamformer, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set., in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Published: 2017

31. Attention-Based Multimodal Fusion for Video Description

Author: Kazuhiko Sumi, Chiori Hori, Ziming Zhang, Teng-Yok Lee, Bret Harsham, John R. Hershey, Takaaki Hori, and Tim K. Marks
Subjects: Artificial neural network, Computer science, business.industry, Concatenation, Feature extraction, Cognitive neuroscience of visual object recognition, Pattern recognition, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Recurrent neural network, Feature (computer vision), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Relevance (information retrieval), Artificial intelligence, 0305 other medical science, business, Word (computer architecture)
Abstract: Current methods for video description are based on encoder-decoder sentence generation using recurrent neural networks (RNNs). Recent work has demonstrated the advantages of integrating temporal attention mechanisms into these models, in which the decoder network predicts each word in the description by selectively giving more weight to encoded features from specific time frames. Such methods typically use two different types of features: image features (from an object classification model), and motion features (from an action recognition model), combined by naive concatenation in the model input. Because different feature modalities may carry task-relevant information at different times, fusing them by naive concatenation may limit the model's ability to dynamically determine the relevance of each type of feature to different parts of the description. In this paper, we incorporate audio features in addition to the image and motion features. To fuse these three modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. Combining our new multimodal attention model with standard temporal attention outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.
Published: 2017
Full Text: View/download PDF

32. Student-teacher network learning with enhanced features

Author: Jonathan Le Roux, John R. Hershey, Shinji Watanabe, and Takaaki Hori
Subjects: Training set, Voice activity detection, Noise measurement, Computer science, Speech recognition, Student teacher, 02 engineering and technology, Speech processing, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0305 other medical science
Abstract: Recent advances in distant-talking ASR research have confirmed that speech enhancement is an essential technique for improving the ASR performance, especially in the multichannel scenario. However, speech enhancement inevitably distorts speech signals, which can cause significant degradation when enhanced signals are used as training data. Thus, distant-talking ASR systems often resort to using the original noisy signals as training data and the enhanced signals only at test time, and give up on taking advantage of enhancement techniques in the training stage. This paper proposes to make use of enhanced features in the student-teacher learning paradigm. The enhanced features are used as input to a teacher network to obtain soft targets, while a student network tries to mimic the teacher network's outputs using the original noisy features as input, so that speech enhancement is implicitly performed within the student network. Compared with conventional student-teacher learning, which uses a better network as teacher, the proposed self-supervised method uses better (enhanced) inputs to a teacher. This setup matches the above scenario of making use of enhanced features in network training. Experiments with the CHiME-4 challenge real dataset show significant ASR improvements with an error reduction rate of 12% in the single-channel track and 15% in the 2-channel track, respectively, by using 6-channel beamformed features for the teacher model.
Published: 2017
Full Text: View/download PDF

33. Toolkits for Robust Speech Processing

Author: Shinji Watanabe, Florian Metze, Yajie Miao, Marc Delcroix, John R. Hershey, and Takaaki Hori
Subjects: Microphone array, Computer science, business.industry, Deep learning, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Speech processing, ComputingMethodologies_ARTIFICIALINTELLIGENCE, Speech enhancement, Resource (project management), Human–computer interaction, Web page, Artificial intelligence, Language model, business, License
Abstract: Recent robust automatic speech recognition (ASR) techniques have been developed rapidly due to the demand placed on ASR applications in real environments, with the help of publicly available tools developed in the community. This chapter overviews major toolkits available for robust ASR, covering general ASR toolkits, language model toolkits, speech enhancement/microphone array front-end toolkits, deep learning toolkits, and emergent end-to-end ASR toolkits. The aim of this chapter is to provide information about functionalities (features, functions, platform, and language), license, and source location so that readers can easily access such tools to build their own robust ASR systems. Some of the toolkits have actually been used to build state-of-the-art ASR systems for various challenging tasks. The references in this chapter also includes the URLs of the resource webpages.
Published: 2017
Full Text: View/download PDF

34. New Era for Robust Speech Recognition

Author: Marc Delcroix, Florian Metze, Shinji Watanabe, and John R. Hershey
Subjects: Computer science, Speech recognition
Published: 2017
Full Text: View/download PDF

35. Joint CTC/attention decoding for end-to-end speech recognition

Author: Takaaki Hori, John R. Hershey, and Shinji Watanabe
Subjects: Markov chain, Computer science, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Pronunciation, Mandarin Chinese, language.human_language, 030507 speech-language pathology & audiology, 03 medical and health sciences, Tokenization (data security), 0202 electrical engineering, electronic engineering, information engineering, language, 0305 other medical science, Hidden Markov model, Joint (audio engineering), Decoding methods
Abstract: End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.
Published: 2017
Full Text: View/download PDF

36. Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition

Author: Liang Lu, Michael L. Seltzer, Shinji Watanabe, Xiong Xiao, Michael I. Mandel, John R. Hershey, Guoguo Chen, Yu Zhang, Hakan Erdogan, and Dong Yu
Subjects: Beamforming, Microphone array, Artificial neural network, business.industry, Time delay neural network, Computer science, Speech recognition, Pipeline (computing), Pattern recognition, Speech enhancement, Discriminative model, Frequency domain, Artificial intelligence, business
Abstract: Speech-processing systems such as automatic speech recognition (ASR) usually consist of a large number of steps to accomplish their tasks. Due to the long processing pipeline, the processing steps are usually designed to optimize cost functions that are not directly related to the task, leading to suboptimal performance. In this chapter, we introduce a beamforming (BF) network to perform spatial filtering that is optimal for the ASR task. The BF network takes in array signals and predicts the optimal beamforming parameters in the frequency domain, assuming that the array geometry does not change. The network consists of both deterministic processing steps and trainable steps realized by neural networks and trained to minimize the cross-entropy cost function of ASR. In our experiments, the BF network is trained with both artificially generated and real microphone array signals. On the AMI meeting transcription, we found that the trained BF network produces competitive ASR results compared to traditional delay-and-sum beamforming on unseen array signals.
Published: 2017
Full Text: View/download PDF

37. Novel Deep Architectures in Speech Processing

Author: Jonathan Le Roux, Scott Wisdom, Yusuf Ziya Isik, Shinji Watanabe, Zhuo Chen, and John R. Hershey
Subjects: Computer science, business.industry, Inference, Machine learning, computer.software_genre, Speech processing, Spectral clustering, Non-negative matrix factorization, Discriminative model, Problem domain, Artificial intelligence, Cluster analysis, business, Adaptation (computer science), computer
Abstract: Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model. In addition, unsupervised inference tasks such as adaptation and clustering are handled in a natural way. However, these benefits typically come at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, and discriminative training is relatively easy. However, their typically generic architectures often make it unclear how to incorporate specific problem knowledge or to perform flexible tasks such as unsupervised inference. This chapter introduces frameworks to provide the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and reinterpret inference iterations as layers in a deep network, while generalizing the parametrization to create a more powerful network. We show how such frameworks yield new understanding of conventional networks, and how they can result in novel networks for speech processing, including networks based on nonnegative matrix factorization, complex Gaussian microphone array signal processing, and a network inspired by efficient spectral clustering. We then discuss what has been learned in recent work and provide a prospectus for future research in this area.
Published: 2017
Full Text: View/download PDF

38. Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio

Author: Jonathan Le Roux, Shinji Watanabe, John R. Hershey, and Hakan Erdogan
Subjects: Channel (digital image), business.industry, Computer science, Speech recognition, Separation (aeronautics), 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Non-negative matrix factorization, Speech enhancement, Recurrent neural network, Computer Science::Sound, 0202 electrical engineering, electronic engineering, information engineering, Deep neural networks, Artificial intelligence, business
Abstract: We investigate the use of deep neural networks and deep recurrent neural networks for separation and recognition of speech in challenging environments. Mask prediction networks received considerable interest recently for speech separation and speech enhancement problems where the background signals are nonstationary and challenging. Initial signal-level enhancement with deep neural networks has also been shown to be useful for noise-robust speech recognition in these environments. We consider using various loss functions for training the networks and illustrate differences among them. We compare the performance of deep computational architectures with conventional statistical techniques as well as variants of nonnegative matrix factorization, and establish that one can achieve impressively superior results with deep-learning-based techniques on this problem.
Published: 2017
Full Text: View/download PDF

39. Dialog state tracking with attention-based sequence-to-sequence learning

Author: Takaaki Hori, Bret Harsham, Jonathan Le Roux, Shinji Watanabe, Koji Yusuke, Chiori Hori, John R. Hershey, Yi Jing, Takeyuki Aikawa, Zhaocheng Zhu, and Hai Wang
Subjects: BitTorrent tracker, business.industry, Computer science, Speech recognition, Frame (networking), Tracking system, 02 engineering and technology, 010501 environmental sciences, computer.software_genre, 01 natural sciences, Set (abstract data type), 020204 information systems, Test set, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Sequence learning, Pattern matching, Dialog box, business, computer, Natural language processing, 0105 earth and related environmental sciences
Abstract: We present an advanced dialog state tracking system designed for the 5th Dialog State Tracking Challenge (DSTC5). The main task of DSTC5 is to track the dialog state in a human-human dialog. For each utterance, the tracker emits a frame of slot-value pairs considering the full history of the dialog up to the current turn. Our system includes an encoder-decoder architecture with an attention mechanism to map an input word sequence to a set of semantic labels, i.e., slot-value pairs. This handles the problem of the unknown alignment between the utterances and the labels. By combining the attention-based tracker with rule-based trackers elaborated for English and Chinese, the F-score for the development set improved from 0.475 to 0.507 compared to the rule-only trackers. Moreover, we achieved 0.517 F-score by refining the combination strategy based on the topic and slot level performance of each tracker. In this paper, we also validate the efficacy of each technique and report the test set results submitted to the challenge.
Published: 2016
Full Text: View/download PDF

40. Context-Sensitive and Role-Dependent Spoken Language Understanding Using Bidirectional and Attention LSTMs

Author: John R. Hershey, Takaaki Hori, Chiori Hori, and Shinji Watanabe
Subjects: Computer science, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, Context (language use), 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Linguistics, 0105 earth and related environmental sciences, Spoken language
Published: 2016
Full Text: View/download PDF

41. Single-Channel Multi-Speaker Separation Using Deep Clustering

Author: Jonathan Le Roux, Yusuf Ziya Isik, Zhuo Chen, Shinji Watanabe, and John R. Hershey
Subjects: business.industry, Computer science, Speech recognition, Deep learning, Word error rate, 020207 software engineering, 02 engineering and technology, Signal, Cocktail party effect, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Spectrogram, Segmentation, Artificial intelligence, 0305 other medical science, Cluster analysis, business, Communication channel
Abstract: Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.
Published: 2016
Full Text: View/download PDF

42. Minimum word error training of long short-term memory recurrent neural network language models for speech recognition

Author: John R. Hershey, Chiori Hori, Takaaki Hori, and Shinji Watanabe
Subjects: Sequence, Computer science, Speech recognition, Word error rate, Task (project management), 030507 speech-language pathology & audiology, 03 medical and health sciences, 0302 clinical medicine, Cross entropy, Discriminative model, Language model, 0305 other medical science, 030217 neurology & neurosurgery, Word (computer architecture)
Abstract: This paper describes minimum word error (MWE) training of recurrent neural network language models (RNNLMs) for speech recognition. RNNLMs are usually trained to minimize a cross entropy of estimated word probabilities against the correct word sequence, which corresponds to maximum likelihood criterion. However, this training does not necessarily maximize a performance measure in a target task, i.e. it does not minimize word error rate (WER) explicitly in speech recognition. To solve such a problem, several discriminative training methods have already been proposed for n-gram language models, but those for RNNLMs have not sufficiently investigated. In this paper, we propose a MWE training method for RNNLMs, and report significant WER reductions when we applied the MWE method to a standard Elman-type RNNLM and a more advanced model, a Long Short-Term Memory (LSTM) RNNLM. We also present efficient MWE training with N-best lists on Graphics Processing Units (GPUs).
Published: 2016
Full Text: View/download PDF

43. Deep beamforming networks for multi-channel speech recognition

Author: John R. Hershey, Guoguo Chen, Yu Zhang, Michael I. Mandel, Liang Lu, Shinji Watanabe, Michael L. Seltzer, Xiong Xiao, Dong Yu, and Hakan Erdogan
Subjects: Beamforming, Reverberation, Computer science, Microphone, Time delay neural network, Speech recognition, Feature extraction, Word error rate, Direction of arrival, 020206 networking & telecommunications, 02 engineering and technology, Speech processing, 030507 speech-language pathology & audiology, 03 medical and health sciences, Filter design, Noise, Computer Science::Sound, 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science
Abstract: Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beam forming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beam forming and acoustic modeling networks are trained jointly using back-propagation with a common cross entropy objective function. In experiments on the AMI meeting corpus,we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.
Published: 2016
Full Text: View/download PDF

44. Deep clustering: Discriminative embeddings for segmentation and separation

Author: John R. Hershey, Shinji Watanabe, Zhuo Chen, and Jonathan Le Roux
Subjects: FOS: Computer and information sciences, Computer science, Machine Learning (stat.ML), 02 engineering and technology, Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, Discriminative model, Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Segmentation, Neural and Evolutionary Computing (cs.NE), Cluster analysis, business.industry, Deep learning, Computer Science - Neural and Evolutionary Computing, 020206 networking & telecommunications, Pattern recognition, Spectral clustering, Computer Science - Learning, Computer Science::Sound, Spectrogram, Embedding, Noise (video), Artificial intelligence, 0305 other medical science, business
Abstract: We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains., Comment: Originally submitted on June 5, 2015
Published: 2016
Full Text: View/download PDF

45. Deep unfolding for multichannel source separation

Author: Scott Wisdom, Shinji Watanabe, John R. Hershey, and Jonathan Le Roux
Subjects: Network architecture, Markov random field, Computer science, business.industry, Inference, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Mixture model, 030507 speech-language pathology & audiology, 03 medical and health sciences, Generative model, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Artificial intelligence, 0305 other medical science, business
Abstract: Deep unfolding has recently been proposed to derive novel deep network architectures from model-based approaches. In this paper, we consider its application to multichannel source separation. We unfold a multichannel Gaussian mixture model (MCGMM), resulting in a deep MCGMM computational network that directly processes complex-valued frequency-domain multichannel audio and has an architecture defined explicitly by a generative model, thus combining the advantages of deep networks and model-based approaches. We further extend the deep MCGMM by modeling the GMM states using an MRF, whose unfolded mean-field inference updates add dynamics across layers. Experiments on source separation for multichannel mixtures of two simultaneous speakers shows that the deep MCGMM leads to improved performance with respect to the original MCGMM model.
Published: 2016
Full Text: View/download PDF

46. Improved Mvdr Beamforming Using Single-Channel Mask Prediction Networks

Author: Hakan Erdogan, Shinji Watanabe, John R. Hershey, Michael I. Mandel, and Jonathan Le Roux
Subjects: Beamforming, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Artificial neural network, Computer science, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, 0305 other medical science, Communication channel
Abstract: Recent studies on multi-microphone speech databases indicate that it is beneficial to perform beamforming to improve speech recognition accuracies, especially when there is a high level of background noise. Minimum variance distortionless response (MVDR) beamforming is an important beamforming method that performs quite well for speech recognition purposes especially if the steering vector is known. However, steering the beamformer to focus on speech in unknown acoustic conditions remains a challenging problem. In this study, we use single channel speech enhancement deep networks to form masks that can be used for noise spatial covariance estimation, which steers the MVDR beamforming toward the speech. We analyze how mask prediction affects performance and also discuss various ways to use masks to obtain the speech and noise spatial covariance estimates in a reliable way. We show that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition.
Published: 2016

47. Deep Clustering and Conventional Networks for Music Separation: Stronger Together

Author: Jonathan Le Roux, John R. Hershey, Yi Luo, Nima Mesgarani, and Zhuo Chen
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Speech recognition, Correlation clustering, Machine Learning (stat.ML), 02 engineering and technology, Regularization (mathematics), Computer Science - Sound, Article, Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Cluster analysis, Network architecture, business.industry, Deep learning, 020206 networking & telecommunications, Pattern recognition, Computer Science - Learning, Embedding, Spectrogram, Artificial intelligence, 0305 other medical science, business
Abstract: Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks. However, little is known about its effectiveness in other challenging situations such as music source separation. Contrary to conventional networks that directly estimate the source signals, deep clustering generates an embedding for each time-frequency bin, and separates sources by clustering the bins in the embedding space. We show that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation, presumably because its more flexible objective engenders better regularization. Since the strengths of deep clustering and conventional network architectures appear complementary, we explore combining them in a single hybrid network trained via an approach akin to multi-task learning. Remarkably, the combination significantly outperforms either of its components., Comment: Published in ICASSP 2017
Published: 2016
Full Text: View/download PDF

48. Hidden Markov Acoustic Modeling With Bootstrap and Restructuring for Low-Resourced Languages

Author: Pierre L. Dognin, Xiaodong Cui, Peder A. Olsen, Jian Xue, Xin Chen, Bowen Zhou, Upendra V. Chaudhari, and John R. Hershey
Subjects: Vocabulary, Acoustics and Ultrasonics, Restructuring, Computer science, business.industry, Speech recognition, Gaussian, media_common.quotation_subject, Machine learning, computer.software_genre, Data modeling, symbols.namesake, ComputingMethodologies_PATTERNRECOGNITION, Computer Science::Sound, symbols, Artificial intelligence, Electrical and Electronic Engineering, Hidden Markov model, business, Cluster analysis, computer, Gaussian process, Decoding methods, media_common
Abstract: This paper proposes an acoustic modeling approach based on bootstrap and restructuring to dealing with data sparsity for low-resourced languages. The goal of the approach is to improve the statistical reliability of acoustic modeling for automatic speech recognition (ASR) in the context of speed, memory and response latency requirements for real-world applications. In this approach, randomized hidden Markov models (HMMs) estimated from the bootstrapped training data are aggregated for reliable sequence prediction. The aggregation leads to an HMM with superior prediction capability at cost of a substantially larger size. For practical usage the aggregated HMM is restructured by Gaussian clustering followed by model refinement. The restructuring aims at reducing the aggregated HMM to a desirable model size while maintaining its performance close to the original aggregated HMM. To that end, various Gaussian clustering criteria and model refinement algorithms have been investigated in the full covariance model space before the conversion to the diagonal covariance model space in the last stage of the restructuring. Large vocabulary continuous speech recognition (LVCSR) experiments on Pashto and Dari have shown that acoustic models obtained by the proposed approach can yield superior performance over the conventional training procedure with almost the same run-time memory consumption and decoding speed.
Published: 2012
Full Text: View/download PDF

49. Tracking Motion, Deformation, and Texture Using Conditionally Gaussian Processes

Author: Javier R. Movellan, John R. Hershey, and Tim K. Marks
Subjects: Computer science, Facial motion capture, Movement, Gaussian, Normal Distribution, Video Recording, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Optical flow, Image processing, Facial recognition system, Pattern Recognition, Automated, symbols.namesake, Image texture, Artificial Intelligence, Motion estimation, Image Processing, Computer-Assisted, Filtering problem, Humans, Computer vision, Stochastic Processes, Facial expression, Ground truth, business.industry, Applied Mathematics, Template matching, Object detection, Computational Theory and Mathematics, Face, Video tracking, symbols, Computer Vision and Pattern Recognition, Artificial intelligence, business, Algorithms, Software
Abstract: We present a generative model and inference algorithm for 3D nonrigid object tracking. The model, which we call G-flow, enables the joint inference of 3D position, orientation, and nonrigid deformations, as well as object texture and background texture. Optimal inference under G-flow reduces to a conditionally Gaussian stochastic filtering problem. The optimal solution to this problem reveals a new space of computer vision algorithms, of which classic approaches such as optic flow and template matching are special cases that are optimal only under special circumstances. We evaluate G-flow on the problem of tracking facial expressions and head motion in 3D from single-camera video. Previously, the lack of realistic video data with ground truth nonrigid position information has hampered the rigorous evaluation of nonrigid tracking. We introduce a practical method of obtaining such ground truth data and present a new face video data set that was created using this technique. Results on this data set show that G-flow is much more robust and accurate than current deterministic optic-flow-based approaches.
Published: 2010
Full Text: View/download PDF

50. Monaural speech separation and recognition challenge

Author: Steven J. Rennie, John R. Hershey, and Martin Cooke
Subjects: Auditory scene analysis, auditory scene analysis, Computer science, Speech recognition, Monaural, computer.software_genre, 01 natural sciences, Theoretical Computer Science, Task (project management), noise robustness, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0103 physical sciences, simultaneous speech, 010301 acoustics, speaker identification, business.industry, speech recognition, Speech processing, Human-Computer Interaction, speech separation, Artificial intelligence, Noise (video), Computational linguistics, 0305 other medical science, business, computer, Software, Natural language processing, Test data, Communication channel
Abstract: International audience; Robust speech recognition in everyday conditions requires the solution to a number of challenging problems, not least the ability to handle multiple sound sources. The specific case of speech recognition in the presence of a competing talker has been studied for several decades, resulting in a number of quite distinct algorithmic solutions whose focus ranges from modeling both target and competing speech to speech separation using auditory grouping principles. The purpose of the monaural speech separation and recognition challenge was to permit a large-scale comparison of techniques for the competing talker problem. The task was to identify keywords in sentences spoken by a target talker when mixed into a single channel with a background talker speaking similar sentences. Ten independent sets of results were contributed, alongside a baseline recognition system. Performance was evaluated using common training and test data and common metrics. Listeners' performance in the same task was also measured. This paper describes the challenge problem, compares the performance of the contributed algorithms, and discusses the factors which distinguish the systems. One highlight of the comparison was the finding that several systems achieved near-human performance in some conditions, and one out-performed listeners overall.
Published: 2010
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

83 results on '"John R. Hershey"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources