Author: "John R. Hershey" / Topic: 03 medical and health sciences - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"John R. Hershey"' showing total 27 results

Start Over Author "John R. Hershey" Topic 03 medical and health sciences

27 results on '"John R. Hershey"'

1. Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis

Author: Naoyuki Kanda, Maokui He, Shinji Watanabe, Jinyu Li, Zili Huang, Zhuo Chen, Jun Du, Scott Wisdom, John R. Hershey, Pavel Denisov, Desh Raj, Hakan Erdogan, Yi Luo, and Takuya Yoshioka
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Speech recognition, Modular system, 020206 networking & telecommunications, 02 engineering and technology, Computer Science - Sound, Speaker diarisation, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), Error analysis, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Task analysis, Subtitle, Transcription (software), 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR., Accepted to IEEE SLT 2021
Published: 2021
Full Text: View/download PDF

2. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Author: Suyoun Kim, John R. Hershey, Shinji Watanabe, Tomoki Hayashi, and Takaaki Hori
Subjects: Network architecture, Artificial neural network, Computer science, Speech recognition, Markov process, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Robustness (computer science), Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, symbols, Beam search, 020201 artificial intelligence & image processing, Language model, Electrical and Electronic Engineering, 0305 other medical science, Hidden Markov model, Decoding methods
Abstract: Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder–decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
Published: 2017
Full Text: View/download PDF

3. Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming

Author: Tsubasa Ochiai, Takaaki Hori, John R. Hershey, Shinji Watanabe, and Xiong Xiao
Subjects: Beamforming, Artificial neural network, Noise measurement, Computer science, Time delay neural network, Speech recognition, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Word error rate, 020206 networking & telecommunications, 02 engineering and technology, ComputingMethodologies_ARTIFICIALINTELLIGENCE, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Language model, Electrical and Electronic Engineering, 0305 other medical science, Hidden Markov model
Abstract: This paper proposes a unified architecture for end-to-end automatic speech recognition (ASR) to encompass microphone-array signal processing such as a state-of-the-art neural beamformer within the end-to-end framework. Recently, the end-to-end ASR paradigm has attracted great research interest as an alternative to conventional hybrid paradigms with deep neural networks and hidden Markov models. Using this novel paradigm, we simplify ASR architecture by integrating such ASR components as acoustic, phonetic, and language models with a single neural network and optimize the overall components for the end-to-end ASR objective: generating a correct label sequence. Although most existing end-to-end frameworks have mainly focused on ASR in clean environments, our aim is to build more realistic end-to-end systems in noisy environments. To handle such challenging noisy ASR tasks, we study multichannel end-to-end ASR architecture, which directly converts multichannel speech signal to text through speech enhancement. This architecture allows speech enhancement and ASR components to be jointly optimized to improve the end-to-end ASR objective and leads to an end-to-end framework that works well in the presence of strong background noise. We elaborate the effectiveness of our proposed method on the multichannel ASR benchmarks in noisy environments (CHiME-4 and AMI). The experimental results show that our proposed multichannel end-to-end system obtained performance gains over the conventional end-to-end baseline with enhanced inputs from a delay-and-sum beamformer (i.e., BeamformIT) in terms of character error rate. In addition, further analysis shows that our neural beamformer, which is optimized only with the end-to-end ASR objective, successfully learned a noise suppression function.
Published: 2017
Full Text: View/download PDF

4. Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend

Author: Vikramjit Mitra, John R. Hershey, Takaaki Hori, Zhuo Chen, Hakan Erdogan, Shinji Watanabe, and Jonathan Le Roux
Subjects: Beamforming, Artificial neural network, business.industry, Computer science, Speech recognition, Feature extraction, Word error rate, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Theoretical Computer Science, Human-Computer Interaction, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Recurrent neural network, 0202 electrical engineering, electronic engineering, information engineering, Mel-frequency cepstrum, Artificial intelligence, Language model, 0305 other medical science, business, Software
Abstract: This paper gives an in-depth presentation of the multi-microphone speech recognition system we submitted to the 3rd CHiME speech separation and recognition challenge (CHiME-3) and its extension. The proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front-end speech enhancement to the language modeling. Three different types of beamforming are used to combine multi-microphone signals to obtain a single higher-quality signal. The beamformed signal is further processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, the beamformed signal is processed by two proposed noise-robust feature extraction methods. All features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes multi-channel noisy data training and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full set of techniques substantially reduced the word error rate (WER). Combining hypotheses from different beamforming and robust-feature systems ultimately achieved 5.05% WER for the real-test data, an 84.7% reduction relative to the baseline of 32.99% WER and a 44.5% reduction from our official CHiME-3 challenge result of 9.1% WER. Furthermore, this final result is better than the best result (5.8% WER) reported in the CHiME-3 challenge.
Published: 2017
Full Text: View/download PDF

5. Improving Universal Sound Separation Using Sound Classification

Author: Scott Wisdom, Daniel P. W. Ellis, Efthymios Tzinis, Aren Jansen, and John R. Hershey
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer science, Speech recognition, Machine Learning (stat.ML), 010501 environmental sciences, 01 natural sciences, Oracle, Computer Science - Sound, Domain (software engineering), Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, Classifier (linguistics), Source separation, FOS: Electrical engineering, electronic engineering, information engineering, Sound (geography), 0105 earth and related environmental sciences, Hyperparameter, geography, geography.geographical_feature_category, business.industry, Deep learning, Computer Science::Sound, Artificial intelligence, 0305 other medical science, business, Focus (optics), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic sources from an open domain, regardless of their class. In this paper, we utilize the semantic information learned by sound classifier networks trained on a vast amount of diverse sounds to improve universal sound separation. In particular, we show that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information. This approach is especially useful in an iterative setup, where source estimates from an initial separation stage and their corresponding classifier-derived embeddings are fed to a second separation network. By performing a thorough hyperparameter search consisting of over a thousand experiments, we find that classifier embeddings from oracle clean sources provide nearly one dB of SNR gain, and our best iterative models achieve a significant fraction of this oracle performance, establishing a new state-of-the-art for universal sound separation.
Published: 2019

6. Universal Sound Separation

Author: Ilya Kavalerov, Brian Patton, Kevin W. Wilson, John R. Hershey, Hakan Erdogan, Jonathan Le Roux, and Scott Wisdom
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer science, Speech recognition, Machine Learning (stat.ML), 02 engineering and technology, Computer Science - Sound, Machine Learning (cs.LG), Convolution, 030507 speech-language pathology & audiology, 03 medical and health sciences, Statistics - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Network architecture, Basis (linear algebra), business.industry, Deep learning, Short-time Fourier transform, 020206 networking & telecommunications, Speech enhancement, Task (computing), Artificial intelligence, 0305 other medical science, business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation., Comment: 5 pages, accepted to WASPAA 2019
Published: 2019
Full Text: View/download PDF

7. Prior-based Binary Masking and Discriminative Methods for Reverberant and Noisy Speech Recognition Using Distant Stereo Microphones

Author: Yuuki Tachioka, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux
Subjects: Masking (art), General Computer Science, System combination, Computer science, business.industry, Speech recognition, Binary number, Feature transformation, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Discriminative model, 0202 electrical engineering, electronic engineering, information engineering, Deep neural networks, Artificial intelligence, 0305 other medical science, business
Published: 2017
Full Text: View/download PDF

8. SDR - half-baked or well done?

Author: Jonathan Le Roux, John R. Hershey, Scott Wisdom, and Hakan Erdogan
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Noise reduction, media_common.quotation_subject, 02 engineering and technology, Data_CODINGANDINFORMATIONTHEORY, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, Signal-to-noise ratio, Interference (communication), Audio and Speech Processing (eess.AS), Distortion, 0202 electrical engineering, electronic engineering, information engineering, Source separation, FOS: Electrical engineering, electronic engineering, information engineering, Quality (business), media_common, Measure (data warehouse), 020206 networking & telecommunications, Speech enhancement, Computer engineering, 0305 other medical science, Communication channel, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality. A decade ago, the BSS_eval toolkit was developed to give researchers worldwide a way to evaluate the quality of their algorithms in a simple, fair, and hopefully insightful way: it attempted to account for channel variations, and to not only evaluate the total distortion in the estimated signal but also split it in terms of various factors such as remaining interference, newly added artifacts, and channel errors. In recent years, hundreds of papers have been relying on this toolkit to evaluate their proposed methods and compare them to previous works, often arguing that differences on the order of 0.1 dB proved the effectiveness of a method over others. We argue here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results. We propose to use a slightly modified definition, resulting in a simpler, more robust measure, called scale-invariant SDR (SI-SDR). We present various examples of critical failure of the original SDR that SI-SDR overcomes.
Published: 2018

9. End-to-End Multi-Speaker Speech Recognition

Author: Jonathan Le Roux, Shinji Watanabe, John R. Hershey, Shane Settle, and Takaaki Hori
Subjects: Artificial neural network, business.industry, Computer science, Speech recognition, Deep learning, 010501 environmental sciences, 01 natural sciences, Cocktail party effect, Task (project management), 030507 speech-language pathology & audiology, 03 medical and health sciences, End-to-end principle, Transcription (linguistics), Task analysis, Artificial intelligence, Isolation (database systems), 0305 other medical science, Cluster analysis, business, 0105 earth and related environmental sciences
Abstract: Current advances in deep learning have resulted in a convergence of methods across a wide range of tasks, opening the door for tighter integration of modules that were previously developed and optimized in isolation. Recent ground-breaking works have produced end-to-end deep network methods for both speech separation and end-to-end automatic speech recognition (ASR). Speech separation methods such as deep clustering address the challenging cocktail-party problem of distinguishing multiple simultaneous speech signals. This is an enabling technology for real-world human machine interaction (HMI). However, speech separation requires ASR to interpret the speech for any HMI task. Likewise, ASR requires speech separation to work in an unconstrained environment. Although these two components can be trained in isolation and connected after the fact, this paradigm is likely to be sub-optimal, since it relies on artificially mixed data. In this paper, we develop the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals. The joint training framework synergistically adapts the separation and recognition to each other. As an additional benefit, it enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.
Published: 2018
Full Text: View/download PDF

10. Alternative Objective Functions for Deep Clustering

Author: Jonathan Le Roux, John R. Hershey, and Zhong-Qiu Wang
Subjects: Scheme (programming language), Network architecture, Linear programming, business.industry, Computer science, Inference, 020206 networking & telecommunications, 02 engineering and technology, Machine learning, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Symmetric matrix, Artificial intelligence, 0305 other medical science, Cluster analysis, business, computer, computer.programming_language
Abstract: The recently proposed deep clustering framework represents a significant step towards solving the cocktail party problem. This study proposes and compares a variety of alternative objective functions for training deep clustering networks. In addition, whereas the original deep clustering work relied on k-means clustering for test-time inference, here we investigate inference methods that are matched to the training objective. Furthermore, we explore the use of an improved chimera network architecture for speech separation, which combines deep clustering with mask-inference networks in a multiobjective training scheme. The deep clustering loss acts as a regularizer while training the end-to-end mask inference network for best separation. With further iterative phase reconstruction, our best proposed method achieves a state-of-the-art 11.5 dB signal-to-distortion ratio (SDR) result on the publicly available wsj0-2mix dataset, with a much simpler architecture than the previous best approach.
Published: 2018
Full Text: View/download PDF

11. Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

Author: John R. Hershey, Jonathan Le Roux, and Zhong-Qiu Wang
Subjects: Microphone array, Computer science, business.industry, Phase (waves), 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Discriminative model, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, 0305 other medical science, business, Cluster analysis, Spatial analysis
Abstract: The recently-proposed deep clustering algorithm represents a fundamental advance towards solving the cocktail party problem in the single-channel case. When multiple microphones are available, spatial information can be leveraged to differentiate signals from different directions. This study combines spectral and spatial features in a deep clustering framework so that the complementary spectral and spatial information can be simultaneously exploited to improve speech separation. We find that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry. Experiments on a spatial-ized version of the wsj0-2mix dataset show the strong potential of the proposed algorithm for speech separation in reverberant environments.
Published: 2018
Full Text: View/download PDF

12. Exploring Tradeoffs in Models for Low-latency Speech Enhancement

Author: Michael Chinen, Richard F. Lyon, John R. Hershey, Jeremy Thorpe, Kevin W. Wilson, Rif A. Saurous, Brian Patton, and Jan Skoglund
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Artificial neural network, Computer science, Computation, Speech recognition, Computer Science - Sound, Convolution, Task (project management), Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Spectrogram, Latency (engineering), 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing, Decibel
Abstract: We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and find that zero-look-ahead models can achieve, on average, within 0.03 dB SDR of our best bidirectional model. Further, we find that 200 milliseconds of look-ahead is sufficient to achieve equivalent performance to our best bidirectional model.
Published: 2018
Full Text: View/download PDF

13. Differentiable Consistency Constraints for Improved Deep Speech Enhancement

Author: Jeremy Thorpe, Kevin W. Wilson, Michael Chinen, Brian Patton, Scott Wisdom, John R. Hershey, and Rif A. Saurous
Subjects: Masking (art), FOS: Computer and information sciences, Sound (cs.SD), Artificial neural network, Noise measurement, Computer science, Short-time Fourier transform, 020206 networking & telecommunications, 02 engineering and technology, Computer Science - Sound, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Signal-to-noise ratio, Fourier transform, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, symbols, FOS: Electrical engineering, electronic engineering, information engineering, Spectrogram, Differentiable function, 0305 other medical science, Algorithm, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks. In this paper, we show that STFT consistency and mixture consistency can be jointly imposed by adding simple differentiable projection layers to the enhancement network. These layers are compatible with real or complex-valued masks. Using both of these constraints with complex-valued masks provides a 0.7 dB increase in scale-invariant signal-to-distortion ratio (SI-SDR) on a large dataset of speech corrupted by a wide variety of nonstationary noise across a range of input SNRs.
Published: 2018
Full Text: View/download PDF

14. End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

Author: Jonathan Le Roux, Zhong-Qiu Wang, John R. Hershey, and DeLiang Wang
Subjects: Masking (art), FOS: Computer and information sciences, Sound (cs.SD), Computer science, Phase (waves), Inverse, Machine Learning (stat.ML), 02 engineering and technology, Signal, Computer Science - Sound, Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Statistics - Machine Learning, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Computation and Language, Series (mathematics), Short-time Fourier transform, 020206 networking & telecommunications, Function (mathematics), Computer Science - Learning, Fourier transform, symbols, 0305 other medical science, Algorithm, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defined on the reconstructed signals, which are optimized for best separation. In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers. While mask values are typically limited to lie between zero and one for approaches using the mixture phase for reconstruction, this limitation is less relevant if the estimated magnitudes are to be used together with phase reconstruction. We thus propose several novel activation functions for the output layer of the T-F masking, to allow mask values beyond one. On the publicly-available wsj0-2mix dataset, our approach achieves state-of-the-art 12.6 dB scale-invariant signal-to-distortion ratio (SI-SDR) and 13.1 dB SDR, revealing new possibilities for deep learning based phase reconstruction and representing a fundamental progress towards solving the notoriously-hard cocktail party problem., Comment: Submitted to Interspeech 2018
Published: 2018
Full Text: View/download PDF

15. A Purely End-to-End System for Multi-speaker Speech Recognition

Author: Jonathan Le Roux, John R. Hershey, Shinji Watanabe, Hiroshi Seki, and Takaaki Hori
Subjects: End to end system, Sequence, Training set, Computer science, Speech recognition, 020208 electrical & electronic engineering, Contrast (statistics), 02 engineering and technology, Task (project management), 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Source separation, 0305 other medical science
Abstract: Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.
Published: 2018
Full Text: View/download PDF

16. Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

Author: Zhong Meng, Hakan Erdogan, John R. Hershey, and Shinji Watanabe
Subjects: FOS: Computer and information sciences, Beamforming, Sound (cs.SD), Computer science, Speech recognition, 02 engineering and technology, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Environmental noise, Voice activity detection, Computer Science - Computation and Language, Noise measurement, business.industry, Deep learning, Acoustic model, 020206 networking & telecommunications, Speech processing, Filter design, Recurrent neural network, Artificial intelligence, 0305 other medical science, business, Computation and Language (cs.CL), Adaptive beamformer, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set., in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Published: 2017

17. Attention-Based Multimodal Fusion for Video Description

Author: Kazuhiko Sumi, Chiori Hori, Ziming Zhang, Teng-Yok Lee, Bret Harsham, John R. Hershey, Takaaki Hori, and Tim K. Marks
Subjects: Artificial neural network, Computer science, business.industry, Concatenation, Feature extraction, Cognitive neuroscience of visual object recognition, Pattern recognition, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Recurrent neural network, Feature (computer vision), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Relevance (information retrieval), Artificial intelligence, 0305 other medical science, business, Word (computer architecture)
Abstract: Current methods for video description are based on encoder-decoder sentence generation using recurrent neural networks (RNNs). Recent work has demonstrated the advantages of integrating temporal attention mechanisms into these models, in which the decoder network predicts each word in the description by selectively giving more weight to encoded features from specific time frames. Such methods typically use two different types of features: image features (from an object classification model), and motion features (from an action recognition model), combined by naive concatenation in the model input. Because different feature modalities may carry task-relevant information at different times, fusing them by naive concatenation may limit the model's ability to dynamically determine the relevance of each type of feature to different parts of the description. In this paper, we incorporate audio features in addition to the image and motion features. To fuse these three modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. Combining our new multimodal attention model with standard temporal attention outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.
Published: 2017
Full Text: View/download PDF

18. Student-teacher network learning with enhanced features

Author: Jonathan Le Roux, John R. Hershey, Shinji Watanabe, and Takaaki Hori
Subjects: Training set, Voice activity detection, Noise measurement, Computer science, Speech recognition, Student teacher, 02 engineering and technology, Speech processing, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0305 other medical science
Abstract: Recent advances in distant-talking ASR research have confirmed that speech enhancement is an essential technique for improving the ASR performance, especially in the multichannel scenario. However, speech enhancement inevitably distorts speech signals, which can cause significant degradation when enhanced signals are used as training data. Thus, distant-talking ASR systems often resort to using the original noisy signals as training data and the enhanced signals only at test time, and give up on taking advantage of enhancement techniques in the training stage. This paper proposes to make use of enhanced features in the student-teacher learning paradigm. The enhanced features are used as input to a teacher network to obtain soft targets, while a student network tries to mimic the teacher network's outputs using the original noisy features as input, so that speech enhancement is implicitly performed within the student network. Compared with conventional student-teacher learning, which uses a better network as teacher, the proposed self-supervised method uses better (enhanced) inputs to a teacher. This setup matches the above scenario of making use of enhanced features in network training. Experiments with the CHiME-4 challenge real dataset show significant ASR improvements with an error reduction rate of 12% in the single-channel track and 15% in the 2-channel track, respectively, by using 6-channel beamformed features for the teacher model.
Published: 2017
Full Text: View/download PDF

19. Joint CTC/attention decoding for end-to-end speech recognition

Author: Takaaki Hori, John R. Hershey, and Shinji Watanabe
Subjects: Markov chain, Computer science, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Pronunciation, Mandarin Chinese, language.human_language, 030507 speech-language pathology & audiology, 03 medical and health sciences, Tokenization (data security), 0202 electrical engineering, electronic engineering, information engineering, language, 0305 other medical science, Hidden Markov model, Joint (audio engineering), Decoding methods
Abstract: End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.
Published: 2017
Full Text: View/download PDF

20. Single-Channel Multi-Speaker Separation Using Deep Clustering

Author: Jonathan Le Roux, Yusuf Ziya Isik, Zhuo Chen, Shinji Watanabe, and John R. Hershey
Subjects: business.industry, Computer science, Speech recognition, Deep learning, Word error rate, 020207 software engineering, 02 engineering and technology, Signal, Cocktail party effect, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Spectrogram, Segmentation, Artificial intelligence, 0305 other medical science, Cluster analysis, business, Communication channel
Abstract: Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.
Published: 2016
Full Text: View/download PDF

21. Minimum word error training of long short-term memory recurrent neural network language models for speech recognition

Author: John R. Hershey, Chiori Hori, Takaaki Hori, and Shinji Watanabe
Subjects: Sequence, Computer science, Speech recognition, Word error rate, Task (project management), 030507 speech-language pathology & audiology, 03 medical and health sciences, 0302 clinical medicine, Cross entropy, Discriminative model, Language model, 0305 other medical science, 030217 neurology & neurosurgery, Word (computer architecture)
Abstract: This paper describes minimum word error (MWE) training of recurrent neural network language models (RNNLMs) for speech recognition. RNNLMs are usually trained to minimize a cross entropy of estimated word probabilities against the correct word sequence, which corresponds to maximum likelihood criterion. However, this training does not necessarily maximize a performance measure in a target task, i.e. it does not minimize word error rate (WER) explicitly in speech recognition. To solve such a problem, several discriminative training methods have already been proposed for n-gram language models, but those for RNNLMs have not sufficiently investigated. In this paper, we propose a MWE training method for RNNLMs, and report significant WER reductions when we applied the MWE method to a standard Elman-type RNNLM and a more advanced model, a Long Short-Term Memory (LSTM) RNNLM. We also present efficient MWE training with N-best lists on Graphics Processing Units (GPUs).
Published: 2016
Full Text: View/download PDF

22. Deep beamforming networks for multi-channel speech recognition

Author: John R. Hershey, Guoguo Chen, Yu Zhang, Michael I. Mandel, Liang Lu, Shinji Watanabe, Michael L. Seltzer, Xiong Xiao, Dong Yu, and Hakan Erdogan
Subjects: Beamforming, Reverberation, Computer science, Microphone, Time delay neural network, Speech recognition, Feature extraction, Word error rate, Direction of arrival, 020206 networking & telecommunications, 02 engineering and technology, Speech processing, 030507 speech-language pathology & audiology, 03 medical and health sciences, Filter design, Noise, Computer Science::Sound, 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science
Abstract: Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beam forming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beam forming and acoustic modeling networks are trained jointly using back-propagation with a common cross entropy objective function. In experiments on the AMI meeting corpus,we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.
Published: 2016
Full Text: View/download PDF

23. Deep clustering: Discriminative embeddings for segmentation and separation

Author: John R. Hershey, Shinji Watanabe, Zhuo Chen, and Jonathan Le Roux
Subjects: FOS: Computer and information sciences, Computer science, Machine Learning (stat.ML), 02 engineering and technology, Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, Discriminative model, Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Segmentation, Neural and Evolutionary Computing (cs.NE), Cluster analysis, business.industry, Deep learning, Computer Science - Neural and Evolutionary Computing, 020206 networking & telecommunications, Pattern recognition, Spectral clustering, Computer Science - Learning, Computer Science::Sound, Spectrogram, Embedding, Noise (video), Artificial intelligence, 0305 other medical science, business
Abstract: We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains., Comment: Originally submitted on June 5, 2015
Published: 2016
Full Text: View/download PDF

24. Deep unfolding for multichannel source separation

Author: Scott Wisdom, Shinji Watanabe, John R. Hershey, and Jonathan Le Roux
Subjects: Network architecture, Markov random field, Computer science, business.industry, Inference, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Mixture model, 030507 speech-language pathology & audiology, 03 medical and health sciences, Generative model, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Artificial intelligence, 0305 other medical science, business
Abstract: Deep unfolding has recently been proposed to derive novel deep network architectures from model-based approaches. In this paper, we consider its application to multichannel source separation. We unfold a multichannel Gaussian mixture model (MCGMM), resulting in a deep MCGMM computational network that directly processes complex-valued frequency-domain multichannel audio and has an architecture defined explicitly by a generative model, thus combining the advantages of deep networks and model-based approaches. We further extend the deep MCGMM by modeling the GMM states using an MRF, whose unfolded mean-field inference updates add dynamics across layers. Experiments on source separation for multichannel mixtures of two simultaneous speakers shows that the deep MCGMM leads to improved performance with respect to the original MCGMM model.
Published: 2016
Full Text: View/download PDF

25. Monaural speech separation and recognition challenge

Author: Steven J. Rennie, John R. Hershey, and Martin Cooke
Subjects: Auditory scene analysis, auditory scene analysis, Computer science, Speech recognition, Monaural, computer.software_genre, 01 natural sciences, Theoretical Computer Science, Task (project management), noise robustness, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0103 physical sciences, simultaneous speech, 010301 acoustics, speaker identification, business.industry, speech recognition, Speech processing, Human-Computer Interaction, speech separation, Artificial intelligence, Noise (video), Computational linguistics, 0305 other medical science, business, computer, Software, Natural language processing, Test data, Communication channel
Abstract: International audience; Robust speech recognition in everyday conditions requires the solution to a number of challenging problems, not least the ability to handle multiple sound sources. The specific case of speech recognition in the presence of a competing talker has been studied for several decades, resulting in a number of quite distinct algorithmic solutions whose focus ranges from modeling both target and competing speech to speech separation using auditory grouping principles. The purpose of the monaural speech separation and recognition challenge was to permit a large-scale comparison of techniques for the competing talker problem. The task was to identify keywords in sentences spoken by a target talker when mixed into a single channel with a background talker speaking similar sentences. Ten independent sets of results were contributed, alongside a baseline recognition system. Performance was evaluated using common training and test data and common metrics. Listeners' performance in the same task was also measured. This paper describes the challenge problem, compares the performance of the contributed algorithms, and discusses the factors which distinguish the systems. One highlight of the comparison was the finding that several systems achieved near-human performance in some conditions, and one out-performed listeners overall.
Published: 2010
Full Text: View/download PDF

26. Improved Mvdr Beamforming Using Single-Channel Mask Prediction Networks

Author: Hakan Erdogan, Shinji Watanabe, John R. Hershey, Michael I. Mandel, and Jonathan Le Roux
Subjects: Beamforming, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, Artificial neural network, Computer science, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, 0305 other medical science, Communication channel
Abstract: Recent studies on multi-microphone speech databases indicate that it is beneficial to perform beamforming to improve speech recognition accuracies, especially when there is a high level of background noise. Minimum variance distortionless response (MVDR) beamforming is an important beamforming method that performs quite well for speech recognition purposes especially if the steering vector is known. However, steering the beamformer to focus on speech in unknown acoustic conditions remains a challenging problem. In this study, we use single channel speech enhancement deep networks to form masks that can be used for noise spatial covariance estimation, which steers the MVDR beamforming toward the speech. We analyze how mask prediction affects performance and also discuss various ways to use masks to obtain the speech and noise spatial covariance estimates in a reliable way. We show that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition.
Published: 2016

27. Deep Clustering and Conventional Networks for Music Separation: Stronger Together

Author: Jonathan Le Roux, John R. Hershey, Yi Luo, Nima Mesgarani, and Zhuo Chen
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Speech recognition, Correlation clustering, Machine Learning (stat.ML), 02 engineering and technology, Regularization (mathematics), Computer Science - Sound, Article, Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Cluster analysis, Network architecture, business.industry, Deep learning, 020206 networking & telecommunications, Pattern recognition, Computer Science - Learning, Embedding, Spectrogram, Artificial intelligence, 0305 other medical science, business
Abstract: Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks. However, little is known about its effectiveness in other challenging situations such as music source separation. Contrary to conventional networks that directly estimate the source signals, deep clustering generates an embedding for each time-frequency bin, and separates sources by clustering the bins in the embedding space. We show that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation, presumably because its more flexible objective engenders better regularization. Since the strengths of deep clustering and conventional network architectures appear complementary, we explore combining them in a single hybrid network trained via an approach akin to multi-task learning. Remarkably, the combination significantly outperforms either of its components., Comment: Published in ICASSP 2017
Published: 2016
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

27 results on '"John R. Hershey"'

1. Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis

2. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

3. Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming

4. Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend

5. Improving Universal Sound Separation Using Sound Classification

6. Universal Sound Separation

7. Prior-based Binary Masking and Discriminative Methods for Reverberant and Noisy Speech Recognition Using Distant Stereo Microphones

8. SDR - half-baked or well done?

9. End-to-End Multi-Speaker Speech Recognition

10. Alternative Objective Functions for Deep Clustering

11. Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

12. Exploring Tradeoffs in Models for Low-latency Speech Enhancement

13. Differentiable Consistency Constraints for Improved Deep Speech Enhancement

14. End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

15. A Purely End-to-End System for Multi-speaker Speech Recognition

16. Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

17. Attention-Based Multimodal Fusion for Video Description

18. Student-teacher network learning with enhanced features

19. Joint CTC/attention decoding for end-to-end speech recognition

20. Single-Channel Multi-Speaker Separation Using Deep Clustering

21. Minimum word error training of long short-term memory recurrent neural network language models for speech recognition

22. Deep beamforming networks for multi-channel speech recognition

23. Deep clustering: Discriminative embeddings for segmentation and separation

24. Deep unfolding for multichannel source separation

25. Monaural speech separation and recognition challenge

26. Improved Mvdr Beamforming Using Single-Channel Mask Prediction Networks

27. Deep Clustering and Conventional Networks for Music Separation: Stronger Together

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

27 results on '"John R. Hershey"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources