26 results on '"Scott Wisdom"'
Search Results
2. AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
- Author
-
Efthymios Tzinis, Scott Wisdom, Tal Remez, and John R. Hershey
- Published
- 2022
- Full Text
- View/download PDF
3. Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis
- Author
-
Naoyuki Kanda, Maokui He, Shinji Watanabe, Jinyu Li, Zili Huang, Zhuo Chen, Jun Du, Scott Wisdom, John R. Hershey, Pavel Denisov, Desh Raj, Hakan Erdogan, Yi Luo, and Takuya Yoshioka
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer science ,Speech recognition ,Modular system ,020206 networking & telecommunications ,02 engineering and technology ,Computer Science - Sound ,Speaker diarisation ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Audio and Speech Processing (eess.AS) ,Error analysis ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Task analysis ,Subtitle ,Transcription (software) ,0305 other medical science ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR., Accepted to IEEE SLT 2021
- Published
- 2021
- Full Text
- View/download PDF
4. Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training
- Author
-
Aswin Sivaraman, Scott Wisdom, Hakan Erdogan, and John R. Hershey
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus. The models are tested on real AMI recordings containing overlapping speech, and are evaluated subjectively by human listeners. To objectively evaluate our models, we also devise a synthetic AMI test set. For human evaluations on real recordings, we also propose a modification of the standard MUSHRA protocol to handle imperfect reference signals, which we call MUSHIRA. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets, outperforming unadapted generalist models trained on orders of magnitude more data. Our results show that unsupervised learning through MixIT enables model adaptation on real-world unlabeled spontaneous speech recordings.
- Published
- 2021
- Full Text
- View/download PDF
5. End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings
- Author
-
Shinji Watanabe, John R. Hershey, Soumi Maiti, Scott Wisdom, Kevin W. Wilson, and Hakan Erdogan
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,business.industry ,Computer science ,Speech recognition ,Deep learning ,Computer Science - Sound ,Data modeling ,Machine Learning (cs.LG) ,Speaker diarisation ,Discriminative model ,Robustness (computer science) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Loudspeaker ,Artificial intelligence ,business ,Cluster analysis ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Network model - Abstract
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multi-task transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data., Comment: 5 pages, 2 figures, ICASSP 2021
- Published
- 2021
- Full Text
- View/download PDF
6. Self-Supervised Learning from Automatically Separated Sound Scenes
- Author
-
Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, and Xavier Serra
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,contrastive learning ,Audio and Speech Processing (eess.AS) ,source separation ,FOS: Electrical engineering, electronic engineering, information engineering ,audio representation learning ,self-supervision ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) - Abstract
Comunicació presentada a 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), celebrat del 17 al 20 d'octubre de 2021 a New Paltz, Estats Units. Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
- Published
- 2021
- Full Text
- View/download PDF
7. Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes
- Author
-
Romain Serizel, Hakan Erdogan, Justin Salamon, Nicolas Turpault, John R. Hershey, Scott Wisdom, Eduardo Fonseca, Prem Seetharaman, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Google Inc, Research at Google, Universitat Pompeu Fabra [Barcelona] (UPF), Descript, Inc., Adobe Research, Part of this work was made with the support of the French National Research Agency, in the framework of the project LEAUDS 'Learning to understand audio scenes' (ANR-18-CE23-0020) and the French region Grand-Est. High Performance Computing resources were partially provided by the EXPLOR centre hosted by the University de Lorraine., Grid'5000, ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
FOS: Computer and information sciences ,Sound localization ,Sound (cs.SD) ,Reverberation ,Soundscape ,Computer science ,Speech recognition ,02 engineering and technology ,Computer Science - Sound ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Audio and Speech Processing (eess.AS) ,Robustness (computer science) ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Sound (geography) ,synthetic soundscapes ,geography ,Signal processing ,geography.geographical_feature_category ,Event (computing) ,Sound event detection ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,Benchmark (computing) ,sound separation ,020201 artificial intelligence & image processing ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
International audience; We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We show that the localization in time of sound events is still a problem for SED systems. We also show that reverberation and non-target sound events are severely degrading the performance of the SED systems. In the latter case, sound separation seems like a promising solution.
- Published
- 2020
- Full Text
- View/download PDF
8. Performance Study of a Convolutional Time-Domain Audio Separation Network for Real-Time Speech Denoising
- Author
-
Hakan Erdogan, Samuel Sonning, Scott Wisdom, and Christian Schüldt
- Subjects
Computer science ,business.industry ,Noise reduction ,Deep learning ,Speech recognition ,020206 networking & telecommunications ,02 engineering and technology ,Convolutional neural network ,Speech enhancement ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Time domain ,0305 other medical science ,business ,Representation (mathematics) - Abstract
Time-domain audio separation networks based on dilated temporal convolutions have recently been shown to perform very well compared to methods that are based on a time-frequency representation in speech separation tasks, even outperforming an oracle binary time-frequency mask of the speakers. This paper investigates the performance of such a time-domain network (Conv-TasNet) for speech denoising in a real-time setting, comparing various parameter settings. Most importantly, different amounts of lookahead are evaluated and compared to the baseline of a fully causal model. We show that a large part of the increase in performance between a causal and non-causal model is achieved with a lookahead of only 20 milliseconds, demonstrating the usefulness of even small lookaheads for many real-time applications.
- Published
- 2020
- Full Text
- View/download PDF
9. What's All the FUSS About Free Universal Sound Separation Data?
- Author
-
Romain Serizel, Prem Seetharaman, Justin Salamon, Daniel P. W. Ellis, John R. Hershey, Scott Wisdom, Eduardo Fonseca, Nicolas Turpault, Hakan Erdogan, Google Inc, Research at Google, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Universitat Pompeu Fabra [Barcelona] (UPF), Adobe Research, Descript, Inc., and ANR-18-CE23-0020,LEAUDS,Apprentissage statistique pour la compréhension de scènes audio(2018)
- Subjects
FOS: Computer and information sciences ,Reverberation ,Sound (cs.SD) ,open-source datasets ,Computer science ,Sound separation ,Separation (aeronautics) ,02 engineering and technology ,Impulse (physics) ,Computer Science - Sound ,Data modeling ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,Open domain ,FOS: Electrical engineering, electronic engineering, information engineering ,business.industry ,Deep learning ,deep learning ,020206 networking & telecommunications ,Universal sound separation ,variable source sep- aration ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,020201 artificial intelligence & image processing ,Artificial intelligence ,Variable number ,business ,Algorithm ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
International audience; We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box shaped rooms with frequency-dependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.5 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.
- Published
- 2020
- Full Text
- View/download PDF
10. Improving Universal Sound Separation Using Sound Classification
- Author
-
Scott Wisdom, Daniel P. W. Ellis, Efthymios Tzinis, Aren Jansen, and John R. Hershey
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Computer science ,Speech recognition ,Machine Learning (stat.ML) ,010501 environmental sciences ,01 natural sciences ,Oracle ,Computer Science - Sound ,Domain (software engineering) ,Machine Learning (cs.LG) ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Audio and Speech Processing (eess.AS) ,Statistics - Machine Learning ,Classifier (linguistics) ,Source separation ,FOS: Electrical engineering, electronic engineering, information engineering ,Sound (geography) ,0105 earth and related environmental sciences ,Hyperparameter ,geography ,geography.geographical_feature_category ,business.industry ,Deep learning ,Computer Science::Sound ,Artificial intelligence ,0305 other medical science ,business ,Focus (optics) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic sources from an open domain, regardless of their class. In this paper, we utilize the semantic information learned by sound classifier networks trained on a vast amount of diverse sounds to improve universal sound separation. In particular, we show that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information. This approach is especially useful in an iterative setup, where source estimates from an initial separation stage and their corresponding classifier-derived embeddings are fed to a second separation network. By performing a thorough hyperparameter search consisting of over a thousand experiments, we find that classifier embeddings from oracle clean sources provide nearly one dB of SNR gain, and our best iterative models achieve a significant fraction of this oracle performance, establishing a new state-of-the-art for universal sound separation.
- Published
- 2019
11. Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement
- Author
-
Desh Raj, Shinji Watanabe, Kevin W. Wilson, Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, John R. Hershey, and Zhuo Chen
- Subjects
FOS: Computer and information sciences ,Beamforming ,Sound (cs.SD) ,Computer Science - Machine Learning ,Artificial neural network ,Covariance function ,business.industry ,Computer science ,Word error rate ,Machine Learning (stat.ML) ,Context (language use) ,Pattern recognition ,Computer Science - Sound ,Machine Learning (cs.LG) ,Speech enhancement ,Signal-to-noise ratio ,Audio and Speech Processing (eess.AS) ,Statistics - Machine Learning ,FOS: Electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,business ,Block size ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, including factorizing the spatial covariance into a time-varying amplitude component and a time-invariant spatial component, as well as using block-based techniques. In addition, we introduce a multi-frame beamforming method which improves the results significantly by adding contextual frames to the beamforming formulations. We extensively evaluate and analyze the effects of window size, block size, and multi-frame context size for these methods. Our best method utilizes a sequence of three neural separation and multi-frame time-invariant spatial beamforming stages, and demonstrates an average improvement of 2.75 dB in scale-invariant signal-to-noise ratio and 14.2% absolute reduction in a comparative speech recognition metric across four challenging reverberant speech enhancement and separation tasks. We also use our three-speaker separation model to separate real recordings in the LibriCSS evaluation set into non-overlapping tracks, and achieve a better word error rate as compared to a baseline mask based beamformer., 7 pages, 7 figures, IEEE SLT 2021 (slt2020.org)
- Published
- 2019
12. Universal Sound Separation
- Author
-
Ilya Kavalerov, Brian Patton, Kevin W. Wilson, John R. Hershey, Hakan Erdogan, Jonathan Le Roux, and Scott Wisdom
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computer science ,Speech recognition ,Machine Learning (stat.ML) ,02 engineering and technology ,Computer Science - Sound ,Machine Learning (cs.LG) ,Convolution ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Statistics - Machine Learning ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Source separation ,Network architecture ,Basis (linear algebra) ,business.industry ,Deep learning ,Short-time Fourier transform ,020206 networking & telecommunications ,Speech enhancement ,Task (computing) ,Artificial intelligence ,0305 other medical science ,business ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation., Comment: 5 pages, accepted to WASPAA 2019
- Published
- 2019
- Full Text
- View/download PDF
13. SDR - half-baked or well done?
- Author
-
Jonathan Le Roux, John R. Hershey, Scott Wisdom, and Hakan Erdogan
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer science ,Noise reduction ,media_common.quotation_subject ,02 engineering and technology ,Data_CODINGANDINFORMATIONTHEORY ,Computer Science - Sound ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Signal-to-noise ratio ,Interference (communication) ,Audio and Speech Processing (eess.AS) ,Distortion ,0202 electrical engineering, electronic engineering, information engineering ,Source separation ,FOS: Electrical engineering, electronic engineering, information engineering ,Quality (business) ,media_common ,Measure (data warehouse) ,020206 networking & telecommunications ,Speech enhancement ,Computer engineering ,0305 other medical science ,Communication channel ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality. A decade ago, the BSS_eval toolkit was developed to give researchers worldwide a way to evaluate the quality of their algorithms in a simple, fair, and hopefully insightful way: it attempted to account for channel variations, and to not only evaluate the total distortion in the estimated signal but also split it in terms of various factors such as remaining interference, newly added artifacts, and channel errors. In recent years, hundreds of papers have been relying on this toolkit to evaluate their proposed methods and compare them to previous works, often arguing that differences on the order of 0.1 dB proved the effectiveness of a method over others. We argue here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results. We propose to use a slightly modified definition, resulting in a simpler, more robust measure, called scale-invariant SDR (SI-SDR). We present various examples of critical failure of the original SDR that SI-SDR overcomes.
- Published
- 2018
14. Differentiable Consistency Constraints for Improved Deep Speech Enhancement
- Author
-
Jeremy Thorpe, Kevin W. Wilson, Michael Chinen, Brian Patton, Scott Wisdom, John R. Hershey, and Rif A. Saurous
- Subjects
Masking (art) ,FOS: Computer and information sciences ,Sound (cs.SD) ,Artificial neural network ,Noise measurement ,Computer science ,Short-time Fourier transform ,020206 networking & telecommunications ,02 engineering and technology ,Computer Science - Sound ,Speech enhancement ,030507 speech-language pathology & audiology ,03 medical and health sciences ,symbols.namesake ,Signal-to-noise ratio ,Fourier transform ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,symbols ,FOS: Electrical engineering, electronic engineering, information engineering ,Spectrogram ,Differentiable function ,0305 other medical science ,Algorithm ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks. In this paper, we show that STFT consistency and mixture consistency can be jointly imposed by adding simple differentiable projection layers to the enhancement network. These layers are compatible with real or complex-valued masks. Using both of these constraints with complex-valued masks provides a 0.7 dB increase in scale-invariant signal-to-distortion ratio (SI-SDR) on a large dataset of speech corrupted by a wide variety of nonstationary noise across a range of input SNRs.
- Published
- 2018
- Full Text
- View/download PDF
15. Speech Analysis With the Strong Uncorrelating Transform
- Author
-
Scott Wisdom, Les Atlas, and Greg Okopal
- Subjects
Voice activity detection ,Acoustics and Ultrasonics ,Noise (signal processing) ,Computer science ,Speech recognition ,Speech processing ,Signal ,Computational Mathematics ,Signal-to-noise ratio ,Computer Science::Sound ,Computer Science (miscellaneous) ,Demodulation ,Electrical and Electronic Engineering ,Random variable ,Algorithm ,Statistic - Abstract
The strong uncorrelating transform (SUT) provides estimates of independent components from linear mixtures using only second-order information, provided that the components have unique circularity coefficients. We propose a processing framework for generating complex-valued subbands from real-valued mixtures of speech and noise where the objective is to control the likely values of the sample circularity coefficients of the underlying speech and noise components in each subband. We show how several processing parameters affect the noncircularity of speech-like and noise components in the subband, ultimately informing parameter choices that allow for estimation of each of the components in a subband using the SUT. Additionally, because the speech and noise components will have unique sample circularity coefficients, this statistic can be used to identify time–frequency regions that contain voiced speech. We give an example of the recovery of the circularity coefficients of a real speech signal from a two-channel noisy mixture at $-{\hbox {25}}~\hbox{dB}$ SNR, which demonstrates how the estimates of noncircularity can reveal the time-frequency structure of a speech signal in very high levels of noise. Finally, we present the results of a voice activity detection (VAD) experiment showing that two new circularity-based statistics, one of which is derived from the SUT processing, can achieve improved performance over state-of-the-art VADs in real-world recordings of noise.
- Published
- 2015
- Full Text
- View/download PDF
16. Building recurrent networks by unfolding iterative thresholding for sequential sparse recovery
- Author
-
Les Atlas, Scott Wisdom, Thomas Powers, and James W. Pitton
- Subjects
Network architecture ,Artificial neural network ,Computer science ,business.industry ,Deep learning ,020206 networking & telecommunications ,Pattern recognition ,02 engineering and technology ,Sparse approximation ,Machine learning ,computer.software_genre ,Backpropagation ,Set (abstract data type) ,Recurrent neural network ,Compressed sensing ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer - Abstract
Historically, sparse methods and neural networks, particularly modern deep learning methods, have been relatively disparate areas. Sparse methods are typically used for signal enhancement, compression, and recovery, usually in an unsupervised framework, while neural networks commonly rely on a supervised training set. In this paper, we use the specific problem of sequential sparse recovery, which models a sequence of observations over time using a sequence of sparse coefficients, to show how algorithms for sparse modeling can be combined with supervised deep learning to improve sparse recovery. Specifically, we show that the iterative soft-thresholding algorithm (ISTA) for sequential sparse recovery corresponds to a stacked recurrent neural network (RNN) under specific architecture and parameter constraints. Then we demonstrate the benefit of training this RNN with backpropagation using supervised data for the task of column-wise compressive sensing of images. This training corresponds to adaptation of the original iterative thresholding algorithm and its parameters. Thus, we show by example that sparse modeling can provide a rich source of principled and structured deep network architectures that can be trained to improve performance on specific tasks.
- Published
- 2017
- Full Text
- View/download PDF
17. Novel Deep Architectures in Speech Processing
- Author
-
Jonathan Le Roux, Scott Wisdom, Yusuf Ziya Isik, Shinji Watanabe, Zhuo Chen, and John R. Hershey
- Subjects
Computer science ,business.industry ,Inference ,Machine learning ,computer.software_genre ,Speech processing ,Spectral clustering ,Non-negative matrix factorization ,Discriminative model ,Problem domain ,Artificial intelligence ,Cluster analysis ,business ,Adaptation (computer science) ,computer - Abstract
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model. In addition, unsupervised inference tasks such as adaptation and clustering are handled in a natural way. However, these benefits typically come at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, and discriminative training is relatively easy. However, their typically generic architectures often make it unclear how to incorporate specific problem knowledge or to perform flexible tasks such as unsupervised inference. This chapter introduces frameworks to provide the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and reinterpret inference iterations as layers in a deep network, while generalizing the parametrization to create a more powerful network. We show how such frameworks yield new understanding of conventional networks, and how they can result in novel networks for speech processing, including networks based on nonnegative matrix factorization, complex Gaussian microphone array signal processing, and a network inspired by efficient spectral clustering. We then discuss what has been learned in recent work and provide a prospectus for future research in this area.
- Published
- 2017
- Full Text
- View/download PDF
18. Deep Recurrent NMF for Speech Separation by Unfolding Iterative Thresholding
- Author
-
Scott Wisdom, Thomas Powers, Les Atlas, and James W. Pitton
- Subjects
FOS: Computer and information sciences ,Network architecture ,Sound (cs.SD) ,Optimization problem ,Computer science ,business.industry ,Initialization ,020206 networking & telecommunications ,Pattern recognition ,Machine Learning (stat.ML) ,02 engineering and technology ,Computer Science - Sound ,Non-negative matrix factorization ,Machine Learning (cs.LG) ,Computer Science - Learning ,Recurrent neural network ,ComputingMethodologies_PATTERNRECOGNITION ,Statistics - Machine Learning ,Convergence (routing) ,0202 electrical engineering, electronic engineering, information engineering ,Spectrogram ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Interpretability - Abstract
In this paper, we propose a novel recurrent neural network architecture for speech separation. This architecture is constructed by unfolding the iterations of a sequential iterative soft-thresholding algorithm (ISTA) that solves the optimization problem for sparse nonnegative matrix factorization (NMF) of spectrograms. We name this network architecture deep recurrent NMF (DR-NMF). The proposed DR-NMF network has three distinct advantages. First, DR-NMF provides better interpretability than other deep architectures, since the weights correspond to NMF model parameters, even after training. This interpretability also provides principled initializations that enable faster training and convergence to better solutions compared to conventional random initialization. Second, like many deep networks, DR-NMF is an order of magnitude faster at test time than NMF, since computation of the network output only requires evaluating a few layers at each time step. Third, when a limited amount of training data is available, DR-NMF exhibits stronger generalization and separation performance compared to sparse NMF and state-of-the-art long-short term memory (LSTM) networks. When a large amount of training data is available, DR-NMF achieves lower yet competitive separation performance compared to LSTM networks., Comment: To be presented at WASPAA 2017
- Published
- 2017
- Full Text
- View/download PDF
19. Enabling Seamless Wireless Power Delivery in Dynamic Environments
- Author
-
Benjamin H. Waters, Scott Wisdom, Alanson P. Sample, and Joshua R. Smith
- Subjects
Engineering ,Power transmission ,business.industry ,Electrical engineering ,Impedance matching ,Tracking system ,Inductive charging ,Adaptive system ,Electronic engineering ,Maximum power transfer theorem ,Wireless ,Electrical and Electronic Engineering ,Docking station ,business - Abstract
Effective means of delivering wireless power to volumes of spaces will enable users the freedom and mobility to seamlessly power and recharge their devices in an unencumbered fashion. This has particular importance for consumer electronic, medical, and industrial applications, where usage models focus on unstructured and dynamic environments. However, existing wireless power technology falls short of this vision. Inductive charging solutions are limited to near-contact distances and require a docking station or precise placement for effective operation. Far-field wireless power techniques allow much greater range, but require complicated tracking systems to maintain a line-of-sight connection for high-efficiency power delivery to mobile applications. Recent work using magnetically coupled resonators (MCRs) for wireless power delivery has shown a promising intersection between range (on the order of a meter), efficiency (over 80%), and delivered power (up to tens of watts). However, unpredictable loads rapidly change system operating points, and changes in position disrupt system efficiency, which affects the ultimate usability of these systems. Dynamic adaptation to these changes in operating conditions and power transfer range is a critical capability in developing a fully functional and versatile wireless power solution. This paper provides an overview of methods used to adapt to variations in range, orientation, and load using both wideband and fixed-frequency techniques.
- Published
- 2013
- Full Text
- View/download PDF
20. Benefits of noncircular statistics for nonstationary signals
- Author
-
Les Atlas, James W. Pitton, Greg Okopal, and Scott Wisdom
- Subjects
Gaussian ,Estimator ,020206 networking & telecommunications ,Statistical model ,02 engineering and technology ,Complex normal distribution ,Time–frequency analysis ,030507 speech-language pathology & audiology ,03 medical and health sciences ,symbols.namesake ,Fourier transform ,Statistics ,0202 electrical engineering, electronic engineering, information engineering ,symbols ,0305 other medical science ,Random variable ,Statistical signal processing ,Mathematics - Abstract
Conventional statistical signal processing of nonstationary signals uses circular complex Gaussian distributions to model the complex-valued short-time Fourier transform. In this paper, we show how noncircular complex Gaussian distributions can provide better statistical models of a variety of nonstationary acoustic signals. The estimators required for this model are computationally efficient, and also have a simple approximate finite-sample distribution. We also show that noncircular Gaussian models provide distinct benefits for statistical signal processing. In particular, we show how noncircular Gaussian models can improve detection of nonstationary acoustic events, and we explore how estimator parameter choices affect performance.
- Published
- 2016
- Full Text
- View/download PDF
21. On spectral noncircularity of natural signals
- Author
-
James W. Pitton, Scott Wisdom, and Les Atlas
- Subjects
Acoustics ,Likelihood-ratio test ,0202 electrical engineering, electronic engineering, information engineering ,Estimator ,020206 networking & telecommunications ,020201 artificial intelligence & image processing ,02 engineering and technology ,Transient (oscillation) ,Frequency modulation ,Algorithm ,Spectral line ,Mathematics - Abstract
Natural signals are typically nonstationary. The complex-valued frequency spectra of nonstationary signals do not have zero spectral correlation, as is assumed for wide-sense stationary processes. Instead, these spectra have non-zero second-order noncircular statistics-that is, they are not rotationally invariant-that are potentially useful for detection, classification, and enhancement. These noncircular statistics are especially significant for transient events, which are common in many natural signals. In this paper we provide practical and effective estimators for spectral noncircularity and spectral correlation. We illustrate the behavior of our spectral noncircularity estimators for synthetic signals. Then, we derive a generalized likelihood ratio test using both circular and noncircular models and show how estimates of spectral noncircularity provide performance improvements for detection of natural acoustic events.
- Published
- 2016
- Full Text
- View/download PDF
22. Deep unfolding for multichannel source separation
- Author
-
Scott Wisdom, Shinji Watanabe, John R. Hershey, and Jonathan Le Roux
- Subjects
Network architecture ,Markov random field ,Computer science ,business.industry ,Inference ,020206 networking & telecommunications ,Pattern recognition ,02 engineering and technology ,Mixture model ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Generative model ,0202 electrical engineering, electronic engineering, information engineering ,Source separation ,Artificial intelligence ,0305 other medical science ,business - Abstract
Deep unfolding has recently been proposed to derive novel deep network architectures from model-based approaches. In this paper, we consider its application to multichannel source separation. We unfold a multichannel Gaussian mixture model (MCGMM), resulting in a deep MCGMM computational network that directly processes complex-valued frequency-domain multichannel audio and has an architecture defined explicitly by a generative model, thus combining the advantages of deep networks and model-based approaches. We further extend the deep MCGMM by modeling the GMM states using an MRF, whose unfolded mean-field inference updates add dynamics across layers. Experiments on source separation for multichannel mixtures of two simultaneous speakers shows that the deep MCGMM leads to improved performance with respect to the original MCGMM model.
- Published
- 2016
- Full Text
- View/download PDF
23. Voice activity detection using subband noncircularity
- Author
-
Les Atlas, James W. Pitton, Scott Wisdom, and Greg Okopal
- Subjects
Noise ,Signal-to-noise ratio ,Voice activity detection ,Noise measurement ,Feature (computer vision) ,Computer science ,Speech recognition ,Speech coding ,Speech processing - Abstract
Many voice activity detection (VAD) systems use the magnitude of complex-valued spectral representations. However, using only the magnitude often does not fully characterize the statistical behavior of the complex values. We present two novel methods for performing VAD on single- and dual-channel audio that do completely account for the second-order statistical behavior of complex data. Our methods exploit the second-order noncircularity (also known as impropriety) of complex subbands of speech and noise. Since speech tends to be more improper than noise, higher impropriety suggests speech activity. Our single-channel method is blind in the sense that it is unsupervised and, unlike many VAD systems, does not rely on non-speech periods for noise parameter estimation. Our methods achieve improved performance over other state-of-the-art magnitude-based VADs on the QUT-NOISE-TIMIT corpus, which indicates that impropriety is a compelling new feature for voice activity detection.
- Published
- 2015
- Full Text
- View/download PDF
24. Extending coherence for optimal detection of nonstationary harmonic signals
- Author
-
Scott Wisdom, James W. Pitton, and Les Atlas
- Subjects
Coherence time ,Detector ,Electronic engineering ,Detection performance ,Coherence (signal processing) ,Fundamental frequency ,Performance improvement ,Algorithm ,Mathematics - Abstract
This paper describes an improved detector for nonstationary harmonic signals. The performance improvement is accomplished by using a novel method for extending the coherence time of such signals. This method applies a transformation to a noisy signal that attempts to fit a simple model to the signal's slowly changing fundamental frequency over the analysis duration. By matching the change in the signal's fundamental frequency, analysis is more coherent with the signal over longer durations, which allows the use of longer windows and thus improves detection performance.
- Published
- 2014
- Full Text
- View/download PDF
25. Estimating the noncircularity of latent components within complex-valued subband mixtures with applications to speech processing
- Author
-
Scott Wisdom, Les Atlas, and Greg Okopal
- Subjects
Voice activity detection ,Degree (graph theory) ,Computer Science::Sound ,business.industry ,Noise (signal processing) ,Complex valued ,Pattern recognition ,Artificial intelligence ,Pitch tracking ,Speech processing ,business ,Mathematics - Abstract
This paper describes an approach that estimates the circularity coefficients of multiple underlying components within complex subbands of an additive mixture of voiced speech and noise via the strong uncorrelating transform (SUT). For the SUT to be effective, the latent source signals must have unique nonzero circularity coefficients; this requirement is satisfied by using narrow filters to impose a degree of noncircularity upon what would typically be circular noise. The circularity coefficient estimates are then used for voice activity detection, pitch tracking, and enhancement.
- Published
- 2014
- Full Text
- View/download PDF
26. Extending coherence time for analysis of modulated random processes
- Author
-
Les Atlas, James W. Pitton, and Scott Wisdom
- Subjects
Coherence time ,Stochastic process ,Speech recognition ,Estimator ,Relaxation (iterative method) ,Fundamental frequency ,Algorithm ,Noise (electronics) ,Frequency modulation ,Sonar ,Mathematics - Abstract
In this paper, we relax a commonly-used assumption about a class of nonstationary random processes composed of modulated wide-sense stationary random processes: that the fundamental frequency of the modulator is stationary within the analysis window. To compensate for the relaxation of this assumption, we define the generalized DEMON (“demodulated noise”) spectrum representing modulation frequency, which we use to increase the coherence time of such signals. Increased coherence time means longer analysis windows, which provides higher SNR estimators. We use the example of detection on both synthetic and real-world passive sonar signals to demonstrate this increase.
- Published
- 2014
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.