Author: "Zhang, Wangyou" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Wangyou"' showing total 100 results

Start Over Author "Zhang, Wangyou"

100 results on '"Zhang, Wangyou"'

1. SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

Author: Jung, Jee-weon, Wu, Yihan, Wang, Xin, Kim, Ji-Hoon, Maiti, Soumi, Matsunaga, Yuta, Shim, Hye-jin, Tian, Jinchuan, Evans, Nicholas, Chung, Joon Son, Zhang, Wangyou, Um, Seyun, Takamichi, Shinnosuke, and Watanabe, Shinji
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, existing datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Existing SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. We present SpoofCeleb, which leverages a fully automated pipeline that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. The resulting SpoofCeleb dataset comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We provide baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb., Comment: 9 pages, 2 figures, 8 tables
Published: 2024

2. Text-To-Speech Synthesis In The Wild

Author: Jung, Jee-weon, Zhang, Wangyou, Maiti, Soumi, Wu, Yihan, Wang, Xin, Kim, Ji-Hoon, Matsunaga, Yuta, Um, Seyun, Tian, Jinchuan, Shim, Hye-jin, Evans, Nicholas, Chung, Joon Son, Takamichi, Shinnosuke, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms. The recent literature nonetheless shows efforts to train TTS systems using data collected in the wild. While this approach allows for the use of massive quantities of natural speech, until now, there are no common datasets. We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, in this case, applied to the VoxCeleb1 dataset commonly used for speaker recognition. We further propose two training sets. TITW-Hard is derived from the transcription, segmentation, and selection of VoxCeleb1 source data. TITW-Easy is derived from the additional application of enhancement and additional data selection based on DNSMOS. We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard. Both the dataset and protocols are publicly available and support the benchmarking of TTS systems trained using TITW data., Comment: 5 pages, submitted to ICASSP 2025 as a conference paper
Published: 2024

3. Towards Robust Speech Representation Learning for Thousands of Languages

Author: Chen, William, Zhang, Wangyou, Peng, Yifan, Li, Xinjian, Tian, Jinchuan, Shi, Jiatong, Chang, Xuankai, Maiti, Soumi, Livescu, Karen, and Watanabe, Shinji
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in https://www.wavlab.org/activities/2024/xeus/., Comment: Updated affiliations; 20 pages
Published: 2024

4. URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Author: Zhang, Wangyou, Scheibler, Robin, Saijo, Kohei, Cornell, Samuele, Li, Chenda, Ni, Zhaoheng, Kumar, Anurag, Pirklbauer, Jan, Sach, Marvin, Watanabe, Shinji, Fingscheidt, Tim, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generalizability of SE. We aim to extend the SE definition to cover different sub-tasks to explore the limits of SE models, starting from denoising, dereverberation, bandwidth extension, and declipping. A novel framework is proposed to unify all these sub-tasks in a single model, allowing the use of all existing SE approaches. We collected public speech and noise data from different domains to construct diverse evaluation data. Finally, we discuss the insights gained from our preliminary baseline experiments based on both generative and discriminative SE methods with 12 curated metrics., Comment: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix
Published: 2024
Full Text: View/download PDF

5. Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Author: Zhang, Wangyou, Saijo, Kohei, Jung, Jee-weon, Li, Chenda, Watanabe, Shinji, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement. In this paper, we aim to provide new insights for addressing the above issues by exploring the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes. Our investigation involves several popular SE architectures and speech data from different domains. Experiments reveal both similarities and distinctions between the scaling effects in SE and other tasks such as speech recognition. These findings further provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures., Comment: 5 pages, 3 figures, 4 tables, Accepted by Interspeech 2024
Published: 2024
Full Text: View/download PDF

6. SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition

Author: Wu, Yihan, Maiti, Soumi, Peng, Yifan, Zhang, Wangyou, Li, Chenda, Wang, Yuyue, Wang, Xihua, Watanabe, Shinji, and Song, Ruihua
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advancements in language models have significantly enhanced performance in multiple speech-related tasks. Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model. However, this design omits the intrinsic connections between different speech tasks, which can potentially boost the performance of each task. In this work, we propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens. Built upon four primary tasks -- speech synthesis, speech recognition, speech language modeling, and text language modeling -- SpeechComposer can easily extend to more speech tasks via compositions of well-designed prompt tokens, like voice conversion and speech enhancement. The unification of prompt tokens also makes it possible for knowledge sharing among different speech tasks in a more structured manner. Experimental results demonstrate that our proposed SpeechComposer can improve the performance of both primary tasks and composite tasks, showing the effectiveness of the shared prompt tokens. Remarkably, the unified decoder-only model achieves a comparable and even better performance than the baselines which are expert models designed for single tasks., Comment: 11 pages, 2 figures
Published: 2024

7. ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

Author: Jung, Jee-weon, Zhang, Wangyou, Shi, Jiatong, Aldeneh, Zakaria, Higuchi, Takuya, Theobald, Barry-John, Abdelaziz, Ahmed Hussen, and Watanabe, Shinji
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN., Comment: 5 pages, 3 figures, 7 tables, Interspeech 2024
Published: 2024

8. Improving Design of Input Condition Invariant Speech Enhancement

Author: Zhang, Wangyou, Jung, Jee-weon, Watanabe, Shinji, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at https://github.com/espnet/espnet., Comment: Accepted by ICASSP 2024, 5 pages, 2 figures, 3 tables (corrected the results of no processing on CHiME-4 (Simu) in Table 2)
Published: 2024

9. A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

Author: Saijo, Kohei, Zhang, Wangyou, Wang, Zhong-Qiu, Watanabe, Shinji, Kobayashi, Tetsunori, and Ogawa, Tetsuji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting. This is achieved by integrating two modules into an SE model: 1) an internal separation module that does both speaker counting and separation; and 2) a TSE module that extracts the target speech from the internal separation outputs using target speaker cues. The model is trained to perform TSE if the target speaker cue is given and SS otherwise. By training the model to remove noise and reverberation, we allow the model to tackle the five tasks mentioned above with a single model, which has not been accomplished yet. Evaluation results demonstrate that the proposed MUSE model can successfully handle multiple tasks with a single model., Comment: 6 pages, 4 figures, 2 tables, accepted by ASRU2023
Published: 2023

10. Toward Universal Speech Enhancement for Diverse Input Conditions

Author: Zhang, Wangyou, Saijo, Kohei, Wang, Zhong-Qiu, Watanabe, Shinji, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: The past decade has witnessed substantial growth of data-driven speech enhancement (SE) techniques thanks to deep learning. While existing approaches have shown impressive performance in some common datasets, most of them are designed only for a single condition (e.g., single-channel, multi-channel, or a fixed sampling frequency) or only consider a single task (e.g., denoising or dereverberation). Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model. In this paper, we make the first attempt to investigate this line of research. First, we devise a single SE model that is independent of microphone channels, signal lengths, and sampling frequencies. Second, we design a universal SE benchmark by combining existing public corpora with multiple conditions. Our experiments on a wide range of datasets show that the proposed single model can successfully handle diverse conditions with strong performance., Comment: 6 pages, 3 figures, 5 tables, published in ASRU 2023 (corrected the results of noisy speech on CHiME-4 (Simu) in Table 4)
Published: 2023

11. Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

Author: Chen, William, Shi, Jiatong, Yan, Brian, Berrebbi, Dan, Zhang, Wangyou, Peng, Yifan, Chang, Xuankai, Maiti, Soumi, and Watanabe, Shinji
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more research groups. We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages. To build WavLabLM, we devise a novel multi-stage pre-training method, designed to address the language imbalance of multilingual data. WavLabLM achieves comparable performance to XLS-R on ML-SUPERB with less than 10% of the training data, making SSL realizable with academic compute. We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials. We open-source all code and models in ESPnet., Comment: Accepted to ASRU 2023
Published: 2023

12. Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Author: Peng, Yifan, Tian, Jinchuan, Yan, Brian, Berrebbi, Dan, Chang, Xuankai, Li, Xinjian, Shi, Jiatong, Arora, Siddhant, Chen, William, Sharma, Roshan, Zhang, Wangyou, Sudo, Yui, Shakeel, Muhammad, Jung, Jee-weon, Maiti, Soumi, and Watanabe, Shinji
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science., Comment: Accepted at ASRU 2023
Published: 2023

13. Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

Author: Masuyama, Yoshiki, Chang, Xuankai, Zhang, Wangyou, Cornell, Samuele, Wang, Zhong-Qiu, Ono, Nobutaka, Qian, Yanmin, and Watanabe, Shinji
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%)., Comment: Accepted to IEEE WASPAA 2023
Published: 2023

14. Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Author: Zhang, Wangyou and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data. On the other hand, the use of weakly-supervised data is less explored for speech pre-training. To fill this gap, we propose a weakly-supervised speech pre-training method based on speaker-aware speech data. It adopts a similar training procedure to the widely-used masked speech prediction based SSL framework, while incorporating additional target-speaker enrollment information as an auxiliary input. In this way, the learned representation is steered towards the target speaker even in the presence of highly overlapping interference, allowing potential applications to tasks such as target speech recognition. Our experiments on Libri2Mix and WSJ0-2mix datasets show that the proposed model achieves significantly better ASR performance compared to WavLM, the state-of-the-art SSL model with denoising capability., Comment: Accepted by Interspeech; 5 pages, 1 figure, 3 tables
Published: 2023

15. ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

Author: Lu, Yen-Ju, Chang, Xuankai, Li, Chenda, Zhang, Wangyou, Cornell, Samuele, Ni, Zhaoheng, Masuyama, Yoshiki, Yan, Brian, Scheibler, Robin, Wang, Zhong-Qiu, Tsao, Yu, Qian, Yanmin, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE work, numerous features have been added, including recent state-of-the-art speech enhancement models with their respective training and evaluation recipes. Importantly, a new interface has been designed to flexibly combine speech enhancement front-ends with other tasks, including automatic speech recognition (ASR), speech translation (ST), and spoken language understanding (SLU). To showcase such integration, we performed experiments on carefully designed synthetic datasets for noisy-reverberant multi-channel ST and SLU tasks, which can be used as benchmark corpora for future research. In addition to these new tasks, we also use CHiME-4 and WSJ0-2Mix to benchmark multi- and single-channel SE approaches. Results show that the integration of SE front-ends with back-end tasks is a promising research direction even for tasks besides ASR, especially in the multi-channel scenario. The code is available online at https://github.com/ESPnet/ESPnet. The multi-channel ST and SLU datasets, which are another contribution of this work, are released on HuggingFace., Comment: To appear in Interspeech 2022
Published: 2022

16. End-to-End Multi-speaker ASR with Independent Vector Analysis

Author: Scheibler, Robin, Zhang, Wangyou, Chang, Xuankai, Watanabe, Shinji, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition. We propose a frontend for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm. It uses the fast and stable iterative source steering algorithm together with a neural source model. The parameters from the ASR module and the neural source model are optimized jointly from the ASR loss itself. We demonstrate competitive performance with previous systems using neural beamforming frontends. First, we explore the trade-offs when using various number of channels for training and testing. Second, we demonstrate that the proposed IVA frontend performs well on noisy data, even when trained on clean mixtures only. Furthermore, it extends without retraining to the separation of more speakers, which is demonstrated on mixtures of three and four speakers., Comment: Submitted to INTERSPEECH2022. 5 pages, 2 figures, 3 tables
Published: 2022

17. Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge

Author: Lu, Yen-Ju, Cornell, Samuele, Chang, Xuankai, Zhang, Wangyou, Li, Chenda, Ni, Zhaoheng, Wang, Zhong-Qiu, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper describes our submission to the L3DAS22 Challenge Task 1, which consists of speech enhancement with 3D Ambisonic microphones. The core of our approach combines Deep Neural Network (DNN) driven complex spectral mapping with linear beamformers such as the multi-frame multi-channel Wiener filter. Our proposed system has two DNNs and a linear beamformer in between. Both DNNs are trained to perform complex spectral mapping, using a combination of waveform and magnitude spectrum losses. The estimated signal from the first DNN is used to drive a linear beamformer, and the beamforming result, together with this enhanced signal, are used as extra inputs for the second DNN which refines the estimation. Then, from this new estimated signal, the linear beamformer and second DNN are run iteratively. The proposed method was ranked first in the challenge, achieving, on the evaluation set, a ranking metric of 0.984, versus 0.833 of the challenge baseline., Comment: to be published in IEEE ICASSP 2022
Published: 2022

18. Separating Long-Form Speech with Group-Wise Permutation Invariant Training

Author: Zhang, Wangyou, Chen, Zhuo, Kanda, Naoyuki, Liu, Shujie, Li, Jinyu, Eskimez, Sefik Emre, Yoshioka, Takuya, Xiao, Xiong, Meng, Zhong, Qian, Yanmin, and Wei, Furu
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription. Speech separation is often required to handle overlapped speech that is commonly observed in conversation. Although the original utterancelevel permutation invariant training-based continuous speech separation approach has proven to be effective in various conditions, it lacks the ability to leverage the long-span relationship of utterances and is computationally inefficient due to the highly overlapped sliding windows. To overcome these drawbacks, we propose a novel training scheme named Group-PIT, which allows direct training of the speech separation models on the long-form speech with a low computational cost for label assignment. Two different speech separation approaches with Group-PIT are explored, including direct long-span speech separation and short-span speech separation with long-span tracking. The experiments on the simulated meeting-style data demonstrate the effectiveness of our proposed approaches, especially in dealing with a very long speech input., Comment: 5 pages, 3 figures, 3 tables, submitted to IEEE ICASSP 2022
Published: 2021

19. Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

Author: Zhang, Wangyou, Shi, Jing, Li, Chenda, Watanabe, Shinji, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The deep learning based time-domain models, e.g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement. However, many experiments on the time-domain speech enhancement model are done in simulated conditions, and it is not well studied whether the good performance can generalize to real-world scenarios. In this paper, we aim to provide an insightful investigation of applying multi-channel Conv-TasNet based speech enhancement to both simulation and real data. Our preliminary experiments show a large performance gap between the two conditions in terms of the ASR performance. Several approaches are applied to close this gap, including the integration of multi-channel Conv-TasNet into the beamforming model with various strategies, and the joint training of speech enhancement and speech recognition models. Our experiments on the CHiME-4 corpus show that our proposed approaches can greatly reduce the speech recognition performance discrepancy between simulation and real data, while preserving the strong speech enhancement capability in the frontend., Comment: 5 pages, 3 figures, accepted by IEEE WASPAA 2021
Published: 2021

20. End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

Author: Zhang, Wangyou, Boeddeker, Christoph, Watanabe, Shinji, Nakatani, Tomohiro, Delcroix, Marc, Kinoshita, Keisuke, Ochiai, Tsubasa, Kamo, Naoyuki, Haeb-Umbach, Reinhold, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel multi-speaker reverberant condition, and propose to extend our previous framework for end-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend subnetworks including voice activity detection like masks. The techniques significantly stabilize the end-to-end training process. The experiments on the spatialized wsj1-2mix corpus show that the proposed system achieves about 35% WER relative reduction compared to our conventional multi-channel E2E ASR system, and also obtains decent speech dereverberation and separation performance (SDR=12.5 dB) in the reverberant multi-speaker condition while trained only with the ASR criterion., Comment: 5 pages, 1 figure, accepted by ICASSP 2021
Published: 2021
Full Text: View/download PDF

21. The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

Author: Watanabe, Shinji, Boyer, Florian, Chang, Xuankai, Guo, Pengcheng, Hayashi, Tomoki, Higuchi, Yosuke, Hori, Takaaki, Huang, Wen-Chin, Inaguma, Hirofumi, Kamo, Naoyuki, Karita, Shigeki, Li, Chenda, Shi, Jing, Subramanian, Aswin Shanmugam, and Zhang, Wangyou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively.
Published: 2020

22. Convolutive Transfer Function Invariant SDR training criteria for Multi-Channel Reverberant Speech Separation

Author: Boeddeker, Christoph, Zhang, Wangyou, Nakatani, Tomohiro, Kinoshita, Keisuke, Ochiai, Tsubasa, Delcroix, Marc, Kamo, Naoyuki, Qian, Yanmin, and Haeb-Umbach, Reinhold
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separation with a time-domain training objective function. For the objective we propose to use a convolutive transfer function invariant Signal-to-Distortion Ratio (CI-SDR) based loss. While this is a well-known evaluation metric (BSS Eval), it has not been used as a training objective before. To show the effectiveness, we demonstrate the performance on LibriSpeech based reverberant mixtures. On this task, the proposed system approaches the error rate obtained on single-source non-reverberant input, i.e., LibriSpeech test_clean, with a difference of only 1.2 percentage points, thus outperforming a conventional permutation invariant training based system and alternative objectives like Scale Invariant Signal-to-Distortion Ratio by a large margin., Comment: Accepted by ICASSP 2021
Published: 2020

23. ESPnet-se: end-to-end speech enhancement and separation toolkit designed for asr integration

Author: Li, Chenda, Shi, Jing, Zhang, Wangyou, Subramanian, Aswin Shanmugam, Chang, Xuankai, Kamo, Naoyuki, Hira, Moto, Hayashi, Tomoki, Boeddeker, Christoph, Chen, Zhuo, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhancement and separation).It is capable of processing both single-channel and multi-channel data, with various functionalities including dereverberation, denoising and source separation. We provide all-in-one recipes including data pre-processing, feature extraction, training and evaluation pipelines for a wide range of benchmark datasets. This paper describes the design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets., Comment: Accepted by SLT 2021
Published: 2020
Full Text: View/download PDF

24. Recent Developments on ESPnet Toolkit Boosted by Conformer

Author: Guo, Pengcheng, Boyer, Florian, Chang, Xuankai, Hayashi, Tomoki, Higuchi, Yosuke, Inaguma, Hirofumi, Kamo, Naoyuki, Li, Chenda, Garcia-Romero, Daniel, Shi, Jiatong, Shi, Jing, Watanabe, Shinji, Wei, Kun, Zhang, Wangyou, and Zhang, Yuekai
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.
Published: 2020

25. End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

Author: Zhang, Wangyou, Subramanian, Aswin Shanmugam, Chang, Xuankai, Watanabe, Shinji, and Qian, Yanmin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Despite successful applications of end-to-end approaches in multi-channel speech recognition, the performance still degrades severely when the speech is corrupted by reverberation. In this paper, we integrate the dereverberation module into the end-to-end multi-channel speech recognition system and explore two different frontend architectures. First, a multi-source mask-based weighted prediction error (WPE) module is incorporated in the frontend for dereverberation. Second, another novel frontend architecture is proposed, which extends the weighted power minimization distortionless response (WPD) convolutional beamformer to perform simultaneous separation and dereverberation. We derive a new formulation from the original WPD, which can handle multi-source input, and replace eigenvalue decomposition with the matrix inverse operation to make the back-propagation algorithm more stable. The above two architectures are optimized in a fully end-to-end manner, only using the speech recognition criterion. Experiments on both spatialized wsj1-2mix corpus and REVERB show that our proposed model outperformed the conventional methods in reverberant scenarios., Comment: 5 pages, 3 figures, conference
Published: 2020
Full Text: View/download PDF

26. End-to-End Multi-speaker Speech Recognition with Transformer

Author: Chang, Xuankai, Zhang, Wangyou, Qian, Yanmin, Roux, Jonathan Le, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER., Comment: To appear in ICASSP 2020
Published: 2020

27. MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Author: Chang, Xuankai, Zhang, Wangyou, Qian, Yanmin, Roux, Jonathan Le, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-to-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function., Comment: Accepted at ASRU 2019
Published: 2019

28. A Comparative Study on Transformer vs RNN in Speech Applications

Author: Karita, Shigeki, Chen, Nanxin, Hayashi, Tomoki, Hori, Takaaki, Inaguma, Hirofumi, Jiang, Ziyan, Someki, Masao, Soplin, Nelson Enrique Yalta, Yamamoto, Ryuichi, Wang, Xiaofei, Watanabe, Shinji, Yoshimura, Takenori, and Zhang, Wangyou
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes., Comment: Accepted at ASRU 2019
Published: 2019
Full Text: View/download PDF

29. Improving Design of Input Condition Invariant Speech Enhancement

Author: Zhang, Wangyou, primary, Jung, Jee-weon, additional, and Qian, Yanmin, additional
Published: 2024
Full Text: View/download PDF

30. Generation-Based Target Speech Extraction with Speech Discretization and Vocoder

Author: Yu, Linfeng, primary, Zhang, Wangyou, additional, Du, Chenpeng, additional, Zhang, Leying, additional, Liang, Zheng, additional, and Qian, Yanmin, additional
Published: 2024
Full Text: View/download PDF

31. A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, And Extraction

Author: Saijo, Kohei, primary, Zhang, Wangyou, additional, Wang, Zhong-Qiu, additional, Watanabe, Shinji, additional, Kobayashi, Tetsunori, additional, and Ogawa, Tetsuji, additional
Published: 2023
Full Text: View/download PDF

32. Toward Universal Speech Enhancement For Diverse Input Conditions

Author: Zhang, Wangyou, primary, Saijo, Kohei, additional, Wang, Zhong-Qiu, additional, Watanabe, Shinji, additional, and Qian, Yanmin, additional
Published: 2023
Full Text: View/download PDF

33. Exploring Time-Frequency Domain Target Speaker Extraction For Causal and Non-Causal Processing

Author: Zhang, Wangyou, primary, Yang, Lei, additional, and Qian, Yanmin, additional
Published: 2023
Full Text: View/download PDF

34. Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data

Author: Peng, Yifan, primary, Tian, Jinchuan, additional, Yan, Brian, additional, Berrebbi, Dan, additional, Chang, Xuankai, additional, Li, Xinjian, additional, Shi, Jiatong, additional, Arora, Siddhant, additional, Chen, William, additional, Sharma, Roshan, additional, Zhang, Wangyou, additional, Sudo, Yui, additional, Shakeel, Muhammad, additional, Jung, Jee-Weon, additional, Maiti, Soumi, additional, and Watanabe, Shinji, additional
Published: 2023
Full Text: View/download PDF

35. Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning

Author: Chen, William, primary, Shi, Jiatong, additional, Yan, Brian, additional, Berrebbi, Dan, additional, Zhang, Wangyou, additional, Peng, Yifan, additional, Chang, Xuankai, additional, Maiti, Soumi, additional, and Watanabe, Shinji, additional
Published: 2023
Full Text: View/download PDF

36. Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

Author: Masuyama, Yoshiki, primary, Chang, Xuankai, additional, Zhang, Wangyou, additional, Cornell, Samuele, additional, Wang, Zhong-Qiu, additional, Ono, Nobutaka, additional, Qian, Yanmin, additional, and Watanabe, Shinji, additional
Published: 2023
Full Text: View/download PDF

37. Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Author: Zhang, Wangyou, primary and Qian, Yanmin, additional
Published: 2023
Full Text: View/download PDF

38. Overlap Aware Continuous Speech Separation without Permutation Invariant Training

Author: Yu, Linfeng, primary, Zhang, Wangyou, additional, Li, Chenda, additional, and Qian, Yanmin, additional
Published: 2023
Full Text: View/download PDF

39. Two-Stage Single-Channel Speech Enhancement with Multi-Frame Filtering

Author: Lin, Shaoxiong, primary, Zhang, Wangyou, additional, and Qian, Yanmin, additional
Published: 2023
Full Text: View/download PDF

40. End-to-End Multi-Speaker ASR with Independent Vector Analysis

Author: Scheibler, Robin, primary, Zhang, Wangyou, additional, Chang, Xuankai, additional, Watanabe, Shinji, additional, and Qian, Yanmin, additional
Published: 2023
Full Text: View/download PDF

41. Text-Informed Knowledge Distillation for Robust Speech Enhancement and Recognition

Author: Wang, Wei, primary, Zhang, Wangyou, additional, Lin, Shaoxiong, additional, and Qian, Yanmin, additional
Published: 2022
Full Text: View/download PDF

42. ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

Author: Lu, Yen-Ju, primary, Chang, Xuankai, additional, Li, Chenda, additional, Zhang, Wangyou, additional, Cornell, Samuele, additional, Ni, Zhaoheng, additional, Masuyama, Yoshiki, additional, Yan, Brian, additional, Scheibler, Robin, additional, Wang, Zhong-Qiu, additional, Tsao, Yu, additional, Qian, Yanmin, additional, and Watanabe, Shinji, additional
Published: 2022
Full Text: View/download PDF

43. Separating Long-Form Speech with Group-wise Permutation Invariant Training

Author: Zhang, Wangyou, primary, Chen, Zhuo, additional, Kanda, Naoyuki, additional, Liu, Shujie, additional, Li, Jinyu, additional, Emre Eskimez, Sefik, additional, Yoshioka, Takuya, additional, Xiao, Xiong, additional, Meng, Zhong, additional, Qian, Yanmin, additional, and Wei, Furu, additional
Published: 2022
Full Text: View/download PDF

44. A Heterogeneous Graph to Abstract Syntax Tree Framework for Text-to-SQL

Author: Cao, Ruisheng, Chen, Lu, Li, Jieyu, Zhang, Hanchong, Xu, Hongshen, Zhang, Wangyou, and Yu, Kai
Abstract: Text-to-SQL is the task of converting a natural language utterance plus the corresponding database schema into a SQL program. The inputs naturally form a heterogeneous graph while the output SQL can be transduced into an abstract syntax tree (AST). Traditional encoder-decoder models ignore higher-order semantics in heterogeneous graph encoding and introduce permutation biases during AST construction, thus incapable of exploiting the refined structure knowledge precisely. In this work, we propose a generic heterogeneous graph to abstract syntax tree (HG2AST) framework to integrate dedicated structure knowledge into statistics-based models. On the encoder side, we leverage a line graph enhanced encoder (LGESQL) to iteratively update both node and edge features through dual graph message passing and aggregation. On the decoder side, a grammar-based decoder first constructs the equivalent SQL AST and then transforms it into the desired SQL via post-processing. To avoid over-fitting permutation biases, we propose a golden tree-oriented learning (GTL) algorithm to adaptively control the expanding order of AST nodes. The graph encoder and tree decoder are combined into a unified framework through two auxiliary modules. Extensive experiments on various text-to-SQL datasets, including single/multi-table, single/cross-domain, and multilingual settings, demonstrate the superiority and broad applicability.
Published: 2023
Full Text: View/download PDF

45. The Sjtu System For Multimodal Information Based Speech Processing Challenge 2021

Author: Wang, Wei, primary, Gong, Xun, additional, Wu, Yifei, additional, Zhou, Zhikai, additional, Li, Chenda, additional, Zhang, Wangyou, additional, Han, Bing, additional, and Qian, Yanmin, additional
Published: 2022
Full Text: View/download PDF

46. Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge

Author: Lu, Yen-Ju, primary, Cornell, Samuele, additional, Chang, Xuankai, additional, Zhang, Wangyou, additional, Li, Chenda, additional, Ni, Zhaoheng, additional, Wang, Zhong-Qiu, additional, and Watanabe, Shinji, additional
Published: 2022
Full Text: View/download PDF

47. Text Adaptive Detection for Customizable Keyword Spotting

Author: Xi, Yu, primary, Tan, Tian, additional, Zhang, Wangyou, additional, Yang, Baochen, additional, and Yu, Kai, additional
Published: 2022
Full Text: View/download PDF

48. Exploring Effective Data Utilization for Low-Resource Speech Recognition

Author: Zhou, Zhikai, primary, Wang, Wei, additional, Zhang, Wangyou, additional, and Qian, Yanmin, additional
Published: 2022
Full Text: View/download PDF

49. Ultrasound-Assisted Enzymatic Extraction of Polysaccharides from Waste Corn Bract: Process Optimization, Characterization, Antioxidant and Anti-Diabetic Potentials

Author: Liu, Yihui, primary, Li, Yayi, additional, Niu, Na, additional, Xie, Yuxing, additional, Zhang, Wangyou, additional, Dong, Shuaiyi, additional, Pu, Gulei, additional, Liu, Chenqi, additional, Jiang, Caibo, additional, Cai, Mingjin, additional, Liu, Yang, additional, and Zhang, Yang, additional
Published: 2022
Full Text: View/download PDF

50. End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

Author: Zhang, Wangyou, primary, Chang, Xuankai, additional, Boeddeker, Christoph, additional, Nakatani, Tomohiro, additional, Watanabe, Shinji, additional, and Qian, Yanmin, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

100 results on '"Zhang, Wangyou"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources