Author: "Guo, Pengcheng" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Guo, Pengcheng"' showing total 758 results

Start Over Author "Guo, Pengcheng"

758 results on '"Guo, Pengcheng"'

1. SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Author: Guo, Pengcheng, Chang, Xuankai, Lv, Hang, Watanabe, Shinji, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to capture speaker prompts from overlapping speech based on target-speaker enrollment. These prompts serve to steer the model in extracting speaker-specific features and accurately recognizing target-speaker transcriptions. Experimental results demonstrate that our approach effectively adapts the pre-trained speech foundation model to TS-ASR. Compared with the robust TS-HuBERT model, the proposed SQ-Whisper significantly improves performance, yielding up to 15% and 10% relative reductions in word error rates (WERs) on the Libri2Mix and WSJ0-2Mix datasets, respectively. With data augmentation, we establish new state-of-the-art WERs of 14.6% on the Libri2Mix Test set and 4.4% on the WSJ0-2Mix Test set. Furthermore, we evaluate our model on the real-world AMI meeting dataset, which shows consistent improvement over other adaptation methods., Comment: Accepted by IEEE/ACM TASLP
Published: 2024

2. Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

Author: Liu, Shuiyun, Kong, Yuxiang, Guo, Pengcheng, Zhuang, Weiji, Gao, Peng, Wang, Yujun, and Xie, Lei
Subjects: Computer Science - Sound, Computer Science - Human-Computer Interaction, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech has emerged as a widely embraced user interface across diverse applications. However, for individuals with dysarthria, the inherent variability in their speech poses significant challenges. This paper presents an end-to-end Pretrain-based Dual-filter Dysarthria Wake-up word Spotting (PD-DWS) system for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge. Specifically, our system improves performance from two key perspectives: audio modeling and dual-filter strategy. For audio modeling, we propose an innovative 2branch-d2v2 model based on the pre-trained data2vec2 (d2v2), which can simultaneously model automatic speech recognition (ASR) and wake-up word spotting (WWS) tasks through a unified multi-task finetuning paradigm. Additionally, a dual-filter strategy is introduced to reduce the false accept rate (FAR) while maintaining the same false reject rate (FRR). Experimental results demonstrate that our PD-DWS system achieves an FAR of 0.00321 and an FRR of 0.005, with a total score of 0.00821 on the test-B eval set, securing first place in the challenge., Comment: 8 pages, Accepted to SLT 2024
Published: 2024

3. NPU-NTU System for Voice Privacy 2024 Challenge

Author: Yao, Jixun, Kuzmin, Nikita, Wang, Qing, Guo, Pengcheng, Ning, Ziqian, Guo, Dake, Lee, Kong Aik, Chng, Eng-Siong, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker anonymization is an effective privacy protection solution that conceals the speaker's identity while preserving the linguistic content and paralinguistic information of the original speech. To establish a fair benchmark and facilitate comparison of speaker anonymization systems, the VoicePrivacy Challenge (VPC) was held in 2020 and 2022, with a new edition planned for 2024. In this paper, we describe our proposed speaker anonymization system for VPC 2024. Our system employs a disentangled neural codec architecture and a serial disentanglement strategy to gradually disentangle the global speaker identity and time-variant linguistic content and paralinguistic information. We introduce multiple distillation methods to disentangle linguistic content, speaker identity, and emotion. These methods include semantic distillation, supervised speaker distillation, and frame-level emotion distillation. Based on these distillations, we anonymize the original speaker identity using a weighted sum of a set of candidate speaker identities and a randomly generated speaker identity. Our system achieves the best trade-off of privacy protection and emotion preservation in VPC 2024., Comment: System description for VPC 2024
Published: 2024

4. Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

Author: Yang, Yuncheng, Qin, Yulei, Wu, Tong, Xu, Zihan, Li, Gang, Guo, Pengcheng, Shao, Hang, Shi, Yuchen, Li, Ke, Sun, Xing, Yang, Jie, and Gu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: The cultivation of expertise for large language models (LLMs) to solve tasks of specific areas often requires special-purpose tuning with calibrated behaviors on the expected stable outputs. To avoid huge cost brought by manual preparation of instruction datasets and training resources up to hundreds of hours, the exploitation of open knowledge including a wealth of low rank adaptation (LoRA) models and instruction datasets serves as a good starting point. However, existing methods on model and data selection focus on the performance of general-purpose capabilities while neglecting the knowledge gap exposed in domain-specific deployment. In the present study, we propose to bridge such gap by introducing few human-annotated samples (i.e., K-shot) for advancing task expertise of LLMs with open knowledge. Specifically, we develop an efficient and scalable pipeline to cost-efficiently produce task experts where K-shot data intervene in selecting the most promising expert candidates and the task-relevant instructions. A mixture-of-expert (MoE) system is built to make the best use of individual-yet-complementary knowledge between multiple experts. We unveil the two keys to the success of a MoE system, 1) the abidance by K-shot, and 2) the insistence on diversity. For the former, we ensure that models that truly possess problem-solving abilities on K-shot are selected rather than those blind guessers. Besides, during data selection, instructions that share task-relevant contexts with K-shot are prioritized. For the latter, we highlight the diversity of constituting experts and that of the fine-tuning instructions throughout the model and data selection process. Extensive experimental results confirm the superiority of our approach over existing methods on utilization of open knowledge across various tasks. Our codes will be available at https://github.com/Yaphabates/Rocket., Comment: 29 pages, 12 tables, 10 figures
Published: 2024

5. Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper

Author: Xu, Tianyi, Huang, Kaixun, Guo, Pengcheng, Zhou, Yu, Huang, Longtao, Xue, Hui, and Xie, Lei
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Pre-trained multilingual speech foundation models, like Whisper, have shown impressive performance across different languages. However, adapting these models to new or specific languages is computationally extensive and faces catastrophic forgetting problems. Addressing these issues, our study investigates strategies to enhance the model on new languages in the absence of original training data, while also preserving the established performance on the original languages. Specifically, we first compare various LoRA-based methods to find out their vulnerability to forgetting. To mitigate this issue, we propose to leverage the LoRA parameters from the original model for approximate orthogonal gradient descent on the new samples. Additionally, we also introduce a learnable rank coefficient to allocate trainable parameters for more efficient training. Our experiments with a Chinese Whisper model (for Uyghur and Tibetan) yield better results with a more compact parameter set.
Published: 2024

6. Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Author: Qin, Yulei, Yang, Yuncheng, Guo, Pengcheng, Li, Gang, Shao, Hang, Shi, Yuchen, Xu, Zihan, Gu, Yun, Li, Ke, and Sun, Xing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Signal Processing
Abstract: Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. However, under the context of instruction tuning, there still exists a gap in knowledge on what kind of data evaluation metrics can be employed and how they can be integrated into the selection mechanism. To bridge this gap, we present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs. We systematically categorize all applicable methods into quality-based, diversity-based, and importance-based ones where a unified, fine-grained taxonomy is structured. For each category, representative methods are elaborated to describe the landscape of relevant research. In addition, comparison between the latest methods is conducted on their officially reported results to provide in-depth discussions on their limitations. Finally, we summarize the open challenges and propose the promosing avenues for future studies. All related contents are available at https://github.com/yuleiqin/fantastic-data-engineering., Comment: review, survey, 37 pages, 5 figures, 4 tables
Published: 2024

7. CRMSP: A Semi-supervised Approach for Key Information Extraction with Class-Rebalancing and Merged Semantic Pseudo-Labeling

Author: Zhang, Qi, Song, Yonghong, Guo, Pengcheng, and Hui, Yangyang
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: There is a growing demand in the field of KIE (Key Information Extraction) to apply semi-supervised learning to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.
Published: 2024

8. MUSA: Multi-lingual Speaker Anonymization via Serial Disentanglement

Author: Yao, Jixun, Wang, Qing, Guo, Pengcheng, Ning, Ziqian, Yang, Yuguang, Pan, Yu, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker anonymization is an effective privacy protection solution designed to conceal the speaker's identity while preserving the linguistic content and para-linguistic information of the original speech. While most prior studies focus solely on a single language, an ideal speaker anonymization system should be capable of handling multiple languages. This paper proposes MUSA, a Multi-lingual Speaker Anonymization approach that employs a serial disentanglement strategy to perform a step-by-step disentanglement from a global time-invariant representation to a temporal time-variant representation. By utilizing semantic distillation and self-supervised speaker distillation, the serial disentanglement strategy can avoid strong inductive biases and exhibit superior generalization performance across different languages. Meanwhile, we propose a straightforward anonymization strategy that employs empty embedding with zero values to simulate the speaker identity concealment process, eliminating the need for conversion to a pseudo-speaker identity and thereby reducing the complexity of speaker anonymization process. Experimental results on VoicePrivacy official datasets and multi-lingual datasets demonstrate that MUSA can effectively protect speaker privacy while preserving linguistic content and para-linguistic information., Comment: Submitted to TASLP
Published: 2024

9. Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix

Author: Yao, Jixun, Wang, Qing, Guo, Pengcheng, Ning, Ziqian, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker anonymization is an effective privacy protection solution that aims to conceal the speaker's identity while preserving the naturalness and distinctiveness of the original speech. Mainstream approaches use an utterance-level vector from a pre-trained automatic speaker verification (ASV) model to represent speaker identity, which is then averaged or modified for anonymization. However, these systems suffer from deterioration in the naturalness of anonymized speech, degradation in speaker distinctiveness, and severe privacy leakage against powerful attackers. To address these issues and especially generate more natural and distinctive anonymized speech, we propose a novel speaker anonymization approach that models a matrix related to speaker identity and transforms it into an anonymized singular value transformation-assisted matrix to conceal the original speaker identity. Our approach extracts frame-level speaker vectors from a pre-trained ASV model and employs an attention mechanism to create a speaker-score matrix and speaker-related tokens. Notably, the speaker-score matrix acts as the weight for the corresponding speaker-related token, representing the speaker's identity. The singular value transformation-assisted matrix is generated by recomposing the decomposed orthonormal eigenvectors matrix and non-linear transformed singular through Singular Value Decomposition (SVD). Experiments on VoicePrivacy Challenge datasets demonstrate the effectiveness of our approach in protecting speaker privacy under all attack scenarios while maintaining speech naturalness and distinctiveness., Comment: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published: 2024

10. Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Author: Geng, Xuelong, Xu, Tianyi, Wei, Kun, Mu, Bingshen, Xue, Hongfei, Wang, He, Li, Yangze, Guo, Pengcheng, Dai, Yuhang, Li, Longhao, Shao, Mingchen, and Xie, Lei
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.
Published: 2024

11. Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Author: Wang, He, Guo, Pengcheng, Wan, Xucheng, Zhou, Huan, and Xie, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Automatic lip-reading (ALR) aims to automatically transcribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first propose a novel multi-scale lip motion extraction algorithm based on the size of the speaker's face and an Enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branchformer and E-Branchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set., Comment: 6 pages, 3 figures, Accepted at ICMEW 2024
Published: 2024

12. An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge

Author: Han, Runduo, Yan, Xiaopeng, Xu, Weiming, Guo, Pengcheng, Sun, Jiayao, Wang, He, Lu, Quan, Jiang, Ning, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-end automatic speech recognition (ASR) systems. Experiments show that our approach achieves a character error rate (CER) of 24.2% and 33.2% on the Dev and Eval set, respectively, obtaining the second place in the challenge., Comment: Accepted by ICASSP 2024
Published: 2024

13. The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

Author: Wang, He, Guo, Pengcheng, Chen, Wei, Zhou, Pan, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after multi-system fusion, ranking first place in all three tracks we participate., Comment: Included in CNVSRC Workshop 2023, NCMMSC 2023
Published: 2024

14. ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge

Author: Wang, He, Guo, Pengcheng, Li, Yue, Zhang, Ao, Sun, Jiayao, Xie, Lei, Chen, Wei, Zhou, Pan, Bu, Hui, Xu, Xin, Zhang, Binbin, Chen, Zhuo, Wu, Jian, Wang, Longbiao, Chng, Eng Siong, and Li, Sun
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours of noise for data augmentation. Two tracks, including automatic speech recognition (ASR) and automatic speech diarization and recognition (ASDR) are set up, using character error rate (CER) and concatenated minimum permutation character error rate (cpCER) as evaluation metrics, respectively. Overall, the ICMC-ASR Challenge attracts 98 participating teams and receives 53 valid results in both tracks. In the end, first-place team USTCiflytek achieves a CER of 13.16% in the ASR track and a cpCER of 21.48% in the ASDR track, showing an absolute improvement of 13.08% and 51.4% compared to our challenge baseline, respectively., Comment: Accepted at ICASSP 2024
Published: 2024

15. MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

Author: Wang, He, Guo, Pengcheng, Zhou, Pan, and Xie, Lei
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative improvement compared with our previous system which ranked the second place in the challenge. Following the fusion of multiple systems, our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset., Comment: 5 pages, 3 figures Accepted at ICASSP 2024
Published: 2024
Full Text: View/download PDF

16. Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

Author: Mu, Bingshen, Guo, Pengcheng, Guo, Dake, Zhou, Pan, Chen, Wei, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multi-frame cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets., Comment: Accepted by ICASSP 2024
Published: 2023

17. Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

Author: Shao, Qijie, Guo, Pengcheng, Yan, Jinghao, Hu, Pengfei, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic speech recognition (ASR) and accent recognition (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related accent characteristics, while coarse-grained units are better for learning linguistic information. Moreover, an explicit interaction of two tasks can also provide complementary information and improve the performance of each other, but it is rarely used by existing approaches. In this paper, we propose a novel Decoupling and Interacting Multi-task Network (DIMNet) for joint speech and accent recognition, which is comprised of a connectionist temporal classification (CTC) branch, an AR branch, an ASR branch, and a bottom feature encoder. Specifically, AR and ASR are first decoupled by separated branches and two-granular modeling units to learn task-specific representations. The AR branch is from our previously proposed linguistic-acoustic bimodal AR model and the ASR branch is an encoder-decoder based Conformer model. Then, for the task interaction, the CTC branch provides aligned text for the AR task, while accent embeddings extracted from our AR model are incorporated into the ASR branch's encoder and decoder. Finally, during ASR inference, a cross-granular rescoring method is introduced to fuse the complementary information from the CTC and attention decoder after the decoupling. Our experiments on English and Chinese datasets demonstrate the effectiveness of the proposed model, which achieves 21.45%/28.53% AR accuracy relative improvement and 32.33%/14.55% ASR error rate relative reduction over a published standard baseline, respectively., Comment: Accepted by IEEE Transactions on Audio, Speech and Language Processing (TASLP)
Published: 2023
Full Text: View/download PDF

18. SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Author: Li, Yangze, Yu, Fan, Liang, Yuhao, Guo, Pengcheng, Shi, Mohan, Du, Zhihao, Zhang, Shiliang, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.
Published: 2023

19. A single-cell chromatin accessibility dataset of human primed and naïve pluripotent stem cell-derived teratoma

Author: Li, Jinxiu, Fu, Lixin, Li, Yunpan, Sun, Wei, Yi, Yao, Jia, Wenqi, Li, Haiwei, Liu, Hao, Guo, Pengcheng, Wang, Yang, Shen, Yue, Zhang, Xiuqing, Lv, Yuan, Qin, Baoming, Li, Wenjuan, Liu, Chuanyu, Liu, Longqi, Mazid, Md. Abdul, Lai, Yiwei, Esteban, Miguel A., Jiang, Yu, and Wu, Liang
Published: 2024
Full Text: View/download PDF

20. A novel approach to alleviate acetaminophen-induced hepatotoxicity with hybrid balloon flower root-derived exosome-like nanoparticles (BDEs) with silymarin via inhibition of hepatocyte MAPK pathway and apoptosis

Author: Kim, Jisu, Gao, Chao, Guo, Pengcheng, Sheng, Jianyong, and Wang, Jianxin
Published: 2024
Full Text: View/download PDF

21. Understanding vegetation phenology responses to easily ignored climate factors in china's mid-high latitudes

Author: Wang, Qianfeng, Chen, Huixia, Xu, Feng, Bento, Virgílio A., Zhang, Rongrong, Wu, Xiaoping, and Guo, Pengcheng
Published: 2024
Full Text: View/download PDF

22. Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

Author: Chang, Xuankai, Yan, Brian, Choi, Kwanghee, Jung, Jeeweon, Lu, Yichen, Maiti, Soumi, Sharma, Roshan, Shi, Jiatong, Tian, Jinchuan, Watanabe, Shinji, Fujita, Yuya, Maekaku, Takashi, Guo, Pengcheng, Cheng, Yao-Fei, Denisov, Pavel, Saijo, Kohei, and Wang, Hsiu-Hsuan
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations, which significantly compresses the size of speech data. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length. Hence, training time is significantly reduced while retaining notable performance. In this study, we undertake a comprehensive and systematic exploration into the application of discrete units within end-to-end speech processing models. Experiments on 12 automatic speech recognition, 3 speech translation, and 1 spoken language understanding corpora demonstrate that discrete units achieve reasonably good results in almost all the settings. We intend to release our configurations and trained models to foster future research efforts., Comment: Submitted to IEEE ICASSP 2024
Published: 2023

23. Timbre-reserved Adversarial Attack in Speaker Identification

Author: Wang, Qing, Yao, Jixun, Zhang, Li, Guo, Pengcheng, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: As a type of biometric identification, a speaker identification (SID) system is confronted with various kinds of attacks. The spoofing attacks typically imitate the timbre of the target speakers, while the adversarial attacks confuse the SID system by adding a well-designed adversarial perturbation to an arbitrary speech. Although the spoofing attack copies a similar timbre as the victim, it does not exploit the vulnerability of the SID model and may not make the SID system give the attacker's desired decision. As for the adversarial attack, despite the SID system can be led to a designated decision, it cannot meet the specified text or speaker timbre requirements for the specific attack scenarios. In this study, to make the attack in SID not only leverage the vulnerability of the SID model but also reserve the timbre of the target speaker, we propose a timbre-reserved adversarial attack in the speaker identification. We generate the timbre-reserved adversarial audios by adding an adversarial constraint during the different training stages of the voice conversion (VC) model. Specifically, the adversarial constraint is using the target speaker label to optimize the adversarial perturbation added to the VC model representations and is implemented by a speaker classifier joining in the VC model training. The adversarial constraint can help to control the VC model to generate the speaker-wised audio. Eventually, the inference of the VC model is the ideal adversarial fake audio, which is timbre-reserved and can fool the SID system., Comment: 11 pages, 8 figures
Published: 2023

24. Effects of Rabbit Bone Marrow Mesenchymal Stem Cell Transplantation on Anterior Cruciate Ligament Injury Repair and Related Serum Indicators

Author: Guo, Pengcheng, Zhang, Ziyang, Guo, Renhao, and Cai, Chengfa
Published: 2024
Full Text: View/download PDF

25. Multimodal cell atlas of the ageing human skeletal muscle

Author: Lai, Yiwei, Ramírez-Pardo, Ignacio, Isern, Joan, An, Juan, Perdiguero, Eusebio, Serrano, Antonio L., Li, Jinxiu, García-Domínguez, Esther, Segalés, Jessica, Guo, Pengcheng, Lukesova, Vera, Andrés, Eva, Zuo, Jing, Yuan, Yue, Liu, Chuanyu, Viña, José, Doménech-Fernández, Julio, Gómez-Cabrera, Mari Carmen, Song, Yancheng, Liu, Longqi, Xu, Xun, Muñoz-Cánoves, Pura, and Esteban, Miguel A.
Published: 2024
Full Text: View/download PDF

26. A spatiotemporal atlas of cholestatic injury and repair in mice

Author: Wu, Baihua, Shentu, Xinyi, Nan, Haitao, Guo, Pengcheng, Hao, Shijie, Xu, Jiangshan, Shangguan, Shuncheng, Cui, Lei, Cen, Jin, Deng, Qiuting, Wu, Yan, Liu, Chang, Song, Yumo, Lin, Xiumei, Wang, Zhifeng, Yuan, Yue, Ma, Wen, Li, Ronghai, Li, Yikang, Qian, Qiwei, Du, Wensi, Lai, Tingting, Yang, Tao, Liu, Chuanyu, Ma, Xiong, Chen, Ao, Xu, Xun, Lai, Yiwei, Liu, Longqi, Esteban, Miguel A., and Hui, Lijian
Published: 2024
Full Text: View/download PDF

27. A spatiotemporal atlas of mouse liver homeostasis and regeneration

Author: Xu, Jiangshan, Guo, Pengcheng, Hao, Shijie, Shangguan, Shuncheng, Shi, Quan, Volpe, Giacomo, Huang, Keke, Zuo, Jing, An, Juan, Yuan, Yue, Cheng, Mengnan, Deng, Qiuting, Zhang, Xiao, Lai, Guangyao, Nan, Haitao, Wu, Baihua, Shentu, Xinyi, Wu, Liang, Wei, Xiaoyu, Jiang, Yujia, Huang, Xin, Pan, Fengyu, Song, Yumo, Li, Ronghai, Wang, Zhifeng, Liu, Chuanyu, Liu, Shiping, Li, Yuxiang, Yang, Tao, Xu, Zhicheng, Du, Wensi, Li, Ling, Ahmed, Tanveer, You, Kai, Dai, Zhen, Li, Li, Qin, Baoming, Li, Yinxiong, Lai, Liangxue, Qin, Dajiang, Chen, Junling, Fan, Rong, Li, Yongyin, Hou, Jinlin, Ott, Michael, Sharma, Amar Deep, Cantz, Tobias, Schambach, Axel, Kristiansen, Karsten, Hutchins, Andrew P., Göttgens, Berthold, Maxwell, Patrick H., Hui, Lijian, Xu, Xun, Liu, Longqi, Chen, Ao, Lai, Yiwei, and Esteban, Miguel A.
Published: 2024
Full Text: View/download PDF

28. TVDO: Tchebycheff Value-Decomposition Optimization for Multi-Agent Reinforcement Learning

Author: Hu, Xiaoliang, Guo, Pengcheng, Zhou, Chuanwei, Zhang, Tong, and Cui, Zhen
Subjects: Computer Science - Multiagent Systems
Abstract: In cooperative multi-agent reinforcement learning (MARL) settings, the centralized training with decentralized execution (CTDE) becomes customary recently due to the physical demand. However, the most dilemma is the inconsistency of jointly-trained policies and individually-optimized actions. In this work, we propose a novel value-based multi-objective learning approach, named Tchebycheff value decomposition optimization (TVDO), to overcome the above dilemma. In particular, a nonlinear Tchebycheff aggregation method is designed to transform the MARL task into multi-objective optimal counterpart by tightly constraining the upper bound of individual action-value bias. We theoretically prove that TVDO well satisfies the necessary and sufficient condition of individual global max (IGM) with no extra limitations, which exactly guarantees the consistency between the global and individual optimal action-value function. Empirically, in the climb and penalty game, we verify that TVDO represents precisely from global to individual value factorization with a guarantee of the policy consistency. Furthermore, we also evaluate TVDO in the challenging scenarios of StarCraft II micromanagement tasks, and extensive experiments demonstrate that TVDO achieves more competitive performances than several state-of-the-art MARL methods.
Published: 2023

29. Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

Author: Xu, Tianyi, Yang, Zhanheng, Huang, Kaixun, Guo, Pengcheng, Zhang, Ao, Li, Biao, Chen, Changru, Li, Chao, and Xie, Lei
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.
Published: 2023

30. Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

Author: Wang, Qing, Yao, Jixun, Wang, Ziqian, Guo, Pengcheng, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pseudo-Siamese network architecture to learn from the black-box SID model constraining both intrinsic similarity and structural similarity simultaneously. The intrinsic similarity loss is to learn an intrinsic invariance, while the structural similarity loss is to ensure that the substitute SID model shares a similar decision boundary to the fixed black-box SID model. The substitute model can be used as a proxy to generate timbre-reserved fake audio for attacking. Experimental results on the Audio Deepfake Detection (ADD) challenge dataset indicate that the attack success rate of our proposed approach yields up to 60.58% and 55.38% in the white-box and black-box scenarios, respectively, and can deceive both human beings and machines., Comment: 5 pages
Published: 2023

31. BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Author: Liang, Yuhao, Yu, Fan, Li, Yangze, Guo, Pengcheng, Zhang, Shiliang, Chen, Qian, and Xie, Lei
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%., Comment: Accepted by INTERSPEECH 2023
Published: 2023

32. TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition

Author: Xue, Hongfei, Shao, Qijie, Chen, Peikun, Guo, Pengcheng, Xie, Lei, and Liu, Jie
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: UniSpeech has achieved superior performance in cross-lingual automatic speech recognition (ASR) by explicitly aligning latent representations to phoneme units using multi-task self-supervised learning. While the learned representations transfer well from high-resource to low-resource languages, predicting words directly from these phonetic representations in downstream ASR is challenging. In this paper, we propose TranUSR, a two-stage model comprising a pre-trained UniData2vec and a phoneme-to-word Transcoder. Different from UniSpeech, UniData2vec replaces the quantized discrete representations with continuous and contextual representations from a teacher model for phonetically-aware pre-training. Then, Transcoder learns to translate phonemes to words with the aid of extra texts, enabling direct word generation. Experiments on Common Voice show that UniData2vec reduces PER by 5.3% compared to UniSpeech, while Transcoder yields a 14.4% WER reduction compared to grapheme fine-tuning., Comment: 5 pages, 3 figures. Accepted by INTERSPEECH 2023
Published: 2023
Full Text: View/download PDF

33. Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

Author: Huang, Kaixun, Zhang, Ao, Yang, Zhanheng, Guo, Pengcheng, Mu, Bingshen, Xu, Tianyi, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list., Comment: Accepted by interspeech2023
Published: 2023

34. The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge

Author: Guo, Pengcheng, Wang, He, Mu, Bingshen, Zhang, Ao, and Chen, Peikun
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper describes our NPU-ASLP system for the Audio-Visual Diarization and Recognition (AVDR) task in the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. Specifically, the weighted prediction error (WPE) and guided source separation (GSS) techniques are used to reduce reverberation and generate clean signals for each single speaker first. Then, we explore the effectiveness of Branchformer and E-Branchformer based ASR systems. To better make use of the visual modality, a cross-attention based multi-modal fusion module is proposed, which explicitly learns the contextual relationship between different modalities. Experiments show that our system achieves a concatenated minimum-permutation character error rate (cpCER) of 28.13\% and 31.21\% on the Dev and Eval set, and obtains second place in the challenge., Comment: 2 pages, accepted by ICASSP 2023
Published: 2023

35. VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

Author: Zhang, Ao, Wang, He, Guo, Pengcheng, Fu, Yihui, Xie, Lei, Gao, Yingying, Zhang, Shilei, and Feng, Junlan
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting cross-attention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves 2.79% false rejection rate and 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge., Comment: 5 pages. Accepted at ICASSP2023
Published: 2023

36. The Design of an Efficient Distributed Collaborative Scheduling Method and the Optimal Planning Strategy for Providing Photovoltaic Access Capacity

Author: Bai, Fan, Guo, Pengcheng, Wang, Luda, Sun, Liwei, Chu, Haifeng, Lv, Siyu, Liu, Hongzhe, Yu, Wenwu, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Hua, Yongzhao, editor, Liu, Yishi, editor, and Han, Liang, editor
Published: 2024
Full Text: View/download PDF

37. Controllability of Windmill Networks

Author: Guo, Pengcheng, Lv, Pengchao, Huang, Junjie, Liu, Bo, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Pan, Linqiang, editor, Wang, Yong, editor, and Lin, Jianqing, editor
Published: 2024
Full Text: View/download PDF

38. TESSP: Text-Enhanced Self-Supervised Speech Pre-training

Author: Yao, Zhuoyuan, Ren, Shuo, Chen, Sanyuan, Ma, Ziyang, Guo, Pengcheng, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the same model. To solve this problem, we propose Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to incorporate the linguistic information into speech pre-training. Our model consists of three parts, i.e., a speech encoder, a text encoder and a shared encoder. The model takes unsupervised speech and text data as the input and leverages the common HuBERT and MLM losses respectively. We also propose phoneme up-sampling and representation swapping to enable joint modeling of the speech and text information. Specifically, to fix the length mismatching problem between speech and text data, we phonemize the text sequence and up-sample the phonemes with the alignment information extracted from a small set of supervised data. Moreover, to close the gap between the learned speech and text representations, we swap the text representation with the speech representation extracted by the respective private encoders according to the alignment information. Experiments on the Librispeech dataset shows the proposed TESSP model achieves more than 10% improvement compared with WavLM on the test-clean and test-other sets. We also evaluate our model on the SUPERB benchmark, showing our model has better performance on Phoneme Recognition, Acoustic Speech Recognition and Speech Translation compared with WavLM., Comment: 9 pages, 4 figures
Published: 2022

39. Distinguishable Speaker Anonymization based on Formant and Fundamental Frequency Scaling

Author: Yao, Jixun, Wang, Qing, Lei, Yi, Guo, Pengcheng, Xie, Lei, Wang, Namin, and Liu, Jie
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Cryptography and Security, Computer Science - Sound
Abstract: Speech data on the Internet are proliferating exponentially because of the emergence of social media, and the sharing of such personal data raises obvious security and privacy concerns. One solution to mitigate these concerns involves concealing speaker identities before sharing speech data, also referred to as speaker anonymization. In our previous work, we have developed an automatic speaker verification (ASV)-model-free anonymization framework to protect speaker privacy while preserving speech intelligibility. Although the framework ranked first place in VoicePrivacy 2022 challenge, the anonymization was imperfect, since the speaker distinguishability of the anonymized speech was deteriorated. To address this issue, in this paper, we directly model the formant distribution and fundamental frequency (F0) to represent speaker identity and anonymize the source speech by the uniformly scaling formant and F0. By directly scaling the formant and F0, the speaker distinguishability degradation of the anonymized speech caused by the introduction of other speakers is prevented. The experimental results demonstrate that our proposed framework can improve the speaker distinguishability and significantly outperforms our previous framework in voice distinctiveness. Furthermore, our proposed method also can trade off the privacy-utility by using different scaling factors., Comment: Submitted to ICASSP 2023
Published: 2022

40. Preserving background sound in noise-robust voice conversion via multi-task learning

Author: Yao, Jixun, Lei, Yi, Wang, Qing, Guo, Pengcheng, Ning, Ziqian, Xie, Lei, Li, Hai, Liu, Junhui, and Xie, Danming
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multi-task learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and confines the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task shares a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules. Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data., Comment: Submitted to ICASSP 2023
Published: 2022

41. MFCCA:Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario

Author: Yu, Fan, Zhang, Shiliang, Guo, Pengcheng, Liang, Yuhao, Du, Zhihao, Lin, Yuxiao, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone array receiving sound, we propose a multi-frame cross-channel attention, which models cross-channel information between adjacent frames to exploit the complementarity of both frame-wise and channel-wise knowledge. Besides, we also propose a multi-layer convolutional mechanism to fuse the multi-channel output and a channel masking strategy to combat the channel number mismatch problem between training and inference. Experiments on the AliMeeting, a real-world corpus, reveal that our proposed model outperforms single-channel model by 31.7\% and 37.0\% CER reduction on Eval and Test sets. Moreover, with comparable model parameters and training data, our proposed model achieves a new SOTA performance on the AliMeeting corpus, as compared with the top ranking systems in the ICASSP2022 M2MeT challenge, a recently held multi-channel multi-speaker ASR challenge., Comment: Accepted by SLT 2022
Published: 2022

42. Effect of Wind Velocity on Smoldering Ignition of Moist Pine Needles by a Glowing Firebrand

Author: Fang, Wei, Yang, Jiuling, Chen, Haixiang, Zhang, Linhe, Guo, Pengcheng, and Yuan, Yukui
Published: 2024
Full Text: View/download PDF

43. Experimental Study on the Generation of Arc Beads from Conductors by Arc Discharge

Author: Zhang, Hongmin, Guo, Pengcheng, Chen, Haixiang, Fang, Wei, and Zhang, Linhe
Published: 2024
Full Text: View/download PDF

44. NWPU-ASLP System for the VoicePrivacy 2022 Challenge

Author: Yao, Jixun, Wang, Qing, Zhang, Li, Guo, Pengcheng, Liang, Yuhao, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper presents the NWPU-ASLP speaker anonymization system for VoicePrivacy 2022 Challenge. Our submission does not involve additional Automatic Speaker Verification (ASV) model or x-vector pool. Our system consists of four modules, including feature extractor, acoustic model, anonymization module, and neural vocoder. First, the feature extractor extracts the Phonetic Posteriorgram (PPG) and pitch from the input speech signal. Then, we reserve a pseudo speaker ID from a speaker look-up table (LUT), which is subsequently fed into a speaker encoder to generate the pseudo speaker embedding that is not corresponding to any real speaker. To ensure the pseudo speaker is distinguishable, we further average the randomly selected speaker embedding and weighted concatenate it with the pseudo speaker embedding to generate the anonymized speaker embedding. Finally, the acoustic model outputs the anonymized mel-spectrogram from the anonymized speaker embedding and a modified version of HifiGAN transforms the mel-spectrogram into the anonymized speech waveform. Experimental results demonstrate the effectiveness of our proposed anonymization system., Comment: VoicePrivacy 2022 Challenge
Published: 2022

45. Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Author: Wei, Kun, Guo, Pengcheng, and Jiang, Ning
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect contextual dependencies that span across utterances. In this paper, we propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition. Specifically, for the encoder network, we capture the contexts of previous speech and incorporate such historic information into current input by a context-aware residual attention mechanism. For the decoder, the prediction of current utterance is also conditioned on the historic linguistic information through a conditional decoder framework. We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models., Comment: Accepted by Interspeech2022
Published: 2022

46. Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition

Author: Shao, Qijie, Yan, Jinghao, Kang, Jian, Guo, Pengcheng, Shi, Xian, Hu, Pengfei, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: General accent recognition (AR) models tend to directly extract low-level information from spectrums, which always significantly overfit on speakers or channels. Considering accent can be regarded as a series of shifts relative to native pronunciation, distinguishing accents will be an easier task with accent shift as input. But due to the lack of native utterance as an anchor, estimating the accent shift is difficult. In this paper, we propose linguistic-acoustic similarity based accent shift (LASAS) for AR tasks. For an accent speech utterance, after mapping the corresponding text vector to multiple accent-associated spaces as anchors, its accent shift could be estimated by the similarities between the acoustic embedding and those anchors. Then, we concatenate the accent shift with a dimension-reduced text vector to obtain a linguistic-acoustic bimodal representation. Compared with pure acoustic embedding, the bimodal representation is richer and more clear by taking full advantage of both linguistic and acoustic information, which can effectively improve AR performance. Experiments on Accented English Speech Recognition Challenge (AESRC) dataset show that our method achieves 77.42% accuracy on Test set, obtaining a 6.94% relative improvement over a competitive system in the challenge., Comment: Accepted by Interspeech 2022
Published: 2022

47. A novel physical and data-driven optimization methodology for designing a renewable energy, power to gas and solid oxide fuel cell system based on ensemble learning algorithm

Author: Ding, Xiaoyi, Wang, Yifan, Guo, Pengcheng, Sun, Wei, Harrison, Gareth P., Lv, Xiaojing, and Weng, Yiwu
Published: 2024
Full Text: View/download PDF

48. Feasibility evaluation of a wind/P2G/SOFC/GT multi-energy microgrid system with synthetic fuel based on C-H-O elemental ternary analysis

Author: Ding, Xiaoyi, Guo, Pengcheng, Sun, Wei, Harrison, Gareth P., Lv, Xiaojing, and Weng, Yiwu
Published: 2024
Full Text: View/download PDF

49. Unveiling mechanical response and deformation mechanism of extruded WE43 magnesium alloy under high-speed impact

Author: Guo, Pengcheng, Ye, Tuo, Wu, Yuanzhi, Qiu, Shawei, Xia, Erli, Liu, Xianlan, and Li, Luoxing
Published: 2024
Full Text: View/download PDF

50. Intergranular deformation mechanism stimulated by {10-12} extension twins in AZ81-La alloy

Author: Zhu, Biwu, Su, Jing, Liu, Xiao, Tang, Changping, Liu, Wenhui, Li, Luoxing, Xu, Congchang, and Guo, Pengcheng
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

758 results on '"Guo, Pengcheng"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources