Author: "Zhang, Yongmao" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Yongmao"' showing total 36 results

Start Over Author "Zhang, Yongmao"

36 results on '"Zhang, Yongmao"'

1. Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Author: Guo, Dake, Zhu, Xinfa, Xue, Liumeng, Zhang, Yongmao, Tian, Wenjie, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware(TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis., Comment: Accepted by INTERSPEECH2024
Published: 2024

2. Accent-VITS:accent transfer for end-to-end TTS

Author: Ma, Linhan, Zhang, Yongmao, Zhu, Xinfa, Lei, Yi, Ning, Ziqian, Zhu, Pengcheng, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer.We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints.Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective.Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline., Comment: Accepted by NCMMSC2023
Published: 2023

3. PromptSpeaker: Speaker Generation Based on Text Descriptions

Author: Zhang, Yongmao, Liu, Guanghou, Lei, Yi, Chen, Yunlin, Yin, Hao, Xie, Lei, and Li, Zhifei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable subjective matching quality with the speaker prompt., Comment: Accepted to ASRU 2023
Published: 2023

4. SStackGNN: Graph Data Augmentation Simplified Stacking Graph Neural Network for Twitter Bot Detection

Author: Shi, Shuhao, Chen, Jian, Wang, Zhengyan, Zhang, Yuxin, Zhang, Yongmao, Fu, Chengqi, Qiao, Kai, and Yan, Bin
Published: 2024
Full Text: View/download PDF

5. METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer

Author: Zhu, Xinfa, Lei, Yi, Li, Tao, Zhang, Yongmao, Zhou, Hongbin, Lu, Heng, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion, and language factors in the speech signal will make a system produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes the Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift-based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization-based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS., Comment: 10 pages, 3 figures
Published: 2023

6. The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Author: Song, Kun, lei, Yi, Chen, Peikun, Cao, Yiqing, Wei, Kun, Zhang, Yongmao, Xie, Lei, Jiang, Ning, and Zhao, Guoqing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Specifically, to improve the robustness to multi-source speech input, we adopt various data augmentation strategies and a ROVER-based score fusion on multiple ASR model outputs. To better handle the noisy ASR transcripts, we introduce a three-stage fine-tuning strategy to improve translation accuracy. Finally, we build a TTS model with high naturalness and sound quality, which leverages a two-stage framework, using network bottleneck features as a robust intermediate representation for speaker timbre and linguistic content disentanglement. Based on the two-stage framework, pre-trained speaker embedding is leveraged as a condition to transfer the speaker timbre in the source English speech to the translated Chinese speech. Experimental results show that our system has high translation accuracy, speech naturalness, sound quality, and speaker similarity. Moreover, it shows good robustness to multi-source data., Comment: IWSLT@ACL 2023 system paper. Our submitted system ranks 1st in the S2ST task of the IWSLT 2023 evaluation campaign
Published: 2023

7. PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

Author: Liu, Guanghou, Zhang, Yongmao, Lei, Yi, Chen, Yunlin, Wang, Rui, Li, Zhifei, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Style transfer TTS has shown impressive performance in recent years. However, style control is often restricted to systems built on expressive speech recordings with discrete style categories. In practical situations, users may be interested in transferring style by typing text descriptions of desired styles, without the reference speech in the target style. The text-guided content generation techniques have drawn wide attention recently. In this work, we explore the possibility of controllable style transfer with natural language descriptions. To this end, we propose PromptStyle, a text prompt-guided cross-speaker style transfer system. Specifically, PromptStyle consists of an improved VITS and a cross-modal style encoder. The cross-modal style encoder constructs a shared space of stylistic and semantic representation through a two-stage training process. Experiments show that PromptStyle can achieve proper style transfer with text prompts while maintaining relatively high stability and speaker similarity. Audio samples are available in our demo page.
Published: 2023

8. Accent-VITS: Accent Transfer for End-to-End TTS

Author: Ma, Linhan, Zhang, Yongmao, Zhu, Xinfa, Lei, Yi, Ning, Ziqian, Zhu, Pengcheng, Xie, Lei, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Jia, Jia, editor, Ling, Zhenhua, editor, Chen, Xie, editor, Li, Ya, editor, and Zhang, Zixing, editor
Published: 2024
Full Text: View/download PDF

9. Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

Author: Zhu, Xinfa, Lei, Yi, Song, Kun, Zhang, Yongmao, Li, Tao, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel transfer, we introduce a reference-candidate pool and propose an attention-based reference selection approach. Extensive experiments demonstrate the good design of our model., Comment: Accepted by ICASSP2023
Published: 2022

10. VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

Author: Zhang, Yongmao, Xue, Heyang, Li, Hanzhao, Xie, Lei, Guo, Tingwei, Zhang, Ruixiong, and Gong, Caixia
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: End-to-end singing voice synthesis (SVS) model VISinger can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP), we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFi-GAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics., Comment: Submitted to ICASSP 2023
Published: 2022

11. DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Author: Song, Kun, Zhang, Yongmao, Lei, Yi, Cong, Jian, Li, Hanzhao, Xie, Lei, He, Gang, and Bai, Jinfeng
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in the training phase and the predicted spectrograms in the inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experiments show that DSPGAN significantly outperforms the compared approaches and it can generate high-fidelity speech for various TTS models trained using diverse data., Comment: Accepted to ICASSP 2023
Published: 2022

12. Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

Author: Song, Kun, Cong, Jian, Wang, Xinsheng, Zhang, Yongmao, Xie, Lei, Jiang, Ning, and Wu, Haiying
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data., Comment: Accepted by ISCSLP 2022
Published: 2022

13. AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

Author: Zhang, Yongmao, Wang, Zhichao, Yang, Peiji, Sun, Hongshen, Wang, Zhisheng, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowd-sourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Mel) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings are contributed from the ordinary unprofessional speakers. To tackle this problem, we update the two-stage approach to a novel three-stage approach, where T2BN and BN2Mel are trained using the high-quality target speaker data and a new BN-to-BN module is plugged in between the two modules to perform accent transfer. To train the BN2BN module, the parallel unaccented and accented BN features are obtained by a proposed data augmentation procedure. Finally the proposed three-stage approach manages to produce accent speech for the target speaker with good prosody, as the prosody pattern is inherited from the professional target speaker and accent transfer is achieved by the BN2BN module at the same time. The proposed approach, named as AccentSpeech, is validated in a Mandarin TTS accent transfer task., Comment: Accepted by ISCSLP2022
Published: 2022

14. AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

Author: Song, Kun, Xue, Heyang, Wang, Xinsheng, Cong, Jian, Zhang, Yongmao, Xie, Lei, Yang, Bing, Zhang, Xiong, and Su, Dan
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios due to the challenges raised by the requirement of the lightweight model and less computational complexity. In this paper, a tiny VITS-based TTS model, named AdaVITS, for low computing resource speaker adaptation is proposed. To effectively reduce parameters and computational complexity of VITS, an iSTFT-based wave construction decoder is proposed to replace the upsampling-based decoder which is resource-consuming in the original VITS. Besides, NanoFlow is introduced to share the density estimate across flow blocks to reduce the parameters of the prior encoder. Furthermore, to reduce the computational complexity of the textual encoder, scaled-dot attention is replaced with linear attention. To deal with the instability caused by the simplified model, instead of using the original text encoder, phonetic posteriorgram (PPG) is utilized as linguistic feature via a text-to-PPG module, which is then used as input for the encoder. Experiment shows that AdaVITS can generate stable and natural speech in speaker adaptation with 8.97M model parameters and 0.72GFlops computational complexity., Comment: Accepted by ISCSLP 2022
Published: 2022

15. Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher

Author: Xue, Heyang, Wang, Xinsheng, Zhang, Yongmao, Xie, Lei, Zhu, Pengcheng, and Bi, Mengxiao
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Building a high-quality singing corpus for a person who is not good at singing is non-trivial, thus making it challenging to create a singing voice synthesizer for this person. Learn2Sing is dedicated to synthesizing the singing voice of a speaker without his or her singing data by learning from data recorded by others, i.e., the singing teacher. Inspired by the fact that pitch is the key style factor to distinguish singing from speaking voice, the proposed Learn2Sing 2.0 first generates the preliminary acoustic feature with averaged pitch value in the phone level, which allows the training of this process for different styles, i.e., speaking or singing, share same conditions except for the speaker information. Then, conditioned on the specific style, a diffusion decoder, which is accelerated by a fast sampling algorithm during the inference stage, is adopted to gradually restore the final acoustic feature. During the training, to avoid the information confusion of the speaker embedding and the style embedding, mutual information is employed to restrain the learning of speaker embedding and style embedding. Experiments show that the proposed approach is capable of synthesizing high-quality singing voice for the target speaker without singing data with 10 decoding steps., Comment: Submitted to INTERSPEECH 2022
Published: 2022

16. Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Author: Wang, Yu, Wang, Xinsheng, Zhu, Pengcheng, Wu, Jie, Li, Hanzhao, Xue, Heyang, Zhang, Yongmao, Xie, Lei, and Bi, Mengxiao
Subjects: Computer Science - Sound, Computer Science - Databases, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper introduces Opencpop, a publicly available high-quality Mandarin singing corpus designed for singing voice synthesis (SVS). The corpus consists of 100 popular Mandarin songs performed by a female professional singer. Audio files are recorded with studio quality at a sampling rate of 44,100 Hz and the corresponding lyrics and musical scores are provided. All singing recordings have been phonetically annotated with phoneme boundaries and syllable (note) boundaries. To demonstrate the reliability of the released data and to provide a baseline for future research, we built baseline deep neural network-based SVS models and evaluated them with both objective metrics and subjective mean opinion score (MOS) measure. Experimental results show that the best SVS model trained on our database achieves 3.70 MOS, indicating the reliability of the provided corpus. Opencpop is released to the open-source community WeNet, and the corpus, as well as synthesized demos, can be found on the project homepage., Comment: will be submitted to Interspeech 2022
Published: 2022

17. VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

Author: Zhang, Yongmao, Cong, Jian, Xue, Heyang, Xie, Lei, Zhu, Pengcheng, and Bi, Mengxiao
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates audio waveform from lyrics and musical score. Our approach is inspired by VITS, which adopts VAE-based posterior encoder augmented with normalizing flow-based prior encoder and adversarial decoder to realize complete end-to-end speech generation. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder based on the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+Neural-Vocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions., Comment: 5 pages, ICASSP 2022
Published: 2021

18. METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

Author: Zhu, Xinfa, primary, Lei, Yi, additional, Li, Tao, additional, Zhang, Yongmao, additional, Zhou, Hongbin, additional, Lu, Heng, additional, and Xie, Lei, additional
Published: 2024
Full Text: View/download PDF

19. SqueezeGCN: Adaptive Neighborhood Aggregation with Squeeze Module for Twitter Bot Detection Based on GCN

Author: Fu, Chengqi, primary, Shi, Shuhao, additional, Zhang, Yuxin, additional, Zhang, Yongmao, additional, Chen, Jian, additional, Yan, Bin, additional, and Qiao, Kai, additional
Published: 2023
Full Text: View/download PDF

20. Promptspeaker: Speaker Generation Based on Text Descriptions

Author: Zhang, Yongmao, primary, Liu, Guanghou, additional, Lei, Yi, additional, Chen, Yunlin, additional, Yin, Hao, additional, Xie, Lei, additional, and Li, Zhifei, additional
Published: 2023
Full Text: View/download PDF

21. Application of wearable devices based on deep learning algorithm in rope skipping data monitoring

Author: Zhang Yongmao and Liu Yuxin
Subjects: Geometry and Topology, Software, Theoretical Computer Science
Abstract: At present, wearable devices have some problems, such as poor adaptability to human motion behavior, and the recognition accuracy required for different wearers cannot be achieved. Based on the principle of deep learning algorithm, this paper realizes the development of intelligent rope skipping movement data monitoring system. Through the universal human body analysis model, the attention mechanism is introduced and embedded into the decoding network. The data set of rope skipping is classified by multiple labels, and the convolution of spatial graph is constructed, which is extended to the time series dynamics of moving human skeleton data. Aiming at the problem of complex information data in the process of moving human body recognition, we use pose estimation to calculate the key points of moving human body, extract the dynamic structure information of human skeleton sequence. Due to the problems of line of sight occlusion in the process of moving human target tracking, a target tracking algorithm based on multi domain convolution neural network is adopted to improve the feature extraction ability of the algorithm by segmenting the target to be tracked and identifying the area around the target. The data set of rope skipping is collected by wearable sensors, and the difference in the numerical range may be large, so the data is normalized. Finally, through the loss function, the fitting effect of neural network can be evaluated, and the gradient optimization model parameters can be calculated, and coping with different data changes. Through the final system performance test, it is verified that the accuracy rate of the system designed in this paper is above 90%, which can effectively monitor the data of skipping rope and be used in the actual operation of skipping rope.
Published: 2023
Full Text: View/download PDF

22. PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

Author: Liu, Guanghou, primary, Zhang, Yongmao, additional, Lei, Yi, additional, Chen, Yunlin, additional, Wang, Rui, additional, Xie, Lei, additional, and Li, Zhifei, additional
Published: 2023
Full Text: View/download PDF

23. VISinger2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

Author: Zhang, Yongmao, primary, Xue, Heyang, additional, Li, Hanzhao, additional, Xie, Lei, additional, Guo, Tingwei, additional, Zhang, Ruixiong, additional, and Gong, Caixia, additional
Published: 2023
Full Text: View/download PDF

24. SqueezeGCN: Adaptive Neighborhood Aggregation with Squeeze Module for Twitter Bot Detection Based on GCN.

Author: Fu, Chengqi, Shi, Shuhao, Zhang, Yuxin, Zhang, Yongmao, Chen, Jian, Yan, Bin, and Qiao, Kai
Subjects: NEIGHBORHOODS, SOCIAL networks
Abstract: Despite notable advancements in bot detection methods based on Graph Neural Networks (GNNs). The efficacy of Graph Neural Networks relies heavily on the homophily assumption, which posits that nodes with the same label are more likely to form connections between them. However, the latest social bots are capable of concealing themselves by extensively interacting with authentic user accounts, forging extensive connections on social graphs, and thus deviating from the homophily assumption. Consequently, conventional Graph Neural Network methods continue to face significant challenges in detecting these novel types of social bots. To address this issue, we proposed SqueezeGCN, an adaptive neighborhood aggregation with the Squeeze Module for Twitter bot detection based on a GCN. The Squeeze Module uses a parallel multi-layer perceptron (MLP) to squeeze feature vectors into a one-dimensional representation. Subsequently, we adopted the sigmoid activation function, which normalizes values between 0 and 1, serving as node aggregation weights. The aggregation weight vector is processed by a linear layer to obtain the aggregation embedding, and the classification result is generated using a MLP classifier. This design generates adaptive aggregation weights for each node, diverging from the traditional singular neighbor aggregation approach. Our experiments demonstrate that SqueezeGCN performs well on three widely acknowledged Twitter bot detection benchmarks. Comparisons with a GCN reveal improvements of 2.37%, 15.59%, and 1.33% for the respective datasets. Furthermore, our approach demonstrates improvements when compared to state-of-the-art algorithms on the three benchmark datasets. The experimental results further affirm the exceptional effectiveness of our proposed algorithm for Twitter bot detection. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

Author: Zhu, Xinfa, primary, Lei, Yi, additional, Song, Kun, additional, Zhang, Yongmao, additional, Li, Tao, additional, and Xie, Lei, additional
Published: 2023
Full Text: View/download PDF

26. DSPGAN: A Gan-Based Universal Vocoder for High-Fidelity TTS by Time-Frequency Domain Supervision from DSP

Author: Song, Kun, primary, Zhang, Yongmao, additional, Lei, Yi, additional, Cong, Jian, additional, Li, Hanzhao, additional, Xie, Lei, additional, He, Gang, additional, and Bai, Jinfeng, additional
Published: 2023
Full Text: View/download PDF

27. The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Author: Song, Kun, primary, Lei, Yi, additional, Chen, Peikun, additional, Cao, Yiqing, additional, Wei, Kun, additional, Zhang, Yongmao, additional, Xie, Lei, additional, Jiang, Ning, additional, and Zhao, Guoqing, additional
Published: 2023
Full Text: View/download PDF

28. AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

Author: Song, Kun, primary, Xue, Heyang, additional, Wang, Xinsheng, additional, Cong, Jian, additional, Zhang, Yongmao, additional, Xie, Lei, additional, Yang, Bing, additional, Zhang, Xiong, additional, and Su, Dan, additional
Published: 2022
Full Text: View/download PDF

29. Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

Author: Song, Kun, primary, Cong, Jian, additional, Wang, Xinsheng, additional, Zhang, Yongmao, additional, Xie, Lei, additional, Jiang, Ning, additional, and Wu, Haiying, additional
Published: 2022
Full Text: View/download PDF

30. AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

Author: Zhang, Yongmao, primary, Wang, Zhichao, additional, Yang, Peiji, additional, Sun, Hongshen, additional, Wang, Zhisheng, additional, and Xie, Lei, additional
Published: 2022
Full Text: View/download PDF

31. Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher

Author: Xue, Heyang, primary, Wang, Xinsheng, additional, Zhang, Yongmao, additional, Xie, Lei, additional, Zhu, Pengcheng, additional, and Bi, Mengxiao, additional
Published: 2022
Full Text: View/download PDF

32. Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Author: Wang, Yu, primary, Wang, Xinsheng, additional, Zhu, Pengcheng, additional, Wu, Jie, additional, Li, Hanzhao, additional, Xue, Heyang, additional, Zhang, Yongmao, additional, Xie, Lei, additional, and Bi, Mengxiao, additional
Published: 2022
Full Text: View/download PDF

33. VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

Author: Zhang, Yongmao, Cong, Jian, Xue, Heyang, Xie, Lei, Zhu, Pengcheng, and Bi, Mengxiao
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates audio waveform from lyrics and musical score. Our approach is inspired by VITS, which adopts VAE-based posterior encoder augmented with normalizing flow-based prior encoder and adversarial decoder to realize complete end-to-end speech generation. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder based on the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+Neural-Vocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions., 5 pages, ICASSP 2022
Published: 2022
Full Text: View/download PDF

34. Exploration and Research on Smart Sports Classrooms in Colleges in the Information Age

Author: Zhang, Yongmao, primary, Zhao, Hua, additional, and Peng, Dan, additional
Published: 2022
Full Text: View/download PDF

35. Metabolomic mechanisms of short chain chlorinated paraffins toxicity in rats

Author: Yang, Lixin, primary, Liu, Yinping, additional, Cui, Ze, additional, Zhang, Yongmao, additional, Zhang, Jianbo, additional, and Lian, Kaoqi, additional
Published: 2021
Full Text: View/download PDF

36. Analysis of Current Situation of Fancy Rope Skipping Curriculum in Regular Institutions of Higher Learning in Sichuan Province

Author: Zhang, Yongmao, primary
Published: 2016
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

36 results on '"Zhang, Yongmao"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources