Author: "Su, Dan" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Su, Dan"' showing total 4,286 results

Start Over Author "Su, Dan"

4,286 results on '"Su, Dan"'

1. Nemotron-4 340B Technical Report

Author: Nvidia, Adler, Bo, Agarwal, Niket, Aithal, Ashwath, Anh, Dong H., Bhattacharya, Pallab, Brundyn, Annika, Casper, Jared, Catanzaro, Bryan, Clay, Sharon, Cohen, Jonathan, Das, Sirshak, Dattagupta, Ayush, Delalleau, Olivier, Derczynski, Leon, Dong, Yi, Egert, Daniel, Evans, Ellie, Ficek, Aleksander, Fridman, Denys, Ghosh, Shaona, Ginsburg, Boris, Gitman, Igor, Grzegorzek, Tomasz, Hero, Robert, Huang, Jining, Jawa, Vibhu, Jennings, Joseph, Jhunjhunwala, Aastha, Kamalu, John, Khan, Sadaf, Kuchaiev, Oleksii, LeGresley, Patrick, Li, Hui, Liu, Jiwei, Liu, Zihan, Long, Eileen, Mahabaleshwarkar, Ameya Sunil, Majumdar, Somshubra, Maki, James, Martinez, Miguel, de Melo, Maer Rodrigues, Moshkov, Ivan, Narayanan, Deepak, Narenthiran, Sean, Navarro, Jesus, Nguyen, Phong, Nitski, Osvald, Noroozi, Vahid, Nutheti, Guruprasad, Parisien, Christopher, Parmar, Jupinder, Patwary, Mostofa, Pawelec, Krzysztof, Ping, Wei, Prabhumoye, Shrimai, Roy, Rajarshi, Saar, Trisha, Sabavat, Vasanth Rao Naik, Satheesh, Sanjeev, Scowcroft, Jane Polak, Sewall, Jason, Shamis, Pavel, Shen, Gerald, Shoeybi, Mohammad, Sizer, Dave, Smelyanskiy, Misha, Soares, Felipe, Sreedhar, Makesh Narsimhan, Su, Dan, Subramanian, Sandeep, Sun, Shengyang, Toshniwal, Shubham, Wang, Hao, Wang, Zhilin, You, Jiaxuan, Zeng, Jiaqi, Zhang, Jimmy, Zhang, Jing, Zhang, Vivienne, Zhang, Yian, and Zhu, Chen
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.
Published: 2024

2. Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

Author: Zhu, Yongxin, Su, Dan, He, Liqiang, Xu, Linli, and Yu, Dong
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. See \url{https://youngsheen.github.io/GPST/demo} for demo samples., Comment: Accept in ACL2024-main
Published: 2024

3. Prompt-guided Precise Audio Editing with Diffusion Models

Author: Xu, Manjie, Li, Chenxing, zhang, Duzhen, Su, Dan, Liang, Wei, and Yu, Dong
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as PPAE, which serves as a general module for diffusion models and enables precise audio editing. The editing is based on the input textual prompt only and is entirely training-free. We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process. Experimental results highlight the effectiveness of our method in various editing tasks., Comment: Accepted by ICML 2024
Published: 2024

4. Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

Author: Peng, Chong, He, Liqiang, and Su, Dan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of available information. This paper introduces a novel framework within an unsupervised setting for learning voice-face associations. By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task. Empirical evidence demonstrates that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks, improving verification by approximately 3%, matching by about 2.5%, and retrieval by around 1.3%.
Published: 2024

5. Nemotron-4 15B Technical Report

Author: Parmar, Jupinder, Prabhumoye, Shrimai, Jennings, Joseph, Patwary, Mostofa, Subramanian, Sandeep, Su, Dan, Zhu, Chen, Narayanan, Deepak, Jhunjhunwala, Aastha, Dattagupta, Ayush, Jawa, Vibhu, Liu, Jiwei, Mahabaleshwarkar, Ameya, Nitski, Osvald, Brundyn, Annika, Maki, James, Martinez, Miguel, You, Jiaxuan, Kamalu, John, LeGresley, Patrick, Fridman, Denys, Casper, Jared, Aithal, Ashwath, Kuchaiev, Oleksii, Shoeybi, Mohammad, Cohen, Jonathan, and Catanzaro, Bryan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.
Published: 2024

6. MM-LLMs: Recent Advances in MultiModal Large Language Models

Author: Zhang, Duzhen, Yu, Yahan, Dong, Jiahua, Li, Chenxing, Su, Dan, Chu, Chenhui, and Yu, Dong
Subjects: Computer Science - Computation and Language
Abstract: In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing 126 MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain., Comment: Accepted by ACL2024 (findings)
Published: 2024

7. pH-controlled reversible sol-gel inversion by cerous phosphate nanofibers for hemostasis

Author: Su, Tuo, Xu, Jun-Chen, Yu, Wei, Su, Dan, Shi, Di-Er, Pang, Yi-Chao, Ying, Yao, Li, Wang-Chang, Li, Juan, Zheng, Jing-Wu, Qiao, Liang, Che, Sheng-Lei, and Yu, Jing
Published: 2024
Full Text: View/download PDF

8. Normal spinopelvic parameters and correlation analysis in 217 asymptomatic children

Author: Qi, Hao, Zhao, ZengHui, Gao, XianDa, Wang, Chenchen, Zhang, Zuzhuo, Su, Dan, Zu, Feiyu, Xue, Rui, Hou, Zhiyong, Chen, Wei, and Zhang, Di
Published: 2024
Full Text: View/download PDF

9. Face it and shoulder it jointly: from personal experience to mitigation behavior of climate change

Author: Fu, Yuling, Shi, Jiaxin, Su, Dan, and Deng, Fumin
Published: 2024
Full Text: View/download PDF

10. Non-volatile memory based on PZT/FeGa thin film memtranstor

Author: He, Jin-Cheng, Xing, Jian, Shen, Jian-Xin, Su, Dan, Liu, En-Ke, Wang, Shou-Guo, and Sun, Young
Subjects: Physics - Applied Physics, Condensed Matter - Materials Science
Abstract: The PZT/FeGa thin film memtranstor was prepared and the modulation of the magnetoelectric coefficient by external magnetic and electric fields was studied. The magnetoelectric coefficient of the PZT/FeGa memtranstor can be reversed by flipping the direction of magnetization of FeGa or ferroelectric polarization of PZT. Notably, the sign of the magnetoelectric coefficient can be switched repeatedly by reversing ferroelectric polarization of PZT when the external magnetic field remains constant. Moreover, the binary switching behavior can still be maintained under zero DC bias magnetic field. When the polarization direction remains stable, the magnetoelectric coefficient also does not change. This means that the magnetoelectric coefficient of PZT/FeGa is non-volatile. Furthermore, the retention and endurance characteristics of the PZT/FeGa thin film memtranstor have been investigated. These findings demonstrate the potential of the PZT/FeGa thin film memtranstor for non-volatile memory applications., Comment: 10 pages, 4 figures
Published: 2023

11. A High Fidelity and Low Complexity Neural Audio Coding

Author: Liu, Wenzhe, Xiao, Wei, Wang, Meng, Yang, Shan, Shi, Yupeng, Kang, Yuyong, Su, Dan, Shang, Shidong, and Yu, Dong
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide-band components and adopts traditional signal processing to compress high-band components according to psychological hearing knowledge. Inspired by auditory perception theory, a perception-based loss function is designed to improve harmonic modeling. Besides, generative adversarial network (GAN) compression is proposed for the first time for neural audio codecs. Our method is superior to prior advanced neural codecs across subjective and objective metrics and allows real-time inference on desktop and mobile.
Published: 2023

12. DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis

Author: Gu, Yu, Bian, Yianrao, Lei, Guangzhi, Weng, Chao, and Su, Dan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed DurIAN-E utilizes multiple stacked SwishRNN-based Transformer blocks as linguistic encoders. Style-Adaptive Instance Normalization (SAIN) layers are exploited into frame-level encoders to improve the modeling ability of expressiveness. A denoiser incorporating both denoising diffusion probabilistic model (DDPM) for mel-spectrograms and SAIN modules is conducted to further improve the synthetic speech quality and expressiveness. Experimental results prove that the proposed expressive TTS model in this paper can achieve better performance than the state-of-the-art approaches in both subjective mean opinion score (MOS) and preference tests.
Published: 2023

13. Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Author: Zhu, Jiaxu, Tong, Weinan, Xu, Yaoxun, Song, Changhe, Wu, Zhiyong, You, Zhao, Su, Dan, Yu, Dong, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method., Comment: Proceedings of Interspeech. arXiv admin note: text overlap with arXiv:2309.01437
Published: 2023
Full Text: View/download PDF

14. Tumour-associated macrophages and Schwann cells promote perineural invasion via paracrine loop in pancreatic ductal adenocarcinoma

Author: Zhang, Bin, Guo, Xiaofeng, Huang, Leyi, Zhang, Yuting, Li, Zhiguo, Su, Dan, Lin, Longfa, Zhou, Peng, Ye, Huilin, Lu, Yanan, and Zhou, Quanbo
Published: 2024
Full Text: View/download PDF

15. Strengthening effect of mixed biochar on microbial remediation of PAHs-contaminated soil in cold areas

Author: Su, Dan, Dong, Yushan, Liu, Yihan, Yang, Caixia, and Wang, Xin
Published: 2024
Full Text: View/download PDF

16. Efficacy and safety of immune checkpoint inhibitors in solid tumor patients combined with chronic coronary syndromes or its risk factor: a nationwide multicenter cohort study

Author: Liu, Chao, Ruan, Yuli, Huang, Rui, Fang, Lin, Wu, Tong, Lv, Ying, Cui, Luying, Liao, Yuanyu, Wang, Bojun, Chen, Zhuo, Su, Dan, Ma, Yue, Han, Shuling, Guan, Xin, Cui, Jie, Yao, Yang, Wang, Yao, Wang, Mengmeng, Liu, Ruiqi, and Zhang, Yanqiao
Published: 2024
Full Text: View/download PDF

17. Model Debiasing via Gradient-based Explanation on Representation

Author: Zhang, Jindi, Wang, Luning, Su, Dan, Huang, Yongxiang, Cao, Caleb Chen, and Chen, Lei
Subjects: Computer Science - Machine Learning, Computer Science - Computers and Society
Abstract: Machine learning systems produce biased results towards certain demographic groups, known as the fairness problem. Recent approaches to tackle this problem learn a latent code (i.e., representation) through disentangled representation learning and then discard the latent code dimensions correlated with sensitive attributes (e.g., gender). Nevertheless, these approaches may suffer from incomplete disentanglement and overlook proxy attributes (proxies for sensitive attributes) when processing real-world data, especially for unstructured data, causing performance degradation in fairness and loss of useful information for downstream tasks. In this paper, we propose a novel fairness framework that performs debiasing with regard to both sensitive attributes and proxy attributes, which boosts the prediction performance of downstream task models without complete disentanglement. The main idea is to, first, leverage gradient-based explanation to find two model focuses, 1) one focus for predicting sensitive attributes and 2) the other focus for predicting downstream task labels, and second, use them to perturb the latent code that guides the training of downstream task models towards fairness and utility goals. We show empirically that our framework works with both disentangled and non-disentangled representation learning methods and achieves better fairness-accuracy trade-off on unstructured and structured datasets than previous state-of-the-art approaches.
Published: 2023

18. Learn What NOT to Learn: Towards Generative Safety in Chatbots

Author: Khalatbari, Leila, Bang, Yejin, Su, Dan, Chung, Willy, Ghadimi, Saeed, Sameti, Hossein, and Fung, Pascale
Subjects: Computer Science - Computation and Language
Abstract: Conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. Prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. In this paper, we present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals. Our approach differs from the standard contrastive learning framework in that it automatically obtains positive and negative signals from the safe and unsafe language distributions that have been learned beforehand. The LOT framework utilizes divergence to steer the generations away from the unsafe subspace and towards the safe subspace while sustaining the flow of conversation. Our approach is memory and time-efficient during decoding and effectively reduces toxicity while preserving engagingness and fluency. Empirical results indicate that LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models. Our findings are further corroborated by human evaluation., Comment: 9 pages, 3 tables, 3 figures
Published: 2023

19. Investigations on the Energy Characteristics and Internal Flow Dynamics of a Mixed-Flow Pump Considering of Inlet Pre-Rotation at Off-Rated Flow Conditions

Author: Yang, Yang, Chen, Xionghuan, Su, Dan, Gu, Tianxiang, Xi, Bin, Wang, Hui, Jiao, Weixuan, Ji, Leilei, He, Zhaoming, and Wang, Chuan
Published: 2024
Full Text: View/download PDF

20. Association Between Serum Neurofilament Light Chain and Cognitive Performance Among Older Adults in the United States: A Cross-Sectional Study

Author: Gao, Yuanyuan, Su, Dan, Xue, Zhouya, Ji, Lin, and Wang, Shu
Published: 2023
Full Text: View/download PDF

21. Genomic analysis and filtration of novel prognostic biomarkers based on metabolic and immune subtypes in pancreatic cancer

Author: Chen, Guangyu, Liu, Yueze, Su, Dan, Qiu, Jiangdong, Long, Junyu, Zhao, Fangyu, Tao, Jinxin, Yang, Gang, Huang, Hua, Xiao, Jianchun, Zhang, Taiping, and Zhao, Yupei
Published: 2023
Full Text: View/download PDF

22. Exploring regional ecological compensation of cultivated land from the perspective of the mismatch between grain supply and demand

Author: Su, Dan, Wang, Jiayi, Wu, Qing, Fang, Xiaoqian, Cao, Yu, Li, Guoyu, and CAO, Yu
Published: 2023
Full Text: View/download PDF

23. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Author: Bang, Yejin, Cahyawijaya, Samuel, Lee, Nayeon, Dai, Wenliang, Su, Dan, Wilie, Bryan, Lovenia, Holy, Ji, Ziwei, Yu, Tiezheng, Chung, Willy, Do, Quyet V., Xu, Yan, and Fung, Pascale
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion. We also release codebase for evaluation set extraction., Comment: 45 pages, AACL 2022
Published: 2023

24. NusaCrowd: Open Source Initiative for Indonesian NLP Resources

Author: Cahyawijaya, Samuel, Lovenia, Holy, Aji, Alham Fikri, Winata, Genta Indra, Wilie, Bryan, Mahendra, Rahmad, Wibisono, Christian, Romadhony, Ade, Vincentio, Karissa, Koto, Fajri, Santoso, Jennifer, Moeljadi, David, Wirawan, Cahya, Hudi, Frederikus, Parmonangan, Ivan Halim, Alfina, Ika, Wicaksono, Muhammad Satrio, Putra, Ilham Firdausi, Rahmadani, Samsul, Oenang, Yulianti, Septiandri, Ali Akbar, Jaya, James, Dhole, Kaustubh D., Suryani, Arie Ardiyanti, Putri, Rifki Afina, Su, Dan, Stevens, Keith, Nityasya, Made Nindyatama, Adilazuarda, Muhammad Farid, Ignatius, Ryan, Diandaru, Ryandito, Yu, Tiezheng, Ghifari, Vito, Dai, Wenliang, Xu, Yan, Damapuspita, Dyah, Tho, Cuk, Karo, Ichwanul Muslim Karo, Fatyanosa, Tirana Noor, Ji, Ziwei, Fung, Pascale, Neubig, Graham, Baldwin, Timothy, Ruder, Sebastian, Sujaini, Herry, Sakti, Sakriani, and Purwarianti, Ayu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
Published: 2022

25. TriNet: stabilizing self-supervised learning from complete or slow collapse on ASR

Author: Cao, Lixin, Wang, Jun, Yang, Ben, Su, Dan, and Yu, Dong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. TriNet learns the SSL latent embedding space and incorporates it to a higher level space for predicting pseudo target vectors generated by a frozen teacher. Our experimental results show that the proposed method notably stabilizes and accelerates pre-training and achieves a relative word error rate reduction (WERR) of 6.06% compared to the state-of-the-art (SOTA) Data2vec for a downstream benchmark ASR task. We will release our code at https://github.com/tencent-ailab/., Comment: Accepted by ICASSP 2023
Published: 2022

26. UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Author: Lei, Yi, Yang, Shan, Wang, Xinsheng, Xie, Qicong, Yao, Jixun, Xie, Lei, and Su, Dan
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are leveraged to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn.
Published: 2022

27. Nonlinear‐disturbance‐observer‐based predictive control for trajectory tracking of planar motors

Author: Su‐Dan Huang, Zhi‐Hui Xu, Guang‐Zhong Cao, Chao Wu, and Jiangbiao He
Subjects: linear motors, motion control, position control, predictive control, Applications of electric power, TK4001-4102
Abstract: Abstract To improve the trajectory tracking performance of planar motors against disturbances, model predictive position control (MPPC) methods using the non‐linear disturbance observer (NDO) are proposed in this study. Based on the single‐axis dynamic model with disturbances, a single‐axis NDO is designed using an extended state observer approach. The designed NDO is expressed as a third‐order non‐linear state‐space equation in which the position error, velocity error, and lumped disturbance in the single axis are taken as the state variables. Two MPPC methods are developed based on the NDO. In the first MPPC, the disturbance is embedded into the prediction model using the NDO, and a controller is designed to minimise a quadratic cost function, which is established by applying the prediction model with disturbance. The output of the controller is the control action. In the second MPPC, a controller is used to minimise the quadratic cost function, which is built by employing the prediction model without disturbance. The sum of the output of the controller and the compensated disturbance estimated by the NDO is the control action. The comparative experiment is performed on a planar motor system self‐developed in the laboratory. The effectiveness of the proposed methods is verified via the experimental results.
Published: 2024
Full Text: View/download PDF

28. Adsorption of Zn atoms by monolayer WS2 doped with different atoms X (X = O, Se, N, P, F, Cl): first principles study

Author: Mu, Yansong, Liu, Guili, Su, Dan, Yang, Zhonghua, and Zhang, Guoying
Published: 2024
Full Text: View/download PDF

29. Electronic structure and optical properties of nitrogen-doped antimonene under biaxial strain: first-principles study

Author: Wei, Ran, Liu, Guili, Qian, Shaoran, Su, Dan, and Zhang, Guoying
Published: 2024
Full Text: View/download PDF

30. Removing Efficiency and Mechanism of Ciprofloxacin from Aqueous Solution Using Rectorite

Author: Su, Dan, Huang, Jingyi, Li, Yang, Chen, Lin, and Wang, Yingru
Published: 2024
Full Text: View/download PDF

31. First principle study of the effect of doping on the optoelectronic properties of Cr-adsorbed MoS2

Author: Wei, Ran, Liu, Guili, Su, Dan, Ma, Mengting, Mu, Yansong, Yang, Zhonghua, and Zhang, Guoying
Published: 2024
Full Text: View/download PDF

32. Generative Long-form Question Answering: Relevance, Faithfulness and Succinctness

Author: Su, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In this thesis, we investigated the relevance, faithfulness, and succinctness aspects of Long Form Question Answering (LFQA). LFQA aims to generate an in-depth, paragraph-length answer for a given question, to help bridge the gap between real scenarios and the existing open-domain QA models which can only extract short-span answers. LFQA is quite challenging and under-explored. Few works have been done to build an effective LFQA system. It is even more challenging to generate a good-quality long-form answer relevant to the query and faithful to facts, since a considerable amount of redundant, complementary, or contradictory information will be contained in the retrieved documents. Moreover, no prior work has been investigated to generate succinct answers. We are among the first to research the LFQA task. We pioneered the research direction to improve the answer quality in terms of 1) query-relevance, 2) answer faithfulness, and 3) answer succinctness., Comment: PhD Thesis
Published: 2022

33. Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

Author: Dai, Wenliang, Liu, Zihan, Ji, Ziwei, Su, Dan, and Fung, Pascale
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Large-scale vision-language pre-trained (VLP) models are prone to hallucinate non-existent visual objects when generating text based on visual information. In this paper, we systematically study the object hallucination problem from three aspects. First, we examine recent state-of-the-art VLP models, showing that they still hallucinate frequently, and models achieving better scores on standard metrics (e.g., CIDEr) could be more unfaithful. Second, we investigate how different types of image encoding in VLP influence hallucination, including region-based, grid-based, and patch-based. Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination. Third, we decouple various VLP objectives and demonstrate that token-level image-text alignment and controlled generation are crucial to reducing hallucination. Based on that, we propose a simple yet effective VLP loss named ObjMLM to further mitigate object hallucination. Results show that it reduces object hallucination by up to 17.4% when tested on two benchmarks (COCO Caption for in-domain and NoCaps for out-of-domain evaluation)., Comment: Accepted at EACL 2023
Published: 2022

34. Context Generation Improves Open Domain Question Answering

Author: Su, Dan, Patwary, Mostofa, Prabhumoye, Shrimai, Xu, Peng, Prenger, Ryan, Shoeybi, Mohammad, Fung, Pascale, Anandkumar, Anima, and Catanzaro, Bryan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Closed-book question answering (QA) requires a model to directly answer an open-domain question without access to any external knowledge. Prior work on closed-book QA either directly finetunes or prompts a pretrained language model (LM) to leverage the stored knowledge. However, they do not fully exploit the parameterized knowledge. To address this issue, we propose a two-stage, closed-book QA framework which employs a coarse-to-fine approach to extract relevant knowledge and answer a question. Our approach first generates a related context for a given question by prompting a pretrained LM. We then prompt the same LM for answer prediction using the generated context and the question. Additionally, to eliminate failure caused by context uncertainty, we marginalize over generated contexts. Experimental results on three QA benchmarks show that our method significantly outperforms previous closed-book QA methods (e.g. exact matching 68.6% vs. 55.3%), and is on par with open-book methods that exploit external knowledge sources (e.g. 68.6% vs. 68.0%). Our method is able to better exploit the stored knowledge in pretrained LMs without adding extra learnable parameters or needing finetuning, and paves the way for hybrid models that integrate pretrained LMs with external knowledge., Comment: 8 pages; Accepted at EACL2023
Published: 2022

35. The DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022

Author: Qin, Xiaoyi, Li, Na, Lin, Yuke, Ding, Yiwei, Weng, Chao, Su, Dan, and Li, Ming
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper is the system description of the DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC22). In this challenge, we focus on track1 and track3. For track1, multiple backbone networks are adopted to extract frame-level features. Since track1 focus on the cross-age scenarios, we adopt the cross-age trials and perform QMF to calibrate score. The magnitude-based quality measures achieve a large improvement. For track3, the semi-supervised domain adaptation task, the pseudo label method is adopted to make domain adaptation. Considering the noise labels in clustering, the ArcFace is replaced by Sub-center ArcFace. The final submission achieves 0.107 mDCF in task1 and 7.135% EER in task3.
Published: 2022

36. Test and analysis of energy characteristics of large vertical submersible pumps

Author: Chen Yang, Li Lingyu, Chen Huixiang, and Su Dan
Subjects: vertical submerged pump, model test, prototype test, efficiency, shaft power, General Works
Abstract: The efficiency of the pump device is an important parameter to judge the overall dynamic performance of the pumping station. The commonly used method at home and abroad is to carry out model tests of the pump device. The performance parameters of the prototype pump and pump device are obtained by the similarity conversion formula. However, at present, there are not many device model tests for large vertical submersible pumps. Taking a large vertical submersible mixed-flow pumping station in China as an example, research predicted the performance of the pump device through a model test and a submersible pump prototype test. The results show that the model test of the large vertical submersible mixed-flow pump device has a maximum efficiency of approximately 77.8%, and the prototype test conversion device has a maximum efficiency of approximately 80.33%. The device model test and the pump factory prototype test results are compared. It is found that the performance parameters of the pump measured by the prototype test are in good agreement with the device model test under the design conditions, and there is a certain error when the deviation from the design conditions is significant. The device model test and the factory test of the pump are indispensable in the large-scale road of submersible pumps, and a large number of tests are needed to sum up the experience.
Published: 2024
Full Text: View/download PDF

37. Multi-state data storage in a two-dimensional stripy antiferromagnet implemented by magnetoelectric effect

Author: Gu, Pingfan, Wang, Cong, Su, Dan, Dong, Zehao, Wang, Qiuyuan, Han, Zheng, Watanabe, Kenji, Taniguchi, Takashi, Ji, Wei, Sun, Young, and Ye, Yu
Subjects: Condensed Matter - Materials Science, Physics - Applied Physics
Abstract: A promising approach to the next generation of low-power, functional, and energy-efficient electronics relies on novel materials with coupled magnetic and electric degrees of freedom. In particular, stripy antiferromagnets often exhibit broken crystal and magnetic symmetries, which may bring about the magnetoelectric (ME) effect and enable the manipulation of intriguing properties and functionalities by electrical means. The demand for expanding the boundaries of data storage and processing technologies has led to the development of spintronics toward two-dimensional (2D) platforms. This work reports the ME effect in the 2D stripy antiferromagnetic insulator CrOCl down to a single layer. By measuring the tunneling resistance of CrOCl on the parameter space of temperature, magnetic field, and applied voltage, we verified the ME coupling down to the 2D limit and unraveled its mechanism. Utilizing the multi-stable states and ME coupling at magnetic phase transitions, we realize multi-state data storage in the tunneling devices. Our work not only advances the fundamental understanding of spin-charge coupling but also demonstrates the great potential of 2D antiferromagnetic materials to deliver devices and circuits beyond the traditional binary operations., Comment: 8 pages, 3 figures
Published: 2022
Full Text: View/download PDF

38. Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings

Author: Qin, Xiaoyi, Li, Na, Weng, Chao, Su, Dan, and Li, Ming
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Automatic speaker verification has achieved remarkable progress in recent years. However, there is little research on cross-age speaker verification (CASV) due to insufficient relevant data. In this paper, we mine cross-age test sets based on the VoxCeleb dataset and propose our age-invariant speaker representation(AISR) learning method. Since the VoxCeleb is collected from the YouTube platform, the dataset consists of cross-age data inherently. However, the meta-data does not contain the speaker age label. Therefore, we adopt the face age estimation method to predict the speaker age value from the associated visual data, then label the audio recording with the estimated age. We construct multiple Cross-Age test sets on VoxCeleb (Vox-CA), which deliberately select the positive trials with large age-gap. Also, the effect of nationality and gender is considered in selecting negative pairs to align with Vox-H cases. The baseline system performance drops from 1.939\% EER on the Vox-H test set to 10.419\% on the Vox-CA20 test set, which indicates how difficult the cross-age scenario is. Consequently, we propose an age-decoupling adversarial learning (ADAL) method to alleviate the negative effect of the age gap and reduce intra-class variance. Our method outperforms the baseline system by over 10\% related EER reduction on the Vox-CA20 test set. The source code and trial resources are available on https://github.com/qinxiaoyi/Cross-Age_Speaker_Verification, Comment: Accepted by Interspeech2022
Published: 2022

39. Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

Author: Lei, Yi, Yang, Shan, Cong, Jian, Xie, Lei, and Su, Dan
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution $p(z)$ of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same $p(z)$ from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained speaker encoder and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.
Published: 2022

40. Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

Author: Xue, Liumeng, Yang, Shan, Hu, Na, Su, Dan, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Building a voice conversion system for noisy target speakers, such as users providing noisy samples or Internet found data, is a challenging task since the use of contaminated speech in model training will apparently degrade the conversion performance. In this paper, we leverage the advances of our recently proposed Glow-WaveGAN and propose a noise-independent speech representation learning approach for high-quality voice conversion for noisy target speakers. Specifically, we learn a latent feature space where we ensure that the target distribution modeled by the conversion model is exactly from the modeled distribution of the waveform generator. With this premise, we further manage to make the latent feature to be noise-invariant. Specifically, we introduce a noise-controllable WaveGAN, which directly learns the noise-independent acoustic representation from waveform by the encoder and conducts noise control in the hidden space through a FiLM module in the decoder. As for the conversion model, importantly, we use a flow-based model to learn the distribution of noise-independent but speaker-related latent features from phoneme posteriorgrams. Experimental results demonstrate that the proposed model achieves high speech quality and speaker similarity in the voice conversion for noisy target speakers., Comment: Accepted by INTERSPEECH 2022
Published: 2022

41. End-to-End Voice Conversion with Information Perturbation

Author: Xie, Qicong, Yang, Shan, Lei, Yi, Xie, Lei, and Su, Dan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of intelligibility, naturalness, and speaker similarity.
Published: 2022

42. EFFECTS OF LEVOSIMENDAN ON DIAPHRAGMATIC DYSFUNCTION IN PATIENTS WITH SEPSIS

Author: Wu, Jia-Qian, Wang, Ying-Xin, Su, Dan, Shao, Teng-Hao, Ding, Xiao-Xu, Sun, Tao, Cui, Na, and Yu, Zhan-Biao
Published: 2024
Full Text: View/download PDF

43. Uncovering scale effects on spatial patterns and interactions of multiple cropland ecosystem services

Author: Cao, Yu, Su, Dan, Wang, Jiayi, Li, Guoyu, Fang, Xiaoqian, Wu, Qing, and Cao, Yu
Published: 2023
Full Text: View/download PDF

44. Nitrate sources and their influence on hydrogeochemistry in karst caves of Southwest China

Author: Zhou, Zhongfa, Ding, Shengjun, Xiong, Yong, Shi, Liangxing, Su, Dan, Gong, Xiaohuan, Dong, Hui, and Yan, Lihui
Published: 2023
Full Text: View/download PDF

45. Time to sputum culture conversion and its associated factors among drug-resistant tuberculosis patients: a systematic review and meta-analysis

Author: Yang Wenlu, Zhao Xia, Wu Chuntao, Yu Qiaolin, Xiao Xujue, Yao Rong, Su Dan, Yan Xi, and Wan Bin
Subjects: DR-TB, Sputum culture conversion time, Risk factors, Meta-analysis, Infectious and parasitic diseases, RC109-216
Abstract: Abstract Objective We aimed to evaluate the sputum culture conversion time of DR-TB patients and its related factors. Methods PubMed, The Cochrane Library, Embase, CINAHL, Web of Science, CNKI, Wan Fang, CBM and VIP databases were electronically searched to collect studies on sputum culture conversion time in patients with DR-TB. Meta-analysis was performed by using the R 4.3.0 version and Stata 16 software. Results A total of 45 studies involving 17373 patients were included. Meta-analysis results showed that the pooled median time to sputum culture conversion was 68.57 days (IQR 61.01,76.12). The median time of sputum culture conversion in patients with drug-resistant tuberculosis was different in different WHO regions, countries with different levels of development and different treatment schemes. And female (aHR = 0.59,95%CI: s0.46,0.76), alcohol history (aHR = 0.70,95%CI:0.50,0.98), smoking history (aHR = 0.58,95%CI:0.38,0.88), history of SLD use (aHR = 0.64,95%CI:0.47,0.87), BMI
Published: 2024
Full Text: View/download PDF

46. AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

Author: Song, Kun, Xue, Heyang, Wang, Xinsheng, Cong, Jian, Zhang, Yongmao, Xie, Lei, Yang, Bing, Zhang, Xiong, and Su, Dan
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios due to the challenges raised by the requirement of the lightweight model and less computational complexity. In this paper, a tiny VITS-based TTS model, named AdaVITS, for low computing resource speaker adaptation is proposed. To effectively reduce parameters and computational complexity of VITS, an iSTFT-based wave construction decoder is proposed to replace the upsampling-based decoder which is resource-consuming in the original VITS. Besides, NanoFlow is introduced to share the density estimate across flow blocks to reduce the parameters of the prior encoder. Furthermore, to reduce the computational complexity of the textual encoder, scaled-dot attention is replaced with linear attention. To deal with the instability caused by the simplified model, instead of using the original text encoder, phonetic posteriorgram (PPG) is utilized as linguistic feature via a text-to-PPG module, which is then used as input for the encoder. Experiment shows that AdaVITS can generate stable and natural speech in speaker adaptation with 8.97M model parameters and 0.72GFlops computational complexity., Comment: Accepted by ISCSLP 2022
Published: 2022

47. Towards Answering Open-ended Ethical Quandary Questions

Author: Bang, Yejin, Lee, Nayeon, Yu, Tiezheng, Khalatbari, Leila, Xu, Yan, Cahyawijaya, Samuel, Su, Dan, Wilie, Bryan, Barraud, Romain, Barezi, Elham J., Madotto, Andrea, Kee, Hayden, and Fung, Pascale
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Considerable advancements have been made in various NLP tasks based on the impressive power of large language models (LLMs) and many NLP applications are deployed in our daily lives. In this work, we challenge the capability of LLMs with the new task of Ethical Quandary Generative Question Answering. Ethical quandary questions are more challenging to address because multiple conflicting answers may exist to a single quandary. We explore the current capability of LLMs in providing an answer with a deliberative exchange of different perspectives to an ethical quandary, in the approach of Socratic philosophy, instead of providing a closed answer like an oracle. We propose a model that searches for different ethical principles applicable to the ethical quandary and generates an answer conditioned on the chosen principles through prompt-based few-shot learning. We also discuss the remaining challenges and ethical issues involved in this task and suggest the direction toward developing responsible NLP systems by incorporating human values explicitly., Comment: 16 pages
Published: 2022

48. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Author: Huang, Rongjie, Lam, Max W. Y., Wang, Jun, Su, Dan, Yu, Dong, Ren, Yi, and Zhao, Zhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at \url{https://FastDiff.github.io/}., Comment: Accepted by IJCAI 2022
Published: 2022

49. 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Author: You, Zhao, Feng, Shulin, Su, Dan, and Yu, Dong
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, Conformer based CTC/AED model has become a mainstream architecture for ASR. In this paper, based on our prior work, we identify and integrate several approaches to achieve further improvements for ASR tasks, which we denote as multi-loss, multi-path and multi-level, summarized as "3M" model. Specifically, multi-loss refers to the joint CTC/AED loss and multi-path denotes the Mixture-of-Experts(MoE) architecture which can effectively increase the model capacity without remarkably increasing computation cost. Multi-level means that we introduce auxiliary loss at multiple level of a deep model to help training. We evaluate our proposed method on the public WenetSpeech dataset and experimental results show that the proposed method provides 12.2%-17.6% relative CER improvement over the baseline model trained by Wenet toolkit. On our large scale dataset of 150k hours corpus, the 3M model has also shown obvious superiority over the baseline Conformer model. Code is publicly available at https://github.com/tencent-ailab/3m-asr., Comment: 5 pages, 1 figure. Submitted to INTERSPEECH 2022
Published: 2022

50. Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

Author: Zhou, Yixuan, Song, Changhe, Li, Xiang, Zhang, Luwen, Wu, Zhiyong, Bian, Yanyao, Su, Dan, and Meng, Helen
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and parameters. Previous researches usually use a speaker encoder to extract a global fixed speaker embedding from reference speech, and several attempts have tried variable-length speaker embedding. However, they neglect to transfer the personal pronunciation characteristics related to phoneme content, leading to poor speaker similarity in terms of detailed speaking styles and pronunciation habits. To improve the ability of the speaker encoder to model personal pronunciation characteristics, we propose content-dependent fine-grained speaker embedding for zero-shot speaker adaptation. The corresponding local content embeddings and speaker embeddings are extracted from a reference speech, respectively. Instead of modeling the temporal relations, a reference attention module is introduced to model the content relevance between the reference speech and the input text, and to generate the fine-grained speaker embedding for each phoneme encoder output. The experimental results show that our proposed method can improve speaker similarity of synthesized speeches, especially for unseen speakers., Comment: Accepted by Interspeech 2022
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

4,286 results on '"Su, Dan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources