Author: "Liu, Shujie" / Topic: computation and language (cs.cl) - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Liu, Shujie"' showing total 30 results

Start Over Author "Liu, Shujie" Topic computation and language (cs.cl)

30 results on '"Liu, Shujie"'

1. On decoder-only architecture for speech-to-text and large language model integration

Author: Wu, Jian, Gaur, Yashesh, Chen, Zhuo, Zhou, Long, Zhu, Yimeng, Wang, Tianrui, Li, Jinyu, Liu, Shujie, Ren, Bo, Liu, Linquan, and Wu, Yu
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
Published: 2023

2. Accelerating Transducers through Adjacent Token Merging

Author: Li, Yuang, Wu, Yu, Li, Jinyu, and Liu, Shujie
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. In this way, the total time step could be reduced, and the inference of both the encoder and joint network is accelerated. Experiments on LibriSpeech show that our method can reduce 57% of tokens and improve the inference speed on GPU by 70% without any notable loss of accuracy. Additionally, we demonstrate that A-ToMe is also an effective solution to reduce tokens in long-form ASR, where the input speech consists of multiple utterances., Interspeech 2023
Published: 2023

3. Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition

Author: Li, Yuang, Wu, Yu, Li, Jinyu, and Liu, Shujie
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The integration of Language Models (LMs) has proven to be an effective way to address domain shifts in speech recognition. However, these approaches usually require a significant amount of target domain text data for the training of LMs. Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM). LLM is used in two ways: 1) second-pass rescoring: reranking N-best hypotheses of a given ASR system with LLaMA; 2) deep LLM-fusion: incorporating LLM into the decoder of an encoder-decoder based ASR system. Experiments show that, with only one domain prompt, both methods can effectively reduce word error rates (WER) on out-of-domain TedLium-2 and SPGISpeech datasets. Especially, the deep LLM-fusion has the advantage of better recall of entity and out-of-vocabulary words.
Published: 2023

4. Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Author: Wei, Kun, Zhou, Long, Zhang, Ziqiang, Chen, Liping, Liu, Shujie, He, Lei, Li, Jinyu, and Wei, Furu
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks. By effectively leveraging the paired text data, Speech2S is capable of modeling the cross-lingual speech conversion from source to target language. We verify the performance of the proposed Speech2S on Europarl-ST and VoxPopuli datasets. Experimental results demonstrate that Speech2S gets an improvement of about 5 BLEU scores compared to encoder-only pre-training models, and achieves a competitive or even better performance than existing state-of-the-art models1., Comment: Submitted to ICASSP 2023
Published: 2023

5. VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Author: Wang, Tianrui, Zhou, Long, Zhang, Ziqiang, Wu, Yu, Liu, Shujie, Gaur, Yashesh, Chen, Zhuo, Li, Jinyu, and Wei, Furu
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional codec language model task via multi-task learning framework. To accomplish this, we first convert all the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence conversion problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines., Working in progress
Published: 2023

6. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Author: Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, and Wei, Furu
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work., Comment: Working in progress
Published: 2023
Full Text: View/download PDF

7. ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Author: Le, Chenyang, Qian, Yao, Zhou, Long, Liu, Shujie, Zeng, Michael, and Huang, Xuedong
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.
Published: 2023
Full Text: View/download PDF

8. Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Author: Zhang, Ziqiang, Zhou, Long, Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, and Wei, Furu
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}., Comment: We encourage readers to listen to the audio samples on our demo page: \url{https://aka.ms/vallex}
Published: 2023
Full Text: View/download PDF

9. Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Author: Sun, Eric, Li, Jinyu, Hu, Yuxuan, Zhu, Yimeng, Zhou, Long, Xue, Jian, Wang, Peidong, Liu, Linquan, Liu, Shujie, Lin, Edward, and Gong, Yifan
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss, enabling transformer experts to learn language-specific information. By combining gated transformer experts with shared transformer layers, we construct multilingual transformer blocks and utilize linear experts to effectively regularize the joint network. The curriculum training scheme leverages LID to guide the gated experts in improving their respective language performance. Experimental results on a bilingual task involving English and Spanish demonstrate significant improvements, with average relative word error reductions of 12.5% and 7.3% compared to the baseline bilingual and monolingual models, respectively. Notably, our method achieves performance comparable to the upper-bound model trained and inferred with oracle LID. Extending our approach to trilingual, quadrilingual, and pentalingual models reveals similar advantages to those observed in the bilingual models, highlighting its ease of extension to multiple languages.
Published: 2023
Full Text: View/download PDF

10. BEATs: Audio Pre-Training with Acoustic Tokenizers

Author: Chen, Sanyuan, Wu, Yu, Wang, Chengyi, Liu, Shujie, Tompkins, Daniel, Chen, Zhuo, and Wei, Furu
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), Computer Science - Artificial Intelligence, FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL model to abstract the high-level audio semantics and discard the redundant details as in human perception. However, a semantic-rich acoustic tokenizer for general audio pre-training is usually not straightforward to obtain, due to the continuous property of audio and unavailable phoneme sequences like speech. To tackle this challenge, we propose BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder representation from Audio Transformers, where an acoustic tokenizer and an audio SSL model are optimized by iterations. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model. The iteration is repeated with the hope of mutual promotion of the acoustic tokenizer and audio SSL model. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new state-of-the-art mAP 50.6% on AudioSet-2M for audio-only models without using any external data, and 98.1% accuracy on ESC-50. The code and pre-trained models are available at https://aka.ms/beats.
Published: 2022

11. SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

Author: Zhang, Ziqiang, Zhou, Long, Ao, Junyi, Liu, Shujie, Dai, Lirong, Li, Jinyu, and Wei, Furu
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decompose the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be jointly pre-trained with unpaired speech and text data respectively. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks. Experimental results show that SpeechUT gets substantial improvements over strong baselines, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks. To better understand the proposed SpeechUT, detailed analyses are conducted. The code and pre-trained models are available at https://aka.ms/SpeechUT., 14 pages, accepted by EMNLP 2022
Published: 2022

12. The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task

Author: Zhang, Ziqiang, Ao, Junyi, Zhou, Long, Liu, Shujie, Wei, Furu, and Li, Jinyu
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper describes the submission of our end-to-end YiTrans speech translation system for the IWSLT 2022 offline task, which translates from English audio to German, Chinese, and Japanese. The YiTrans system is built on large-scale pre-trained encoder-decoder models. More specifically, we first design a multi-stage pre-training strategy to build a multi-modality model with a large amount of labeled and unlabeled data. We then fine-tune the corresponding components of the model for the downstream speech translation tasks. Moreover, we make various efforts to improve performance, such as data filtering, data augmentation, speech segmentation, model ensemble, and so on. Experimental results show that our YiTrans system obtains a significant improvement than the strong baseline on three translation directions, and it achieves +5.2 BLEU improvements over last year's optimal end-to-end system on tst2021 English-German. Our final submissions rank first on English-German and English-Chinese end-to-end systems in terms of the automatic evaluation metric. We make our code and models publicly available., 11 pages
Published: 2022

13. LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers

Author: Wang, Peidong, Sun, Eric, Xue, Jian, Wu, Yu, Zhou, Long, Gaur, Yashesh, Liu, Shujie, and Li, Jinyu
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST models may need to be streaming and do not require source language identification (i.e. language-agnostic). In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers. Based on the transducer model structure, we propose four methods, a unified joint and prediction network for multilingual output, a clustered multilingual encoder, target language identification for encoder, and connectionist temporal classification regularization. Experimental results show that LAMASSU not only drastically reduces the model size but also reaches the performances of monolingual ASR and bilingual ST models., Comment: INTERSPEECH 2023
Published: 2022
Full Text: View/download PDF

14. SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Author: Zhang, Ziqiang, Chen, Sanyuan, Zhou, Long, Wu, Yu, Ren, Shuo, Liu, Shujie, Yao, Zhuoyuan, Gong, Xun, Dai, Lirong, Li, Jinyu, and Wei, Furu
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. Leveraging only 10K text sentences, our SpeechLM gets a 16\% relative WER reduction over the best base model performance (from 6.8 to 5.7) on the public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also evaluate our SpeechLM on various spoken language processing tasks under the universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Our code and models are available at https://aka.ms/SpeechLM., We discovered some errors about the data in the pre-training experiments, which will affect all the results about SpeechLM-P Base models. We are re-conducting the related experiments and will update the paper with the new results
Published: 2022
Full Text: View/download PDF

15. LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

Author: Gong, Xun, Wu, Yu, Li, Jinyu, Liu, Shujie, Zhao, Rui, Chen, Xie, and Qian, Yanmin
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction, respectively., Comment: Submitted to ICASSP2023
Published: 2022
Full Text: View/download PDF

16. Self-Supervised Learning for speech recognition with Intermediate layer supervision

Author: Wang, Chengyi, Wu, Yu, Chen, Sanyuan, Liu, Shujie, Li, Jinyu, Qian, Yao, and Yang, Zhenglu
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information. Since the network capacity is limited, we believe the speech recognition performance could be further improved if the model is dedicated to audio content information learning. To this end, we propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly, which achieves a 23.5%/11.6% relative word error rate reduction in the w/o language model setting for base/large models. Detailed analysis shows the bottom layers of our model have a better correlation with phonetic units, which is consistent with our intuition and explains the success of our method for ASR., Submitted to ICASSP 2022
Published: 2021

17. UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

Author: Wang, Chengyi, Wu, Yu, Qian, Yao, Kumatani, Kenichi, Liu, Shujie, Wei, Furu, Zeng, Michael, and Huang, Xuedong
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, ComputingMethodologies_PATTERNRECOGNITION, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach., accepted by ICML2021
Published: 2021

18. Knowledge Enhanced Fine-Tuning for Better Handling Unseen Entities in Dialogue Generation

Author: Cui, Leyang, Wu, Yu, Liu, Shujie, and Zhang, Yue
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Although pre-training models have achieved great success in dialogue generation, their performance drops dramatically when the input contains an entity that does not appear in pre-training and fine-tuning datasets (unseen entity). To address this issue, existing methods leverage an external knowledge base to generate appropriate responses. In real-world scenario, the entity may not be included by the knowledge base or suffer from the precision of knowledge retrieval. To deal with this problem, instead of introducing knowledge base as the input, we force the model to learn a better semantic representation by predicting the information in the knowledge base, only based on the input context. Specifically, with the help of a knowledge base, we introduce two auxiliary training objectives: 1) Interpret Masked Word, which conjectures the meaning of the masked entity given the context; 2) Hypernym Generation, which predicts the hypernym of the entity based on the context. Experiment results on two dialogue corpus verify the effectiveness of our methods under both knowledge available and unavailable settings., EMNLP 2021
Published: 2021

19. A Configurable Multilingual Model is All You Need to Recognize All Languages

Author: Zhou, Long, Li, Jinyu, Sun, Eric, and Liu, Shujie
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multilingual automatic speech recognition (ASR) models have shown great promise in recent years because of the simplified model training and deployment process. Conventional methods either train a universal multilingual model without taking any language information or with a 1-hot language ID (LID) vector to guide the recognition of the target language. In practice, the user can be prompted to pre-select several languages he/she can speak. The multilingual model without LID cannot well utilize the language information set by the user while the multilingual model with LID can only handle one pre-selected language. In this paper, we propose a novel configurable multilingual model (CMM) which is trained only once but can be configured as different models based on users' choices by extracting language-specific modules together with a universal model from the trained CMM. Particularly, a single CMM can be deployed to any user scenario where the users can pre-select any combination of languages. Trained with 75K hours of transcribed anonymized Microsoft multilingual data and evaluated with 10-language test sets, the proposed CMM improves from the universal multilingual model by 26.0%, 16.9%, and 10.4% relative word error reduction when the user selects 1, 2, or 3 languages, respectively. CMM also performs significantly better on code-switching test sets.
Published: 2021
Full Text: View/download PDF

20. GraphCodeBERT: Pre-training Code Representations with Data Flow

Author: Guo, Daya, Ren, Shuo, Lu, Shuai, Feng, Zhangyin, Tang, Duyu, Liu, Shujie, Zhou, Long, Duan, Nan, Svyatkovskiy, Alexey, Fu, Shengyu, Tufano, Michele, Deng, Shao Kun, Clement, Colin, Drain, Dawn, Sundaresan, Neel, Yin, Jian, Jiang, Daxin, and Zhou, Ming
Subjects: Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search., Comment: Accepted by ICLR2021
Published: 2020
Full Text: View/download PDF

21. Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

Author: Chen, Sanyuan, Wu, Yu, Chen, Zhuo, Yoshioka, Takuya, Liu, Shujie, and Li, Jinyu
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently. However, multi-channel speech separation sometimes does not necessarily need such a heavy structure for all time frames especially when the cross-talker challenge happens only occasionally. For example, in conversation scenarios, most regions contain only a single active speaker, where the separation task downgrades to a single speaker enhancement problem. It turns out that using a very deep network structure for dealing with signals with a low overlap ratio not only negatively affects the inference efficiency but also hurts the separation performance. To deal with this problem, we propose an early exit mechanism, which enables the Transformer model to handle different cases with adaptive depth. Experimental results indicate that not only does the early exit mechanism accelerate the inference, but it also improves the accuracy.
Published: 2020
Full Text: View/download PDF

22. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Author: Ren, Shuo, Guo, Daya, Lu, Shuai, Zhou, Long, Liu, Shujie, Tang, Duyu, Sundaresan, Neel, Zhou, Ming, Blanco, Ambrosio, and Ma, Shuai
Subjects: Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy., Comment: 8 pages, 6 figures
Published: 2020
Full Text: View/download PDF

23. Accelerating Transformer Decoding via a Hybrid of Self-attention and Recurrent Neural Network

Author: Wang, Chengyi, Wu, Shuangzhi, and Liu, Shujie
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Due to the highly parallelizable architecture, Transformer is faster to train than RNN-based models and popularly used in machine translation tasks. However, at inference time, each output word requires all the hidden states of the previously generated words, which limits the parallelization capability, and makes it much slower than RNN-based ones. In this paper, we systematically analyze the time cost of different components of both the Transformer and RNN-based model. Based on it, we propose a hybrid network of self-attention and RNN structures, in which, the highly parallelizable self-attention is utilized as the encoder, and the simpler RNN structure is used as the decoder. Our hybrid network can decode 4-times faster than the Transformer. In addition, with the help of knowledge distillation, our hybrid network achieves comparable translation quality to the original Transformer.
Published: 2019

24. Source Dependency-Aware Transformer with Supervised Self-Attention

Author: Wang, Chengyi, Wu, Shuangzhi, and Liu, Shujie
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Recently, Transformer has achieved the state-of-the-art performance on many machine translation tasks. However, without syntax knowledge explicitly considered in the encoder, incorrect context information that violates the syntax structure may be integrated into source hidden states, leading to erroneous translations. In this paper, we propose a novel method to incorporate source dependencies into the Transformer. Specifically, we adopt the source dependency tree and define two matrices to represent the dependency relations. Based on the matrices, two heads in the multi-head self-attention module are trained in a supervised manner and two extra cross entropy losses are introduced into the training objective function. Under this training objective, the model is trained to learn the source dependency relations directly. Without requiring pre-parsed input during inference, our model can generate better translations with the dependency-aware context information. Experiments on bi-directional Chinese-to-English, English-to-Japanese and English-to-German translation tasks show that our proposed method can significantly improve the Transformer baseline., Comment: 6 pages
Published: 2019
Full Text: View/download PDF

25. Neural Speech Synthesis with Transformer Network

Author: Li, Naihan, Liu, Shujie, Liu, Yanqing, Zhao, Sheng, Liu, Ming, and Zhou, Ming
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).
Published: 2018

26. Approximate Distribution Matching for Sequence-to-Sequence Learning

Author: Chen, Wenhu, Li, Guanlin, Liu, Shujie, Zhang, Zhirui, Li, Mu, and Zhou, Ming
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computation and Language (cs.CL)
Abstract: Sequence-to-Sequence models were introduced to tackle many real-life problems like machine translation, summarization, image captioning, etc. The standard optimization algorithms are mainly based on example-to-example matching like maximum likelihood estimation, which is known to suffer from data sparsity problem. Here we present an alternate view to explain sequence-to-sequence learning as a distribution matching problem, where each source or target example is viewed to represent a local latent distribution in the source or target domain. Then, we interpret sequence-to-sequence learning as learning a transductive model to transform the source local latent distributions to match their corresponding target distributions. In our framework, we approximate both the source and target latent distributions with recurrent neural networks (augmenter). During training, the parallel augmenters learn to better approximate the local latent distributions, while the sequence prediction model learns to minimize the KL-divergence of the transformed source distributions and the approximated target distributions. This algorithm can alleviate the data sparsity issues in sequence learning by locally augmenting more unseen data pairs and increasing the model's robustness. Experiments conducted on machine translation and image captioning consistently demonstrate the superiority of our proposed algorithm over the other competing algorithms., 9 pages, 8 figures
Published: 2018

27. Style Transfer as Unsupervised Machine Translation

Author: Zhang, Zhirui, Ren, Shuo, Liu, Shujie, Wang, Jianyong, Chen, Peng, Li, Mu, Zhou, Ming, and Chen, Enhong
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Language style transferring rephrases text with specific stylistic attributes while preserving the original attribute-independent content. One main challenge in learning a style transfer system is a lack of parallel data where the source sentence is in one style and the target sentence in another style. With this constraint, in this paper, we adapt unsupervised machine translation methods for the task of automatic style transfer. We first take advantage of style-preference information and word embedding similarity to produce pseudo-parallel data with a statistical machine translation (SMT) framework. Then the iterative back-translation approach is employed to jointly train two neural machine translation (NMT) based transfer systems. To control the noise generated during joint training, a style classifier is introduced to guarantee the accuracy of style transfer and penalize bad candidates in the generated pseudo data. Experiments on benchmark datasets show that our proposed method outperforms previous state-of-the-art models in terms of both accuracy of style transfer and quality of input-output correspondence.
Published: 2018
Full Text: View/download PDF

28. Achieving Human Parity on Automatic Chinese to English News Translation

Author: Hassan, Hany, Aue, Anthony, Chen, Chang, Chowdhary, Vishal, Clark, Jonathan, Federmann, Christian, Huang, Xuedong, Junczys-Dowmunt, Marcin, Lewis, William, Li, Mu, Liu, Shujie, Liu, Tie-Yan, Luo, Renqian, Menezes, Arul, Qin, Tao, Seide, Frank, Tan, Xu, Tian, Fei, Wu, Lijun, Wu, Shuangzhi, Xia, Yingce, Zhang, Dongdong, Zhang, Zhirui, and Zhou, Ming
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whether such systems can approach or achieve parity with human translations. In this paper, we first address the problem of how to define and accurately measure human parity in translation. We then describe Microsoft's machine translation system and measure the quality of its translations on the widely used WMT 2017 news translation task from Chinese to English. We find that our latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations. We also find that it significantly exceeds the quality of crowd-sourced non-professional translations.
Published: 2018
Full Text: View/download PDF

29. Implicit Distortion and Fertility Models for Attention-based Encoder-Decoder NMT Model

Author: Feng, Shi, Liu, Shujie, Li, Mu, and Zhou, Ming
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Data_CODINGANDINFORMATIONTHEORY, Computation and Language (cs.CL)
Abstract: Neural machine translation has shown very promising results lately. Most NMT models follow the encoder-decoder framework. To make encoder-decoder models more flexible, attention mechanism was introduced to machine translation and also other tasks like speech recognition and image captioning. We observe that the quality of translation by attention-based encoder-decoder can be significantly damaged when the alignment is incorrect. We attribute these problems to the lack of distortion and fertility models. Aiming to resolve these problems, we propose new variations of attention-based encoder-decoder and compare them with other models on machine translation. Our proposed method achieved an improvement of 2 BLEU points over the original attention-based encoder-decoder., Comment: 11 pages, updated details
Published: 2016
Full Text: View/download PDF

30. Beyond Word-based Language Model in Statistical Machine Translation

Author: Zhang, Jiajun, Liu, Shujie, Li, Mu, Zhou, Ming, and Zong, Chengqing
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Language model is one of the most important modules in statistical machine translation and currently the word-based language model dominants this community. However, many translation models (e.g. phrase-based models) generate the target language sentences by rendering and compositing the phrases rather than the words. Thus, it is much more reasonable to model dependency between phrases, but few research work succeed in solving this problem. In this paper, we tackle this problem by designing a novel phrase-based language model which attempts to solve three key sub-problems: 1, how to define a phrase in language model; 2, how to determine the phrase boundary in the large-scale monolingual data in order to enlarge the training set; 3, how to alleviate the data sparsity problem due to the huge vocabulary size of phrases. By carefully handling these issues, the extensive experiments on Chinese-to-English translation show that our phrase-based language model can significantly improve the translation quality by up to +1.47 absolute BLEU score., Comment: 8 pages
Published: 2015
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

30 results on '"Liu, Shujie"'

1. On decoder-only architecture for speech-to-text and large language model integration

2. Accelerating Transducers through Adjacent Token Merging

3. Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition

4. Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

5. VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

6. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

7. ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

8. Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

9. Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

10. BEATs: Audio Pre-Training with Acoustic Tokenizers

11. SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

12. The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task

13. LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers

14. SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

15. LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

16. Self-Supervised Learning for speech recognition with Intermediate layer supervision

17. UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

18. Knowledge Enhanced Fine-Tuning for Better Handling Unseen Entities in Dialogue Generation

19. A Configurable Multilingual Model is All You Need to Recognize All Languages

20. GraphCodeBERT: Pre-training Code Representations with Data Flow

21. Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

22. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

23. Accelerating Transformer Decoding via a Hybrid of Self-attention and Recurrent Neural Network

24. Source Dependency-Aware Transformer with Supervised Self-Attention

25. Neural Speech Synthesis with Transformer Network

26. Approximate Distribution Matching for Sequence-to-Sequence Learning

27. Style Transfer as Unsupervised Machine Translation

28. Achieving Human Parity on Automatic Chinese to English News Translation

29. Implicit Distortion and Fertility Models for Attention-based Encoder-Decoder NMT Model

30. Beyond Word-based Language Model in Statistical Machine Translation

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

30 results on '"Liu, Shujie"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources