Author: "Cheng, Ning" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Cheng, Ning"' showing total 2,064 results

Start Over Author "Cheng, Ning"

2,064 results on '"Cheng, Ning"'

1. PFID: Privacy First Inference Delegation Framework for LLMs

Author: Yang, Haoyan, Li, Zhitao, Zhang, Yong, Wang, Jianzong, Cheng, Ning, Li, Ming, and Xiao, Jing
Subjects: Computer Science - Computation and Language
Abstract: This paper introduces a novel privacy-preservation framework named PFID for LLMs that addresses critical privacy concerns by localizing user data through model sharding and singular value decomposition. When users are interacting with LLM systems, their prompts could be subject to being exposed to eavesdroppers within or outside LLM system providers who are interested in collecting users' input. In this work, we proposed a framework to camouflage user input, so as to alleviate privacy issues. Our framework proposes to place model shards on the client and the public server, we sent compressed hidden states instead of prompts to and from servers. Clients have held back information that can re-privatized the hidden states so that overall system performance is comparable to traditional LLMs services. Our framework was designed to be communication efficient, computation can be delegated to the local client so that the server's computation burden can be lightened. We conduct extensive experiments on machine translation tasks to verify our framework's performance., Comment: Submitted to EMNLP2024
Published: 2024

2. EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

Author: Zhuang, Ziyang, Miao, Chenfeng, Zou, Kun, Fang, Ming, Wei, Tao, Li, Zijian, Cheng, Ning, Hu, Wei, Wang, Shaojun, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup., Comment: Submitted to ICASSP 2025
Published: 2024

3. Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation

Author: Cheng, Ning, Guan, Changhao, Gao, Jing, Wang, Weihao, Li, You, Meng, Fandong, Zhou, Jie, Fang, Bin, Xu, Jinan, and Han, Wenjuan
Subjects: Computer Science - Robotics
Abstract: Touch holds a pivotal position in enhancing the perceptual and interactive capabilities of both humans and robots. Despite its significance, current tactile research mainly focuses on visual and tactile modalities, overlooking the language domain. Inspired by this, we construct Touch100k, a paired touch-language-vision dataset at the scale of 100k, featuring tactile sensation descriptions in multiple granularities (i.e., sentence-level natural expressions with rich semantics, including contextual and dynamic relationships, and phrase-level descriptions capturing the key features of tactile sensations). Based on the dataset, we propose a pre-training method, Touch-Language-Vision Representation Learning through Curriculum Linking (TLV-Link, for short), inspired by the concept of curriculum learning. TLV-Link aims to learn a tactile representation for the GelSight sensor and capture the relationship between tactile, language, and visual modalities. We evaluate our representation's performance across two task categories (namely, material property identification and robot grasping prediction), focusing on tactile representation and zero-shot touch understanding. The experimental evaluation showcases the effectiveness of our representation. By enabling TLV-Link to achieve substantial improvements and establish a new state-of-the-art in touch-centric multimodal representation learning, Touch100k demonstrates its value as a valuable resource for research. Project page: https://cocacola-lab.github.io/Touch100k/.
Published: 2024

4. Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Author: Shi, Haoxiang, Zhang, Xulong, Cheng, Ning, Zhang, Yong, Yu, Jun, Xiao, Jing, and Wang, Jianzong
Subjects: Computer Science - Computation and Language
Abstract: The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared information between modalities was not processed to generate emotions. Information redundancy problem. To overcome these limitations, we propose a cross-modal fusion emotion prediction network based on vector connections. The network mainly includes two stages: the multi-modal feature fusion stage based on connection vectors and the emotion classification stage based on fused features. Furthermore, we design a supervised inter-class contrastive learning module based on emotion labels. Experimental results confirm the effectiveness of the proposed method, demonstrating excellent performance on the IEMOCAP and MELD datasets., Comment: Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)
Published: 2024

5. RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Author: Shi, Haoxiang, Wang, Jianzong, Zhang, Xulong, Cheng, Ning, Yu, Jun, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information., Comment: Accepted by the 8th APWeb-WAIM International Joint Conference on Web and Big Data
Published: 2024

6. RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval

Author: Wang, Jianzong, Shi, Haoxiang, Luo, Kaiyi, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Information Retrieval
Abstract: Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing technique designed for semi-paired cross-modal retrieval tasks, named Reconstruction Relations Embedded Hashing (RREH). RREH assumes that multi-modal data share a common subspace. For paired data, RREH explores the latent consistent information of heterogeneous modalities by seeking a shared representation. For unpaired data, to effectively capture the latent discriminative features, the high-order relationships between unpaired data and anchors are embedded into the latent subspace, which are computed by efficient linear reconstruction. The anchors are sampled from paired data, which improves the efficiency of hash learning. The RREH trains the underlying features and the binary encodings in a unified framework with high-order reconstruction relations preserved. With the well devised objective function and discrete optimization algorithm, RREH is designed to be scalable, making it suitable for large-scale datasets and facilitating efficient cross-modal retrieval. In the evaluation process, the proposed is tested with partially paired data to establish its superiority over several existing methods., Comment: Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)
Published: 2024

7. Transformer in Touch: A Survey

Author: Gao, Jing, Cheng, Ning, Fang, Bin, and Han, Wenjuan
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The Transformer model, initially achieving significant success in the field of natural language processing, has recently shown great potential in the application of tactile perception. This review aims to comprehensively outline the application and development of Transformers in tactile technology. We first introduce the two fundamental concepts behind the success of the Transformer: the self-attention mechanism and large-scale pre-training. Then, we delve into the application of Transformers in various tactile tasks, including but not limited to object recognition, cross-modal generation, and object manipulation, offering a concise summary of the core methodologies, performance benchmarks, and design highlights. Finally, we suggest potential areas for further research and future work, aiming to generate more interest within the community, tackle existing challenges, and encourage the use of Transformer models in the tactile field., Comment: 27 pages, 2 tables, 5 figures, accepted by ICIC 2024
Published: 2024

8. Potential and Limitations of LLMs in Capturing Structured Semantics: A Case Study on SRL

Author: Cheng, Ning, Yan, Zhaohui, Wang, Ziming, Li, Zhijie, Yu, Jiaming, Zheng, Zilong, Tu, Kewei, Xu, Jinan, and Han, Wenjuan
Subjects: Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) play a crucial role in capturing structured semantics to enhance language understanding, improve interpretability, and reduce bias. Nevertheless, an ongoing controversy exists over the extent to which LLMs can grasp structured semantics. To assess this, we propose using Semantic Role Labeling (SRL) as a fundamental task to explore LLMs' ability to extract structured semantics. In our assessment, we employ the prompting approach, which leads to the creation of our few-shot SRL parser, called PromptSRL. PromptSRL enables LLMs to map natural languages to explicit semantic structures, which provides an interpretable window into the properties of LLMs. We find interesting potential: LLMs can indeed capture semantic structures, and scaling-up doesn't always mirror potential. Additionally, limitations of LLMs are observed in C-arguments, etc. Lastly, we are surprised to discover that significant overlap in the errors is made by both LLMs and untrained humans, accounting for almost 30% of all errors., Comment: Accepted by ICIC 2024
Published: 2024

9. MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

Author: Li, Pengcheng, Wang, Jianzong, Zhang, Xulong, Zhang, Yong, Xiao, Jing, and Cheng, Ning
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively disentangle via a concise neural network. The proposed model utilizes Siamese encoders to learn clean representations, further enhanced by the designed mutual information estimator. The Siamese structure and the newly designed convolution module contribute to the lightweight of our model while ensuring performance in diverse voice conversion tasks. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

10. Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Author: Deng, Yimin, Wang, Jianzong, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

11. QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering

Author: Ouyang, Sheng, Wang, Jianzong, Zhang, Yong, Li, Zhitao, Liang, Ziqi, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computation and Language
Abstract: Extractive Question Answering (EQA) in Machine Reading Comprehension (MRC) often faces the challenge of dealing with semantically identical but format-variant inputs. Our work introduces a novel approach, called the ``Query Latent Semantic Calibrator (QLSC)'', designed as an auxiliary module for existing MRC models. We propose a unique scaling strategy to capture latent semantic center features of queries. These features are then seamlessly integrated into traditional query and passage embeddings using an attention mechanism. By deepening the comprehension of the semantic queries-passage relationship, our approach diminishes sensitivity to variations in text format and boosts the model's capability in pinpointing accurate answers. Experimental results on robust Question-Answer datasets confirm that our approach effectively handles format-variant but semantically identical queries, highlighting the effectiveness and adaptability of our proposed method., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

12. EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Author: Liang, Ziqi, Wang, Jianzong, Zhang, Xulong, Zhang, Yong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at https://largeaudiomodel.com/eadvc., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

13. EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

Author: Wang, Jianzong, Liang, Ziqi, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3% and 0.2% in Character Error Rate (CER) on the Aishell-1 and HKUST datasets, respectively., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

14. CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

Author: Wang, Jianzong, Li, Pengcheng, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective., Comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Published: 2024

15. Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Author: Cheng, Ning, Li, You, Gao, Jing, Fang, Bin, Xu, Jinan, and Han, Wenjuan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/., Comment: Accepted by ICIC 2024
Published: 2024

16. Medical Speech Symptoms Classification via Disentangled Representation

Author: Wang, Jianzong, Li, Pengcheng, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Artificial Intelligence
Abstract: Intent is defined for understanding spoken language in existing works. Both textual features and acoustic features involved in medical speech contain intent, which is important for symptomatic diagnosis. In this paper, we propose a medical speech classification model named DRSC that automatically learns to disentangle intent and content representations from textual-acoustic data for classification. The intent representations of the text domain and the Mel-spectrogram domain are extracted via intent encoders, and then the reconstructed text feature and the Mel-spectrogram feature are obtained through two exchanges. After combining the intent from two domains into a joint representation, the integrated intent representation is fed into a decision layer for classification. Experimental results show that our model obtains an average accuracy rate of 95% in detecting 25 different medical symptoms., Comment: Accepted by the 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2024)
Published: 2024

17. Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

Author: Li, Ming, Zhang, Yong, He, Shwai, Li, Zhitao, Zhao, Hongyu, Wang, Jianzong, Cheng, Ning, and Zhou, Tianyi
Subjects: Computer Science - Computation and Language
Abstract: Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But it also leads to extra cost and computation due to the involvement of LLMs in this process. To reduce the filtering cost, we study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model? Despite the performance gap between weak and strong language models, we find their highly consistent capability to perceive instruction difficulty and data selection results. This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model. Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even better performance on standard benchmarks. Extensive experiments validate the efficacy and efficiency of our approach., Comment: ACL2024 main, Camera-ready
Published: 2024

18. DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Author: Wang, Jianzong, Li, Pengcheng, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Most existing neural-based text-to-speech methods rely on extensive datasets and face challenges under low-resource condition. In this paper, we introduce a novel semi-supervised text-to-speech synthesis model that learns from both paired and unpaired data to address this challenge. The key component of the proposed model is a dynamic quantized representation module, which is integrated into a sequential autoencoder. When given paired data, the module incorporates a trainable codebook that learns quantized representations under the supervision of the paired data. However, due to the limited paired data in low-resource scenario, these paired data are difficult to cover all phonemes. Then unpaired data is fed to expand the dynamic codebook by adding quantized representation vectors that are sufficiently distant from the existing ones during training. Experiments show that with less than 120 minutes of paired data, the proposed method outperforms existing methods in both subjective and objective metrics., Comment: Accepted by the 13th IEEE International Conference on Big Data and Cloud Computing (IEEE BDCloud 2023)
Published: 2023

19. CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

Author: Wang, Jianzong, Deng, Yimin, Liang, Ziqi, Zhang, Xulong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper proposes a talking face generation method named "CP-EB" that takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking embedding. It's noted that not only the head pose but also eye blinking are both important aspects for deep fake detection. The implicit control of poses by video has already achieved by the state-of-art work. According to recent research, eye blinking has weak correlation with input audio which means eye blinks extraction from audio and generation are possible. Hence, we propose a GAN-based architecture to extract eye blink feature from input audio and reference video respectively and employ contrastive training between them, then embed it into the concatenated features of identity and poses to generate talking face images. Experimental results show that the proposed method can generate photo-realistic talking face with synchronous lips motions, natural head poses and blinking eyes., Comment: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)
Published: 2023

20. CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Author: Deng, Yimin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks., Comment: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)
Published: 2023

21. PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter

Author: Yang, Haoyan, Li, Zhitao, Zhang, Yong, Wang, Jianzong, Cheng, Ning, Li, Ming, and Xiao, Jing
Subjects: Computer Science - Computation and Language
Abstract: The Retrieval Question Answering (ReQA) task employs the retrieval-augmented framework, composed of a retriever and generator. The generator formulates the answer based on the documents retrieved by the retriever. Incorporating Large Language Models (LLMs) as generators is beneficial due to their advanced QA capabilities, but they are typically too large to be fine-tuned with budget constraints while some of them are only accessible via APIs. To tackle this issue and further improve ReQA performance, we propose a trainable Pluggable Reward-Driven Contextual Adapter (PRCA), keeping the generator as a black box. Positioned between the retriever and generator in a Pluggable manner, PRCA refines the retrieved information by operating in a token-autoregressive strategy via maximizing rewards of the reinforcement learning phase. Our experiments validate PRCA's effectiveness in enhancing ReQA performance on three datasets by up to 20% improvement to fit black-box LLMs into existing frameworks, demonstrating its considerable potential in the LLMs era., Comment: Accepted by the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. (EMNLP2023)
Published: 2023

22. An In-depth Survey of Large Language Model-based Artificial Intelligence Agents

Author: Zhao, Pengyu, Jin, Zijian, and Cheng, Ning
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Due to the powerful capabilities demonstrated by large language model (LLM), there has been a recent surge in efforts to integrate them with AI agents to enhance their performance. In this paper, we have explored the core differences and characteristics between LLM-based AI agents and traditional AI agents. Specifically, we first compare the fundamental characteristics of these two types of agents, clarifying the significant advantages of LLM-based agents in handling natural language, knowledge storage, and reasoning capabilities. Subsequently, we conducted an in-depth analysis of the key components of AI agents, including planning, memory, and tool use. Particularly, for the crucial component of memory, this paper introduced an innovative classification scheme, not only departing from traditional classification methods but also providing a fresh perspective on the design of an AI agent's memory system. We firmly believe that in-depth research and understanding of these core components will lay a solid foundation for the future advancement of AI agent technology. At the end of the paper, we provide directional suggestions for further research in this field, with the hope of offering valuable insights to scholars and researchers in the field.
Published: 2023

23. Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Author: Luo, Kaiyi, Zhang, Xulong, Wang, Jianzong, Li, Huaxiong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR., Comment: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)
Published: 2023

24. FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Author: Wang, Jianzong, Zhang, Xulong, Sun, Aolan, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration., Comment: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)
Published: 2023

25. AOSR-Net: All-in-One Sandstorm Removal Network

Author: Si, Yazhong, Zhang, Xulong, Yang, Fan, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one sandstorm removal network (AOSR-Net). This model is developed based on a re-formulated sandstorm scattering model, which directly establishes the image mapping relationship by integrating intermediate parameters. Such integration scheme effectively addresses the problems of over-enhancement and weak generalization in the field of sand dust image enhancement. Experimental results on synthetic and real-world sandstorm images demonstrate the superiority of the proposed AOSR-Net over state-of-the-art (SOTA) algorithms., Comment: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)
Published: 2023

26. DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Author: Qi, Zipeng, Zhang, Xulong, Cheng, Ning, Xiao, Jing, and Wang, Jianzong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features., Comment: submmit to ICASSP 2024
Published: 2023

27. Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network. According to both objective and subjective evaluation experiments, DiffGAN-VC has been shown to achieve high voice quality on non-parallel data sets. Compared with the CycleGAN-VC method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound quality., Comment: Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)
Published: 2023

28. Machine Unlearning Methodology base on Stochastic Teacher Network

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, Sun, Yifu, Zhang, Chuanyao, and Xiao, Jing
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: The rise of the phenomenon of the "right to be forgotten" has prompted research on machine unlearning, which grants data owners the right to actively withdraw data that has been used for model training, and requires the elimination of the contribution of that data to the model. A simple method to achieve this is to use the remaining data to retrain the model, but this is not acceptable for other data owners who continue to participate in training. Existing machine unlearning methods have been found to be ineffective in quickly removing knowledge from deep learning models. This paper proposes using a stochastic network as a teacher to expedite the mitigation of the influence caused by forgotten data on the model. We performed experiments on three datasets, and the findings demonstrate that our approach can efficiently mitigate the influence of target data on the model within a single epoch. This allows for one-time erasure and reconstruction of the model, and the reconstruction model achieves the same performance as the retrained model., Comment: Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)
Published: 2023

29. Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

Author: Zhu, Kexin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Music Emotion Recognition involves the automatic identification of emotional elements within music tracks, and it has garnered significant attention due to its broad applicability in the field of Music Information Retrieval. It can also be used as the upstream task of many other human-related tasks such as emotional music generation and music recommendation. Due to existing psychology research, music emotion is determined by multiple factors such as the Timbre, Velocity, and Structure of the music. Incorporating multiple factors in MER helps achieve more interpretable and finer-grained methods. However, most prior works were uni-domain and showed weak consistency between arousal modeling performance and valence modeling performance. Based on this background, we designed a multi-domain emotion modeling method for instrumental music that combines symbolic analysis and acoustic analysis. At the same time, because of the rarity of music data and the difficulty of labeling, our multi-domain approach can make full use of limited data. Our approach was implemented and assessed using the publicly available piano dataset EMOPIA, resulting in a notable improvement over our baseline model with a 2.4% increase in overall accuracy, establishing its state-of-the-art performance., Comment: Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)
Published: 2023

30. From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Author: Li, Ming, Zhang, Yong, Li, Zhitao, Chen, Jiuhai, Chen, Lichang, Cheng, Ning, Wang, Jianzong, Zhou, Tianyi, and Xiao, Jing
Subjects: Computer Science - Computation and Language
Abstract: In the realm of Large Language Models (LLMs), the balance between instruction data quality and quantity is a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability. Through the application of IFD, cherry samples can be pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on datasets like Alpaca and WizardLM underpin our findings; with a mere $10\%$ of original data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the instruction tuning of LLMs, promising both efficiency and resource-conscious advancements. Codes, data, and models are available: https://github.com/tianyi-lab/Cherry_LLM, Comment: NAACL, Camera-ready
Published: 2023

31. PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Author: Deng, Yimin, Tang, Huaizhen, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech., Comment: Accepted by the 31st ACM International Conference on Multimedia (MM2023)
Published: 2023
Full Text: View/download PDF

32. Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism

Author: Fan, Jiaxin, Zhang, Yong, Li, Hanzhang, Wang, Jianzong, Li, Zhitao, Ouyang, Sheng, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Chinese Automatic Speech Recognition (ASR) error correction presents significant challenges due to the Chinese language's unique features, including a large character set and borderless, morpheme-based structure. Current mainstream models often struggle with effectively utilizing word-level features and phonetic information. This paper introduces a novel approach that incorporates a dynamic error scaling mechanism to detect and correct phonetically erroneous text generated by ASR output. This mechanism operates by dynamically fusing word-level features and phonetic information, thereby enriching the model with additional semantic data. Furthermore, our method implements unique error reduction and amplification strategies to address the issues of matching wrong words caused by incorrect characters. Experimental results indicate substantial improvements in ASR error correction, demonstrating the effectiveness of our proposed method and yielding promising results on established datasets., Comment: Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)
Published: 2023

33. Prompt Guided Copy Mechanism for Conversational Question Answering

Author: Zhang, Yong, Li, Zhitao, Wang, Jianzong, Gao, Yiming, Cheng, Ning, Yu, Fengying, and Xiao, Jing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Conversational Question Answering (CQA) is a challenging task that aims to generate natural answers for conversational flow questions. In this paper, we propose a pluggable approach for extractive methods that introduces a novel prompt-guided copy mechanism to improve the fluency and appropriateness of the extracted answers. Our approach uses prompts to link questions to answers and employs attention to guide the copy mechanism to verify the naturalness of extracted answers, making necessary edits to ensure that the answers are fluent and appropriate. The three prompts, including a question-rationale relationship prompt, a question description prompt, and a conversation history prompt, enhance the copy mechanism's performance. Our experiments demonstrate that this approach effectively promotes the generation of natural answers and achieves good results in the CoQA challenge., Comment: Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)
Published: 2023

34. CollabKG: A Learnable Human-Machine-Cooperative Information Extraction Toolkit for (Event) Knowledge Graph Construction

Author: Wei, Xiang, Chen, Yufeng, Cheng, Ning, Cui, Xingyu, Xu, Jinan, and Han, Wenjuan
Subjects: Computer Science - Computation and Language
Abstract: In order to construct or extend entity-centric and event-centric knowledge graphs (KG and EKG), the information extraction (IE) annotation toolkit is essential. However, existing IE toolkits have several non-trivial problems, such as not supporting multi-tasks, not supporting automatic updates. In this work, we present CollabKG, a learnable human-machine-cooperative IE toolkit for KG and EKG construction. Specifically, for the multi-task issue, CollabKG unifies different IE subtasks, including named entity recognition (NER), entity-relation triple extraction (RE), and event extraction (EE), and supports both KG and EKG. Then, combining advanced prompting-based IE technology, the human-machine-cooperation mechanism with LLMs as the assistant machine is presented which can provide a lower cost as well as a higher performance. Lastly, owing to the two-way interaction between the human and machine, CollabKG with learning ability allows self-renewal. Besides, CollabKG has several appealing features (e.g., customization, training-free, propagation, etc.) that make the system powerful, easy-to-use, and high-productivity. We holistically compare our toolkit with other existing tools on these features. Human evaluation quantitatively illustrates that CollabKG significantly improves annotation quality, efficiency, and stability simultaneously.
Published: 2023

35. The mechanisms of Porphyromonas gingivalis–derived outer membrane vesicles-induced neurotoxicity and microglia activation

Author: Wei-Chun Chuang, Cheng-Ning Yang, Han-Wei Wang, Sze-Kwan Lin, Ching-Chu Yu, Jhe-Hao Syu, Chun-Pin Chiang, Young-Ji Shiao, and Yi-Wen Chen
Subjects: Porphyromonas gingivalis, Outer membrane vesicles, Neuron, Microglial cell, Gingipain, Lipopolysaccharide, Dentistry, RK1-715
Abstract: Background/purpose: Periodontitis is associated with various systemic diseases, potentially facilitated by the passage of Porphyromonas gingivalis outer membrane vesicles (Pg-OMVs). Several recent studies have suggested a connection between Pg-OMVs and neuroinflammation and neurodegeneration, but the precise causal relationship remains unclear. This study aimed to investigate the mechanisms underlying these associations using in vitro models. Materials and methods: Isolated Pg-OMVs were characterized by morphology, size, and gingipain activity. We exposed SH-SY5Y neuroblastoma cells and BV-2 microglial cells to various concentrations of Pg-OMVs. Cell morphology, a 3-(4,5-Dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide assay, an enzyme-linked immunosorbent assay, and Western blot analysis were used to evaluate the cellular mechanism underlying Pg-OMV-induced neurotoxicity in neuronal cells and inflammatory responses in microglial cells. Results: Exposure to Pg-OMVs induced neurotoxicity in SH-SY5Y cells, as evidenced by cellular shrinkage, reduced viability, activation of apoptotic pathways, and diminished neuronal differentiation markers. Gingipain inhibition mitigated these effects, suggesting that gingipain mediates Pg-OMVs-induced neurotoxicity in SH-SY5Y cells. Our research on neuroinflammation suggests that upon endocytosis of Pg-OMVs by BV-2 cells, lipopolysaccharide (LPS) can modulate the production of inducible nitric oxide synthase and tumor necrosis factor-alpha by activating pathways that involve phosphorylated AKT and the phosphorylated JNK pathway. Conclusion: Our study demonstrated that following the endocytosis of Pg-OMVs, gingipain can induce neurotoxicity in SH-SY5Y cells. Furthermore, the Pg-OMVs-associated LPS can trigger neuroinflammation via AKT and JNK signaling pathways in BV-2 cells.
Published: 2024
Full Text: View/download PDF

36. EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Author: Tang, Haobin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control., Comment: Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)
Published: 2023

37. SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Author: Wang, Jianzong, Zhang, Xulong, Tang, Haobin, Sun, Aolan, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits, we propose a Self-supervised learning framework to learn an Anti-distortion acoustic Representation (SAR) to replace human-crafted acoustic features by introducing distortion prior to an auto-encoder pre-training process. The learned acoustic representation from the proposed framework is proved anti-distortion compared to the most commonly used mel-spectrogram through both objective and subjective evaluation., Comment: Accepted by IJCNN2023. 2023 International Joint Conference on Neural Networks (IJCNN2023)
Published: 2023

38. Melatonin Enhanced Microglia M2 Polarization in Rat Model of Neuro-inflammation Via Regulating ER Stress/PPARδ/SIRT1 Signaling Axis

Author: Pan, Hung-Chuan, Yang, Cheng-Ning, Lee, Wen-Jane, Sheehan, Jason, Wu, Sheng-Mao, Chen, Hong-Shiu, Lin, Mao-Hsun, Shen, Li-Wei, Lee, Shu-Hua, Shen, Chin-Chang, Pan, Liang-Yi, Liu, Shing‑Hwa, and Sheu, Meei-Ling
Published: 2024
Full Text: View/download PDF

39. On the Calibration and Uncertainty with P\'{o}lya-Gamma Augmentation for Dialog Retrieval Models

Author: Ye, Tong, Si, Shijing, Wang, Jianzong, Cheng, Ning, Li, Zhitao, and Xiao, Jing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Deep neural retrieval models have amply demonstrated their power but estimating the reliability of their predictions remains challenging. Most dialog response retrieval models output a single score for a response on how relevant it is to a given question. However, the bad calibration of deep neural network results in various uncertainty for the single score such that the unreliable predictions always misinform user decisions. To investigate these issues, we present an efficient calibration and uncertainty estimation framework PG-DRR for dialog response retrieval models which adds a Gaussian Process layer to a deterministic deep neural network and recovers conjugacy for tractable posterior inference by P\'{o}lya-Gamma augmentation. Finally, PG-DRR achieves the lowest empirical calibration error (ECE) in the in-domain datasets and the distributional shift task while keeping $R_{10}@1$ and MAP performance., Comment: Accepted by AAAI 2023
Published: 2023

40. Efficient Uncertainty Estimation with Gaussian Process for Reliable Dialog Response Retrieval

Author: Ye, Tong, Li, Zhitao, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Deep neural networks have achieved remarkable performance in retrieval-based dialogue systems, but they are shown to be ill calibrated. Though basic calibration methods like Monte Carlo Dropout and Ensemble can calibrate well, these methods are time-consuming in the training or inference stages. To tackle these challenges, we propose an efficient uncertainty calibration framework GPF-BERT for BERT-based conversational search, which employs a Gaussian Process layer and the focal loss on top of the BERT architecture to achieve a high-quality neural ranker. Extensive experiments are conducted to verify the effectiveness of our method. In comparison with basic calibration methods, GPF-BERT achieves the lowest empirical calibration error (ECE) in three in-domain datasets and the distributional shift tasks, while yielding the highest $R_{10}@1$ and MAP performance on most cases. In terms of time consumption, our GPF-BERT has an 8$\times$ speedup., Comment: Accepted by ICASSP 2023
Published: 2023

41. Improving EEG-based Emotion Recognition by Fusing Time-frequency And Spatial Representations

Author: Zhu, Kexin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Using deep learning methods to classify EEG signals can accurately identify people's emotions. However, existing studies have rarely considered the application of the information in another domain's representations to feature selection in the time-frequency domain. We propose a classification network of EEG signals based on the cross-domain feature fusion method, which makes the network more focused on the features most related to brain activities and thinking changes by using the multi-domain attention mechanism. In addition, we propose a two-step fusion method and apply these methods to the EEG emotion recognition network. Experimental results show that our proposed network, which combines multiple representations in the time-frequency domain and spatial domain, outperforms previous methods on public datasets and achieves state-of-the-art at present., Comment: Accepted by ICASSP 2023 - The 48th IEEE International Conference on Acoustics, Speech, & Signal Processing
Published: 2023

42. Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy

Author: Zhang, Xulong, Tang, Haobin, Wang, Jianzong, Cheng, Ning, Luo, Jian, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Because of predicting all the target tokens in parallel, the non-autoregressive models greatly improve the decoding efficiency of speech recognition compared with traditional autoregressive models. In this work, we present dynamic alignment Mask CTC, introducing two methods: (1) Aligned Cross Entropy (AXE), finding the monotonic alignment that minimizes the cross-entropy loss through dynamic programming, (2) Dynamic Rectification, creating new training samples by replacing some masks with model predicted tokens. The AXE ignores the absolute position alignment between prediction and ground truth sentence and focuses on tokens matching in relative order. The dynamic rectification method makes the model capable of simulating the non-mask but possible wrong tokens, even if they have high confidence. Our experiments on WSJ dataset demonstrated that not only AXE loss but also the rectification method could improve the WER performance of Mask CTC., Comment: Accepted by ICASSP 2023
Published: 2023

43. QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Author: Tang, Haobin, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected. In this paper, we propose QI-TTS which aims to better transfer and control intonation to further deliver the speaker's questioning intention while transferring emotion from reference speech. We propose a multi-style extractor to extract style embedding from two different levels. While the sentence level represents emotion, the final syllable level represents intonation. For fine-grained intonation control, we use relative attributes to represent intonation intensity at the syllable level.Experiments have validated the effectiveness of QI-TTS for improving intonation expressiveness in emotional speech synthesis., Comment: Accepted by ICASSP 2023
Published: 2023

44. Improving Music Genre Classification from Multi-Modal Properties of Music and Genre Correlations Perspective

Author: Ru, Ganghui, Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound
Abstract: Music genre classification has been widely studied in past few years for its various applications in music information retrieval. Previous works tend to perform unsatisfactorily, since those methods only use audio content or jointly use audio content and lyrics content inefficiently. In addition, as genres normally co-occur in a music track, it is desirable to capture and model the genre correlations to improve the performance of multi-label music genre classification. To solve these issues, we present a novel multi-modal method leveraging audio-lyrics contrastive loss and two symmetric cross-modal attention, to align and fuse features from audio and lyrics. Furthermore, based on the nature of the multi-label classification, a genre correlations extraction module is presented to capture and model potential genre correlations. Extensive experiments demonstrate that our proposed method significantly surpasses other multi-label music genre classification methods and achieves state-of-the-art result on Music4All dataset., Comment: Accepted by ICASSP 2023
Published: 2023

45. ChatIE: Zero-Shot Information Extraction via Chatting with ChatGPT

Author: Wei, Xiang, Cui, Xingyu, Cheng, Ning, Wang, Xiaobin, Zhang, Xin, Huang, Shen, Xie, Pengjun, Xu, Jinan, Chen, Yufeng, Zhang, Meishan, Jiang, Yong, and Han, Wenjuan
Subjects: Computer Science - Computation and Language
Abstract: Zero-shot information extraction (IE) aims to build IE systems from the unannotated text. It is challenging due to involving little human intervention. Challenging but worthwhile, zero-shot IE reduces the time and effort that data labeling takes. Recent efforts on large language models (LLMs, e.g., GPT-3, ChatGPT) show promising performance on zero-shot settings, thus inspiring us to explore prompt-based methods. In this work, we ask whether strong IE models can be constructed by directly prompting LLMs. Specifically, we transform the zero-shot IE task into a multi-turn question-answering problem with a two-stage framework (ChatIE). With the power of ChatGPT, we extensively evaluate our framework on three IE tasks: entity-relation triple extract, named entity recognition, and event extraction. Empirical results on six datasets across two languages show that ChatIE achieves impressive performance and even surpasses some full-shot models on several datasets (e.g., NYT11-HRL). We believe that our work could shed light on building IE models with limited resources.
Published: 2023

46. Prevalence and Epidemiological Characteristics of Venous Thromboembolism in Jiaxing City

Author: HONG Ling, LU Liping, CHENG Ning, SUN Qin, JIANG Jianhong, ZHU Liangfeng
Subjects: venous thromboembolism, prevalence rate, risk factors, diagnose-related groups, jiaxing, retrospective studies, Medicine
Abstract: Background Venous thromboembolism (VTE) has become the third most common cardiovascular disease after ischemic heart disease and stroke. Since the official launch of the National Programme for Prevention and Management of Pulmonary Embolism and Deep Vein Thrombosis in 2018, medical personnel have paid more attention to VTE, and the screening rate of VTE has also increased significantly. However, epidemiological investigations related to VTE are mostly limited to single center studies. There is still a lack of large-scale survey data on the prevalence of VTE involving different levels and properties of medical institutions. Objective To understand the prevalence and epidemiological characteristics of VTE in Jiaxing, and provide a basis for further prevention and treatment of VTE in this region. Methods A total of 731 755 discharged patients diagnosed with VTE from all secondary and above medical institutions in Jiaxing City in 2021 were included as the study subjects. General patient data and disease diagnosis-related grouping indicators reflecting the number of patients admitted to the hospital and the difficulty of diagnosis and treatment techniques were collected. Multiple linear regression analysis was used to explore influencing factors of the number of VTE patients in hospitals. Results According to inclusion and exclusion criteria, a total of 10 758 VTE patients were diagnosed. The median age of patients was 71 (62, 79) years. The number of VTE patients in general hospital was the highest, which amounting to 9 732 cases (90.46%), while the number of VTE patients in psychiatric hospital was the lowest at 53 cases (0.49%). The number of VTE patients in tertiary hospitals was higher than that in secondary hospitals [7 929 (73.70%) vs. 2 829 (26.30%) ]. The proportion of VTE patients in general hospital was the highest at 1.85%, and the lowest in maternal and child health hospital at 0.10%. The proportion of VTE patients in tertiary hospitals was higher than that in secondary hospitals, at 1.80% and 0.97%, respectively. The prevalence of VTE in different age groups was statistically significant (χ2=32 383.098, P
Published: 2024
Full Text: View/download PDF

47. Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, Zhao, Mengyuan, Zhang, Zhiyong, and Xiao, Jing
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR). The improvement largely lies in the modeling of linguistic information by decoder. The decoder joint-optimized with an acoustic encoder renders the language model from ground-truth sequences in an auto-regressive manner during training. However, the training corpus of the decoder is limited to the speech transcriptions, which is far less than the corpus needed to train an acceptable language model. This leads to poor robustness of decoder. To alleviate this problem, we propose linguistic-enhanced transformer, which introduces refined CTC information to decoder during training process, so that the decoder can be more robust. Our experiments on AISHELL-1 speech corpus show that the character error rate (CER) is relatively reduced by up to 7%. We also find that in joint CTC-Attention ASR model, decoder is more sensitive to linguistic information than acoustic information., Comment: Accepted by ECAISS2022, The Fourth International Workshop on Edge Computing and Artificial Intelligence based Sensor-Cloud System
Published: 2022

48. Improving Imbalanced Text Classification with Dynamic Curriculum Learning

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recent advances in pre-trained language models have improved the performance for text classification tasks. However, little attention is paid to the priority scheduling strategy on the samples during training. Humans acquire knowledge gradually from easy to complex concepts, and the difficulty of the same material can also vary significantly in different learning stages. Inspired by this insights, we proposed a novel self-paced dynamic curriculum learning (SPDCL) method for imbalanced text classification, which evaluates the sample difficulty by both linguistic character and model capacity. Meanwhile, rather than using static curriculum learning as in the existing research, our SPDCL can reorder and resample training data by difficulty criterion with an adaptive from easy to hard pace. The extensive experiments on several classification tasks show the effectiveness of SPDCL strategy, especially for the imbalanced dataset., Comment: Accepted by UEIoT2022, The 3rd International Workshop on Ubiquitous Electric Internet of Things
Published: 2022

49. MetaSpeech: Speech Effects Switch Along with Environment for Metaverse

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Metaverse expands the physical world to a new dimension, and the physical environment and Metaverse environment can be directly connected and entered. Voice is an indispensable communication medium in the real world and Metaverse. Fusion of the voice with environment effects is important for user immersion in Metaverse. In this paper, we proposed using the voice conversion based method for the conversion of target environment effect speech. The proposed method was named MetaSpeech, which introduces an environment effect module containing an effect extractor to extract the environment information and an effect encoder to encode the environment effect condition, in which gradient reversal layer was used for adversarial training to keep the speech content and speaker information while disentangling the environmental effects. From the experiment results on the public dataset of LJSpeech with four environment effects, the proposed model could complete the specific environment effect conversion and outperforms the baseline methods from the voice conversion task., Comment: Accepted by AI2OT2022, The Third International Workshop on Artificial Intelligence Applications in Internet of Things
Published: 2022

50. Semi-Supervised Learning Based on Reference Model for Low-resource TTS

Author: Zhang, Xulong, Wang, Jianzong, Cheng, Ning, and Xiao, Jing
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Most previous neural text-to-speech (TTS) methods are mainly based on supervised learning methods, which means they depend on a large training dataset and hard to achieve comparable performance under low-resource conditions. To address this issue, we propose a semi-supervised learning method for neural TTS in which labeled target data is limited, which can also resolve the problem of exposure bias in the previous auto-regressive models. Specifically, we pre-train the reference model based on Fastspeech2 with much source data, fine-tuned on a limited target dataset. Meanwhile, pseudo labels generated by the original reference model are used to guide the fine-tuned model's training further, achieve a regularization effect, and reduce the overfitting of the fine-tuned model during training on the limited target data. Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis., Comment: Accepted by NMIC2022, The Fourth International Workshop on Network Meets Intelligent Computations
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,064 results on '"Cheng, Ning"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources