Author: "Xie, Ruobing" / Topic: computer science - computation and language - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xie, Ruobing"' showing total 45 results

Start Over Author "Xie, Ruobing" Topic computer science - computation and language

45 results on '"Xie, Ruobing"'

1. Autonomy-of-Experts Models

Author: Lv, Ang, Xie, Ruobing, Qian, Yining, Wu, Songhao, Sun, Xingwu, Kang, Zhanhui, Wang, Di, and Yan, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.
Published: 2025

2. Scaling Laws for Floating Point Quantization Training

Author: Sun, Xingwu, Li, Shuaipeng, Xie, Ruobing, Han, Weidong, Wu, Kan, Yang, Zhen, Li, Yixing, Wang, An, Li, Shuai, Xue, Jinbao, Cheng, Yu, Tao, Yangyu, Kang, Zhanhui, Xu, Chengzhong, Wang, Di, and Jiang, Jie
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture, Computer Science - Computation and Language
Abstract: Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point quantization and thus cannot well fit the LLM losses in this scenario. In contrast, while floating-point quantization training is more commonly implemented in production, the research on it has been relatively superficial. In this paper, we thoroughly explore the effects of floating-point quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in floating-point quantization training performance of LLM models. While presenting an accurate floating-point quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal floating-point quantization precision is directly proportional to the computational power, but within a wide computational power range, we estimate that the best cost-performance precision lies between 4-8 bits.
Published: 2025

3. Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

Author: Fu, Yuhan, Xie, Ruobing, Sun, Xingwu, Kang, Zhanhui, and Li, Xirong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Multimodal Large Language Models (MLLMs) are known to hallucinate, which limits their practical applications. Recent works have attempted to apply Direct Preference Optimization (DPO) to enhance the performance of MLLMs, but have shown inconsistent improvements in mitigating hallucinations. To address this issue more effectively, we introduce Hallucination-targeted Direct Preference Optimization (HDPO) to reduce hallucinations in MLLMs. Unlike previous approaches, our method tackles hallucinations from their diverse forms and causes. Specifically, we develop three types of preference pair data targeting the following causes of MLLM hallucinations: (1) insufficient visual capabilities, (2) long context generation, and (3) multimodal conflicts. Experimental results demonstrate that our method achieves superior performance across multiple hallucination evaluation datasets, surpassing most state-of-the-art (SOTA) methods and highlighting the potential of our approach. Ablation studies and in-depth analyses further confirm the effectiveness of our method and suggest the potential for further improvements through scaling up.
Published: 2024

4. More Expressive Attention with Negative Weights

Author: Lv, Ang, Xie, Ruobing, Li, Shuaipeng, Liao, Jiayi, Sun, Xingwu, Kang, Zhanhui, Wang, Di, and Yan, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances parameter flexibility. For example, unlike traditional softmax attention heads that use a static output-value (OV) matrix to delete or copy inputs that the heads attend to, Cog Attention naturally learns to use the sign of dynamic query-key (QK) inner products to represent these operations. This enables Cog Attention to perform multiple operations simultaneously within a single head. Meanwhile, Cog Attention's OV matrix can focus more on refinement or modification. (2) Cog Attention enhances the model's robustness against representational collapse by preventing the ``over-squashing'' of earlier tokens into later positions. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models at various scales for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.
Published: 2024

5. Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Author: Sun, Xingwu, Chen, Yanfeng, Huang, Yiqing, Xie, Ruobing, Zhu, Jiaqi, Zhang, Kai, Li, Shuaipeng, Yang, Zhen, Han, Jonny, Shu, Xiaobo, Bu, Jiahao, Chen, Zhongzhi, Huang, Xuemeng, Lian, Fengzong, Yang, Saiyong, Yan, Jianfeng, Zeng, Yuyuan, Ren, Xiaoqin, Yu, Chao, Wu, Lulu, Mao, Yue, Xia, Jun, Yang, Tao, Zheng, Suncong, Wu, Kan, Jiao, Dian, Xue, Jinbao, Zhang, Xipeng, Wu, Decheng, Liu, Kai, Wu, Dengpeng, Xu, Guanghui, Chen, Shaohua, Chen, Shuang, Feng, Xiao, Hong, Yigeng, Zheng, Junqiang, Xu, Chengcheng, Li, Zongwei, Kuang, Xiong, Hu, Jianglu, Chen, Yiqi, Deng, Yuchi, Li, Guiyang, Liu, Ao, Zhang, Chenchen, Hu, Shihui, Zhao, Zilong, Wu, Zifan, Ding, Yao, Wang, Weichao, Liu, Han, Wang, Roberts, Fei, Hao, Yu, Peijie, Zhao, Ze, Cao, Xun, Wang, Hai, Xiang, Fusheng, Huang, Mengyuan, Xiong, Zhiyuan, Hu, Bin, Hou, Xuebin, Jiang, Lei, Ma, Jianqiang, Wu, Jiajia, Deng, Yaping, Shen, Yi, Wang, Qian, Liu, Weijie, Liu, Jie, Chen, Meng, Dong, Liang, Jia, Weiwen, Chen, Hu, Liu, Feifei, Yuan, Rui, Xu, Huilin, Yan, Zhenxiang, Cao, Tengfei, Hu, Zhichao, Feng, Xinhua, Du, Dong, Yu, Tinghao, Tao, Yangyu, Zhang, Feng, Zhu, Jianchen, Xu, Chengzhong, Li, Xirui, Zha, Chong, Ouyang, Wen, Xia, Yinben, Li, Xiang, He, Zekun, Chen, Rongpeng, Song, Jiawei, Chen, Ruibin, Jiang, Fan, Zhao, Chongqing, Wang, Bo, Gong, Hao, Gan, Rong, Hu, Winston, Kang, Zhanhui, Yang, Yong, Liu, Yuhong, Wang, Di, and Jiang, Jie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large, Comment: 17 pages, 4 Figures
Published: 2024

6. Continuous Speech Tokenizer in Text To Speech

Author: Li, Yixing, Xie, Ruobing, Sun, Xingwu, Cheng, Yu, and Kang, Zhanhui
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain., Comment: 4 pages. Under review
Published: 2024

7. Exploring Forgetting in Large Language Model Pre-Training

Author: Liao, Chonghua, Xie, Ruobing, Sun, Xingwu, Sun, Haowen, and Kang, Zhanhui
Subjects: Computer Science - Computation and Language
Abstract: Catastrophic forgetting remains a formidable obstacle to building an omniscient model in large language models (LLMs). Despite the pioneering research on task-level forgetting in LLM fine-tuning, there is scant focus on forgetting during pre-training. We systematically explored the existence and measurement of forgetting in pre-training, questioning traditional metrics such as perplexity (PPL) and introducing new metrics to better detect entity memory retention. Based on our revised assessment of forgetting metrics, we explored low-cost, straightforward methods to mitigate forgetting during the pre-training phase. Further, we carefully analyzed the learning curves, offering insights into the dynamics of forgetting. Extensive evaluations and analyses on forgetting of pre-training could facilitate future research on LLMs.
Published: 2024

8. Lossless KV Cache Compression to 2%

Author: Yang, Zhen, Han, J. N., Wu, Kan, Xie, Ruobing, Wang, An, Sun, Xingwu, and Kang, Zhanhui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is essential. Nonetheless, the growing demands for KV cache memory create significant hurdles for efficient implementation. This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels. CLLA integrates multiple aspects of KV cache compression, including attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework. Our extensive experiments demonstrate that CLLA achieves lossless performance on most tasks while utilizing minimal KV cache, marking a significant advancement in practical KV cache compression.
Published: 2024

9. Magnifier Prompt: Tackling Multimodal Hallucination via Extremely Simple Instructions

Author: Fu, Yuhan, Xie, Ruobing, Liu, Jiazhen, Lan, Bangxiang, Sun, Xingwu, Kang, Zhanhui, and Li, Xirong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Hallucinations in multimodal large language models (MLLMs) hinder their practical applications. To address this, we propose a Magnifier Prompt (MagPrompt), a simple yet effective method to tackle hallucinations in MLLMs via extremely simple instructions. MagPrompt is based on the following two key principles, which guide the design of various effective prompts, demonstrating robustness: (1) MLLMs should focus more on the image. (2) When there are conflicts between the image and the model's inner knowledge, MLLMs should prioritize the image. MagPrompt is training-free and can be applied to open-source and closed-source models, such as GPT-4o and Gemini-pro. It performs well across many datasets and its effectiveness is comparable or even better than more complex methods like VCD. Furthermore, our prompt design principles and experimental analyses provide valuable insights into multimodal hallucination., Comment: The proposed method does not work for up-to-date MLLMs.
Published: 2024

10. Exploring the Benefit of Activation Sparsity in Pre-training

Author: Zhang, Zhengyan, Xiao, Chaojun, Qin, Qiujieli, Lin, Yankai, Zeng, Zhiyuan, Han, Xu, Liu, Zhiyuan, Xie, Ruobing, Sun, Maosong, and Zhou, Jie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to $2\times$ faster inference speed. Codes are available at https://github.com/thunlp/moefication., Comment: ICML 2024
Published: 2024

11. Language Models 'Grok' to Copy

Author: Lv, Ang, Xie, Ruobing, Sun, Xingwu, Kang, Zhanhui, and Yan, Rui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context--a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying., Comment: NAACL 2025 main conference, short paper
Published: 2024

12. HMoE: Heterogeneous Mixture of Experts for Language Modeling

Author: Wang, An, Sun, Xingwu, Xie, Ruobing, Li, Shuaipeng, Zhu, Jiaqi, Yang, Zhen, Zhao, Pinxue, Han, J. N., Kang, Zhanhui, Wang, Di, Okazaki, Naoaki, and Xu, Cheng-zhong
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Mixture of Experts (MoE) offers remarkable performance and computational efficiency by selectively activating subsets of model parameters. Traditionally, MoE models use homogeneous experts, each with identical capacity. However, varying complexity in input data necessitates experts with diverse capabilities, while homogeneous MoE hinders effective expert specialization and efficient parameter utilization. In this study, we propose a novel Heterogeneous Mixture of Experts (HMoE), where experts differ in size and thus possess diverse capacities. This heterogeneity allows for more specialized experts to handle varying token complexities more effectively. To address the imbalance in expert activation, we propose a novel training objective that encourages the frequent activation of smaller experts, enhancing computational efficiency and parameter utilization. Extensive experiments demonstrate that HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks. Codes will be released upon acceptance.
Published: 2024

13. Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Author: Chen, Weize, You, Ziming, Li, Ran, Guan, Yitong, Qian, Chen, Zhao, Chenyang, Yang, Cheng, Xie, Ruobing, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language
Abstract: The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents. However, existing multi-agent frameworks often struggle with integrating diverse capable third-party agents due to reliance on agents defined within their own ecosystems. They also face challenges in simulating distributed environments, as most frameworks are limited to single-device setups. Furthermore, these frameworks often rely on hard-coded communication pipelines, limiting their adaptability to dynamic task requirements. Inspired by the concept of the Internet, we propose the Internet of Agents (IoA), a novel framework that addresses these limitations by providing a flexible and scalable platform for LLM-based multi-agent collaboration. IoA introduces an agent integration protocol, an instant-messaging-like architecture design, and dynamic mechanisms for agent teaming and conversation flow control. Through extensive experiments on general assistant tasks, embodied AI tasks, and retrieval-augmented generation benchmarks, we demonstrate that IoA consistently outperforms state-of-the-art baselines, showcasing its ability to facilitate effective collaboration among heterogeneous agents. IoA represents a step towards linking diverse agents in an Internet-like environment, where agents can seamlessly collaborate to achieve greater intelligence and capabilities. Our codebase has been released at \url{https://github.com/OpenBMB/IoA}., Comment: work in progress
Published: 2024

14. Advancing LLM Reasoning Generalists with Preference Trees

Author: Yuan, Lifan, Cui, Ganqu, Wang, Hanbin, Ding, Ning, Wang, Xingyao, Deng, Jia, Shan, Boji, Chen, Huimin, Xie, Ruobing, Lin, Yankai, Liu, Zhenghao, Zhou, Bowen, Peng, Hao, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model., Comment: Models and data are available at https://github.com/OpenBMB/Eurus
Published: 2024

15. Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models

Author: Ding, Ning, Chen, Yulin, Cui, Ganqu, Lv, Xingtai, Zhao, Weilin, Xie, Ruobing, Zhou, Bowen, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typically accompanied by a sacrifice in performance in other domains. In this paper, we propose to fuse models that are already highly-specialized directly. The proposed fusing framework, UltraFuser, consists of three distinct specialists that are already sufficiently trained on language, coding, and mathematics. A token-level gating mechanism is introduced to blend the specialists' outputs. A two-stage training strategy accompanied by balanced sampling is designed to ensure stability. To effectively train the fused model, we further construct a high-quality supervised instruction tuning dataset, UltraChat 2, which includes text, code, and mathematical content. This dataset comprises approximately 300,000 instructions and covers a wide range of topics in each domain. Experiments show that our model could simultaneously achieve mastery of the three crucial domains.
Published: 2024

16. Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

Author: Guo, Yiju, Cui, Ganqu, Yuan, Lifan, Ding, Ning, Sun, Zexu, Sun, Bowen, Chen, Huimin, Xie, Ruobing, Zhou, Jie, Lin, Yankai, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Systems and Control
Abstract: Alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. In practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). However, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. To navigate this challenge, we argue the prominence of grounding LLMs with evident preferences. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. Our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3H" (helpfulness, honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving improvements in multi-objective alignment., Comment: EMNLP 2024 main conference
Published: 2024

17. Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and Communication

Author: Chen, Weize, Yuan, Chenfei, Yuan, Jiarui, Su, Yusheng, Qian, Chen, Yang, Cheng, Xie, Ruobing, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Natural language (NL) has long been the predominant format for human cognition and communication, and by extension, has been similarly pivotal in the development and application of Large Language Models (LLMs). Yet, besides NL, LLMs have seen various non-NL formats during pre-training, such as code and logical expression. NL's status as the optimal format for LLMs, particularly in single-LLM reasoning and multi-agent communication, has not been thoroughly examined. In this work, we challenge the default use of NL by exploring the utility of non-NL formats in these contexts. We show that allowing LLMs to autonomously select the most suitable format before reasoning or communicating leads to a 3.3 to 5.7\% improvement in reasoning efficiency for different LLMs, and up to a 72.7\% reduction in token usage in multi-agent communication, all while maintaining communicative effectiveness. Our comprehensive analysis further reveals that LLMs can devise a format from limited task instructions and that the devised format is effectively transferable across different LLMs. Intriguingly, the structured communication format decided by LLMs exhibits notable parallels with established agent communication languages, suggesting a natural evolution towards efficient, structured communication in agent communication. Our code is released at \url{https://github.com/thunlp/AutoForm}., Comment: Code release at https://github.com/thunlp/AutoForm
Published: 2024

18. MAVEN-Arg: Completing the Puzzle of All-in-One Event Understanding Dataset with Event Argument Annotation

Author: Wang, Xiaozhi, Peng, Hao, Guan, Yong, Zeng, Kaisheng, Chen, Jianhui, Hou, Lei, Han, Xu, Lin, Yankai, Liu, Zhiyuan, Xie, Ruobing, Zhou, Jie, and Li, Juanzi
Subjects: Computer Science - Computation and Language
Abstract: Understanding events in texts is a core objective of natural language understanding, which requires detecting event occurrences, extracting event arguments, and analyzing inter-event relationships. However, due to the annotation challenges brought by task complexity, a large-scale dataset covering the full process of event understanding has long been absent. In this paper, we introduce MAVEN-Arg, which augments MAVEN datasets with event argument annotations, making the first all-in-one dataset supporting event detection, event argument extraction (EAE), and event relation extraction. As an EAE benchmark, MAVEN-Arg offers three main advantages: (1) a comprehensive schema covering 162 event types and 612 argument roles, all with expert-written definitions and examples; (2) a large data scale, containing 98,591 events and 290,613 arguments obtained with laborious human annotation; (3) the exhaustive annotation supporting all task variants of EAE, which annotates both entity and non-entity event arguments in document level. Experiments indicate that MAVEN-Arg is quite challenging for both fine-tuned EAE models and proprietary large language models (LLMs). Furthermore, to demonstrate the benefits of an all-in-one dataset, we preliminarily explore a potential application, future event prediction, with LLMs. MAVEN-Arg and codes can be obtained from https://github.com/THU-KEG/MAVEN-Argument., Comment: Accepted at ACL 2024. Camera-ready version
Published: 2023

19. Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules

Author: Xiao, Chaojun, Luo, Yuqi, Zhang, Wenbin, Zhang, Pengle, Han, Xu, Lin, Yankai, Zhang, Zhengyan, Xie, Ruobing, Liu, Zhiyuan, Sun, Maosong, and Zhou, Jie
Subjects: Computer Science - Computation and Language
Abstract: Pre-trained language models (PLMs) have achieved remarkable results on NLP tasks but at the expense of huge parameter sizes and the consequent computational costs. In this paper, we propose Variator, a parameter-efficient acceleration method that enhances computational efficiency through plug-and-play compression plugins. Compression plugins are designed to reduce the sequence length via compressing multiple hidden vectors into one and trained with original PLMs frozen. Different from traditional model acceleration methods, which compress PLMs to smaller sizes, Variator offers two distinct advantages: (1) In real-world applications, the plug-and-play nature of our compression plugins enables dynamic selection of different compression plugins with varying acceleration ratios based on the current workload. (2) The compression plugin comprises a few compact neural network layers with minimal parameters, significantly saving storage and memory overhead, particularly in scenarios with a growing number of tasks. We validate the effectiveness of Variator on seven datasets. Experimental results show that Variator can save 53% computational costs using only 0.9% additional parameters with a performance drop of less than 2%. Moreover, when the model scales to billions of parameters, Variator matches the strong performance of uncompressed PLMs., Comment: Accepted by Findings of EMNLP
Published: 2023

20. Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared Pre-trained Language Models

Author: Chen, Weize, Xu, Xiaoyue, Han, Xu, Lin, Yankai, Xie, Ruobing, Liu, Zhiyuan, Sun, Maosong, and Zhou, Jie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Parameter-shared pre-trained language models (PLMs) have emerged as a successful approach in resource-constrained environments, enabling substantial reductions in model storage and memory costs without significant performance compromise. However, it is important to note that parameter sharing does not alleviate computational burdens associated with inference, thus impeding its practicality in situations characterized by limited stringent latency requirements or computational resources. Building upon neural ordinary differential equations (ODEs), we introduce a straightforward technique to enhance the inference efficiency of parameter-shared PLMs. Additionally, we propose a simple pre-training technique that leads to fully or partially shared models capable of achieving even greater inference acceleration. The experimental results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs, providing novel insights into more efficient utilization of parameter-shared models in resource-constrained settings., Comment: EMNLP 2023 Findings
Published: 2023

21. AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems

Author: Zhang, Junjie, Hou, Yupeng, Xie, Ruobing, Sun, Wenqi, McAuley, Julian, Zhao, Wayne Xin, Lin, Leyu, and Wen, Ji-Rong
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: Recently, there has been an emergence of employing LLM-powered agents as believable human proxies, based on their remarkable decision-making capability. However, existing studies mainly focus on simulating human dialogue. Human non-verbal behaviors, such as item clicking in recommender systems, although implicitly exhibiting user preferences and could enhance the modeling of users, have not been deeply explored. The main reasons lie in the gap between language modeling and behavior modeling, as well as the incomprehension of LLMs about user-item relations. To address this issue, we propose AgentCF for simulating user-item interactions in recommender systems through agent-based collaborative filtering. We creatively consider not only users but also items as agents, and develop a collaborative learning approach that optimizes both kinds of agents together. Specifically, at each time step, we first prompt the user and item agents to interact autonomously. Then, based on the disparities between the agents' decisions and real-world interaction records, user and item agents are prompted to reflect on and adjust the misleading simulations collaboratively, thereby modeling their two-sided relations. The optimized agents can also propagate their preferences to other agents in subsequent interactions, implicitly capturing the collaborative filtering idea. Overall, the optimized agents exhibit diverse interaction behaviors within our framework, including user-item, user-user, item-item, and collective interactions. The results show that these agents can demonstrate personalized behaviors akin to those of real-world individuals, sparking the development of next-generation user behavior simulation.
Published: 2023

22. UltraFeedback: Boosting Language Models with Scaled AI Feedback

Author: Cui, Ganqu, Yuan, Lifan, Ding, Ning, Yao, Guanming, He, Bingxiang, Zhu, Wei, Ni, Yuan, Xie, Guotong, Xie, Ruobing, Lin, Yankai, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Learning from human feedback has become a pivot technique in aligning large language models (LLMs) with human preferences. However, acquiring vast and premium human feedback is bottlenecked by time, labor, and human capability, resulting in small sizes or limited topics of current datasets. This further hinders feedback learning as well as alignment research within the open-source community. To address this issue, we explore how to go beyond human feedback and collect high-quality \textit{AI feedback} automatically for a scalable alternative. Specifically, we identify \textbf{scale and diversity} as the key factors for feedback data to take effect. Accordingly, we first broaden instructions and responses in both amount and breadth to encompass a wider range of user-assistant interactions. Then, we meticulously apply a series of techniques to mitigate annotation biases for more reliable AI feedback. We finally present \textsc{UltraFeedback}, a large-scale, high-quality, and diversified AI feedback dataset, which contains over 1 million GPT-4 feedback for 250k user-assistant conversations from various aspects. Built upon \textsc{UltraFeedback}, we align a LLaMA-based model by best-of-$n$ sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models, serving as a solid foundation for future feedback learning research. Our data and models are available at https://github.com/thunlp/UltraFeedback., Comment: ICML 2024 camera ready
Published: 2023

23. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Author: Chen, Weize, Su, Yusheng, Zuo, Jingwei, Yang, Cheng, Yuan, Chenfei, Chan, Chi-Min, Yu, Heyang, Lu, Yaxi, Hung, Yi-Hsin, Qian, Chen, Qin, Yujia, Cong, Xin, Xie, Ruobing, Liu, Zhiyuan, Sun, Maosong, and Zhou, Jie
Subjects: Computer Science - Computation and Language
Abstract: Autonomous agents empowered by Large Language Models (LLMs) have undergone significant improvements, enabling them to generalize across a broad spectrum of tasks. However, in real-world scenarios, cooperation among individuals is often required to enhance the efficiency and effectiveness of task accomplishment. Hence, inspired by human group dynamics, we propose a multi-agent framework \framework that can collaboratively and dynamically adjust its composition as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that \framework framework can effectively deploy multi-agent groups that outperform a single agent. Furthermore, we delve into the emergence of social behaviors among individual agents within a group during collaborative task accomplishment. In view of these behaviors, we discuss some possible strategies to leverage positive ones and mitigate negative ones for improving the collaborative potential of multi-agent groups. Our codes for \framework will soon be released at \url{https://github.com/OpenBMB/AgentVerse}., Comment: Under review. Code at https://github.com/OpenBMB/AgentVerse/
Published: 2023

24. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Author: Qin, Yujia, Liang, Shihao, Ye, Yining, Zhu, Kunlun, Yan, Lan, Lu, Yaxi, Lin, Yankai, Cong, Xin, Tang, Xiangru, Qian, Bill, Zhao, Sihan, Hong, Lauren, Tian, Runchu, Xie, Ruobing, Zhou, Jie, Gerstein, Mark, Li, Dahai, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.
Published: 2023

25. Emergent Modularity in Pre-trained Transformers

Author: Zhang, Zhengyan, Zeng, Zhiyuan, Lin, Yankai, Xiao, Chaojun, Wang, Xiaozhi, Han, Xu, Liu, Zhiyuan, Xie, Ruobing, Sun, Maosong, and Zhou, Jie
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: This work examines the presence of modularity in pre-trained Transformers, a feature commonly found in human brains and thought to be vital for general intelligence. In analogy to human brains, we consider two main characteristics of modularity: (1) functional specialization of neurons: we evaluate whether each neuron is mainly specialized in a certain function, and find that the answer is yes. (2) function-based neuron grouping: we explore finding a structure that groups neurons into modules by function, and each module works for its corresponding function. Given the enormous amount of possible structures, we focus on Mixture-of-Experts as a promising candidate, which partitions neurons into experts and usually activates different experts for different inputs. Experimental results show that there are functional experts, where clustered are the neurons specialized in a certain function. Moreover, perturbing the activations of functional experts significantly affects the corresponding function. Finally, we study how modularity emerges during pre-training, and find that the modular structure is stabilized at the early stage, which is faster than neuron stabilization. It suggests that Transformers first construct the modular structure and then learn fine-grained neuron functions. Our code and data are available at https://github.com/THUNLP/modularity-analysis., Comment: Findings of ACL 2023
Published: 2023

26. Large Language Models are Zero-Shot Rankers for Recommender Systems

Author: Hou, Yupeng, Zhang, Junjie, Lin, Zihan, Lu, Hongyu, Xie, Ruobing, McAuley, Julian, and Zhao, Wayne Xin
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: Recently, large language models (LLMs) (e.g., GPT-4) have demonstrated impressive general-purpose task-solving abilities, including the potential to approach recommendation tasks. Along this line of research, this work aims to investigate the capacity of LLMs that act as the ranking model for recommender systems. We first formalize the recommendation problem as a conditional ranking task, considering sequential interaction histories as conditions and the items retrieved by other candidate generation models as candidates. To solve the ranking task by LLMs, we carefully design the prompting template and conduct extensive experiments on two widely-used datasets. We show that LLMs have promising zero-shot ranking abilities but (1) struggle to perceive the order of historical interactions, and (2) can be biased by popularity or item positions in the prompts. We demonstrate that these issues can be alleviated using specially designed prompting and bootstrapping strategies. Equipped with these insights, zero-shot LLMs can even challenge conventional recommendation models when ranking candidates are retrieved by multiple candidate generators. The code and processed datasets are available at https://github.com/RUCAIBox/LLMRank., Comment: Accepted by ECIR 2024
Published: 2023

27. Recyclable Tuning for Continual Pre-training

Author: Qin, Yujia, Qian, Cheng, Han, Xu, Lin, Yankai, Wang, Huadong, Xie, Ruobing, Liu, Zhiyuan, Sun, Maosong, and Zhou, Jie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Continual pre-training is the paradigm where pre-trained language models (PLMs) continually acquire fresh knowledge from growing data and gradually get upgraded. Before an upgraded PLM is released, we may have tuned the original PLM for various tasks and stored the adapted weights. However, when tuning the upgraded PLM, these outdated adapted weights will typically be ignored and discarded, causing a potential waste of resources. We bring this issue to the forefront and contend that proper algorithms for recycling outdated adapted weights should be developed. To this end, we formulate the task of recyclable tuning for continual pre-training. In pilot studies, we find that after continual pre-training, the upgraded PLM remains compatible with the outdated adapted weights to some extent. Motivated by this finding, we analyze the connection between continually pre-trained PLMs from two novel aspects, i.e., mode connectivity, and functional similarity. Based on the corresponding findings, we propose both an initialization-based method and a distillation-based method for our task. We demonstrate their feasibility in improving the convergence and performance for tuning the upgraded PLM. We also show that both methods can be combined to achieve better performance. The source codes are publicly available at https://github.com/thunlp/RecyclableTuning., Comment: Findings of ACL 2023
Published: 2023

28. Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach

Author: Zhang, Junjie, Xie, Ruobing, Hou, Yupeng, Zhao, Wayne Xin, Lin, Leyu, and Wen, Ji-Rong
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: In the past decades, recommender systems have attracted much attention in both research and industry communities, and a large number of studies have been devoted to developing effective recommendation models. Basically speaking, these models mainly learn the underlying user preference from historical behavior data, and then estimate the user-item matching relationships for recommendations. Inspired by the recent progress on large language models (LLMs), we take a different approach to developing the recommendation models, considering recommendation as instruction following by LLMs. The key idea is that the preferences or needs of a user can be expressed in natural language descriptions (called instructions), so that LLMs can understand and further execute the instruction for fulfilling the recommendation task. Instead of using public APIs of LLMs, we instruction tune an open-source LLM (3B Flan-T5-XL), in order to better adapt LLMs to recommender systems. For this purpose, we first design a general instruction format for describing the preference, intention, task form and context of a user in natural language. Then we manually design 39 instruction templates and automatically generate a large amount of user-personalized instruction data (252K instructions) with varying types of preferences and intentions. To demonstrate the effectiveness of our approach, we instantiate the instruction templates into several widely-studied recommendation (or search) tasks, and conduct extensive experiments on these tasks with real-world datasets. Experiment results show that the proposed approach can outperform several competitive baselines, including the powerful GPT-3.5, on these evaluation tasks. Our approach sheds light on developing more user-friendly recommender systems, in which users can freely communicate with the system and obtain more accurate recommendations via natural language instructions.
Published: 2023

29. WebCPM: Interactive Web Search for Chinese Long-form Question Answering

Author: Qin, Yujia, Cai, Zihan, Jin, Dian, Yan, Lan, Liang, Shihao, Zhu, Kunlun, Lin, Yankai, Han, Xu, Ding, Ning, Wang, Huadong, Xie, Ruobing, Qi, Fanchao, Liu, Zhiyuan, Sun, Maosong, and Zhou, Jie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval
Abstract: Long-form question answering (LFQA) aims at answering complex, open-ended questions with detailed, paragraph-length responses. The de facto paradigm of LFQA necessitates two procedures: information retrieval, which searches for relevant supporting facts, and information synthesis, which integrates these facts into a coherent answer. In this paper, we introduce WebCPM, the first Chinese LFQA dataset. One unique feature of WebCPM is that its information retrieval is based on interactive web search, which engages with a search engine in real time. Following WebGPT, we develop a web search interface. We recruit annotators to search for relevant information using our interface and then answer questions. Meanwhile, the web search behaviors of our annotators would be recorded. In total, we collect 5,500 high-quality question-answer pairs, together with 14,315 supporting facts and 121,330 web search actions. We fine-tune pre-trained language models to imitate human behaviors for web search and to generate answers based on the collected facts. Our LFQA pipeline, built on these fine-tuned models, generates answers that are no worse than human-written ones in 32.5% and 47.5% of the cases on our dataset and DuReader, respectively., Comment: ACL 2023, main conference
Published: 2023

30. Visually Grounded Commonsense Knowledge Acquisition

Author: Yao, Yuan, Yu, Tianyu, Zhang, Ao, Li, Mengdi, Xie, Ruobing, Weber, Cornelius, Liu, Zhiyuan, Zheng, Hai-Tao, Wermter, Stefan, Chua, Tat-Seng, and Sun, Maosong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and selects informative instances from the bag to summarize commonsense entity relations via a novel contrastive attention mechanism. Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming pre-trained language model-based methods by 3.9 AUC and 6.4 mAUC points. The predicted commonsense scores show strong correlation with human judgment with a 0.78 Spearman coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. The data and codes can be obtained at https://github.com/thunlp/CLEVER., Comment: Accepted by AAAI 2023
Published: 2022

31. Pruning Pre-trained Language Models Without Fine-Tuning

Author: Jiang, Ting, Wang, Deqing, Zhuang, Fuzhen, Xie, Ruobing, and Xia, Feng
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: To overcome the overparameterized problem in Pre-trained Language Models (PLMs), pruning is widely used as a simple and straightforward compression method by directly removing unimportant weights. Previous first-order methods successfully compress PLMs to extremely high sparsity with little performance drop. These methods, such as movement pruning, use first-order information to prune PLMs while fine-tuning the remaining weights. In this work, we argue fine-tuning is redundant for first-order pruning, since first-order pruning is sufficient to converge PLMs to downstream tasks without fine-tuning. Under this motivation, we propose Static Model Pruning (SMP), which only uses first-order pruning to adapt PLMs to downstream tasks while achieving the target sparsity level. In addition, we also design a new masking function and training objective to further improve SMP. Extensive experiments at various sparsity levels show SMP has significant improvements over first-order and zero-order methods. Unlike previous first-order methods, SMP is also applicable to low sparsity and outperforms zero-order methods. Meanwhile, SMP is more parameter efficient than other methods due to it does not require fine-tuning., Comment: Accepted to ACL 2023; Code and models are available at https://github.com/kongds/SMP
Published: 2022

32. Better Pre-Training by Reducing Representation Confusion

Author: Zhang, Haojie, Liang, Mingfei, Xie, Ruobing, Sun, Zhenlong, Zhang, Bo, and Lin, Leyu
Subjects: Computer Science - Computation and Language
Abstract: In this work, we revisit the Transformer-based pre-trained language models and identify two different types of information confusion in position encoding and model representations, respectively. Firstly, we show that in the relative position encoding, the joint modeling about relative distances and directions brings confusion between two heterogeneous information. It may make the model unable to capture the associative semantics of the same distance and the opposite directions, which in turn affects the performance of downstream tasks. Secondly, we notice the BERT with Mask Language Modeling (MLM) pre-training objective outputs similar token representations (last hidden states of different tokens) and head representations (attention weights of different heads), which may make the diversity of information expressed by different tokens and heads limited. Motivated by the above investigation, we propose two novel techniques to improve pre-trained language models: Decoupled Directional Relative Position (DDRP) encoding and MTH pre-training objective. DDRP decouples the relative distance features and the directional features in classical relative position encoding. MTH applies two novel auxiliary regularizers besides MLM to enlarge the dissimilarities between (a) last hidden states of different tokens, and (b) attention weights of different heads. These designs allow the model to capture different categories of information more clearly, as a way to alleviate information confusion in representation learning for better optimization. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of our proposed methods., Comment: EACL 2023(Findings)
Published: 2022

33. Prompt Tuning for Discriminative Pre-trained Language Models

Author: Yao, Yuan, Dong, Bowen, Zhang, Ao, Zhang, Zhengyan, Xie, Ruobing, Liu, Zhiyuan, Lin, Leyu, Sun, Maosong, and Wang, Jianyong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recent works have shown promising results of prompt tuning in stimulating pre-trained language models (PLMs) for natural language processing (NLP) tasks. However, to the best of our knowledge, existing works focus on prompt-tuning generative PLMs that are pre-trained to generate target tokens, such as BERT. It is still unknown whether and how discriminative PLMs, e.g., ELECTRA, can be effectively prompt-tuned. In this work, we present DPT, the first prompt tuning framework for discriminative PLMs, which reformulates NLP tasks into a discriminative language modeling problem. Comprehensive experiments on text classification and question answering show that, compared with vanilla fine-tuning, DPT achieves significantly higher performance, and also prevents the unstable problem in tuning large PLMs in both full-set and low-resource settings. The source code and experiment details of this paper can be obtained from https://github.com/thunlp/DPT., Comment: Accepted by Findings of ACL 2022
Published: 2022

34. Denoising Relation Extraction from Document-level Distant Supervision

Author: Xiao, Chaojun, Yao, Yuan, Xie, Ruobing, Han, Xu, Liu, Zhiyuan, Sun, Maosong, Lin, Fen, and Lin, Leyu
Subjects: Computer Science - Computation and Language
Abstract: Distant supervision (DS) has been widely used to generate auto-labeled data for sentence-level relation extraction (RE), which improves RE performance. However, the existing success of DS cannot be directly transferred to the more challenging document-level relation extraction (DocRE), since the inherent noise in DS may be even multiplied in document level and significantly harm the performance of RE. To address this challenge, we propose a novel pre-trained model for DocRE, which denoises the document-level DS data via multiple pre-training tasks. Experimental results on the large-scale DocRE benchmark show that our model can capture useful information from noisy DS data and achieve promising results., Comment: EMNLP 2020 short paper
Published: 2020

35. Knowledge Transfer via Pre-training for Recommendation: A Review and Prospect

Author: Zeng, Zheni, Xiao, Chaojun, Yao, Yuan, Xie, Ruobing, Liu, Zhiyuan, Lin, Fen, Lin, Leyu, and Sun, Maosong
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: Recommender systems aim to provide item recommendations for users, and are usually faced with data sparsity problem (e.g., cold start) in real-world scenarios. Recently pre-trained models have shown their effectiveness in knowledge transfer between domains and tasks, which can potentially alleviate the data sparsity problem in recommender systems. In this survey, we first provide a review of recommender systems with pre-training. In addition, we show the benefits of pre-training to recommender systems through experiments. Finally, we discuss several promising directions for future research for recommender systems with pre-training., Comment: This paper is submitted to Frontiers in Big Data and is under review
Published: 2020

36. Connecting Embeddings for Knowledge Graph Entity Typing

Author: Zhao, Yu, Zhang, Anxiang, Xie, Ruobing, Liu, Kang, and Wang, Xiaojie
Subjects: Computer Science - Computation and Language, Computer Science - Formal Languages and Automata Theory
Abstract: Knowledge graph (KG) entity typing aims at inferring possible missing entity type instances in KG, which is a very significant but still under-explored subtask of knowledge graph completion. In this paper, we propose a novel approach for KG entity typing which is trained by jointly utilizing local typing knowledge from existing entity type assertions and global triple knowledge from KGs. Specifically, we present two distinct knowledge-driven effective mechanisms of entity type inference. Accordingly, we build two novel embedding models to realize the mechanisms. Afterward, a joint model with them is used to infer missing entity type instances, which favors inferences that agree with both entity type instances and triple knowledge in KGs. Experimental results on two real-world datasets (Freebase and YAGO) demonstrate the effectiveness of our proposed mechanisms and models for improving KG entity typing. The source code and data of this paper can be obtained from: https://github.com/ Adam1679/ConnectE
Published: 2020

37. FAQ-based Question Answering via Knowledge Anchors

Author: Xie, Ruobing, Lu, Yanan, Lin, Fen, and Lin, Leyu
Subjects: Computer Science - Computation and Language
Abstract: Question answering (QA) aims to understand questions and find appropriate answers. In real-world QA systems, Frequently Asked Question (FAQ) based QA is usually a practical and effective solution, especially for some complicated questions (e.g., How and Why). Recent years have witnessed the great successes of knowledge graphs (KGs) in KBQA systems, while there are still few works focusing on making full use of KGs in FAQ-based QA. In this paper, we propose a novel Knowledge Anchor based Question Answering (KAQA) framework for FAQ-based QA to better understand questions and retrieve more appropriate answers. More specifically, KAQA mainly consists of three modules: knowledge graph construction, query anchoring and query-document matching. We consider entities and triples of KGs in texts as knowledge anchors to precisely capture the core semantics, which brings in higher precision and better interpretability. The multi-channel matching strategy also enables most sentence matching models to be flexibly plugged in our KAQA framework to fit different real-world computation limitations. In experiments, we evaluate our models on both offline and online query-document matching tasks on a real-world FAQ-based QA system in WeChat Search, with detailed analysis, ablation tests and case studies. The significant improvements confirm the effectiveness and robustness of the KAQA framework in real-world FAQ-based QA., Comment: 12 pages, accepted by NLPCC-2020
Published: 2019

38. Neural Snowball for Few-Shot Relation Learning

Author: Gao, Tianyu, Han, Xu, Xie, Ruobing, Liu, Zhiyuan, Lin, Fen, Lin, Leyu, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Knowledge graphs typically undergo open-ended growth of new relations. This cannot be well handled by relation extraction that focuses on pre-defined relations with sufficient training data. To address new relations with few-shot instances, we propose a novel bootstrapping approach, Neural Snowball, to learn new relations by transferring semantic knowledge about existing relations. More specifically, we use Relational Siamese Networks (RSN) to learn the metric of relational similarities between instances based on existing relations and their labeled data. Afterwards, given a new relation and its few-shot instances, we use RSN to accumulate reliable instances from unlabeled corpora; these instances are used to train a relation classifier, which can further identify new facts of the new relation. The process is conducted iteratively like a snowball. Experiments show that our model can gather high-quality instances for better few-shot relation learning and achieves significant improvement compared to baselines. Codes and datasets are released on https://github.com/thunlp/Neural-Snowball., Comment: Accepted by AAAI2020
Published: 2019

39. Knowledge Representation Learning: A Quantitative Review

Author: Lin, Yankai, Han, Xu, Xie, Ruobing, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language
Abstract: Knowledge representation learning (KRL) aims to represent entities and relations in knowledge graph in low-dimensional semantic space, which have been widely used in massive knowledge-driven tasks. In this article, we introduce the reader to the motivations for KRL, and overview existing approaches for KRL. Afterwards, we extensively conduct and quantitative comparison and analysis of several typical KRL methods on three evaluation tasks of knowledge acquisition including knowledge graph completion, triple classification, and relation extraction. We also review the real-world applications of KRL, such as language modeling, question answering, information retrieval, and recommender systems. Finally, we discuss the remaining challenges and outlook the future directions for KRL. The codes and datasets used in the experiments can be found in https://github.com/thunlp/OpenKE., Comment: 58 pages
Published: 2018

40. Language Modeling with Sparse Product of Sememe Experts

Author: Gu, Yihong, Yan, Jun, Zhu, Hao, Liu, Zhiyuan, Xie, Ruobing, Sun, Maosong, Lin, Fen, and Lin, Leyu
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Most language modeling methods rely on large-scale data to statistically learn the sequential patterns of words. In this paper, we argue that words are atomic language units but not necessarily atomic semantic units. Inspired by HowNet, we use sememes, the minimum semantic units in human languages, to represent the implicit semantics behind words for language modeling, named Sememe-Driven Language Model (SDLM). More specifically, to predict the next word, SDLM first estimates the sememe distribution gave textual context. Afterward, it regards each sememe as a distinct semantic expert, and these experts jointly identify the most probable senses and the corresponding word. In this way, SDLM enables language models to work beyond word-level manipulation to fine-grained sememe-level semantics and offers us more powerful tools to fine-tune language models and improve the interpretability as well as the robustness of language models. Experiments on language modeling and the downstream application of headline gener- ation demonstrate the significant effect of SDLM. Source code and data used in the experiments can be accessed at https:// github.com/thunlp/SDLM-pytorch., Comment: EMNLP 2018. The first three authors contribute equally
Published: 2018

41. Incorporating Chinese Characters of Words for Lexical Sememe Prediction

Author: Jin, Huiming, Zhu, Hao, Liu, Zhiyuan, Xie, Ruobing, Sun, Maosong, Lin, Fen, and Lin, Leyu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Learning
Abstract: Sememes are minimum semantic units of concepts in human languages, such that each word sense is composed of one or multiple sememes. Words are usually manually annotated with their sememes by linguists, and form linguistic common-sense knowledge bases widely used in various NLP tasks. Recently, the lexical sememe prediction task has been introduced. It consists of automatically recommending sememes for words, which is expected to improve annotation efficiency and consistency. However, existing methods of lexical sememe prediction typically rely on the external context of words to represent the meaning, which usually fails to deal with low-frequency and out-of-vocabulary words. To address this issue for Chinese, we propose a novel framework to take advantage of both internal character information and external context information of words. We experiment on HowNet, a Chinese sememe knowledge base, and demonstrate that our framework outperforms state-of-the-art baselines by a large margin, and maintains a robust performance even for low-frequency words., Comment: Accepted as an ACL 2018 long paper. The first two authors contribute equally. Code is available at https://github.com/thunlp/Character-enhanced-Sememe-Prediction
Published: 2018

42. Does William Shakespeare REALLY Write Hamlet? Knowledge Representation Learning with Confidence

Author: Xie, Ruobing, Liu, Zhiyuan, Lin, Fen, and Lin, Leyu
Subjects: Computer Science - Computation and Language
Abstract: Knowledge graphs (KGs), which could provide essential relational information between entities, have been widely utilized in various knowledge-driven applications. Since the overall human knowledge is innumerable that still grows explosively and changes frequently, knowledge construction and update inevitably involve automatic mechanisms with less human supervision, which usually bring in plenty of noises and conflicts to KGs. However, most conventional knowledge representation learning methods assume that all triple facts in existing KGs share the same significance without any noises. To address this problem, we propose a novel confidence-aware knowledge representation learning framework (CKRL), which detects possible noises in KGs while learning knowledge representations with confidence simultaneously. Specifically, we introduce the triple confidence to conventional translation-based methods for knowledge representation learning. To make triple confidence more flexible and universal, we only utilize the internal structural information in KGs, and propose three kinds of triple confidences considering both local and global structural information. In experiments, We evaluate our models on knowledge graph noise detection, knowledge graph completion and triple classification. Experimental results demonstrate that our confidence-aware models achieve significant and consistent improvements on all tasks, which confirms the capability of CKRL modeling confidence with structural information in both KG noise detection and knowledge representation learning., Comment: 8 pages
Published: 2017

43. Neural Emoji Recommendation in Dialogue Systems

Author: Xie, Ruobing, Liu, Zhiyuan, Yan, Rui, and Sun, Maosong
Subjects: Computer Science - Computation and Language
Abstract: Emoji is an essential component in dialogues which has been broadly utilized on almost all social platforms. It could express more delicate feelings beyond plain texts and thus smooth the communications between users, making dialogue systems more anthropomorphic and vivid. In this paper, we focus on automatically recommending appropriate emojis given the contextual information in multi-turn dialogue systems, where the challenges locate in understanding the whole conversations. More specifically, we propose the hierarchical long short-term memory model (H-LSTM) to construct dialogue representations, followed by a softmax classifier for emoji classification. We evaluate our models on the task of emoji classification in a real-world dataset, with some further explorations on parameter sensitivity and case study. Experimental results demonstrate that our method achieves the best performances on all evaluation metrics. It indicates that our method could well capture the contextual information and emotion flow in dialogues, which is significant for emoji recommendation., Comment: 7 pages
Published: 2016

44. Knowledge Representation via Joint Learning of Sequential Text and Knowledge Graphs

Author: Wu, Jiawei, Xie, Ruobing, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language
Abstract: Textual information is considered as significant supplement to knowledge representation learning (KRL). There are two main challenges for constructing knowledge representations from plain texts: (1) How to take full advantages of sequential contexts of entities in plain texts for KRL. (2) How to dynamically select those informative sentences of the corresponding entities for KRL. In this paper, we propose the Sequential Text-embodied Knowledge Representation Learning to build knowledge representations from multiple sentences. Given each reference sentence of an entity, we first utilize recurrent neural network with pooling or long short-term memory network to encode the semantic information of the sentence with respect to the entity. Then we further design an attention model to measure the informativeness of each sentence, and build text-based representations of entities. We evaluate our method on two tasks, including triple classification and link prediction. Experimental results demonstrate that our method outperforms other baselines on both tasks, which indicates that our method is capable of selecting informative sentences and encoding the textual information well into knowledge representations., Comment: 10 pages, 3 figures
Published: 2016

45. Image-embodied Knowledge Representation Learning

Author: Xie, Ruobing, Liu, Zhiyuan, Luan, Huanbo, and Sun, Maosong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Entity images could provide significant visual information for knowledge representation learning. Most conventional methods learn knowledge representations merely from structured triples, ignoring rich visual information extracted from entity images. In this paper, we propose a novel Image-embodied Knowledge Representation Learning model (IKRL), where knowledge representations are learned with both triple facts and images. More specifically, we first construct representations for all images of an entity with a neural image encoder. These image representations are then integrated into an aggregated image-based representation via an attention-based method. We evaluate our IKRL models on knowledge graph completion and triple classification. Experimental results demonstrate that our models outperform all baselines on both tasks, which indicates the significance of visual information for knowledge representations and the capability of our models in learning knowledge representations with images., Comment: 7 pages; Accepted by IJCAI-2017
Published: 2016

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

45 results on '"Xie, Ruobing"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources