Author: "Shang, Lifeng" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shang, Lifeng"' showing total 313 results

Start Over Author "Shang, Lifeng"

313 results on '"Shang, Lifeng"'

1. Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework

Author: Xu, Kaishuai, Yu, Tiezheng, Hou, Wenjun, Cheng, Yi, Li, Liangyou, Jiang, Xin, Shang, Lifeng, Liu, Qun, and Li, Wenjie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.
Published: 2025

2. DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities

Author: Zhuang, Tianyi, Kuang, Chuqiao, Li, Xiaoguang, Teng, Yihua, Wu, Jihao, Wang, Yasheng, and Shang, Lifeng
Subjects: Computer Science - Artificial Intelligence
Abstract: We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation.
Published: 2025

3. Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving

Author: Xu, Xin, Xu, Yan, Chen, Tianhao, Yan, Yuchen, Liu, Chengwu, Chen, Zaoyu, Wang, Yufei, Yin, Yichun, Wang, Yasheng, Shang, Lifeng, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy based on their inherent capabilities. In this work, we propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously, aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware data selection during supervised fine-tuning (SFT) to tailor training data to the model's unique abilities. This approach equips LLMs to autonomously determine and apply the appropriate reasoning strategy at test time. We evaluate TATA through extensive experiments on six mathematical reasoning benchmarks, using both general-purpose and math-specialized LLMs. Empirical results demonstrate that TATA effectively combines the complementary strengths of CoT and TIR, achieving superior or comparable performance with improved inference efficiency compared to TIR alone. Further analysis underscores the critical role of aptitude-aware data selection in enabling LLMs to make effective and adaptive reasoning decisions and align reasoning strategies with model capabilities., Comment: 8 pages
Published: 2025

4. Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge

Author: Zhang, Qiyuan, Wang, Yufei, Jiang, Yuxin, Li, Liangyou, Wu, Chuhan, Wang, Yasheng, Jiang, Xin, Shang, Lifeng, Tang, Ruiming, Lyu, Fuyuan, and Ma, Chen
Subjects: Computer Science - Computation and Language
Abstract: LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become a widely adopted auto-evaluation method. However, its reliability is compromised by the CoT reasoning's inability to capture comprehensive and deeper details, often leading to incomplete outcomes. Existing methods mainly rely on majority voting or criteria expansion, which is insufficient to address the limitation in CoT. We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses, thereby exposing deeper and more comprehensive details within the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed CoT judgment. Extensive experiments demonstrate that our approach enhances evaluation reliability, achieving an average accuracy gain of 6.7% across five benchmarks. Moreover, our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling for supervised fine-tuning (SFT), referred to as crowd rejection sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs generated by ours are more comprehensive and of higher quality, and evaluation accuracy improves as inference scales.
Published: 2025

5. More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Author: Zhang, Jiebin, Zhu, Dawei, Song, Yifan, Wu, Wenhao, Kuang, Chuqiao, Li, Xiaoguang, Shang, Lifeng, Liu, Qun, and Li, Sujian
Subjects: Computer Science - Computation and Language
Abstract: As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. However, these works leaving the trade-off between these two orthogonal dimensions largely under-explored. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.Experiments demonstrate that storing more tokens in the KV cache with lower precision,a strategy we term quantized pruning, can significantly enhance the long-context performance of LLMs. In-depth analysis of the token-precision trade-off across key aspects demonstrates that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning demonstrates notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code is available at https://github.com/zhzihao/QPruningKV., Comment: 13pages,9 figures
Published: 2024

6. ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis

Author: Wang, Zezhong, Zeng, Xingshan, Liu, Weiwen, Li, Liangyou, Wang, Yasheng, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Wong, Kam-Fai
Subjects: Computer Science - Computation and Language
Abstract: Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis process generally involves sampling a set of tools, formulating a requirement based on these tools, and generating the call statements. However, tools sampled randomly lack relevance, making them difficult to combine and thus reducing the diversity of the data. Additionally, current work overlooks the coherence between turns of dialogues, leading to a gap between the synthesized data and real-world scenarios. To address these issues, we propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We integrate these two strategies and enable multiple agents to synthesize the dialogue data interactively, resulting in our tool-calling data synthesis pipeline ToolFlow. Data quality assessments demonstrate improvements in the naturalness and coherence of our synthesized dialogues. Finally, we apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow. Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.
Published: 2024

7. Subtle Errors Matter: Preference Learning via Error-injected Self-editing

Author: Xu, Kaishuai, Yu, Tiezheng, Hou, Wenjun, Cheng, Yi, Leong, Chak Tou, Li, Liangyou, Jiang, Xin, Shang, Lifeng, Liu, Qun, and Li, Wenjie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) have exhibited strong mathematical reasoning prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle yet critical errors, such as miscalculations or incorrect substitutions, limit the LLMs' full potential. Existing studies to improve mathematical ability typically involve applying preference learning to step-wise solution pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook critical subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into pivotal tokens in reasoning or computation steps to construct hard pairs for error mitigation. In detail, RISE uses the LLM itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples. Moreover, the effect of error mitigation extends from mathematical reasoning to logical reasoning and code generation.
Published: 2024

8. RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Author: Zhang, Qiyuan, Wang, Yufei, YU, Tiezheng, Jiang, Yuxin, Wu, Chuhan, Li, Liangyou, Wang, Yasheng, Jiang, Xin, Shang, Lifeng, Tang, Ruiming, Lyu, Fuyuan, and Ma, Chen
Subjects: Computer Science - Computation and Language
Abstract: With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval's effectiveness in bias reduction, the impact of inference cost, and reference relevance.
Published: 2024

9. Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

Author: Li, Tao, He, Zhengbao, Li, Yujun, Wang, Yasheng, Shang, Lifeng, and Huang, Xiaolin
Subjects: Computer Science - Machine Learning
Abstract: Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computational and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, provides an efficient way to fine-tune models by optimizing only a low-rank matrix. Despite recent progress made in improving LoRA's performance, the connection between the LoRA optimization space and the original full parameter space is often overlooked. A solution that appears flat in the LoRA space may exist sharp directions in the full parameter space, potentially harming generalization performance. In this paper, we propose Flat-LoRA, an efficient approach that seeks a low-rank adaptation located in a flat region of the full parameter space.Instead of relying on the well-established sharpness-aware minimization approach, which can incur significant computational and memory burdens, we utilize random weight perturbation with a Bayesian expectation loss objective to maintain training efficiency and design a refined perturbation generation strategy for improved performance. Experiments on natural language processing and image classification tasks with various architectures demonstrate the effectiveness of our approach., Comment: Work in progress
Published: 2024

10. ToolACE: Winning the Points of LLM Function Calling

Author: Liu, Weiwen, Huang, Xu, Zeng, Xingshan, Hao, Xinlong, Yu, Shuai, Li, Dexun, Wang, Shuai, Gan, Weinan, Liu, Zhengying, Yu, Yuanqing, Wang, Zezhong, Wang, Yuxian, Ning, Wu, Hou, Yutai, Wang, Bin, Wu, Chuhan, Wang, Xinzhi, Liu, Yong, Wang, Yasheng, Tang, Duyu, Tu, Dandan, Shang, Lifeng, Jiang, Xin, Tang, Ruiming, Lian, Defu, Liu, Qun, and Chen, Enhong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE., Comment: 21 pages, 22 figures
Published: 2024

11. Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization

Author: Jiang, Yuxin, Huang, Bo, Wang, Yufei, Zeng, Xingshan, Li, Liangyou, Wang, Yasheng, Jiang, Xin, Shang, Lifeng, Tang, Ruiming, and Wang, Wei
Subjects: Computer Science - Computation and Language
Abstract: Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the generation of the winning response and the losing response within pairwise data are typically isolated, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. Firstly, we increase the consistency and informativeness of the pairwise preference signals through targeted modifications, synthesizing a pseudo-winning response by improving the losing response with the winning response as a reference. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model's confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method's superior performance over DPO and showcases its versatility to other DPO variants. We release our repository at https://github.com/YJiangcm/BMC., Comment: 20 pages, 9 figures, 12 tables. Accepted at ICLR 2025
Published: 2024

12. Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step

Author: Wang, Zezhong, Zeng, Xingshan, Liu, Weiwen, Wang, Yufei, Li, Liangyou, Wang, Yasheng, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Wong, Kam-Fai
Subjects: Computer Science - Computation and Language
Abstract: Current research found the issue of Early Answering in large language models (LLMs), where the models already have an answer before generating the Chain-of-Thought (CoT). This phenomenon suggests a potential lack of necessary dependency between the predicted answer and the reasoning process. Consequently, two important questions arise: (1) Is CoT still necessary if the model already has an answer? (2) Can the correctness of the answer serve as valid evidence for the correctness of CoT? To address these questions, we propose a method, namely Chain-of-Probe (CoP), to probe changes in the mind during the model's reasoning. The probing results show that in a significant number of question-answer cases, CoT appears to be unnecessary, and this necessity correlates with the simplicity of the task, defined by reasoning steps required. Furthermore, by analyzing patterns in mind change, we examine the correctness of the model's reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process. To this end, we propose a strategic approach based on CoP to prioritize answers with correct reasoning among multiple candidates, thereby bolstering the reliability of the model's reasoning.
Published: 2024

13. Prompt-Based Length Controlled Generation with Multiple Control Types

Author: Jie, Renlong, Meng, Xiaojun, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have attracted great attention given their strong performance on a wide range of NLP tasks. In practice, users often expect generated texts to fall within a specific length range, making length controlled generation an important topic, especially for GPT-style models. Existing length control methods mostly focus on a simple control type of "equal to" a target length. Different from them, we propose a prompt-based method to achieve length controlled generation under different control types with high accuracy. In particular, we adopt reinforcement learning (RL) and sample filtering with the reward signal given by rule-based reward models, which enhances the length control ability of models by rewarding outputs that follow certain control instructions. In addition, we introduce a standard prompt extractor to parse arbitrary users' input into standard control instructions. Experiments show that our method significantly improves the accuracy of prompt-based length control on popular summarization datasets like CNNDM and NYT under multiple control types. Moreover, both the standard prompt extractor and RL-tuned model show strong generalization to unseen control prompt templates., Comment: Accepted by ACL 2024 findings. arXiv admin note: text overlap with arXiv:2308.12030
Published: 2024

14. Evaluating the External and Parametric Knowledge Fusion of Large Language Models

Author: Zhang, Hao, Zhang, Yuyang, Li, Xiaoguang, Shi, Wenxuan, Xu, Haonan, Liu, Huanshuo, Wang, Yasheng, Shang, Lifeng, Liu, Qun, Liu, Yong, and Tang, Ruiming
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval
Abstract: Integrating external knowledge into large language models (LLMs) presents a promising solution to overcome the limitations imposed by their antiquated and static parametric memory. Prior studies, however, have tended to over-reliance on external knowledge, underestimating the valuable contributions of an LLMs' intrinsic parametric knowledge. The efficacy of LLMs in blending external and parametric knowledge remains largely unexplored, especially in cases where external knowledge is incomplete and necessitates supplementation by their parametric knowledge. We propose to deconstruct knowledge fusion into four distinct scenarios, offering the first thorough investigation of LLM behavior across each. We develop a systematic pipeline for data construction and knowledge infusion to simulate these fusion scenarios, facilitating a series of controlled experiments. Our investigation reveals that enhancing parametric knowledge within LLMs can significantly bolster their capability for knowledge integration. Nonetheless, we identify persistent challenges in memorizing and eliciting parametric knowledge, and determining parametric knowledge boundaries. Our findings aim to steer future explorations on harmonizing external and parametric knowledge within LLMs., Comment: 15 pages, 3 figures, 3 tables
Published: 2024

15. Does the Generator Mind its Contexts? An Analysis of Generative Model Faithfulness under Context Transfer

Author: Hu, Xinshuo, Hu, Baotian, Li, Dongfang, Li, Xiaoguang, and Shang, Lifeng
Subjects: Computer Science - Computation and Language
Abstract: The present study introduces the knowledge-augmented generator, which is specifically designed to produce information that remains grounded in contextual knowledge, regardless of alterations in the context. Previous research has predominantly focused on examining hallucinations stemming from static input, such as in the domains of summarization or machine translation. However, our investigation delves into the faithfulness of generative question answering in the presence of dynamic knowledge. Our objective is to explore the existence of hallucinations arising from parametric memory when contextual knowledge undergoes changes, while also analyzing the underlying causes for their occurrence. In order to efficiently address this issue, we propose a straightforward yet effective measure for detecting such hallucinations. Intriguingly, our investigation uncovers that all models exhibit a tendency to generate previous answers as hallucinations. To gain deeper insights into the underlying causes of this phenomenon, we conduct a series of experiments that verify the critical role played by context in hallucination, both during training and testing, from various perspectives., Comment: LREC-Coling 2024
Published: 2024

16. Learning to Edit: Aligning LLMs with Knowledge Editing

Author: Jiang, Yuxin, Wang, Yufei, Wu, Chuhan, Zhong, Wanjun, Zeng, Xingshan, Gao, Jiahui, Li, Liangyou, Jiang, Xin, Shang, Lifeng, Tang, Ruiming, Liu, Qun, and Wang, Wei
Subjects: Computer Science - Computation and Language
Abstract: Knowledge editing techniques, aiming to efficiently modify a minor proportion of knowledge in large language models (LLMs) without negatively impacting performance across other inputs, have garnered widespread attention. However, existing methods predominantly rely on memorizing the updated knowledge, impeding LLMs from effectively combining the new knowledge with their inherent knowledge when answering questions. To this end, we propose a Learning to Edit (LTE) framework, focusing on teaching LLMs to apply updated knowledge into input questions, inspired by the philosophy of "Teach a man to fish." LTE features a two-phase process: (i) the Alignment Phase, which fine-tunes LLMs on a meticulously curated parallel dataset to make reliable, in-scope edits while preserving out-of-scope information and linguistic proficiency; and (ii) the Inference Phase, which employs a retrieval-based mechanism for real-time and mass knowledge editing. By comparing our approach with seven advanced baselines across four popular knowledge editing benchmarks and two LLM architectures, we demonstrate LTE's superiority in knowledge editing performance, robustness in both batch and sequential editing, minimal interference on general tasks, and rapid editing speeds. The data and code are available at https://github.com/YJiangcm/LTE., Comment: 17 pages, 8 figures, 9 tables. ACL 2024 main camera-ready version
Published: 2024

17. Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

Author: Huang, Shijue, Zhong, Wanjun, Lu, Jianqiao, Zhu, Qi, Gao, Jiahui, Liu, Weiwen, Hou, Yutai, Zeng, Xingshan, Wang, Yasheng, Shang, Lifeng, Jiang, Xin, Xu, Ruifeng, and Liu, Qun
Subjects: Computer Science - Computation and Language
Abstract: The recent trend of using Large Language Models (LLMs) as tool agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs' ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field. The benchmark is publicly available at https://github.com/JoeYing1019/UltraTool., Comment: Accepted by ACL2024 Findings
Published: 2024

18. MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

Author: Kwan, Wai-Chung, Zeng, Xingshan, Jiang, Yuxin, Wang, Yufei, Li, Liangyou, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Wong, Kam-Fai
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) are increasingly relied upon for complex multi-turn conversations across diverse real-world applications. However, existing benchmarks predominantly focus on single-turn evaluations, overlooking the models' capabilities in multi-turn interactions. To address this gap, we introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. We construct multi-turn queries for each category either by augmenting existing datasets or by creating new examples with GPT-4 to avoid data leakage. To study the factors impacting multi-turn abilities, we create single-turn versions of the 1170 multi-turn queries and compare performance. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks. We observe significant performance degradation in multi-turn settings compared to single-turn settings in most models, which is not correlated with the models' fundamental capabilities. Moreover, we identify the distance to relevant content and susceptibility to error propagation as the key factors influencing multi-turn performance. MT-Eval is released publicly to encourage future research towards more robust conversational models., Comment: Code and data are available at https://github.com/KwanWaiChung/MT-Eval
Published: 2024

19. YODA: Teacher-Student Progressive Learning for Language Models

Author: Lu, Jianqiao, Zhong, Wanjun, Wang, Yufei, Guo, Zhijiang, Zhu, Qi, Huang, Wenyong, Wang, Yanlin, Mi, Fei, Wang, Baojun, Wang, Yasheng, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Although large language models (LLMs) have demonstrated adeptness in a range of tasks, they still lag behind human learning efficiency. This disparity is often linked to the inherent human capacity to learn from basic examples, gradually generalize and handle more complex problems, and refine their skills with continuous feedback. Inspired by this, this paper introduces YODA, a novel teacher-student progressive learning framework that emulates the teacher-student education process to improve the efficacy of model fine-tuning. The framework operates on an interactive \textit{basic-generalized-harder} loop. The teacher agent provides tailored feedback on the student's answers, and systematically organizes the education process. This process unfolds by teaching the student basic examples, reinforcing understanding through generalized questions, and then enhancing learning by posing questions with progressively enhanced complexity. With the teacher's guidance, the student learns to iteratively refine its answer with feedback, and forms a robust and comprehensive understanding of the posed questions. The systematic procedural data, which reflects the progressive learning process of humans, is then utilized for model training. Taking math reasoning as a testbed, experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain (+17.01\% on GSM8K and +9.98\% on MATH). In addition, we find that training with curriculum learning further improves learning robustness., Comment: 14 pages, 4 figures, 3 tables
Published: 2024

20. PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models

Author: Tan, Haochen, Guo, Zhijiang, Shi, Zhan, Xu, Lu, Liu, Zhili, Feng, Yunlong, Li, Xiaoguang, Wang, Yasheng, Shang, Lifeng, Liu, Qun, and Song, Linqi
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) have succeeded remarkably in understanding long-form contents. However, exploring their capability for generating long-form contents, such as reports and articles, has been relatively unexplored and inadequately assessed by existing benchmarks. The prevalent evaluation methods, which predominantly rely on crowdsourcing, are recognized for their labor-intensive nature and lack of efficiency, whereas automated metrics, such as the ROUGE score, demonstrate discordance with human judgment criteria. In this paper, we propose ProxyQA, an innovative framework dedicated to assessing long-text generation. ProxyQA comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. LLMs are tasked to generate extensive content in response to these meta-questions, by engaging an evaluator and incorporating the generated texts as contextual background, ProxyQA assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions. We examine multiple LLMs, emphasizing ProxyQA's demanding nature as a high-quality assessment tool. Human evaluation demonstrates that the proxy-question method is notably self-consistent and aligns closely with human evaluative standards. The dataset and leaderboard is available at \url{https://proxy-qa.com}., Comment: Accepted to ACL 2024 main conference
Published: 2024

21. Preparing Lessons for Progressive Training on Language Models

Author: Pan, Yu, Yuan, Ye, Yin, Yichun, Shi, Jiaxin, Xu, Zenglin, Zhang, Ming, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prep\textbf{a}res lessons for ex\textbf{p}anding \textbf{o}perations by \textbf{l}earning high-\textbf{l}ayer functi\textbf{o}nality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.
Published: 2024

22. Data Management For Training Large Language Models: A Survey

Author: Wang, Zige, Zhong, Wanjun, Wang, Yufei, Zhu, Qi, Mi, Fei, Wang, Baojun, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Data plays a fundamental role in training Large Language Models (LLMs). Efficient data management, particularly in formulating a well-suited training dataset, is significant for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning stages. Despite the considerable importance of data management, the underlying mechanism of current prominent practices are still unknown. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various aspects of data management strategy design. Looking into the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through efficient data management practices. The collection of the latest papers is available at https://github.com/ZigeW/data_management_LLM., Comment: Work in progress
Published: 2023

23. FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

Author: Jiang, Yuxin, Wang, Yufei, Zeng, Xingshan, Zhong, Wanjun, Li, Liangyou, Mi, Fei, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Wang, Wei
Subjects: Computer Science - Computation and Language
Abstract: The ability to follow instructions is crucial for Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating pure response quality, rather than assessing whether the response follows constraints stated in the instruction. To fill this research gap, in this paper, we propose FollowBench, a Multi-level Fine-grained Constraints Following Benchmark for LLMs. FollowBench comprehensively includes five different types (i.e., Content, Situation, Style, Format, and Example) of fine-grained constraints. To enable a precise constraint following estimation on diverse difficulties, we introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. To assess whether LLMs' outputs have satisfied every individual constraint, we propose to prompt strong LLMs with constraint-evolution paths to handle challenging open-ended instructions. By evaluating 13 closed-source and open-source popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work. The data and code are publicly available at https://github.com/YJiangcm/FollowBench., Comment: 22 pages, 11 figures, 16 tables. ACL 2024 main camera-ready version
Published: 2023

24. M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models

Author: Kwan, Wai-Chung, Zeng, Xingshan, Wang, Yufei, Sun, Yusen, Li, Liangyou, Shang, Lifeng, Liu, Qun, and Wong, Kam-Fai
Subjects: Computer Science - Computation and Language
Abstract: Managing long sequences has become an important and necessary feature for large language models (LLMs). However, it is still an open question of how to comprehensively and systematically evaluate the long-sequence capability of LLMs. One of the reasons is that conventional and widely-used benchmarks mainly consist of short sequences. In this paper, we propose M4LE, a Multi-ability, Multi-range, Multi-task, Multi-domain benchmark for Long-context Evaluation. M4LE is based on a diverse NLP task pool comprising 36 NLP datasets, 11 task types and 12 domains. To alleviate the scarcity of tasks with naturally long sequences and incorporate multiple-ability assessment, we propose an automatic approach (but with negligible human annotations) to convert short-sequence tasks into a unified long-sequence scenario where LLMs have to identify single or multiple relevant spans in long contexts based on explicit or semantic hints. Specifically, the scenario includes five different types of abilities: (1) explicit single-span; (2) semantic single-span; (3) explicit multiple-span; (4) semantic multiple-span; and (5) global context understanding. The resulting samples in M4LE are evenly distributed from 1k to 8k input length. We conducted a systematic evaluation on 11 well-established LLMs, especially those optimized for long-sequence inputs. Our results reveal that: 1) Current LLMs struggle to understand long context, particularly when tasks require multiple-span attention. 2) Semantic retrieval task is more difficult for competent LLMs. 3) Models fine-tuned on longer text with position interpolation have comparable performance to those using Neural Tangent Kernel (NTK) aware scaling methods without fine-tuning. We make our benchmark publicly available to encourage future research in this challenging area., Comment: Code and data are available at https://github.com/KwanWaiChung/M4LE
Published: 2023

25. Reusing Pretrained Models by Multi-linear Operators for Efficient Training

Author: Pan, Yu, Yuan, Ye, Yin, Yichun, Xu, Zenglin, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Training large models from scratch usually costs a substantial amount of resources. Towards this problem, recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model (termed the ``target model''), leading to a considerable acceleration in training. Despite the successes of these previous studies, they grew pretrained models by mapping partial weights only, ignoring potential correlations across the entire model. As we show in this paper, there are inter- and intra-interactions among the weights of both the pretrained and the target models. As a result, the partial mapping may not capture the complete information and lead to inadequate growth. In this paper, we propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model to further enhance acceleration ability. We utilize multi-linear operators to reduce computational and spacial complexity, enabling acceptable resource requirements. Experiments demonstrate that our method can save 76\% computational costs on DeiT-base transferred from DeiT-small, which outperforms bert2BERT by +12.0\% and LiGO by +20.7\%, respectively., Comment: Accepted in NeurIPS 2023
Published: 2023

26. Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis

Author: Chen, Kai, Wang, Chunwei, Yang, Kuo, Han, Jianhua, Hong, Lanqing, Mi, Fei, Xu, Hang, Liu, Zhengying, Huang, Wenyong, Li, Zhenguo, Yeung, Dit-Yan, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges. This becomes particularly evident when LLMs inadvertently generate harmful or toxic content, either unintentionally or because of intentional inducement. Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs. Conversely, this study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them. In this case, mistakes are repurposed into valuable data for alignment, effectively helping to avoid the production of erroneous responses. Without external models or human annotations, our method leverages a model's intrinsic ability to discern undesirable mistakes and improves the safety of its generated responses. Experimental results reveal that our method outperforms existing alignment approaches in enhancing model safety while maintaining the overall utility., Comment: Accepted by ICLR 2024
Published: 2023

27. Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and Alignment

Author: Xue, Boyang, Wang, Weichao, Wang, Hongru, Mi, Fei, Wang, Rui, Wang, Yasheng, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Wong, Kam-Fai
Subjects: Computer Science - Computation and Language
Abstract: Pretrained language models (PLMs) based knowledge-grounded dialogue systems are prone to generate responses that are factually inconsistent with the provided knowledge source. In such inconsistent responses, the dialogue models fail to accurately express the external knowledge they rely upon. Inspired by previous work which identified that feed-forward networks (FFNs) within Transformers are responsible for factual knowledge expressions, we investigate two methods to efficiently improve the factual expression capability {of FFNs} by knowledge enhancement and alignment respectively. We first propose \textsc{K-Dial}, which {explicitly} introduces {extended FFNs in Transformers to enhance factual knowledge expressions} given the specific patterns of knowledge-grounded dialogue inputs. Additionally, we apply the reinforcement learning for factual consistency (RLFC) method to implicitly adjust FFNs' expressions in responses by aligning with gold knowledge for the factual consistency preference. To comprehensively assess the factual consistency and dialogue quality of responses, we employ extensive automatic measures and human evaluations including sophisticated fine-grained NLI-based metrics. Experimental results on WoW and CMU\_DoG datasets demonstrate that our methods efficiently enhance the ability of the FFN module to convey factual knowledge, validating the efficacy of improving factual consistency for knowledge-grounded dialogue systems., Comment: EMNLP2023 Findings
Published: 2023

28. Exploring the Usage of Chinese Pinyin in Pretraining

Author: Wang, Baojun, Xu, Kun, and Shang, Lifeng
Subjects: Computer Science - Computation and Language
Abstract: Unlike alphabetic languages, Chinese spelling and pronunciation are different. Both characters and pinyin take an important role in Chinese language understanding. In Chinese NLP tasks, we almost adopt characters or words as model input, and few works study how to use pinyin. However, pinyin is essential in many scenarios, such as error correction and fault tolerance for ASR-introduced errors. Most of these errors are caused by the same or similar pronunciation words, and we refer to this type of error as SSP(the same or similar pronunciation) errors for short. In this work, we explore various ways of using pinyin in pretraining models and propose a new pretraining method called PmBERT. Our method uses characters and pinyin in parallel for pretraining. Through delicate pretraining tasks, the characters and pinyin representation are fused, which can enhance the error tolerance for SSP errors. We do comprehensive experiments and ablation tests to explore what makes a robust phonetic enhanced Chinese language model. The experimental results on both the constructed noise-added dataset and the public error-correction dataset demonstrate that our model is more robust compared to SOTA models.
Published: 2023

29. Cancer-Associated Fibroblast-Secreted Exosomes Promote Gastric Cancer Cell Migration and Invasion via the IL-32/ESR1 Axis

Author: Shang, Lifeng, Chen, Xinli, Zhu, Tianyu, Chong, Shujing, Liu, Haiwang, Huang, Wei, Fu, Weibo, She, Hao, and Shen, Xin
Published: 2024
Full Text: View/download PDF

30. Convergent evolution of allele-specific gene expression that leads to non-small cell lung cancer in different human populations

Author: Hou, Qiuyu, Shang, Lifeng, Chen, Xu, Luo, Qiang, Wei, Liang, and Zhang, Chence
Published: 2024
Full Text: View/download PDF

31. SELF: Self-Evolution with Language Feedback

Author: Lu, Jianqiao, Zhong, Wanjun, Huang, Wenyong, Wang, Yufei, Zhu, Qi, Mi, Fei, Wang, Baojun, Wang, Weichao, Zeng, Xingshan, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large Language Models (LLMs) have demonstrated remarkable versatility across various domains. To further advance LLMs, we propose 'SELF' (Self-Evolution with Language Feedback), a novel approach that enables LLMs to self-improve through self-reflection, akin to human learning processes. SELF initiates with a meta-skill learning process that equips the LLMs with capabilities for self-feedback and self-refinement. Subsequently, the model undergoes an iterative process of self-evolution. In each iteration, it utilizes an unlabeled dataset of instructions to generate initial responses. These responses are enhanced through self-feedback and self-refinement. The model is then fine-tuned using this enhanced data. The model undergoes progressive improvement through this iterative self-evolution process. Moreover, the SELF framework enables the model to apply self-refinement during inference, which further improves response quality. Our experiments in mathematics and general tasks demonstrate that SELF can enhance the capabilities of LLMs without human intervention. The SELF framework indicates a promising direction for the autonomous evolution of LLMs, transitioning them from passive information receivers to active participants in their development., Comment: 20 pages, 4 figures, 11 tables
Published: 2023

32. Prompt-Based Length Controlled Generation with Reinforcement Learning

Author: Jie, Renlong, Meng, Xiaojun, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large language models (LLMs) like ChatGPT and GPT-4 have attracted great attention given their surprising performance on a wide range of NLP tasks. Length controlled generation of LLMs emerges as an important topic, which enables users to fully leverage the capability of LLMs in more real-world scenarios like generating a proper answer or essay of a desired length. In addition, the autoregressive generation in LLMs is extremely time-consuming, while the ability of controlling this generated length can reduce the inference cost by limiting the length. Therefore, we propose a prompt-based length control method to achieve high-accuracy length controlled generation. In particular, we adopt reinforcement learning with the reward signal given by either trainable or rule-based reward models, which further enhances the length-control ability of LLMs by rewarding outputs that follows pre-defined control instruction. To enable rule-based inference, we also introduce standard prompt extractor to collect the standard control information from users' input. Experiments show that our method significantly improves the accuracy of prompt-based length control for summarization task on popular datasets like CNNDM and NYT. Both the standard prompt extractor and the RL-tuned model have show strong generalization ability to unseen control prompt templates.
Published: 2023

33. AutoConv: Automatically Generating Information-seeking Conversations with Large Language Models

Author: Li, Siheng, Yang, Cheng, Yin, Yichun, Zhu, Xinyu, Cheng, Zesen, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Yang, Yujiu
Subjects: Computer Science - Computation and Language
Abstract: Information-seeking conversation, which aims to help users gather information through conversation, has achieved great progress in recent years. However, the research is still stymied by the scarcity of training data. To alleviate this problem, we propose AutoConv for synthetic conversation generation, which takes advantage of the few-shot learning ability and generation capacity of large language models (LLM). Specifically, we formulate the conversation generation problem as a language modeling task, then finetune an LLM with a few human conversations to capture the characteristics of the information-seeking process and use it for generating synthetic conversations with high quality. Experimental results on two frequently-used datasets verify that AutoConv has substantial improvements over strong baselines and alleviates the dependence on human annotation. In addition, we also provide several analysis studies to promote future research., Comment: Accepted to ACL 2023 Main Conference (Short)
Published: 2023
Full Text: View/download PDF

34. NewsDialogues: Towards Proactive News Grounded Conversation

Author: Li, Siheng, Yin, Yichun, Yang, Cheng, Jiang, Wangjie, Li, Yiwei, Cheng, Zesen, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Yang, Yujiu
Subjects: Computer Science - Computation and Language
Abstract: Hot news is one of the most popular topics in daily conversations. However, news grounded conversation has long been stymied by the lack of well-designed task definition and scarce data. In this paper, we propose a novel task, Proactive News Grounded Conversation, in which a dialogue system can proactively lead the conversation based on some key topics of the news. In addition, both information-seeking and chit-chat scenarios are included realistically, where the user may ask a series of questions about the news details or express their opinions and be eager to chat. To further develop this novel task, we collect a human-to-human Chinese dialogue dataset \ts{NewsDialogues}, which includes 1K conversations with a total of 14.6K utterances and detailed annotations for target topics and knowledge spans. Furthermore, we propose a method named Predict-Generate-Rank, consisting of a generator for grounded knowledge prediction and response generation, and a ranker for the ranking of multiple responses to alleviate the exposure bias. We conduct comprehensive experiments to demonstrate the effectiveness of the proposed method and further present several key findings and challenges to prompt future research., Comment: Accepted to ACL 2023 Conference (Long Paper; Findings)
Published: 2023
Full Text: View/download PDF

35. Aligning Large Language Models with Human: A Survey

Author: Wang, Yufei, Zhong, Wanjun, Li, Liangyou, Mi, Fei, Zeng, Xingshan, Huang, Wenyong, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks. Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect (hallucinated) information. Hence, aligning LLMs with human expectations has become an active area of interest within the research community. This survey presents a comprehensive overview of these alignment technologies, including the following aspects. (1) Data collection: the methods for effectively collecting high-quality instructions for LLM alignment, including the use of NLP benchmarks, human annotations, and leveraging strong LLMs. (2) Training methodologies: a detailed review of the prevailing training methods employed for LLM alignment. Our exploration encompasses Supervised Fine-tuning, both Online and Offline human preference training, along with parameter-efficient training mechanisms. (3) Model Evaluation: the methods for evaluating the effectiveness of these human-aligned LLMs, presenting a multifaceted approach towards their assessment. In conclusion, we collate and distill our findings, shedding light on several promising future research avenues in the field. This survey, therefore, serves as a valuable resource for anyone invested in understanding and advancing the alignment of LLMs to better suit human-oriented tasks and expectations. An associated GitHub link collecting the latest papers is available at https://github.com/GaryYufei/AlignLLMHumanSurvey., Comment: work in progress
Published: 2023

36. Enhancing Coherence of Extractive Summarization with Multitask Learning

Author: Jie, Renlong, Meng, Xiaojun, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: This study proposes a multitask learning architecture for extractive summarization with coherence boosting. The architecture contains an extractive summarizer and coherent discriminator module. The coherent discriminator is trained online on the sentence vectors of the augmented textual input, thus improving its general ability of judging whether the input sentences are coherent. Meanwhile, we maximize the coherent scores from the coherent discriminator by updating the parameters of the summarizer. To make the extractive sentences trainable in a differentiable manner, we introduce two strategies, including pre-trained converting model (model-based) and converting matrix (MAT-based) that merge sentence representations. Experiments show that our proposed method significantly improves the proportion of consecutive sentences in the extracted summaries based on their positions in the original article (i.e., automatic sentence-level coherence metric), while the goodness in terms of other automatic metrics (i.e., Rouge scores and BertScores) are preserved. Human evaluation also evidences the improvement of coherence and consistency of the extracted summaries given by our method., Comment: 11 pages, 4 figures
Published: 2023

37. Retrieval-based Disentangled Representation Learning with Natural Language Supervision

Author: Zhou, Jiawei, Li, Xiaoguang, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Chen, Lei
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Disentangled representation learning remains challenging as the underlying factors of variation in the data do not naturally exist. The inherent complexity of real-world data makes it unfeasible to exhaustively enumerate and encapsulate all its variations within a finite set of factors. However, it is worth noting that most real-world data have linguistic equivalents, typically in the form of textual descriptions. These linguistic counterparts can represent the data and effortlessly decomposed into distinct tokens. In light of this, we present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish dimensions that capture intrinsic characteristics within data through its natural language counterpart, thus facilitating disentanglement. We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, The results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.
Published: 2022

38. G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

Author: Wan, Zhongwei, Yin, Yichun, Zhang, Wei, Shi, Jiaxin, Shang, Lifeng, Chen, Guangyong, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language
Abstract: Recently, domain-specific PLMs have been proposed to boost the task performance of specific domains (e.g., biomedical and computer science) by continuing to pre-train general PLMs with domain-specific corpora. However, this Domain-Adaptive Pre-Training (DAPT; Gururangan et al. (2020)) tends to forget the previous general knowledge acquired by general PLMs, which leads to a catastrophic forgetting phenomenon and sub-optimal performance. To alleviate this problem, we propose a new framework of General Memory Augmented Pre-trained Language Model (G-MAP), which augments the domain-specific PLM by a memory representation built from the frozen general PLM without losing any general knowledge. Specifically, we propose a new memory-augmented layer, and based on it, different augmented strategies are explored to build the memory representation and then adaptively fuse it into the domain-specific PLM. We demonstrate the effectiveness of G-MAP on various domains (biomedical and computer science publications, news, and reviews) and different kinds (text classification, QA, NER) of tasks, and the extensive results show that the proposed G-MAP can achieve SOTA results on all tasks., Comment: EMNLP 2022,Long paper,Main conference
Published: 2022

39. Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation

Author: Zhang, Zhexin, Cheng, Jiale, Sun, Hao, Deng, Jiawen, Mi, Fei, Wang, Yasheng, Shang, Lifeng, and Huang, Minlie
Subjects: Computer Science - Computation and Language
Abstract: Large pretrained language models can easily produce toxic or biased content, which is prohibitive for practical use. In order to detect such toxic generations, existing methods rely on templates, real-world data extraction, crowdsourcing workers, or automatic generation to construct adversarial contexts that are likely to induce toxic generations. However, what type of context is more likely to induce unsafe responses is still under-explored. In this paper, we identify that context toxicity and context category (e.g., \textit{profanity}, \textit{insult}, \textit{drugs}, etc.) are two important factors to cause safety issues in response generation. Hence, we propose a method called \emph{reverse generation} to construct adversarial contexts conditioned on a given response, with the flexibility to control category, toxicity level, and inductivity of the generated contexts. Via reverse generation, we augment the existing BAD dataset and construct a new dataset BAD+ which contains more than 120K diverse and highly inductive contexts in 12 categories. We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems. Furthermore, we show that BAD+ can greatly enhance the safety of generation and reveal the key factors of safety improvement. Our code and dataset is available at \url{https://github.com/thu-coai/Reverse_Generation}., Comment: Findings of EMNLP 2022
Published: 2022

40. LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling

Author: Chen, Dongsheng, Tao, Chaofan, Hou, Lu, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Recent large-scale video-language pre-trained models have shown appealing performance on various downstream tasks. However, the pre-training process is computationally expensive due to the requirement of millions of video-text pairs and the redundant data structure of each video. To mitigate these problems, we propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training. To enhance the temporal modeling lacking in the image-language model, we propose to add temporal attention modules in the image encoder of BLIP with dynamic temporal scaling. Besides the model-wise adaptation, we also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text. Experimental results on text-video retrieval and video question answering show that the proposed LiteVL even outperforms previous video-language pre-trained models by a clear margin, though without any video-language pre-training., Comment: 13 pages, 6 figures, accepted by EMNLP 2022 main conference
Published: 2022

41. Pre-training Language Models with Deterministic Factual Knowledge

Author: Li, Shaobo, Li, Xiaoguang, Shang, Lifeng, Sun, Chengjie, Liu, Bingquan, Ji, Zhenzhou, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language
Abstract: Previous works show that Pre-trained Language Models (PLMs) can capture factual knowledge. However, some analyses reveal that PLMs fail to perform it robustly, e.g., being sensitive to the changes of prompts when extracting factual knowledge. To mitigate this issue, we propose to let PLMs learn the deterministic relationship between the remaining context and the masked content. The deterministic relationship ensures that the masked factual content can be deterministically inferable based on the existing clues in the context. That would provide more stable patterns for PLMs to capture factual knowledge than randomly masking. Two pre-training tasks are further introduced to motivate PLMs to rely on the deterministic relationship when filling masks. Specifically, we use an external Knowledge Base (KB) to identify deterministic relationships and continuously pre-train PLMs with the proposed methods. The factual knowledge probing experiments indicate that the continuously pre-trained PLMs achieve better robustness in factual knowledge capturing. Further experiments on question-answering datasets show that trying to learn a deterministic relationship with the proposed methods can also help other knowledge-intensive tasks., Comment: Accepted at EMNLP 2022
Published: 2022

42. Exploring Extreme Parameter Compression for Pre-trained Language Models

Author: Ren, Yuxin, Wang, Benyou, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves $96.7\%$ performance of BERT-base with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and $2.7 \times$ faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT., Comment: Accepted at ICLR2022. Code available at https://github.com/twinkle0331/Xcompression
Published: 2022

43. PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained Language Model

Author: Mi, Fei, Li, Yitong, Zeng, Yulong, Zhou, Jingyan, Wang, Yasheng, Xu, Chuanfei, Shang, Lifeng, Jiang, Xin, Zhao, Shiqi, and Liu, Qun
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we introduce PanGu-Bot, a Chinese pre-trained open-domain dialogue generation model based on a large pre-trained language model (PLM) PANGU-alpha (Zeng et al.,2021). Different from other pre-trained dialogue models trained over a massive amount of dialogue data from scratch, we aim to build a powerful dialogue model with relatively fewer data and computation costs by inheriting valuable language capabilities and knowledge from PLMs. To this end, we train PanGu-Bot from the large PLM PANGU-alpha, which has been proven well-performed on a variety of Chinese natural language tasks. We investigate different aspects of responses generated by PanGu-Bot, including response quality, knowledge, and safety. We show that PanGu-Bot outperforms state-of-the-art Chinese dialogue systems (CDIALGPT (Wang et al., 2020), EVA (Zhou et al., 2021), EVA2.0 (Gu et al., 2022)) w.r.t. the above three aspects. We also demonstrate that PanGu-Bot can be easily deployed to generate emotional responses without further training. Throughout our empirical analysis, we also point out that the PanGu-Bot response quality, knowledge correctness, and safety are still far from perfect, and further explorations are indispensable to building reliable and smart dialogue systems. Our model and code will be available at https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/PanGu-Bot soon., Comment: Update model and results; add comparison with EVA2.0
Published: 2022

44. How Pre-trained Language Models Capture Factual Knowledge? A Causal-Inspired Analysis

Author: Li, Shaobo, Li, Xiaoguang, Shang, Lifeng, Dong, Zhenhua, Sun, Chengjie, Liu, Bingquan, Ji, Zhenzhou, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language
Abstract: Recently, there has been a trend to investigate the factual knowledge captured by Pre-trained Language Models (PLMs). Many works show the PLMs' ability to fill in the missing factual words in cloze-style prompts such as "Dante was born in [MASK]." However, it is still a mystery how PLMs generate the results correctly: relying on effective clues or shortcut patterns? We try to answer this question by a causal-inspired analysis that quantitatively measures and evaluates the word-level patterns that PLMs depend on to generate the missing words. We check the words that have three typical associations with the missing words: knowledge-dependent, positionally close, and highly co-occurred. Our analysis shows: (1) PLMs generate the missing factual words more by the positionally close and highly co-occurred words than the knowledge-dependent words; (2) the dependence on the knowledge-dependent words is more effective than the positionally close and highly co-occurred words. Accordingly, we conclude that the PLMs capture the factual knowledge ineffectively because of depending on the inadequate associations., Comment: Accepted at Findings of ACL 2022
Published: 2022

45. Compression of Generative Pre-trained Language Models via Quantization

Author: Tao, Chaofan, Hou, Lu, Zhang, Wei, Shang, Lifeng, Jiang, Xin, Liu, Qun, Luo, Ping, and Wong, Ngai
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the demand for model compression. Despite various methods to compress BERT or its variants, there are few attempts to compress generative PLMs, and the underlying difficulty remains unclear. In this paper, we compress generative PLMs by quantization. We find that previous quantization methods fail on generative tasks due to the \textit{homogeneous word embeddings} caused by reduced capacity, and \textit{varied distribution of weights}. Correspondingly, we propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules. Empirical results on various tasks show that our proposed method outperforms the state-of-the-art compression methods on generative PLMs by a clear margin. With comparable performance with the full-precision models, we achieve 14.4x and 13.4x compression rates on GPT-2 and BART, respectively., Comment: ACL 2022
Published: 2022

46. Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering

Author: Zhou, Jiawei, Li, Xiaoguang, Shang, Lifeng, Luo, Lan, Zhan, Ke, Hu, Enrui, Zhang, Xinyu, Jiang, Hao, Cao, Zhao, Yu, Fan, Jiang, Xin, Liu, Qun, and Chen, Lei
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by more than 10 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios., Comment: Accepted by ACL 2022 main conference; The dataset and code are available at https://github.com/jzhoubu/HLP
Published: 2022

47. Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

Author: Dai, Wenliang, Hou, Lu, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Fung, Pascale
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (e.g., CLIP) with a tremendous amount of image-text pair data, has shown its superiority on various multimodal alignment tasks. Despite its success, the resulting models are not capable of multimodal generative tasks due to the weak text encoder. To tackle this problem, we propose to augment the dual-stream VLP model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal generation. VLKD is pretty data- and computation-efficient compared to the pre-training from scratch. Experimental results show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. For example, it achieves 44.5% zero-shot accuracy on the VQAv2 dataset, surpassing the previous state-of-the-art zero-shot model with $7\times$ fewer parameters. Furthermore, the original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks., Comment: Accepted to ACL 2022
Published: 2022

48. Read before Generate! Faithful Long Form Question Answering with Machine Reading

Author: Su, Dan, Li, Xiaoguang, Zhang, Jindi, Shang, Lifeng, Jiang, Xin, Liu, Qun, and Fung, Pascale
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Long-form question answering (LFQA) aims to generate a paragraph-length answer for a given question. While current work on LFQA using large pre-trained model for generation are effective at producing fluent and somewhat relevant content, one primary challenge lies in how to generate a faithful answer that has less hallucinated content. We propose a new end-to-end framework that jointly models answer generation and machine reading. The key idea is to augment the generation model with fine-grained, answer-related salient information which can be viewed as an emphasis on faithful facts. State-of-the-art results on two LFQA datasets, ELI5 and MS MARCO, demonstrate the effectiveness of our method, in comparison with strong baselines on automatic and human evaluation metrics. A detailed analysis further proves the competency of our methods in generating fluent, relevant, and more faithful answers., Comment: long paper, accepted to ACL 2022 findings
Published: 2022

49. Systematic exploration of the molecular characteristics of CD8+ T cells to predict the response to immunotherapy and the prognosis of patients with colon adenocarcinoma

Author: Shen, Xin, Shang, Lifeng, Han, Junwei, Zhang, Yi, Niu, Wenkai, Liu, Haiwang, and Shi, Hai
Published: 2024
Full Text: View/download PDF

50. bert2BERT: Towards Reusable Pretrained Language Models

Author: Chen, Cheng, Yin, Yichun, Shang, Lifeng, Jiang, Xin, Qin, Yujia, Wang, Fengyu, Wang, Zhi, Chen, Xiao, Liu, Zhiyuan, and Liu, Qun
Subjects: Computer Science - Computation and Language
Abstract: In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. In this paper, we propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model (e.g., BERT_BASE) to a large model (e.g., BERT_LARGE) through parameter initialization and significantly improve the pre-training efficiency of the large model. Specifically, we extend the previous function-preserving on Transformer-based language model, and further improve it by proposing advanced knowledge for large model's initialization. In addition, a two-stage pre-training method is proposed to further accelerate the training process. We did extensive experiments on representative PLMs (e.g., BERT and GPT) and demonstrate that (1) our method can save a significant amount of training cost compared with baselines including learning from scratch, StackBERT and MSLT; (2) our method is generic and applicable to different types of pre-trained models. In particular, bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes. The source code will be publicly available upon publication.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

313 results on '"Shang, Lifeng"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources