Author: "Dai, Sunhao" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Dai, Sunhao"' showing total 23 results

Start Over Author "Dai, Sunhao"

23 results on '"Dai, Sunhao"'

1. Few-shot_LLM_Synthetic_Data_with_Distribution_Matching

Author: Ren, Jiyuan, Du, Zhaocheng, Wen, Zhihao, Jia, Qinglin, Dai, Sunhao, Wu, Chuhan, and Dong, Zhenhua
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: As large language models (LLMs) advance, their ability to perform in-context learning and few-shot language generation has improved significantly. This has spurred using LLMs to produce high-quality synthetic data to enhance the performance of smaller models like online retrievers or weak LLMs. However, LLM-generated synthetic data often differs from the real data in key language attributes (e.g., styles, tones, content proportions, etc.). As a result, mixing these synthetic data directly with real data may distort the original data distribution, potentially hindering performance improvements. To solve this, we introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching. Before generation, SynAlign employs an uncertainty tracker surrogated by the Gaussian Process model to iteratively select data clusters distinct from selected ones as demonstrations for new data synthesis, facilitating the efficient exploration diversity of the real data. Then, a latent attribute reasoning method is employed: the LLM summarizes linguistic attributes of demonstrations and then synthesizes new data based on them. This approach facilitates synthesizing diverse data with linguistic attributes that appear in real data.After generation, the Maximum Mean Discrepancy is used as the objective function to learn the sampling weight of each synthetic data, ensuring distribution matching with the real data. Our experiments on multiple text prediction tasks show significant performance improvements. We also conducted an online A/B test on an online retriever to demonstrate SynAlign's effectiveness., Comment: 10 pages, 5 figures, accepted at www 2025
Published: 2025

2. CitaLaw: Enhancing LLM with Citations in Legal Domain

Author: Zhang, Kepu, Yu, Weijie, Dai, Sunhao, and Xu, Jun
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we propose CitaLaw, the first benchmark designed to evaluate LLMs' ability to produce legally sound responses with appropriate citations. CitaLaw features a diverse set of legal questions for both laypersons and practitioners, paired with a comprehensive corpus of law articles and precedent cases as a reference pool. This framework enables LLM-based systems to retrieve supporting citations from the reference corpus and align these citations with the corresponding sentences in their responses. Moreover, we introduce syllogism-inspired evaluation methods to assess the legal alignment between retrieved references and LLM-generated responses, as well as their consistency with user questions. Extensive experiments on 2 open-domain and 7 legal-specific LLMs demonstrate that integrating legal references substantially enhances response quality. Furthermore, our proposed syllogism-based evaluation method exhibits strong agreement with human judgments.
Published: 2024

3. Length-Induced Embedding Collapse in Transformer-based Models

Author: Zhou, Yuqi, Dai, Sunhao, Cao, Zhanshuo, Zhang, Xiao, and Xu, Jun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval
Abstract: Text embeddings enable various applications, but their performance deteriorates on longer texts. In this paper, we find that the performance degradation is due to a phenomenon called Length Collapse, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, ultimately hurting the performance of downstream tasks. Theoretically, by considering the self-attention mechanism inherently functions as a low-pass filter, we prove that long sequences increase the attenuation rate of the low-pass filter effect of the self-attention mechanism. With layers going deeper, excessive low-pass filtering causes the token signals to retain only their Direct-Current (DC) component, which means the input token feature maps will collapse into a narrow space, especially in long texts. Based on the above analysis, we propose to mitigate the undesirable length collapse limitation by introducing a temperature in softmax(), which achieves a higher low-filter attenuation rate. The tuning-free method, called TempScale, can be plugged into multiple transformer-based embedding models. Empirically, we demonstrate that TempScale can improve existing embedding models, especially on long text inputs, bringing up to 0.53% performance gains on 40 datasets from Massive Text Embedding Benchmark (MTEB) and 0.82% performance gains on 4 datasets from LongEmbed, which specifically focuses on long context retrieval.
Published: 2024

4. From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions

Author: Qu, Changle, Dai, Sunhao, Wei, Xiaochi, Cai, Hengyi, Wang, Shuaiqiang, Yin, Dawei, Xu, Jun, and Wen, Ji-Rong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Tool learning enables Large Language Models (LLMs) to interact with external environments by invoking tools, serving as an effective strategy to mitigate the limitations inherent in their pre-training data. In this process, tool documentation plays a crucial role by providing usage instructions for LLMs, thereby facilitating effective tool utilization. This paper concentrates on the critical challenge of bridging the comprehension gap between LLMs and external tools due to the inadequacies and inaccuracies inherent in existing human-centric tool documentation. We propose a novel framework, DRAFT, aimed at Dynamically Refining tool documentation through the Analysis of Feedback and Trails emanating from LLMs' interactions with external tools. This methodology pivots on an innovative trial-and-error approach, consisting of three distinct learning phases: experience gathering, learning from experience, and documentation rewriting, to iteratively enhance the tool documentation. This process is further optimized by implementing a diversity-promoting exploration strategy to ensure explorative diversity and a tool-adaptive termination mechanism to prevent overfitting while enhancing efficiency. Extensive experiments on multiple datasets demonstrate that DRAFT's iterative, feedback-based refinement significantly ameliorates documentation quality, fostering a deeper comprehension and more effective utilization of tools by LLMs. Notably, our analysis reveals that the tool documentation refined via our approach demonstrates robust cross-model generalization capabilities.
Published: 2024

5. Revisiting Reciprocal Recommender Systems: Metrics, Formulation, and Method

Author: Yang, Chen, Dai, Sunhao, Hou, Yupeng, Zhao, Wayne Xin, Xu, Jun, Song, Yang, and Zhu, Hengshu
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence
Abstract: Reciprocal recommender systems~(RRS), conducting bilateral recommendations between two involved parties, have gained increasing attention for enhancing matching efficiency. However, the majority of existing methods in the literature still reuse conventional ranking metrics to separately assess the performance on each side of the recommendation process. These methods overlook the fact that the ranking outcomes of both sides collectively influence the effectiveness of the RRS, neglecting the necessity of a more holistic evaluation and a capable systemic solution. In this paper, we systemically revisit the task of reciprocal recommendation, by introducing the new metrics, formulation, and method. Firstly, we propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS from three distinct perspectives: overall coverage, bilateral stability, and balanced ranking. These metrics provide a more holistic understanding of the system's effectiveness and enable a comprehensive evaluation. Furthermore, we formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions, which can better model the decoupled effects of potential influencing factors. By utilizing the potential outcome framework, we further develop a model-agnostic causal reciprocal recommendation method that considers the causal effects of recommendations. Additionally, we introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics. Extensive experiments on two real-world datasets from recruitment and dating scenarios demonstrate the effectiveness of our proposed metrics and approach. The code and dataset are available at: https://github.com/RUCAIBox/CRRS., Comment: KDD 2024
Published: 2024
Full Text: View/download PDF

6. Towards Robust Recommendation via Decision Boundary-aware Graph Contrastive Learning

Author: Tang, Jiakai, Dai, Sunhao, Sun, Zexu, Chen, Xu, Xu, Jun, Yu, Wenhui, Hu, Lantao, Jiang, Peng, and Li, Han
Subjects: Computer Science - Information Retrieval
Abstract: In recent years, graph contrastive learning (GCL) has received increasing attention in recommender systems due to its effectiveness in reducing bias caused by data sparsity. However, most existing GCL models rely on heuristic approaches and usually assume entity independence when constructing contrastive views. We argue that these methods struggle to strike a balance between semantic invariance and view hardness across the dynamic training process, both of which are critical factors in graph contrastive learning. To address the above issues, we propose a novel GCL-based recommendation framework RGCL, which effectively maintains the semantic invariance of contrastive pairs and dynamically adapts as the model capability evolves through the training process. Specifically, RGCL first introduces decision boundary-aware adversarial perturbations to constrain the exploration space of contrastive augmented views, avoiding the decrease of task-specific information. Furthermore, to incorporate global user-user and item-item collaboration relationships for guiding on the generation of hard contrastive views, we propose an adversarial-contrastive learning objective to construct a relation-aware view-generator. Besides, considering that unsupervised GCL could potentially narrower margins between data points and the decision boundary, resulting in decreased model robustness, we introduce the adversarial examples based on maximum perturbations to achieve margin maximization. We also provide theoretical analyses on the effectiveness of our designs. Through extensive experiments on five public datasets, we demonstrate the superiority of RGCL compared against twelve baseline models., Comment: KDD 2024
Published: 2024
Full Text: View/download PDF

7. Source Echo Chamber: Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop

Author: Zhou, Yuqi, Dai, Sunhao, Pang, Liang, Wang, Gang, Dong, Zhenhua, Xu, Jun, and Wen, Ji-Rong
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Recently, researchers have uncovered that neural retrieval models prefer AI-generated content (AIGC), called source bias. Compared to active search behavior, recommendation represents another important means of information acquisition, where users are more prone to source bias. Furthermore, delving into the recommendation scenario, as AIGC becomes integrated within the feedback loop involving users, data, and the recommender system, it progressively contaminates the candidate items, the user interaction history, and ultimately, the data used to train the recommendation models. How and to what extent the source bias affects the neural recommendation models within feedback loop remains unknown. In this study, we extend the investigation of source bias into the realm of recommender systems, specifically examining its impact across different phases of the feedback loop. We conceptualize the progression of AIGC integration into the recommendation content ecosystem in three distinct phases-HGC dominate, HGC-AIGC coexist, and AIGC dominance-each representing past, present, and future states, respectively. Through extensive experiments across three datasets from diverse domains, we demonstrate the prevalence of source bias and reveal a potential digital echo chamber with source bias amplification throughout the feedback loop. This trend risks creating a recommender ecosystem with limited information source, such as AIGC, being disproportionately recommended. To counteract this bias and prevent its escalation in the feedback loop, we introduce a black-box debiasing method that maintains model impartiality towards both HGC and AIGC. Our experimental results validate the effectiveness of the proposed debiasing method, confirming its potential to disrupt the feedback loop.
Published: 2024

8. Tool Learning with Large Language Models: A Survey

Author: Qu, Changle, Dai, Sunhao, Wei, Xiaochi, Cai, Hengyi, Wang, Shuaiqiang, Yin, Dawei, Xu, Jun, and Wen, Ji-Rong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recently, tool learning with large language models (LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems. Despite growing attention and rapid advancements in this field, the existing literature remains fragmented and lacks systematic organization, posing barriers to entry for newcomers. This gap motivates us to conduct a comprehensive survey of existing works on tool learning with LLMs. In this survey, we focus on reviewing existing literature from the two primary aspects (1) why tool learning is beneficial and (2) how tool learning is implemented, enabling a comprehensive understanding of tool learning with LLMs. We first explore the "why" by reviewing both the benefits of tool integration and the inherent benefits of the tool learning paradigm from six specific aspects. In terms of "how", we systematically review the literature according to a taxonomy of four key stages in the tool learning workflow: task planning, tool selection, tool calling, and response generation. Additionally, we provide a detailed summary of existing benchmarks and evaluation methods, categorizing them according to their relevance to different stages. Finally, we discuss current challenges and outline potential future directions, aiming to inspire both researchers and industrial developers to further explore this emerging and promising area. We also maintain a GitHub repository to continually keep track of the relevant papers and resources in this rising area at https://github.com/quchangle1/LLM-Tool-Survey., Comment: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-024-40678-2}
Published: 2024
Full Text: View/download PDF

9. ReCODE: Modeling Repeat Consumption with Neural ODE

Author: Dai, Sunhao, Qu, Changle, Chen, Sirui, Zhang, Xiao, and Xu, Jun
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence
Abstract: In real-world recommender systems, such as in the music domain, repeat consumption is a common phenomenon where users frequently listen to a small set of preferred songs or artists repeatedly. The key point of modeling repeat consumption is capturing the temporal patterns between a user's repeated consumption of the items. Existing studies often rely on heuristic assumptions, such as assuming an exponential distribution for the temporal gaps. However, due to the high complexity of real-world recommender systems, these pre-defined distributions may fail to capture the intricate dynamic user consumption patterns, leading to sub-optimal performance. Drawing inspiration from the flexibility of neural ordinary differential equations (ODE) in capturing the dynamics of complex systems, we propose ReCODE, a novel model-agnostic framework that utilizes neural ODE to model repeat consumption. ReCODE comprises two essential components: a user's static preference prediction module and the modeling of user dynamic repeat intention. By considering both immediate choices and historical consumption patterns, ReCODE offers comprehensive modeling of user preferences in the target context. Moreover, ReCODE seamlessly integrates with various existing recommendation models, including collaborative-based and sequential-based models, making it easily applicable in different scenarios. Experimental results on two real-world datasets consistently demonstrate that ReCODE significantly improves the performance of base models and outperforms other baseline methods., Comment: Accepted by SIGIR 2024 (Short Paper)
Published: 2024
Full Text: View/download PDF

10. Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

Author: Dai, Sunhao, Liu, Weihao, Zhou, Yuqi, Pang, Liang, Ruan, Rongju, Wang, Gang, Dong, Zhenhua, Xu, Jun, and Wen, Ji-Rong
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events. Through conducting over 1,000 experiments to assess state-of-the-art retrieval models against the benchmarked datasets in Cocktail, we uncover a clear trade-off between ranking performance and source bias in neural retrieval models, highlighting the necessity for a balanced approach in designing future IR systems. We hope Cocktail can serve as a foundational resource for IR research in the LLM era, with all data and code publicly available at \url{https://github.com/KID-22/Cocktail}., Comment: Accepted by Findings of ACL 2024; Datasets Link: https://huggingface.co/IR-Cocktail
Published: 2024

11. Towards Completeness-Oriented Tool Retrieval for Large Language Models

Author: Qu, Changle, Dai, Sunhao, Wei, Xiaochi, Cai, Hengyi, Wang, Shuaiqiang, Yin, Dawei, Xu, Jun, and Wen, Ji-Rong
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Recently, integrating external tools with Large Language Models (LLMs) has gained significant attention as an effective strategy to mitigate the limitations inherent in their pre-training data. However, real-world systems often incorporate a wide array of tools, making it impractical to input all tools into LLMs due to length limitations and latency constraints. Therefore, to fully exploit the potential of tool-augmented LLMs, it is crucial to develop an effective tool retrieval system. Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions, frequently leading to the retrieval of redundant, similar tools. Consequently, these methods fail to provide a complete set of diverse tools necessary for addressing the multifaceted problems encountered by LLMs. In this paper, we propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools. Specifically, we first fine-tune the PLM-based retrieval models to capture the semantic relationships between queries and tools in the semantic learning stage. Subsequently, we construct three bipartite graphs among queries, scenes, and tools and introduce a dual-view graph collaborative learning framework to capture the intricate collaborative relationships among tools during the collaborative learning stage. Extensive experiments on both the open benchmark and the newly introduced ToolLens dataset show that COLT achieves superior performance. Notably, the performance of BERT-mini (11M) with our proposed model framework outperforms BERT-large (340M), which has 30 times more parameters. Furthermore, we will release ToolLens publicly to facilitate future research on tool retrieval., Comment: Accepted by CIKM 2024; GitHub: https://github.com/quchangle1/COLT
Published: 2024
Full Text: View/download PDF

12. Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era

Author: Dai, Sunhao, Xu, Chen, Xu, Shicheng, Pang, Liang, Dong, Zhenhua, and Xu, Jun
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: With the rapid advancements of large language models (LLMs), information retrieval (IR) systems, such as search engines and recommender systems, have undergone a significant paradigm shift. This evolution, while heralding new opportunities, introduces emerging challenges, particularly in terms of biases and unfairness, which may threaten the information ecosystem. In this paper, we present a comprehensive survey of existing works on emerging and pressing bias and unfairness issues in IR systems when the integration of LLMs. We first unify bias and unfairness issues as distribution mismatch problems, providing a groundwork for categorizing various mitigation strategies through distribution alignment. Subsequently, we systematically delve into the specific bias and unfairness issues arising from three critical stages of LLMs integration into IR systems: data collection, model development, and result evaluation. In doing so, we meticulously review and analyze recent literature, focusing on the definitions, characteristics, and corresponding mitigation strategies associated with these issues. Finally, we identify and highlight some open problems and challenges for future work, aiming to inspire researchers and stakeholders in the IR field and beyond to better understand and mitigate bias and unfairness issues of IR in this LLM era. We also consistently maintain a GitHub repository for the relevant papers and resources in this rising direction at https://github.com/KID-22/LLM-IR-Bias-Fairness-Survey., Comment: KDD 2024 Tutorial&Survey; Tutorial Website: https://llm-ir-bias-fairness.github.io/
Published: 2024
Full Text: View/download PDF

13. UOEP: User-Oriented Exploration Policy for Enhancing Long-Term User Experiences in Recommender Systems

Author: Zhang, Changshuo, Chen, Sirui, Zhang, Xiao, Dai, Sunhao, Yu, Weijie, and Xu, Jun
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence
Abstract: Reinforcement learning (RL) has gained traction for enhancing user long-term experiences in recommender systems by effectively exploring users' interests. However, modern recommender systems exhibit distinct user behavioral patterns among tens of millions of items, which increases the difficulty of exploration. For example, user behaviors with different activity levels require varying intensity of exploration, while previous studies often overlook this aspect and apply a uniform exploration strategy to all users, which ultimately hurts user experiences in the long run. To address these challenges, we propose User-Oriented Exploration Policy (UOEP), a novel approach facilitating fine-grained exploration among user groups. We first construct a distributional critic which allows policy optimization under varying quantile levels of cumulative reward feedbacks from users, representing user groups with varying activity levels. Guided by this critic, we devise a population of distinct actors aimed at effective and fine-grained exploration within its respective user group. To simultaneously enhance diversity and stability during the exploration process, we further introduce a population-level diversity regularization term and a supervision module. Experimental results on public recommendation datasets demonstrate that our approach outperforms all other baselines in terms of long-term performance, validating its user-oriented exploration effectiveness. Meanwhile, further analyses reveal our approach's benefits of improved performance for low-activity users as well as increased fairness among users.
Published: 2024

14. Neural Retrievers are Biased Towards LLM-Generated Content

Author: Dai, Sunhao, Zhou, Yuqi, Pang, Liang, Liu, Weihao, Hu, Xiaolin, Liu, Yong, Zhang, Xiao, Wang, Gang, and Xu, Jun
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search, by generating vast amounts of human-like texts on the Internet. As a result, IR systems in the LLM era are facing a new challenge: the indexed documents are now not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher. We refer to this category of biases in neural retrievers towards the LLM-generated content as the \textbf{source bias}. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, in-depth analyses from the perspective of text compression indicate that LLM-generated texts exhibit more focused semantics with less noise, making it easier for neural retrieval models to semantic match. To mitigate the source bias, we also propose a plug-and-play debiased constraint for the optimization objective, and experimental results show its effectiveness. Finally, we discuss the potential severe concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks are available at https://github.com/KID-22/Source-Bias., Comment: KDD 2024
Published: 2023
Full Text: View/download PDF

15. Uncovering ChatGPT's Capabilities in Recommender Systems

Author: Dai, Sunhao, Shao, Ninglu, Zhao, Haiyuan, Yu, Weijie, Si, Zihua, Xu, Chen, Sun, Zhongxiang, Zhang, Xiao, and Xu, Jun
Subjects: Computer Science - Information Retrieval
Abstract: The debut of ChatGPT has recently attracted the attention of the natural language processing (NLP) community and beyond. Existing studies have demonstrated that ChatGPT shows significant improvement in a range of downstream NLP tasks, but the capabilities and limitations of ChatGPT in terms of recommendations remain unclear. In this study, we aim to conduct an empirical analysis of ChatGPT's recommendation ability from an Information Retrieval (IR) perspective, including point-wise, pair-wise, and list-wise ranking. To achieve this goal, we re-formulate the above three recommendation policies into a domain-specific prompt format. Through extensive experiments on four datasets from different domains, we demonstrate that ChatGPT outperforms other large language models across all three ranking policies. Based on the analysis of unit cost improvements, we identify that ChatGPT with list-wise ranking achieves the best trade-off between cost and performance compared to point-wise and pair-wise ranking. Moreover, ChatGPT shows the potential for mitigating the cold start problem and explainable recommendation. To facilitate further explorations in this area, the full code and detailed original results are open-sourced at https://github.com/rainym00d/LLM4RS., Comment: Accepted by RecSys 2023
Published: 2023
Full Text: View/download PDF

16. A Semi-Synthetic Dataset Generation Framework for Causal Inference in Recommender Systems

Author: Lyu, Yan, Dai, Sunhao, Wu, Peng, Dai, Quanyu, Deng, Yuhao, Hu, Wenjie, Dong, Zhenhua, Xu, Jun, Zhu, Shengyu, and Zhou, Xiao-Hua
Subjects: Computer Science - Information Retrieval
Abstract: Accurate recommendation and reliable explanation are two key issues for modern recommender systems. However, most recommendation benchmarks only concern the prediction of user-item ratings while omitting the underlying causes behind the ratings. For example, the widely-used Yahoo!R3 dataset contains little information on the causes of the user-movie ratings. A solution could be to conduct surveys and require the users to provide such information. In practice, the user surveys can hardly avoid compliance issues and sparse user responses, which greatly hinders the exploration of causality-based recommendation. To better support the studies of causal inference and further explanations in recommender systems, we propose a novel semi-synthetic data generation framework for recommender systems where causal graphical models with missingness are employed to describe the causal mechanism of practical recommendation scenarios. To illustrate the use of our framework, we construct a semi-synthetic dataset with Causal Tags And Ratings (CTAR), based on the movies as well as their descriptive tags and rating information collected from a famous movie rating website. Using the collected data and the causal graph, the user-item-ratings and their corresponding user-item-tags are automatically generated, which provides the reasons (selected tags) why the user rates the items. Descriptive statistics and baseline results regarding the CTAR dataset are also reported. The proposed data generation framework is not limited to recommendation, and the released APIs can be used to generate customized datasets for other research tasks., Comment: 10 pages, 6 figures, 4 tables
Published: 2022

17. COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models

Author: Qu, Changle, Dai, Sunhao, Wei, Xiaochi, Cai, Hengyi, Wang, Shuaiqiang, Yin, Dawei, Xu, Jun, Wen, Ji-Rong, Qu, Changle, Dai, Sunhao, Wei, Xiaochi, Cai, Hengyi, Wang, Shuaiqiang, Yin, Dawei, Xu, Jun, and Wen, Ji-Rong
Abstract: Recently, the integration of external tools with Large Language Models (LLMs) has emerged as a promising approach to overcome the inherent constraints of their pre-training data. However, realworld applications often involve a diverse range of tools, making it infeasible to incorporate all tools directly into LLMs due to constraints on input length and response time. Therefore, to fully exploit the potential of tool-augmented LLMs, it is crucial to develop an effective tool retrieval system. Existing tool retrieval methods techniques mainly rely on semantic matching between user queries and tool descriptions, which often results in the selection of redundant tools. As a result, these methods fail to provide a complete set of diverse tools necessary for addressing the multifaceted problems encountered by LLMs. In this paper, we propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools. Specifically, we first fine-tune the PLM-based retrieval models to capture the semantic relationships between queries and tools in the semantic learning stage. Subsequently, we construct three bipartite graphs among queries, scenes, and tools and introduce a dual-view graph collaborative learning framework to capture the intricate collaborative relationships among tools during the collaborative learning stage. Extensive experiments on both the open benchmark and the newly introduced ToolLens dataset show that COLT achieves superior performance. Notably, the performance of BERT-mini (11M) with our proposed model framework outperforms BERT-large (340M), which has 30 times more parameters. Additionally, we plan to publicly release the ToolLens dataset to support further research in tool retrieval.
Published: 2024

18. Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models

Author: Dai, Sunhao, Xu, Chen, Xu, Shicheng, Pang, Liang, Dong, Zhenhua, Xu, Jun, Dai, Sunhao, Xu, Chen, Xu, Shicheng, Pang, Liang, Dong, Zhenhua, and Xu, Jun
Abstract: With the rapid advancement of large language models (LLMs), information retrieval (IR) systems, such as search engines and recommender systems, have undergone a significant paradigm shift. This evolution, while heralding new opportunities, introduces emerging challenges, particularly in terms of biases and unfairness, which may threaten the information ecosystem. In this paper, we present a comprehensive survey of existing works on emerging and pressing bias and unfairness issues in IR systems when the integration of LLMs. We first unify bias and unfairness issues as distribution mismatch problems, providing a groundwork for categorizing various mitigation strategies through distribution alignment. Subsequently, we systematically delve into the specific bias and unfairness issues arising from three critical stages of LLMs integration into IR systems: data collection, model development, and result evaluation. In doing so, we meticulously review and analyze recent literature, focusing on the definitions, characteristics, and corresponding mitigation strategies associated with these issues. Finally, we identify and highlight some open problems and challenges for future work, aiming to inspire researchers and stakeholders in the IR field and beyond to better understand and mitigate bias and unfairness issues of IR in this LLM era. We also consistently maintain a GitHub repository for the relevant papers and resources in this rising direction at https://github.com/KID-22/LLM-IR-Bias-Fairness-Survey.
Published: 2024

19. Dually Enhanced Delayed Feedback Modeling for Streaming Conversion Rate Prediction

Author: Dai, Sunhao, primary, Zhou, Yuqi, additional, Xu, Jun, additional, and Wen, Ji-Rong, additional
Published: 2023
Full Text: View/download PDF

20. Uncovering ChatGPT’s Capabilities in Recommender Systems

Author: Dai, Sunhao, primary, Shao, Ninglu, additional, Zhao, Haiyuan, additional, Yu, Weijie, additional, Si, Zihua, additional, Xu, Chen, additional, Sun, Zhongxiang, additional, Zhang, Xiao, additional, and Xu, Jun, additional
Published: 2023
Full Text: View/download PDF

21. LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LLM-Generated Texts

Author: Dai, Sunhao, Zhou, Yuqi, Pang, Liang, Liu, Weihao, Hu, Xiaolin, Liu, Yong, Zhang, Xiao, Wang, Gang, Xu, Jun, Dai, Sunhao, Zhou, Yuqi, Pang, Liang, Liu, Weihao, Hu, Xiaolin, Liu, Yong, Zhang, Xiao, Wang, Gang, and Xu, Jun
Abstract: Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search. With their remarkable capabilities in generating human-like texts, LLMs have created enormous texts on the Internet. As a result, IR systems in the LLMs era are facing a new challenge: the indexed documents now are not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of different IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher. We refer to this category of biases in neural retrieval models towards the LLM-generated text as the \textbf{source bias}. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, we provide an in-depth analysis from the perspective of text compression and observe that neural models can better understand the semantic information of LLM-generated text, which is further substantiated by our theoretical analysis. To mitigate the source bias, we also propose a plug-and-play debiased constraint for the optimization objective, and experimental results show the effectiveness. Finally, we discuss the potential severe concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks and codes will later be available at \url{https://github.com/KID-22/LLM4IR-Bias}.
Published: 2023

22. Tool learning with large language models: a survey.

Author: Qu, Changle, Dai, Sunhao, Wei, Xiaochi, Cai, Hengyi, Wang, Shuaiqiang, Yin, Dawei, Xu, Jun, and Wen, Ji-rong
Abstract: Recently, tool learning with large language models (LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems. Despite growing attention and rapid advancements in this field, the existing literature remains fragmented and lacks systematic organization, posing barriers to entry for newcomers. This gap motivates us to conduct a comprehensive survey of existing works on tool learning with LLMs. In this survey, we focus on reviewing existing literature from the two primary aspects (1) why tool learning is beneficial and (2) how tool learning is implemented, enabling a comprehensive understanding of tool learning with LLMs. We first explore the "why" by reviewing both the benefits of tool integration and the inherent benefits of the tool learning paradigm from six specific aspects. In terms of "how", we systematically review the literature according to a taxonomy of four key stages in the tool learning workflow: task planning, tool selection, tool calling, and response generation. Additionally, we provide a detailed summary of existing benchmarks and evaluation methods, categorizing them according to their relevance to different stages. Finally, we discuss current challenges and outline potential future directions, aiming to inspire both researchers and industrial developers to further explore this emerging and promising area. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

23. Counteracting User Attention Bias in Music Streaming Recommendation via Reward Modification

Author: Zhang, Xiao, primary, Dai, Sunhao, additional, Xu, Jun, additional, Dong, Zhenhua, additional, Dai, Quanyu, additional, and Wen, Ji-Rong, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

23 results on '"Dai, Sunhao"'

1. Few-shot_LLM_Synthetic_Data_with_Distribution_Matching

2. CitaLaw: Enhancing LLM with Citations in Legal Domain

3. Length-Induced Embedding Collapse in Transformer-based Models

4. From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions

5. Revisiting Reciprocal Recommender Systems: Metrics, Formulation, and Method

6. Towards Robust Recommendation via Decision Boundary-aware Graph Contrastive Learning

7. Source Echo Chamber: Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop

8. Tool Learning with Large Language Models: A Survey

9. ReCODE: Modeling Repeat Consumption with Neural ODE

10. Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

11. Towards Completeness-Oriented Tool Retrieval for Large Language Models

12. Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era

13. UOEP: User-Oriented Exploration Policy for Enhancing Long-Term User Experiences in Recommender Systems

14. Neural Retrievers are Biased Towards LLM-Generated Content

15. Uncovering ChatGPT's Capabilities in Recommender Systems

16. A Semi-Synthetic Dataset Generation Framework for Causal Inference in Recommender Systems

17. COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models

18. Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models

19. Dually Enhanced Delayed Feedback Modeling for Streaming Conversion Rate Prediction

20. Uncovering ChatGPT’s Capabilities in Recommender Systems

21. LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LLM-Generated Texts

22. Tool learning with large language models: a survey.

23. Counteracting User Attention Bias in Music Streaming Recommendation via Reward Modification

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

23 results on '"Dai, Sunhao"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources