Start Over

$\alpha$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Authors :: Wu, Junkang
Wang, Xue
Yang, Zhengyi
Wu, Jiancan
Gao, Jinyang
Ding, Bolin
Wang, Xiang
He, Xiangnan
Publication Year :: 2024
Abstract: Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO's assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose $\alpha$-DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, $\alpha$-DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for $\alpha$-DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that $\alpha$-DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment. The code is available at https://github.com/junkangwu/alpha-DPO

Subjects :: Computer Science - Machine Learning
Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2410.10148
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

$\alpha$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

$\alpha$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources