1. Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
- Author
-
Yin, Yueqin, Yang, Shentao, Xie, Yujia, Yang, Ziyi, Sun, Yuting, Awadalla, Hany, Chen, Weizhu, and Zhou, Mingyuan
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.
- Published
- 2025