Back to Search Start Over

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Authors :
Kimi Team
Du, Angang
Gao, Bofei
Xing, Bowei
Jiang, Changjiu
Chen, Cheng
Li, Cheng
Xiao, Chenjun
Du, Chenzhuang
Liao, Chonghua
Tang, Chuning
Wang, Congcong
Zhang, Dehao
Yuan, Enming
Lu, Enzhe
Tang, Fengxiang
Sung, Flood
Wei, Guangda
Lai, Guokun
Guo, Haiqing
Zhu, Han
Ding, Hao
Hu, Hao
Yang, Hao
Zhang, Hao
Yao, Haotian
Zhao, Haotian
Lu, Haoyu
Li, Haoze
Yu, Haozhen
Gao, Hongcheng
Zheng, Huabin
Yuan, Huan
Chen, Jia
Guo, Jianhang
Su, Jianlin
Wang, Jianzhou
Zhao, Jie
Zhang, Jin
Liu, Jingyuan
Yan, Junjie
Wu, Junyan
Shi, Lidong
Ye, Ling
Yu, Longhui
Dong, Mengnan
Zhang, Neo
Ma, Ningchen
Pan, Qiwei
Gong, Qucheng
Liu, Shaowei
Ma, Shengling
Wei, Shupeng
Cao, Sihan
Huang, Siying
Jiang, Tao
Gao, Weihao
Xiong, Weimin
He, Weiran
Huang, Weixiao
Wu, Wenhao
He, Wenyang
Wei, Xianghui
Jia, Xianqing
Wu, Xingzhe
Xu, Xinran
Zu, Xinxing
Zhou, Xinyu
Pan, Xuehai
Charles, Y.
Li, Yang
Hu, Yangyang
Liu, Yangyang
Chen, Yanru
Wang, Yejie
Liu, Yibo
Qin, Yidao
Liu, Yifeng
Yang, Ying
Bao, Yiping
Du, Yulun
Wu, Yuxin
Wang, Yuzhi
Zhou, Zaida
Wang, Zhaoji
Li, Zhaowei
Zhu, Zhen
Zhang, Zheng
Wang, Zhexu
Yang, Zhilin
Huang, Zhiqi
Huang, Zihao
Xu, Ziyao
Yang, Zonghan
Publication Year :
2025

Abstract

Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).<br />Comment: 25 pages

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2501.12599
Document Type :
Working Paper