Back to Search Start Over

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Authors :
DeepSeek-AI
Liu, Aixin
Feng, Bei
Wang, Bin
Wang, Bingxuan
Liu, Bo
Zhao, Chenggang
Dengr, Chengqi
Ruan, Chong
Dai, Damai
Guo, Daya
Yang, Dejian
Chen, Deli
Ji, Dongjie
Li, Erhang
Lin, Fangyun
Luo, Fuli
Hao, Guangbo
Chen, Guanting
Li, Guowei
Zhang, H.
Xu, Hanwei
Yang, Hao
Zhang, Haowei
Ding, Honghui
Xin, Huajian
Gao, Huazuo
Li, Hui
Qu, Hui
Cai, J. L.
Liang, Jian
Guo, Jianzhong
Ni, Jiaqi
Li, Jiashi
Chen, Jin
Yuan, Jingyang
Qiu, Junjie
Song, Junxiao
Dong, Kai
Gao, Kaige
Guan, Kang
Wang, Lean
Zhang, Lecong
Xu, Lei
Xia, Leyi
Zhao, Liang
Zhang, Liyue
Li, Meng
Wang, Miaojun
Zhang, Mingchuan
Zhang, Minghua
Tang, Minghui
Li, Mingming
Tian, Ning
Huang, Panpan
Wang, Peiyi
Zhang, Peng
Zhu, Qihao
Chen, Qinyu
Du, Qiushi
Chen, R. J.
Jin, R. L.
Ge, Ruiqi
Pan, Ruizhe
Xu, Runxin
Chen, Ruyi
Li, S. S.
Lu, Shanghao
Zhou, Shangyan
Chen, Shanhuang
Wu, Shaoqing
Ye, Shengfeng
Ma, Shirong
Wang, Shiyu
Zhou, Shuang
Yu, Shuiping
Zhou, Shunfeng
Zheng, Size
Wang, T.
Pei, Tian
Yuan, Tian
Sun, Tianyu
Xiao, W. L.
Zeng, Wangding
An, Wei
Liu, Wen
Liang, Wenfeng
Gao, Wenjun
Zhang, Wentao
Li, X. Q.
Jin, Xiangyue
Wang, Xianzu
Bi, Xiao
Liu, Xiaodong
Wang, Xiaohan
Shen, Xiaojin
Chen, Xiaokang
Chen, Xiaosha
Nie, Xiaotao
Sun, Xiaowen
Wang, Xiaoxiang
Liu, Xin
Xie, Xin
Yu, Xingkai
Song, Xinnan
Zhou, Xinyi
Yang, Xinyu
Lu, Xuan
Su, Xuecheng
Wu, Y.
Li, Y. K.
Wei, Y. X.
Zhu, Y. X.
Xu, Yanhong
Huang, Yanping
Li, Yao
Zhao, Yao
Sun, Yaofeng
Li, Yaohui
Wang, Yaohui
Zheng, Yi
Zhang, Yichao
Xiong, Yiliang
Zhao, Yilong
He, Ying
Tang, Ying
Piao, Yishi
Dong, Yixin
Tan, Yixuan
Liu, Yiyuan
Wang, Yongji
Guo, Yongqiang
Zhu, Yuchen
Wang, Yuduan
Zou, Yuheng
Zha, Yukun
Ma, Yunxian
Yan, Yuting
You, Yuxiang
Liu, Yuxuan
Ren, Z. Z.
Ren, Zehui
Sha, Zhangli
Fu, Zhe
Huang, Zhen
Zhang, Zhen
Xie, Zhenda
Hao, Zhewen
Shao, Zhihong
Wen, Zhiniu
Xu, Zhipeng
Zhang, Zhongyu
Li, Zhuoshu
Wang, Zihan
Gu, Zihui
Li, Zilin
Xie, Ziwei
Publication Year :
2024

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2405.04434
Document Type :
Working Paper