Back to Search Start Over

YAYI 2: Multilingual Open-Source Large Language Models

Authors :
Luo, Yin
Kong, Qingchao
Xu, Nan
Cao, Jia
Hao, Bao
Qu, Baoyu
Chen, Bo
Zhu, Chao
Zhao, Chenyang
Zhang, Donglei
Feng, Fan
Zhao, Feifei
Sun, Hailong
Yang, Hanxuan
Pan, Haojun
Liu, Hongyu
Guo, Jianbin
Du, Jiangtao
Wang, Jingyi
Li, Junfeng
Sun, Lei
Liu, Liduo
Dong, Lifeng
Liu, Lili
Wang, Lin
Zhang, Liwen
Wang, Minzheng
Wang, Pin
Yu, Ping
Li, Qingxiao
Yan, Rui
Zou, Rui
Li, Ruiqun
Huang, Taiwen
Wang, Xiaodong
Wu, Xiaofei
Peng, Xin
Zhang, Xina
Fang, Xing
Xiao, Xinglin
Hao, Yanni
Dong, Yao
Wang, Yigang
Liu, Ying
Jiang, Yongyu
Wang, Yungan
Wang, Yuqi
Wang, Zhangsheng
Yu, Zhaoxin
Luo, Zhen
Mao, Wenji
Wang, Lei
Zeng, Dajun
Publication Year :
2023

Abstract

As the latest advancements in natural language processing, large language models (LLMs) have achieved human-level language understanding and generation abilities in many real-world tasks, and even have been regarded as a potential path to the artificial general intelligence. To better facilitate research on LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been proposed and gained comparable performances to proprietary models. However, these models are primarily designed for English scenarios and exhibit poor performances in Chinese contexts. In this technical report, we propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback. Extensive experiments on multiple benchmarks, such as MMLU and CMMLU, consistently demonstrate that the proposed YAYI 2 outperforms other similar sized open-source models.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2312.14862
Document Type :
Working Paper