Back to Search Start Over

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

Authors :
An, Wei
Bi, Xiao
Chen, Guanting
Chen, Shanhuang
Deng, Chengqi
Ding, Honghui
Dong, Kai
Du, Qiushi
Gao, Wenjun
Guan, Kang
Guo, Jianzhong
Guo, Yongqiang
Fu, Zhe
He, Ying
Huang, Panpan
Li, Jiashi
Liang, Wenfeng
Liu, Xiaodong
Liu, Xin
Liu, Yiyuan
Liu, Yuxuan
Lu, Shanghao
Lu, Xuan
Nie, Xiaotao
Pei, Tian
Qiu, Junjie
Qu, Hui
Ren, Zehui
Sha, Zhangli
Su, Xuecheng
Sun, Xiaowen
Tan, Yixuan
Tang, Minghui
Wang, Shiyu
Wang, Yaohui
Wang, Yongji
Xie, Ziwei
Xiong, Yiliang
Xu, Yanhong
Ye, Shengfeng
Yu, Shuiping
Zha, Yukun
Zhang, Liyue
Zhang, Haowei
Zhang, Mingchuan
Zhang, Wentao
Zhang, Yichao
Zhao, Chenggang
Zhao, Yao
Zhou, Shangyan
Zhou, Shunfeng
Zou, Yuheng
Publication Year :
2024

Abstract

The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC.<br />Comment: This is the preprint version of the paper accepted for presentation at the 2024 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'24). \c{opyright} 2024 IEEE. Personal use of this material is permitted. For other uses, permission from IEEE must be obtained. Please refer to IEEE Xplore for the final published version

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2408.14158
Document Type :
Working Paper