1. ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library
- Author
-
Qianyuan Ran, Xiaowei Jiang, Yingya Zhang, Shanyuan Gao, Pengcheng Li, Heng Pan, Lingbo Tang, Liuyihan Song, Jie Zhang, Pan Pan, Fei Feng, Hao Li, Yong Li, Shaochuang Wang, Zhisheng Xia, Guohui Wang, Jianbo Dong, Xin Long, Zheng Cao, and Yiqun Guo
- Subjects
Collective communication ,Hardware and Architecture ,Computer science ,Server ,Distributed computing ,Scalability ,Bandwidth (signal processing) ,Parallel algorithm ,Electrical and Electronic Engineering ,Routing (electronic design automation) ,Performance improvement ,Training (civil) ,Software - Abstract
Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.
- Published
- 2021
- Full Text
- View/download PDF