1. Scalable Fully Pipelined Hardware Architecture for In-Network Aggregated AllReduce Communication.
- Author
-
Liu, Yao, Zhang, Junyi, Liu, Shuo, Wang, Qiaoling, Dai, Wangchen, and Cheung, Ray Chak Chung
- Subjects
MACHINE learning ,PEER-to-peer architecture (Computer networks) ,HARDWARE ,BANDWIDTHS ,TASK analysis - Abstract
The Ring-AllReduce framework is currently the most popular solution to deploy industry-level distributed machine learning tasks. However, only about half of the maximum bandwidth can be achieved in the optimal condition. In recent years, several in-network aggregation frameworks have been proposed to overcome the drawback, but limited hardware information have been disclosed. In this paper, we propose a scalable fully-pipelined architecture that handles tasks like forwarding, aggregation and retransmission with no bandwidth loss. The architecture is implemented on a Xilinx Ultrascale FPGA that connects to 8 working servers with 10 Gb/s network adapters, and it is able to scale to more complicated scenarios involving more workers. Compared with Ring-AllReduce, using AllReduce-Switch improves the efficient bandwidth of AllReduce communication with a ratio of $1.75\times $. In image training tasks, the proposed hardware architecture helps to achieve up to $1.67\times $ speedup to the training process. For computing-intensive models, the speedup from communication may be partially hidden by computing. In particular, for ResNet-50, AllReduce-Switch improves the training process with MPI and NCCL by $1.30\times $ and $1.04\times $ respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF