Back to Search
Start Over
Scalable Fully Pipelined Hardware Architecture for In-Network Aggregated AllReduce Communication.
- Source :
-
IEEE Transactions on Circuits & Systems. Part I: Regular Papers . Oct2021, Vol. 68 Issue 10, p4194-4206. 13p. - Publication Year :
- 2021
-
Abstract
- The Ring-AllReduce framework is currently the most popular solution to deploy industry-level distributed machine learning tasks. However, only about half of the maximum bandwidth can be achieved in the optimal condition. In recent years, several in-network aggregation frameworks have been proposed to overcome the drawback, but limited hardware information have been disclosed. In this paper, we propose a scalable fully-pipelined architecture that handles tasks like forwarding, aggregation and retransmission with no bandwidth loss. The architecture is implemented on a Xilinx Ultrascale FPGA that connects to 8 working servers with 10 Gb/s network adapters, and it is able to scale to more complicated scenarios involving more workers. Compared with Ring-AllReduce, using AllReduce-Switch improves the efficient bandwidth of AllReduce communication with a ratio of $1.75\times $. In image training tasks, the proposed hardware architecture helps to achieve up to $1.67\times $ speedup to the training process. For computing-intensive models, the speedup from communication may be partially hidden by computing. In particular, for ResNet-50, AllReduce-Switch improves the training process with MPI and NCCL by $1.30\times $ and $1.04\times $ respectively. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 15498328
- Volume :
- 68
- Issue :
- 10
- Database :
- Academic Search Index
- Journal :
- IEEE Transactions on Circuits & Systems. Part I: Regular Papers
- Publication Type :
- Periodical
- Accession number :
- 153763139
- Full Text :
- https://doi.org/10.1109/TCSI.2021.3098841