Back to Search Start Over

Scalable Fully Pipelined Hardware Architecture for In-Network Aggregated AllReduce Communication.

Authors :
Liu, Yao
Zhang, Junyi
Liu, Shuo
Wang, Qiaoling
Dai, Wangchen
Cheung, Ray Chak Chung
Source :
IEEE Transactions on Circuits & Systems. Part I: Regular Papers. Oct2021, Vol. 68 Issue 10, p4194-4206. 13p.
Publication Year :
2021

Abstract

The Ring-AllReduce framework is currently the most popular solution to deploy industry-level distributed machine learning tasks. However, only about half of the maximum bandwidth can be achieved in the optimal condition. In recent years, several in-network aggregation frameworks have been proposed to overcome the drawback, but limited hardware information have been disclosed. In this paper, we propose a scalable fully-pipelined architecture that handles tasks like forwarding, aggregation and retransmission with no bandwidth loss. The architecture is implemented on a Xilinx Ultrascale FPGA that connects to 8 working servers with 10 Gb/s network adapters, and it is able to scale to more complicated scenarios involving more workers. Compared with Ring-AllReduce, using AllReduce-Switch improves the efficient bandwidth of AllReduce communication with a ratio of $1.75\times $. In image training tasks, the proposed hardware architecture helps to achieve up to $1.67\times $ speedup to the training process. For computing-intensive models, the speedup from communication may be partially hidden by computing. In particular, for ResNet-50, AllReduce-Switch improves the training process with MPI and NCCL by $1.30\times $ and $1.04\times $ respectively. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
15498328
Volume :
68
Issue :
10
Database :
Academic Search Index
Journal :
IEEE Transactions on Circuits & Systems. Part I: Regular Papers
Publication Type :
Periodical
Accession number :
153763139
Full Text :
https://doi.org/10.1109/TCSI.2021.3098841