1. BaPipe: Balanced Pipeline Parallelism for DNN Training.
- Author
-
Zhao, Letian, Xu, Rui, Wang, Tianqi, Tian, Teng, Wang, Xiaotian, Wu, Wei, Ieong, Chio-In, and Jin, Xi
- Subjects
- *
ARTIFICIAL neural networks , *DEEP learning , *MACHINE translating , *MACHINE learning - Abstract
The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. Distributed deep learning based on model parallelism has been widely used to satisfy the requirements of DNN training related to computation and memory. In this paper, we propose a training framework for pipeline parallelism called BaPipe (Balanced Pipeline) that can automatically explore methods to schedule pipeline parallelism and balanced partition strategies for DNN training on heterogeneous accelerator clusters. In BaPipe, each accelerator calculates the forward and backward propagation for the assigned partition of networks to implement an intra-batch pipeline parallelism strategy. By considering the parameters of DNN models as well as the computation, memory, and communication resources of each accelerator, BaPipe automatically selects the most suitable method of pipeline scheduling from among multiple proposed scheduling modes. It also uses a novel strategy to automatically investigate load balancing in the context of inter-layer partition, intra-layer partition, and coarse-grained partition. We trained such DNNs as VGG-16, ResNet-50, and Google's Neural Machine Translation (GNMT) on GPU clusters, and simulated the training-related performance of FPGA clusters. Compared with the state-of-the-art frameworks for data parallelism (DP) and pipeline parallelism, BaPipe provides a speedup of 3. 2 × and 4 × of memory reduction on various homogeneous and heterogeneous platforms. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF