Back to Search
Start Over
LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster.
- Source :
-
Journal of Supercomputing . Jun2024, Vol. 80 Issue 9, p12247-12272. 26p. - Publication Year :
- 2024
-
Abstract
- As the cost of deep learning training increases, using heterogeneous GPU clusters is a reasonable way to scale cluster resources to support distributed deep learning (DDL) tasks. However, the commonly used synchronous stochastic gradient descent (SSGD) algorithm based on the bulk synchronous parallel (BSP) model suffers from stragglers in heterogeneous clusters, resulting in a significant reduction in training efficiency. To overcome this challenge, we propose load-balanced batching (LBB) to eliminate stragglers in DDL workloads. LBB first formulates the load balancing problem and builds performance models for all workers in DDL workloads, which is achieved by analyzing the relationship between DDL iteration time and each worker's local batch size. Then the LBB balances all workers' workloads by coordinating local batch sizes. In particular, the LBB greatly mitigates static stragglers and severe dynamic stragglers by solving the load balancing problem and eliminates stragglers by batch size fine-tuning during training. LBB is implemented in PyTorch, and extensive experiments are performed on a heterogeneous server equipped with four GPUs with three different models. The experimental results verify the effectiveness of LBB on standard benchmarks, demonstrating that LBB can significantly reduce training time by 64.57%, 59%, and 5.4% compared to SSGD, local SGD, and FlexRR, respectively, without sacrificing accuracy. [ABSTRACT FROM AUTHOR]
- Subjects :
- *BUILDING performance
*DEEP learning
Subjects
Details
- Language :
- English
- ISSN :
- 09208542
- Volume :
- 80
- Issue :
- 9
- Database :
- Academic Search Index
- Journal :
- Journal of Supercomputing
- Publication Type :
- Academic Journal
- Accession number :
- 177648309
- Full Text :
- https://doi.org/10.1007/s11227-023-05886-w