Back to Search Start Over

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster.

Authors :
Yao, Feixiang
Zhang, Zhonghao
Ji, Zeyu
Liu, Bin
Gao, Haoyuan
Source :
Journal of Supercomputing. Jun2024, Vol. 80 Issue 9, p12247-12272. 26p.
Publication Year :
2024

Abstract

As the cost of deep learning training increases, using heterogeneous GPU clusters is a reasonable way to scale cluster resources to support distributed deep learning (DDL) tasks. However, the commonly used synchronous stochastic gradient descent (SSGD) algorithm based on the bulk synchronous parallel (BSP) model suffers from stragglers in heterogeneous clusters, resulting in a significant reduction in training efficiency. To overcome this challenge, we propose load-balanced batching (LBB) to eliminate stragglers in DDL workloads. LBB first formulates the load balancing problem and builds performance models for all workers in DDL workloads, which is achieved by analyzing the relationship between DDL iteration time and each worker's local batch size. Then the LBB balances all workers' workloads by coordinating local batch sizes. In particular, the LBB greatly mitigates static stragglers and severe dynamic stragglers by solving the load balancing problem and eliminates stragglers by batch size fine-tuning during training. LBB is implemented in PyTorch, and extensive experiments are performed on a heterogeneous server equipped with four GPUs with three different models. The experimental results verify the effectiveness of LBB on standard benchmarks, demonstrating that LBB can significantly reduce training time by 64.57%, 59%, and 5.4% compared to SSGD, local SGD, and FlexRR, respectively, without sacrificing accuracy. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09208542
Volume :
80
Issue :
9
Database :
Academic Search Index
Journal :
Journal of Supercomputing
Publication Type :
Academic Journal
Accession number :
177648309
Full Text :
https://doi.org/10.1007/s11227-023-05886-w