Start Over

MPCA SGD—A Method for Distributed Training of Deep Learning Models on Spark.

Authors :: Langer, Matthias
Hall, Ashley
He, Zhen
Rahayu, Wenny
Source :: IEEE Transactions on Parallel & Distributed Systems. 11/1/2018, Vol. 29 Issue 11, p2540-2556. 17p.
Publication Year :: 2018
Abstract: Many distributed deep learning systems have been published over the past few years, often accompanied by impressive performance claims. In practice these figures are often achieved in high performance computing (HPC) environments with fast InfiniBand network connections. For average deep learning practitioners this is usually an unrealistic scenario, since they cannot afford access to these facilities. Simple re-implementations of algorithms such as EASGD  for standard Ethernet environments often fail to replicate the scalability and performance of the original works. In this paper, we explore this particular problem domain and present MPCA SGD, a method for distributed training of deep neural networks that is specifically designed to run in low-budget environments. MPCA SGD tries to make the best possible use of available resources, and can operate well if network bandwidth is constrained. Furthermore, MPCA SGD runs on top of the popular Apache Spark  framework. Thus, it can easily be deployed in existing data centers and office environments where Spark is already used. When training large deep learning models in a gigabit Ethernet cluster, MPCA SGD achieves significantly faster convergence rates than many popular alternatives. For example, MPCA SGD can train ResNet-152  up to 5.3x faster than state-of-the-art systems like MXNet  , up to 5.3x faster than bulk-synchronous systems like SparkNet  and up to 5.3x faster than decentral asynchronous systems like EASGD . [ABSTRACT FROM AUTHOR]