Start Over

Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

Authors :: Zhu, Zehan
Tian, Ye
Huang, Yan
Xu, Jinming
He, Shibo
Publication Year :: 2023
Publisher :: arXiv, 2023.
Abstract: Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across devices and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. More importantly, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication architectures. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex settings. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms existing asynchronous SOTA algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.

Subjects :: FOS: Computer and information sciences
Computer Science - Machine Learning
Computer Science - Distributed, Parallel, and Cluster Computing
Distributed, Parallel, and Cluster Computing (cs.DC)
Machine Learning (cs.LG)

Details

Database :: OpenAIRE
Accession number :: edsair.doi.dedup.....29cda838ac328fd4881a069479a033d3
Full Text :: https://doi.org/10.48550/arxiv.2307.11617

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources