Author: "Xu, Yuanzhong" / Topic: computer science - distributed, parallel, and cluster computing - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xu, Yuanzhong"' showing total 4 results

Start Over Author "Xu, Yuanzhong" Topic computer science - distributed, parallel, and cluster computing

4 results on '"Xu, Yuanzhong"'

1. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Author: Zheng, Lianmin, Li, Zhuohan, Zhang, Hao, Zhuang, Yonghao, Chen, Zhifeng, Huang, Yanping, Wang, Yida, Xu, Yuanzhong, Zhuo, Danyang, Xing, Eric P., Gonzalez, Joseph E., and Stoica, Ion
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Programming Languages
Abstract: Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa, Comment: OSDI 2022
Published: 2022

2. GSPMD: General and Scalable Parallelization for ML Computation Graphs

Author: Xu, Yuanzhong, Lee, HyoukJoong, Chen, Dehao, Hechtman, Blake, Huang, Yanping, Joshi, Rahul, Krikun, Maxim, Lepikhin, Dmitry, Ly, Andy, Maggioni, Marcello, Pang, Ruoming, Shazeer, Noam, Wang, Shibo, Wang, Tao, Wu, Yonghui, and Chen, Zhifeng
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation. Its representation of partitioning is simple yet general, allowing it to express different or mixed paradigms of parallelism on a wide variety of models. GSPMD infers the partitioning for every operator based on limited user annotations, making it convenient to scale existing single-device programs. It solves several technical challenges for production usage, allowing GSPMD to achieve 50% to 62% compute utilization on up to 2048 Cloud TPUv3 cores for models with up to one trillion parameters.
Published: 2021

3. Exploring the limits of Concurrency in ML Training on Google TPUs

Author: Kumar, Sameer, Bradbury, James, Young, Cliff, Wang, Yu Emma, Levskaya, Anselm, Hechtman, Blake, Chen, Dehao, Lee, HyoukJoong, Deveci, Mehmet, Kumar, Naveen, Kanwar, Pankaj, Wang, Shibo, Wanderman-Milne, Skye, Lacy, Steve, Wang, Tao, Oguntebi, Tayo, Zu, Yazhou, Xu, Yuanzhong, and Swing, Andy
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Recent results in language understanding using neural networks have required training hardware of unprecedentedscale, with thousands of chips cooperating on a single training run. This paper presents techniques to scaleML models on the Google TPU Multipod, a mesh with 4096 TPU-v3 chips. We discuss model parallelism toovercome scaling limitations from the fixed batch size in data parallelism, communication/collective optimizations,distributed evaluation of training metrics, and host input processing scaling optimizations. These techniques aredemonstrated in both the TensorFlow and JAX programming frameworks. We also present performance resultsfrom the recent Google submission to the MLPerf-v0.7 benchmark contest, achieving record training times from16 to 28 seconds in four MLPerf models on the Google TPU-v3 Multipod machine.
Published: 2020

4. Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

Author: Xu, Yuanzhong, Lee, HyoukJoong, Chen, Dehao, Choi, Hongjun, Hechtman, Blake, and Wang, Shibo
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In data-parallel synchronous training of deep neural networks, different devices (replicas) run the same program with different partitions of the training batch, but weight update computation is repeated on all replicas, because the weights do not have a batch dimension to partition. This can be a bottleneck for performance and scalability in typical language models with large weights, and models with small per-replica batch size which is typical in large-scale training. This paper presents an approach to automatically shard the weight update computation across replicas with efficient communication primitives and data formatting, using static analysis and transformations on the training computation graph. We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs, requiring no change to model code. This technique helps close the gap between traditionally expensive (ADAM) and cheap (SGD) optimizers, as they will only take a small part of training step time and have similar peak memory usage. It helped us to achieve state-of-the-art training performance in Google's MLPerf 0.6 submission., Comment: 12 pages, 23 figures, 1 table
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

4 results on '"Xu, Yuanzhong"'

1. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

2. GSPMD: General and Scalable Parallelization for ML Computation Graphs

3. Exploring the limits of Concurrency in ML Training on Google TPUs

4. Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

4 results on '"Xu, Yuanzhong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources