Author: "Phanishayee, Amar" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Phanishayee, Amar"' showing total 76 results

Start Over Author "Phanishayee, Amar"

76 results on '"Phanishayee, Amar"'

1. Integrated Hardware Architecture and Device Placement Search

Author: Wang, Irene, Tarnawski, Jakub, Phanishayee, Amar, and Mahajan, Divya
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy. This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy through novel algorithms, improving the balance of computational resources, memory usage, and data distribution. Our architecture search leverages tensor and vector units, determining their quantity and dimensionality, and on-chip and off-chip memory configurations. It also determines the microbatch size and decides whether to recompute or stash activations, balancing the memory footprint of training and storage size. For each explored architecture configuration, we use an Integer Linear Program (ILP) to find the optimal schedule for executing operators on the accelerator. The ILP results then integrate with a dynamic programming solution to identify the most effective device placement strategy, combining data, pipeline, and tensor model parallelism across multiple accelerators. Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework. The entire source code of PHAZE is available at https://github.com/msr-fiddle/phaze., Comment: Accepted at the 41st International Conference on Machine Learning (ICML), 2024
Published: 2024

2. Data-driven Forecasting of Deep Learning Performance on GPUs

Author: Lee, Seonho, Phanishayee, Amar, and Mahajan, Divya
Subjects: Computer Science - Machine Learning, Computer Science - Performance
Abstract: Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works, where both GPT3 and H100 were not used to train the framework.
Published: 2024

3. Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training

Author: Adnan, Muhammad, Phanishayee, Amar, Kulkarni, Janardhan, Nair, Prashant J., and Mahajan, Divya
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor model parallel scenarios, latter being addressed for the first time. The search optimized accelerators for training relevant metrics such as throughput/TDP under a fixed area and power constraints. However, with the proliferation of specialized architectures and complex distributed training mechanisms, the design space exploration of hardware accelerators is very large. Prior work in this space has tried to tackle this by reducing the search space to either a single accelerator execution that too only for inference, or tuning the architecture for specific layers (e.g., convolution). Instead, we take a unique heuristic-based critical path-based approach to determine the best use of available resources (power and area) either for a set of DNN workloads or each workload individually. First, we perform local search to determine the architecture for each pipeline and tensor model stage. Specifically, the system iteratively generates architectural configurations and tunes the design using a novel heuristic-based approach that prioritizes accelerator resources and scheduling to critical operators in a machine learning workload. Second, to address the complexities of distributed training, the local search selects multiple (k) designs per stage. A global search then identifies an accelerator from the top-k sets to optimize training throughput across the stages. We evaluate this work on 11 different DNN models. Compared to a recent inference-only work Spotlight, our method converges to a design in, on average, 31x less time and offers 12x higher throughput. Moreover, designs generated using our method achieve 12% throughput improvement over TPU architecture.
Published: 2024

4. D\'ej\`aVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

Author: Strati, Foteini, Mcallister, Sara, Phanishayee, Amar, Tarnawski, Jakub, and Klimovic, Ana
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose D\'ej\`aVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (D\'ej\`aVuLib). Using D\'ej\`aVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.
Published: 2024

5. Blox: A Modular Toolkit for Deep Learning Schedulers

Author: Agarwal, Saurabh, Phanishayee, Amar, and Venkataraman, Shivaram
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Deep Learning (DL) workloads have rapidly increased in popularity in enterprise clusters and several new cluster schedulers have been proposed in recent years to support these workloads. With rapidly evolving DL workloads, it is challenging to quickly prototype and compare scheduling policies across workloads. Further, as prior systems target different aspects of scheduling (resource allocation, placement, elasticity etc.), it is also challenging to combine these techniques and understand the overall benefits. To address these challenges we propose Blox, a modular toolkit which allows developers to compose individual components and realize diverse scheduling frameworks. We identify a set of core abstractions for DL scheduling, implement several existing schedulers using these abstractions, and verify the fidelity of these implementations by reproducing results from prior research. We also highlight how we can evaluate and compare existing schedulers in new settings: different workload traces, higher cluster load, change in DNN workloads and deployment characteristics. Finally, we showcase Blox's extensibility by composing policies from different schedulers, and implementing novel policies with minimal code changes. Blox is available at \url{https://github.com/msr-fiddle/blox}., Comment: To be presented at Eurosys'24
Published: 2023

6. A Study on the Intersection of GPU Utilization and CNN Inference

Author: Kosaian, Jack and Phanishayee, Amar
Subjects: Computer Science - Machine Learning, Computer Science - Performance
Abstract: There has been significant progress in developing neural network architectures that both achieve high predictive performance and that also achieve high application-level inference throughput (e.g., frames per second). Another metric of increasing importance is GPU utilization during inference: the measurement of how well a deployed neural network uses the computational capabilities of the GPU on which it runs. Achieving high GPU utilization is critical to increasing application-level throughput and ensuring a good return on investment for deploying GPUs. This paper analyzes the GPU utilization of convolutional neural network (CNN) inference. We first survey the GPU utilization of CNNs to show that there is room to improve the GPU utilization of many of these CNNs. We then investigate the GPU utilization of networks within a neural architecture search (NAS) search space, and explore how using GPU utilization as a metric could potentially be used to accelerate NAS itself. Our study makes the case that there is room to improve the inference-time GPU utilization of CNNs and that knowledge of GPU utilization has the potential to benefit even applications that do not target utilization itself. We hope that the results of this study will spur future innovation in designing GPU-efficient neural networks.
Published: 2022

7. Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers

Author: Li, Youjie, Phanishayee, Amar, Murray, Derek, Tarnawski, Jakub, and Kim, Nam Sung
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training massive DNN models can often exceed the aggregate capacity of all available GPUs on a single server; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training massive models efficiently on a single commodity server. Across various massive DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory., Comment: Accepted at VLDB 2022
Published: 2022
Full Text: View/download PDF

8. Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Author: Mohan, Jayashree, Phanishayee, Amar, Kulkarni, Janardhan, and Chidambaram, Vijay
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource, and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new near-optimal online algorithm. Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
Published: 2021

9. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Author: Narayanan, Deepak, Shoeybi, Mohammad, Casper, Jared, LeGresley, Patrick, Patwary, Mostofa, Korthikanti, Vijay Anand, Vainbrand, Dmitri, Kashinkunti, Prethvi, Bernauer, Julie, Catanzaro, Bryan, Phanishayee, Amar, and Zaharia, Matei
Subjects: Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at https://github.com/nvidia/megatron-lm., Comment: Accepted to SC 2021
Published: 2021

10. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Author: Narayanan, Deepak, Santhanam, Keshav, Kazhamiaka, Fiodar, Phanishayee, Amar, and Zaharia, Matei
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multi-user objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems, making it easy to optimize for objectives in a heterogeneity-aware way, while also being cognizant of performance optimizations like space sharing. Gavel then uses a round-based scheduling mechanism to ensure jobs receive their ideal allocation given the target scheduling policy. Gavel's heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as average job completion time and makespan by up to 3.5x compared to heterogeneity-agnostic policies.
Published: 2020

11. Analyzing and Mitigating Data Stalls in DNN Training

Author: Mohan, Jayashree, Phanishayee, Amar, Raniwala, Ashish, and Chidambaram, Vijay
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Operating Systems
Abstract: Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data preprocessing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time: time spent waiting for data to be fetched and preprocessed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5x on a single server).
Published: 2020

12. Efficient Algorithms for Device Placement of DNN Graph Operators

Author: Tarnawski, Jakub, Phanishayee, Amar, Devanur, Nikhil R., Mahajan, Divya, and Paravecino, Fanny Nina
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning
Abstract: Modern machine learning workloads use large models, with complex structures, that are very expensive to execute. The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of domain-specific accelerators being offered as hardware accelerators in addition to CPUs. These trends necessitate distributing the workload across multiple devices. Recent work has shown that significant gains can be obtained with model parallelism, i.e, partitioning a neural network's computational graph onto multiple devices. In particular, this form of parallelism assumes a pipeline of devices, which is fed a stream of samples and yields high throughput for training and inference of DNNs. However, for such settings (large models and multiple heterogeneous devices), we require automated algorithms and toolchains that can partition the ML workload across devices. In this paper, we identify and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings. We then provide algorithms that solve this problem to optimality. We demonstrate the applicability and efficiency of our approaches using several contemporary DNN computation graphs., Comment: Accepted to NeurIPS 2020
Published: 2020

13. Memory-Efficient Pipeline-Parallel DNN Training

Author: Narayanan, Deepak, Phanishayee, Amar, Shi, Kaiyu, Chen, Xie, and Zaharia, Matei
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning
Abstract: Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators. In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies. PipeDream-2BW can accelerate the training of large GPT and BERT language models by up to 20$\times$ with similar final model accuracy., Comment: Accepted to ICML 2021
Published: 2020

14. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training

Author: Zhu, Hongyu, Phanishayee, Amar, and Pekhimenko, Gennady
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Performance, Statistics - Machine Learning
Abstract: Modern deep neural network (DNN) training jobs use complex and heterogeneous software/hardware stacks. The efficacy of software-level optimizations can vary significantly when used in different deployment configurations. It is onerous and error-prone for ML practitioners and system developers to implement each optimization separately, and determine which ones will improve performance in their own configurations. Unfortunately, existing profiling tools do not aim to answer predictive questions such as "How will optimization X affect the performance of my model?". We address this critical limitation, and proposes a new profiling tool, Daydream, to help programmers efficiently explore the efficacy of DNN optimizations. Daydream models DNN execution with a fine-grained dependency graph based on low-level traces collected by CUPTI, and predicts runtime by simulating execution based on the dependency graph. Daydream maps the low-level traces using DNN domain-specific knowledge, and introduces a set of graph-transformation primitives that can easily model a wide variety of optimizations. We show that Daydream is able to model most mainstream DNN optimization techniques, and accurately predict the efficacy of optimizations that will result in significant performance improvements.
Published: 2020

15. Blink: Fast and Generic Collectives for Distributed ML

Author: Wang, Guanhua, Venkataraman, Shivaram, Phanishayee, Amar, Thelin, Jorgen, Devanur, Nikhil, and Stoica, Ion
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for faster data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8x faster model synchronization, and reduce end-to-end training time for image classification tasks by up to 40%.
Published: 2019

16. The Non-IID Data Quagmire of Decentralized Machine Learning

Author: Hsieh, Kevin, Phanishayee, Amar, Mutlu, Onur, and Gibbons, Phillip B.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Many large-scale machine learning (ML) applications need to perform decentralized learning over datasets generated at different devices and locations. Such datasets pose a significant challenge to decentralized learning because their different contexts result in significant data distribution skew across devices/locations. In this paper, we take a step toward better understanding this challenge by presenting a detailed experimental study of decentralized DNN training on a common type of data skew: skewed distribution of data labels across devices/locations. Our study shows that: (i) skewed data labels are a fundamental and pervasive problem for decentralized learning, causing significant accuracy loss across many ML applications, DNN models, training datasets, and decentralized learning algorithms; (ii) the problem is particularly challenging for DNN models with batch normalization; and (iii) the degree of data skew is a key determinant of the difficulty of the problem. Based on these findings, we present SkewScout, a system-level approach that adapts the communication frequency of decentralized learning algorithms to the (skew-induced) accuracy loss between data partitions. We also show that group normalization can recover much of the accuracy loss of batch normalization.
Published: 2019

17. Themis: Fair and Efficient GPU Cluster Scheduling

Author: Mahajan, Kshiteej, Balasubramanian, Arjun, Singhvi, Arjun, Venkataraman, Shivaram, Akella, Aditya, Phanishayee, Amar, and Chawla, Shuchi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads. We find that established cluster scheduling disciplines are a poor fit because of ML workloads' unique attributes: ML jobs have long-running tasks that need to be gang-scheduled, and their performance is sensitive to tasks' relative placement. We propose Themis, a new scheduling framework for ML training workloads. It's GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off efficiency for fairness in the short term but ensuring finish-time fairness in the long term. Our evaluation on a production trace shows that Themis can improve fairness by more than 2.25X and is ~5% to 250% more cluster efficient in comparison to state-of-the-art schedulers.
Published: 2019

18. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Author: Jeon, Myeongjae, Venkataraman, Shivaram, Phanishayee, Amar, Qian, Junjie, Xiao, Wencong, and Yang, Fan
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.
Published: 2019

19. PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Author: Harlap, Aaron, Narayanan, Deepak, Phanishayee, Amar, Seshadri, Vivek, Devanur, Nikhil, Ganger, Greg, and Gibbons, Phil
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: PipeDream is a Deep Neural Network(DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines. Its pipeline parallel computing model avoids the slowdowns faced by data-parallel training when large models and/or limited network bandwidth induce high communication-to-computation ratios. PipeDream reduces communication by up to 95% for large DNNs relative to data-parallel training, and allows perfect overlap of communication and computation. PipeDream keeps all available GPUs productive by systematically partitioning DNN layers among them to balance work and minimize communication, versions model parameters for backward pass correctness, and schedules the forward and backward passes of different inputs in round-robin fashion to optimize "time to target accuracy". Experiments with five different DNNs on two different clusters show that PipeDream is up to 5x faster in time-to-accuracy compared to data-parallel training.
Published: 2018

20. Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Author: Luo, Liang, Nelson, Jacob, Ceze, Luis, Phanishayee, Amar, and Krishnamurthy, Arvind
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art distributed training techniques for cloud-based ImageNet workloads, with 25% better throughput per dollar.
Published: 2018
Full Text: View/download PDF

21. TBD: Benchmarking and Analyzing Deep Neural Network Training

Author: Zhu, Hongyu, Akrout, Mohamed, Zheng, Bojian, Pelegris, Andrew, Phanishayee, Amar, Schroeder, Bianca, and Pekhimenko, Gennady
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: The recent popularity of deep neural networks (DNNs) has generated a lot of research interest in performing DNN-related computation efficiently. However, the primary focus is usually very narrow and limited to (i) inference -- i.e. how to efficiently execute already trained models and (ii) image classification networks as the primary benchmark for evaluation. Our primary goal in this work is to break this myopic view by (i) proposing a new benchmark for DNN training, called TBD (TBD is short for Training Benchmark for DNNs), that uses a representative set of DNN models that cover a wide range of machine learning applications: image classification, machine translation, speech recognition, object detection, adversarial networks, reinforcement learning, and (ii) by performing an extensive performance analysis of training these different applications on three major deep learning frameworks (TensorFlow, MXNet, CNTK) across different hardware configurations (single-GPU, multi-GPU, and multi-machine). TBD currently covers six major application domains and eight different state-of-the-art models. We present a new toolchain for performance analysis for these models that combines the targeted usage of existing performance analysis tools, careful selection of new and existing metrics and methodologies to analyze the results, and utilization of domain specific characteristics of DNN training. We also build a new set of tools for memory profiling in all three major frameworks; much needed tools that can finally shed some light on precisely how much memory is consumed by different data structures (weights, activations, gradients, workspace) in DNN training. By using our tools and methodologies, we make several important observations and recommendations on where the future research and optimization of DNN training should be focused.
Published: 2018

22. Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Author: Luo, Liang, Nelson, Jacob, Ceze, Luis, Phanishayee, Amar, and Krishnamurthy, Arvind
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift the training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks.We propose PBox, a balanced, scalable central PS hardware that balances compute and communication resources, and PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups to utilize PBox. We show that in a typical cloud environment, PBox can achieve up to 3.8x speedup over state-of-the-art designs when training ImageNet. We discuss future directions of integrating PBox with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement.
Published: 2018

23. Compositional programming and testing of dynamic distributed systems

Author: Desai, Ankush, Phanishayee, Amar, Qadeer, Shaz, and Seshia, Sanjit A
Abstract: A real-world distributed system is rarely implemented as a standalone monolithic system. Instead, it is composed of multiple independent interacting components that together ensure the desired system-level specification. One can scale systematic testing to large, industrial-scale implementations by decomposing the system-level testing problem into a collection of simpler component-level testing problems. This paper proposes techniques for compositional programming and testing of distributed systems with two central contributions: (1) We propose a module system based on the theory of compositional trace refinement for dynamic systems consisting of asynchronously-communicating state machines, where state machines can be dynamically created, and communication topology of the existing state machines can change at runtime; (2) We present ModP, a programming system that implements our module system to enable compositional reasoning (assume-guarantee) of distributed systems. We demonstrate the efficacy of our framework by building two practical fault-tolerant distributed systems, a transaction-commit service and a replicated hash-table. ModP helps implement these systems modularly and validate them via compositional testing. We empirically demonstrate that the abstraction-based compositional reasoning approach helps amplify the coverage during testing and scale it to real-world distributed systems. The distributed services built using ModP achieve performance comparable to open-source equivalents.
Published: 2018

24. Harmony

Author: Li, Youjie, primary, Phanishayee, Amar, additional, Murray, Derek, additional, Tarnawski, Jakub, additional, and Kim, Nam Sung, additional
Published: 2022
Full Text: View/download PDF

25. Efficient large-scale language model training on GPU clusters using megatron-LM

Author: Narayanan, Deepak, primary, Shoeybi, Mohammad, additional, Casper, Jared, additional, LeGresley, Patrick, additional, Patwary, Mostofa, additional, Korthikanti, Vijay, additional, Vainbrand, Dmitri, additional, Kashinkunti, Prethvi, additional, Bernauer, Julie, additional, Catanzaro, Bryan, additional, Phanishayee, Amar, additional, and Zaharia, Matei, additional
Published: 2021
Full Text: View/download PDF

26. Doing more with less

Author: Li, Youjie, primary, Phanishayee, Amar, additional, Murray, Derek, additional, and Kim, Nam Sung, additional
Published: 2021
Full Text: View/download PDF

27. Analyzing and mitigating data stalls in DNN training

Author: Mohan, Jayashree, primary, Phanishayee, Amar, additional, Raniwala, Ashish, additional, and Chidambaram, Vijay, additional
Published: 2021
Full Text: View/download PDF

28. Parameter Hub

Author: Luo, Liang, primary, Nelson, Jacob, additional, Ceze, Luis, additional, Phanishayee, Amar, additional, and Krishnamurthy, Arvind, additional
Published: 2018
Full Text: View/download PDF

29. Benchmarking and Analyzing Deep Neural Network Training

Author: Zhu, Hongyu, primary, Akrout, Mohamed, additional, Zheng, Bojian, additional, Pelegris, Andrew, additional, Jayarajan, Anand, additional, Phanishayee, Amar, additional, Schroeder, Bianca, additional, and Pekhimenko, Gennady, additional
Published: 2018
Full Text: View/download PDF

30. Gist: Efficient Data Encoding for Deep Neural Network Training

Author: Jain, Animesh, primary, Phanishayee, Amar, additional, Mars, Jason, additional, Tang, Lingjia, additional, and Pekhimenko, Gennady, additional
Published: 2018
Full Text: View/download PDF

31. Atomic In-place Updates for Non-volatile Main Memories with Kamino-Tx

Author: Memaripour, Amirsaman, primary, Badam, Anirudh, additional, Phanishayee, Amar, additional, Zhou, Yanqi, additional, Alagappan, Ramnatthan, additional, Strauss, Karin, additional, and Swanson, Steven, additional
Published: 2017
Full Text: View/download PDF

32. FAWN: A Fast Array of Wimpy Nodes.

Author: Andersen, David G., Franklin, Jason, Kaminsky, Michael, Phanishayee, Amar, Tan, Lawrence, and Vasudevan, Vijay
Subjects: DATA libraries, INFORMATION storage & retrieval systems, FLASH memory, SEMICONDUCTOR storage devices, RANDOM access memory, CENTRAL processing units
Abstract: This paper presents a fast array of wimpy nodes — FAWN — an approach for achieving low-power data-intensive datacenter computing. FAWN couples low-power processors to small amounts of local flash storage, balancing computation and I/O capabilities. FAWN optimizes for per node energy efficiency to enable efficient, massively parallel access to data. The key contributions of this paper are the principles of the FAWN approach and the design and implementation of FAWN-KV — a consistent, replicated, highly available, and high-performance key-value storage system built on a FAWN prototype. Our design centers around purely log-structured datastores that provide the basis for high performance on flash storage, as well as for replication and consistency obtained using chain replication on a consistent hashing ring. Our evaluation demonstrates that FAWN clusters can handle roughly 350 key-value queries per Joule of energy — two orders of magnitude more than a disk-based system. [ABSTRACT FROM AUTHOR]
Published: 2011
Full Text: View/download PDF

33. ProjecToR

Author: Ghobadi, Monia, primary, Mahajan, Ratul, additional, Phanishayee, Amar, additional, Devanur, Nikhil, additional, Kulkarni, Janardhan, additional, Ranade, Gireeja, additional, Blanche, Pierre-Alexandre, additional, Rastegarfar, Houman, additional, Glick, Madeleine, additional, and Kilper, Daniel, additional
Published: 2016
Full Text: View/download PDF

34. Evaluation of Elastic Modulation Gains in Microsoft’s Optical Backbone in North America

Author: Ghobadi, Monia, primary, Gaudette, Jamie, additional, Mahajan, Ratul, additional, Phanishayee, Amar, additional, Klinkers, Buddy, additional, and Kilper, Daniel, additional
Published: 2016
Full Text: View/download PDF

35. A (In)Cast of Thousands: Scaling Datacenter TCP to Kiloservers and Gigabits (CMU-PDL-09-101)

Author: Vasudevan, Vijay, Phanishayee, Amar, Hiral Shah, Krevat, Elie, Anderson, David G., Ganger, Gregory R., and Gibson, Garth A.
Subjects: FOS: Computer and information sciences, ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, 80499 Data Format not elsewhere classified
Abstract: This paper presents a practical solution to the problem of high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets—the Incast problem. In these networks, receivers often experience a drastic reduction in throughput when simultaneously requesting data from many servers using TCP. Inbound data overfills small switch buffers, leading to TCP timeouts lasting hundreds of milliseconds. For many datacenter workloads that have a synchronization requirement (e.g., filesystem reads and parallel dataintensive queries), incast can reduce throughput by up to 90%. Our solution for incast uses high-resolution timers in TCP to allow for microsecond-granularity timeouts. We show that this technique is effective in avoiding incast using simulation and real-world experiments. Last, we show that eliminating the minimum retransmission timeout bound is safe for all environments, including the wide-area.
Published: 2009
Full Text: View/download PDF

36. FAWNdamentally Power-Efficient Clusters

Author: Vasudevan, Vijay, Franklin, Jason, Andersen, David, Phanishayee, Amar, Tan, Lawrence, Kaminsky, Michael, and Moraru, Iulian
Subjects: FOS: Computer and information sciences, 80499 Data Format not elsewhere classified
Abstract: As a power-efficient alternative for data-intensive computing, we propose a cluster architecture called a Fast Array of Wimpy Nodes, or FAWN. A FAWN consists of a large number of slower but efficient nodes that each draw only a few watts of power, coupled with low-power storage—our prototype FAWN nodes are built from 500MHz embedded devices with CompactFlash storage that are typically used as wireless routers, Internet gateways, or thin clients. Through our preliminary evaluation, we demonstrate that a FAWN can be up to six times more efficient than traditional systems with Flash storage in terms of queries per joule for seek-bound applications and between two to eight times more efficient for I/O throughput-bound applications
Published: 2009
Full Text: View/download PDF

37. FAWN: A Fast Array of Wimpy Nodes (CMU-PDL-08-108)

Author: Andersen, David G., Franklin, Jason, Phanishayee, Amar, Tan, Lawrence, and Vasudevan, Vijay
Subjects: FOS: Computer and information sciences, 80499 Data Format not elsewhere classified
Abstract: This paper introduces the FAWN—Fast Array of Wimpy Nodes—cluster architecture for providing fast, scalable, and power-efficient key-value storage. A FAWN links together a large number of tiny nodes built using embedded processors and small amounts (2–16GB) of flash memory into an ensemble capable of handling 700 queries per second per node, while consuming fewer than 6 watts of power per node. We have designed and implemented a clustered key-value storage system, FAWN-DHT, that runs atop these nodes. Nodes in FAWN-DHT use a specialized log-like back-end hash-based database to ensure that the system can absorb the large write workload imposed by frequent node arrivals and departures. FAWN uses a two-level cache hierarchy to ensure that imbalanced workloads cannot create hot-spots on one or a few wimpy nodes that impair the system’s ability to service queries at its guaranteed rate. Our evaluation of a small-scale FAWN cluster and several candidate FAWN node systems suggest that FAWN can be a practical approach to building large-scale storage for seek-intensive workloads. Our further analysis indicates that a FAWN cluster is cost-competitive with other approaches (e.g., DRAM, multitudes of magnetic disks, solid-state disk) to providing high query rates, while consuming 3-10x less power.
Published: 2008
Full Text: View/download PDF

38. Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems (CMU-PDL-07-105)

Author: Phanishayee, Amar, Krevat, Elie, Vasudevan, Vijay, Andersen, David G., Ganger, Gregory R., Gibson, Garth A., and Srinivasan Seshan
Subjects: FOS: Computer and information sciences, 80499 Data Format not elsewhere classified
Abstract: Cluster-based and iSCSI-based storage systems rely on standard TCP/IP-over-Ethernet for client access to data. Unfortunately, when data is striped over multiple networked storage nodes, a client can experience a TCP throughput collapse that results in much lower read bandwidth than should be provided by the available network links. Conceptually, this problem arises because the client simultaneously reads fragments of a data block from multiple sources that together send enough data to overload the switch buffers on the client’s link. This paper analyzes this Incast problem, explores its sensitivity to various system parameters, and examines the effectiveness of alternative TCP- and Ethernet-level strategies in mitigating the TCP throughput collapse.
Published: 2007
Full Text: View/download PDF

39. Lab of things

Author: Brush, A.J. Bernheim, primary, Filippov, Evgeni, additional, Huang, Danny, additional, Jung, Jaeyeon, additional, Mahajan, Ratul, additional, Martinez, Frank, additional, Mazhar, Khurshed, additional, Phanishayee, Amar, additional, Samuel, Arjmand, additional, Scott, James, additional, and Singh, Rayman Preet, additional
Published: 2013
Full Text: View/download PDF

40. HomeLab

Author: Singh, Rayman Preet, primary, Brush, A.J. Bernheim, additional, Filippov, Evgeni, additional, Huang, Danny, additional, Mahajan, Ratul, additional, Mazhar, Khurshed, additional, Phanishayee, Amar, additional, and Samuel, Arjmand, additional
Published: 2013
Full Text: View/download PDF

41. Flex-KV

Author: Phanishayee, Amar, primary, Andersen, David G., additional, Pucha, Himabindu, additional, Povzner, Anna, additional, and Belluomini, Wendy, additional
Published: 2012
Full Text: View/download PDF

42. FAWN

Author: Andersen, David G., primary, Franklin, Jason, additional, Kaminsky, Michael, additional, Phanishayee, Amar, additional, Tan, Lawrence, additional, and Vasudevan, Vijay, additional
Published: 2011
Full Text: View/download PDF

43. Scaling all-pairs overlay routing

Author: Sontag, David, primary, Zhang, Yang, additional, Phanishayee, Amar, additional, Andersen, David G., additional, and Karger, David, additional
Published: 2009
Full Text: View/download PDF

44. FAWN

Author: Andersen, David G., primary, Franklin, Jason, additional, Kaminsky, Michael, additional, Phanishayee, Amar, additional, Tan, Lawrence, additional, and Vasudevan, Vijay, additional
Published: 2009
Full Text: View/download PDF

45. Safe and effective fine-grained TCP retransmissions for datacenter communication

Author: Vasudevan, Vijay, primary, Phanishayee, Amar, additional, Shah, Hiral, additional, Krevat, Elie, additional, Andersen, David G., additional, Ganger, Gregory R., additional, Gibson, Garth A., additional, and Mueller, Brian, additional
Published: 2009
Full Text: View/download PDF

46. Ditto

Author: Dogar, Fahad R., primary, Phanishayee, Amar, additional, Pucha, Himabindu, additional, Ruwase, Olatunji, additional, and Andersen, David G., additional
Published: 2008
Full Text: View/download PDF

47. On application-level approaches to avoiding TCP throughput collapse in cluster-based storage systems

Author: Krevat, Elie, primary, Vasudevan, Vijay, additional, Phanishayee, Amar, additional, Andersen, David G., additional, Ganger, Gregory R., additional, Gibson, Garth A., additional, and Seshan, Srinivasan, additional
Published: 2007
Full Text: View/download PDF

48. Scalable Multicast Platforms for a New Generation of Robust Distributed Applications

Author: Birman, Ken, primary, Balakrishnan, Mahesh, additional, Dolev, Danny, additional, Marian, Tudor, additional, Ostrowski, Krzysztof, additional, and Phanishayee, Amar, additional
Published: 2007
Full Text: View/download PDF

49. PLATO: Predictive Latency-Aware Total Ordering

Author: Balakrishnan, Mahesh, primary, Birman, Ken, additional, and Phanishayee, Amar, additional
Published: 2006
Full Text: View/download PDF

50. Ditto.

Author: Dogar, Fahad R., Phanishayee, Amar, Pucha, Himabindu, Ruwase, Olatunji, and Andersen, David G.
Abstract: This paper presents the design, implementation, and evaluation of Ditto, a system that opportunistically caches overheard data to improve subsequent transfer throughput in wireless mesh networks. While mesh networks have been proposed as a way to provide cheap, easily deployable Internet access, they must maintain high transfer throughput to be able to compete with other last-mile technologies. Unfortunately, doing so is difficult because multi-hop wireless transmissions interfere with each other, reducing the available capacity on the network. This problem is particularly severe in common gateway-based scenarios in which nearly all transmissions go through one or a few gateways from the mesh network to the Internet. Ditto exploits on-path as well as opportunistic caching based on overhearing to improve the throughput of data transfers and to reduce load on the gateways. It uses content-based naming to provide application independent caching at the granularity of small chunks, a feature that is key to being able to cache partially overheard data transfers. Our evaluation of Ditto shows that it can achieve significant performance gains for cached data, increasing throughput by up to 7x over simpler on-path caching schemes, and by up to an order of magnitude over no caching. [ABSTRACT FROM AUTHOR]
Published: 2008
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

76 results on '"Phanishayee, Amar"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources