Start Over

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.

Authors :: Chen, Zhaoyun
Quan, Wei
Wen, Mei
Fang, Jianbin
Yu, Jie
Zhang, Chunyuan
Luo, Lei
Source :: IEEE Transactions on Parallel & Distributed Systems. Jan2020, Vol. 31 Issue 1, p34-50. 17p.
Publication Year :: 2020
Abstract: Deep learning (DL) has been widely adopted in various domains of artificial intelligence (AI), achieving dramatic developments in industry and academia. Besides giant AI companies, numerous small and medium-sized enterprises, institutes, and universities (EIUs) have focused on the research and development (R&D) of DL. Considering the high cost of datacenters and high performance computing (HPC) systems, EIUs prefer adopting off-the-shelf GPU clusters as a DL R&D platform for multiple users and developers to process diverse DL workloads. In such scenarios, the scheduling of multiple DL tasks on a shared GPU cluster is both significant and challenging in terms of efficiently utilizing limited resources. Existing schedulers cannot predict the resource requirements of diverse DL workloads, leading to the under-utilization of computing resources and a decline in user satisfaction. This paper proposes GENIE, a QoS-aware dynamic scheduling framework for a shared GPU cluster, which achieves users’ QoS guarantee and high system utilization. In accordance with an exhaustive characterization, GENIE analyzes the key factors that affect the performance of DL tasks and proposes a prediction model derived from lightweight profiling to estimate the processing rate and response latency for diverse DL workloads. Based on the prediction models, we propose a QoS-aware scheduling algorithm to identify the best placements for DL tasks and schedule them on the shared cluster. Experiments on a GPU cluster and large-scale simulations demonstrate that GENIE achieves a QoS-guarantee percentage improvement of up to 67.4 percent and a makespan reduction of up to 28.2 percent, compared to other baseline schedulers. [ABSTRACT FROM AUTHOR]

Subjects :: *COMPUTER scheduling
*DEEP learning
*HIGH performance computing
*GRAPHICS processing units
*PRODUCTION scheduling
*RESEARCH & development
*SMALL business

Details

Language :: English
ISSN :: 10459219
Volume :: 31
Issue :: 1
Database :: Academic Search Index
Journal :: IEEE Transactions on Parallel & Distributed Systems
Publication Type :: Academic Journal
Accession number :: 143316129
Full Text :: https://doi.org/10.1109/TPDS.2019.2931558

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources