Back to Search Start Over

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.

Authors :
Chen, Zhaoyun
Quan, Wei
Wen, Mei
Fang, Jianbin
Yu, Jie
Zhang, Chunyuan
Luo, Lei
Source :
IEEE Transactions on Parallel & Distributed Systems. Jan2020, Vol. 31 Issue 1, p34-50. 17p.
Publication Year :
2020

Abstract

Deep learning (DL) has been widely adopted in various domains of artificial intelligence (AI), achieving dramatic developments in industry and academia. Besides giant AI companies, numerous small and medium-sized enterprises, institutes, and universities (EIUs) have focused on the research and development (R&D) of DL. Considering the high cost of datacenters and high performance computing (HPC) systems, EIUs prefer adopting off-the-shelf GPU clusters as a DL R&D platform for multiple users and developers to process diverse DL workloads. In such scenarios, the scheduling of multiple DL tasks on a shared GPU cluster is both significant and challenging in terms of efficiently utilizing limited resources. Existing schedulers cannot predict the resource requirements of diverse DL workloads, leading to the under-utilization of computing resources and a decline in user satisfaction. This paper proposes GENIE, a QoS-aware dynamic scheduling framework for a shared GPU cluster, which achieves users’ QoS guarantee and high system utilization. In accordance with an exhaustive characterization, GENIE analyzes the key factors that affect the performance of DL tasks and proposes a prediction model derived from lightweight profiling to estimate the processing rate and response latency for diverse DL workloads. Based on the prediction models, we propose a QoS-aware scheduling algorithm to identify the best placements for DL tasks and schedule them on the shared cluster. Experiments on a GPU cluster and large-scale simulations demonstrate that GENIE achieves a QoS-guarantee percentage improvement of up to 67.4 percent and a makespan reduction of up to 28.2 percent, compared to other baseline schedulers. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10459219
Volume :
31
Issue :
1
Database :
Academic Search Index
Journal :
IEEE Transactions on Parallel & Distributed Systems
Publication Type :
Academic Journal
Accession number :
143316129
Full Text :
https://doi.org/10.1109/TPDS.2019.2931558