1. Design of an adaptive GPU sharing and scheduling scheme in container-based cluster
- Author
-
Qichen Chen, Yoonhee Kim, Seoyoung Kim, and Jisun Oh
- Subjects
Computer Networks and Communications ,Computer science ,business.industry ,Distributed computing ,Software development ,020206 networking & telecommunications ,Workload ,02 engineering and technology ,Virtualization ,computer.software_genre ,Supercomputer ,Scheduling (computing) ,Out of memory ,Software portability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business ,computer ,Software - Abstract
Container based virtualization is an innovative technology that accelerates software development by providing portability and maintainability of applications. Recently, a growing number of workloads such as high performance computing (HPC) and Deep Learning(DL) are deployed in the container based environment. However, GPU resource management issues especially the GPU memory over subscription issue in container-based clusters, which brings substantial performance loss, is still challenging. This paper proposes an adaptive fair-share method to share effectively in container-based virtualization environment as well as an execution rescheduling method to manage the execution order of each container for acquiring maximum performance gain. We also proposed a checkpoint based mechanism especially for DL workload running with TensorFlow, which can efficiently solve the GPU memory over subscription problem. We demonstrate that our approach contributes to overall performance improvement as well as higher resource utilization compared to default and static fair-share methods with homogeneous and heterogeneous workloads. Compared to two other conditions, their results show that the proposed method reduces by 16.37%, 15.61% in average execution time and boosts approximately by 52.46%, 10.3% in average GPU memory utilization, respectively. We also evaluated our checkpoint based mechanism by running multiple CNN workloads with TensorFlow at the same time and the result shows our proposed mechanism can ensure each workload executing safely without out of memory (OOM) error occurs.
- Published
- 2019
- Full Text
- View/download PDF