Back to Search Start Over

SMGuard: A Flexible and Fine-Grained Resource Management Framework for GPUs.

Authors :
Yu, Chao
Bai, Yuebin
Yang, Hailong
Cheng, Kun
Gu, Yuhao
Luan, Zhongzhi
Qian, Depei
Source :
IEEE Transactions on Parallel & Distributed Systems. 12/1/2018, Vol. 29 Issue 12, p2849-2862. 14p.
Publication Year :
2018

Abstract

GPUs have been becoming an indispensable computing platform in data centers, and co-locating multiple applications on the same GPU is widely used to improve resource utilization. However, performance interference due to uncontrolled resource contention severely degrades the performance of co-locating applications and fails to deliver satisfactory user experience. In this paper, we present SMGuard, a software approach to flexibly manage the GPU resource usage of multiple applications under co-location. We also propose a capacity based GPU resource model CapSM, which provisions the GPU resources in a fine-grained granularity among co-locating applications. When co-locating latency-sensitive applications with batch applications, SMGuard can prevent batch applications from occupying resources without constraint using quota based mechanism, and guarantee the resource usage of latency-sensitive applications with reservation based mechanism. In addition, SMGuard supports dynamic resource adjustment through evicting the running thread blocks of batch applications to release the occupied resources and remapping the uncompleted thread blocks to the remaining resources, which avoids the relaunch of the preempted kernel. The SMGuard is a pure software solution that does not rely on special GPU hardware or programming model, which is easy to adopt on commodity GPUs in data centers. Our evaluation shows that SMGuard improves the average performance of latency-sensitive applications by 9.8× when co-located with batch applications. In the meanwhile, the GPU utilization can be improved by 35 percent on average. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10459219
Volume :
29
Issue :
12
Database :
Academic Search Index
Journal :
IEEE Transactions on Parallel & Distributed Systems
Publication Type :
Academic Journal
Accession number :
132967338
Full Text :
https://doi.org/10.1109/TPDS.2018.2848621