1. Deadline-Aware Offloading for High-Throughput Accelerators
- Author
-
Bradford M. Beckmann, Matthew D. Sinclair, Timothy G. Rogers, and Tsung Tai Yeh
- Subjects
010302 applied physics ,Job scheduler ,Queueing theory ,Computer science ,Network packet ,Distributed computing ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,020202 computer hardware & architecture ,Instruction set ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Resource allocation (computer) ,General-purpose computing on graphics processing units ,Throughput (business) ,computer ,Host (network) - Abstract
Contemporary GPUs are widely used for throughput-oriented data-parallel workloads and increasingly are being considered for latency-sensitive applications in datacenters. Examples include recurrent neural network (RNN) inference, network packet processing, and intelligent personal assistants. These data parallel applications have both high throughput demands and real-time deadlines (40μs-7ms). Moreover, the kernels in these applications have relatively few threads that do not fully utilize the device unless a large batch size is used. However, batching forces jobs to wait, which increases their latency, especially when realistic job arrival times are considered.Previously, programmers have managed the tradeoffs associated with concurrent, latency-sensitive jobs by using a combination of GPU streams and advanced scheduling algorithms running on the CPU host. Although GPU streams allow the accelerator to execute multiple jobs concurrently, prior state-of-the-art solutions use the relatively distant CPU host to prioritize the latency-sensitive GPU tasks. Thus, these approaches are forced to operate at a coarse granularity and cannot quickly adapt to rapidly changing program behavior.We observe that fine-grain, device-integrated kernel schedulers efficiently meet the deadlines of concurrent, latency-sensitive GPU jobs. To overcome the limitations of software-only, CPU-side approaches, we extend the GPU queue scheduler to manage real-time deadlines. We propose a novel laxity-aware scheduler (LAX) that uses information collected within the GPU to dynamically vary job priority based on how much laxity jobs have before their deadline. Compared to contemporary GPUs, 3 state-of-the-art CPU-side schedulers and 6 other advanced GPU-side schedulers, LAX meets the deadlines of 1.7X – 5.0X more jobs and provides better energy-efficiency, throughput, and 99-percentile tail latency.
- Published
- 2021