1. Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling
- Author
-
Po-Han Wang, Hsiang-Yun Cheng, Li-jhan Chen, and Chia-Lin Yang
- Subjects
010302 applied physics ,Computer science ,Locality ,02 engineering and technology ,Parallel computing ,Thread (computing) ,01 natural sciences ,Win32 Thread Information Block ,020202 computer hardware & architecture ,Instruction set ,Shared memory ,Hardware and Architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,Cache ,Cache algorithms - Abstract
Modern GPGPUs support the concurrent execution of thousands of threads to provide an energy-efficient platform. However, the massive multi-threading of GPGPUs incurs serious cache contention, as the cache lines brought by one thread can easily be evicted by other threads in the small shared cache. In this paper, we propose a software-hardware cooperative approach that exploits the spatial locality among different thread blocks to better utilize the precious cache capacity. Through dynamic locality estimation and thread block scheduling, we can capture more performance improvement opportunities than prior work that only explores the spatial locality between consecutive thread blocks. Evaluations across diverse GPGPU applications show that, on average, our locality-aware scheduler provides 25 and 9 percent performance improvement over the commonly-employed round-robin scheduler and the state-of-the-art scheduler, respectively.
- Published
- 2017
- Full Text
- View/download PDF