Back to Search Start Over

Data-centric Reliability Management in GPUs

Authors :
Evgenia Smirni
Adwait Jog
Gurunath Kadam
Source :
DSN
Publication Year :
2021
Publisher :
IEEE, 2021.

Abstract

Graphics Processing Units (GPUs) have become the default choice of acceleration in a wide range of application domains. To keep up with computational demands, the GPU memory system is constantly being innovated from both the cache and DRAM perspectives. Such innovations can adversely affect GPU reliability and in fact, can lead to an increase in the number of multi-bit faults. To address this problem, we systematically study a wide range of GPGPU applications and find that usually, only a small percentage of data needs protection to increase application resilience. This data is highly accessed and shared (constitutes hot memory), which implies that faults in this space can often lead to incorrect application output. An in-depth analysis of application code shows that information of such data can be passed on to the hardware to guide low-overhead detection/correction schemes. In this vein, we developed low-overhead partial data replication schemes that exploit latency tolerance in GPUs. Overall, this data-centric approach dramatically improves GPGPU application resilience, with a minimal additional average performance overhead of 1.2% for detection-only and 3.4% for detection-and-correction.

Details

Database :
OpenAIRE
Journal :
2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Accession number :
edsair.doi...........200a92e51e7b51f0563bd4a281a1de18
Full Text :
https://doi.org/10.1109/dsn48987.2021.00040