Back to Search
Start Over
Data-centric Reliability Management in GPUs
- Source :
- DSN
- Publication Year :
- 2021
- Publisher :
- IEEE, 2021.
-
Abstract
- Graphics Processing Units (GPUs) have become the default choice of acceleration in a wide range of application domains. To keep up with computational demands, the GPU memory system is constantly being innovated from both the cache and DRAM perspectives. Such innovations can adversely affect GPU reliability and in fact, can lead to an increase in the number of multi-bit faults. To address this problem, we systematically study a wide range of GPGPU applications and find that usually, only a small percentage of data needs protection to increase application resilience. This data is highly accessed and shared (constitutes hot memory), which implies that faults in this space can often lead to incorrect application output. An in-depth analysis of application code shows that information of such data can be passed on to the hardware to guide low-overhead detection/correction schemes. In this vein, we developed low-overhead partial data replication schemes that exploit latency tolerance in GPUs. Overall, this data-centric approach dramatically improves GPGPU application resilience, with a minimal additional average performance overhead of 1.2% for detection-only and 3.4% for detection-and-correction.
Details
- Database :
- OpenAIRE
- Journal :
- 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
- Accession number :
- edsair.doi...........200a92e51e7b51f0563bd4a281a1de18
- Full Text :
- https://doi.org/10.1109/dsn48987.2021.00040