Back to Search Start Over

A study of the viability of exploiting memory content similarity to improve resilience to memory errors

Authors :
Kurt B. Ferreira
Patrick G. Bridges
Aidan P. Thompson
Scott Levy
Christian Robert Trott
Source :
The International Journal of High Performance Computing Applications. 29:5-20
Publication Year :
2014
Publisher :
SAGE Publications, 2014.

Abstract

Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grow, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory. In this paper, we propose a novel runtime for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the viability of this approach by examining memory snapshots collected from eight high-performance computing (HPC) applications and two important HPC operating systems. Based on the characteristics of the similarity uncovered, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.

Details

ISSN :
17412846 and 10943420
Volume :
29
Database :
OpenAIRE
Journal :
The International Journal of High Performance Computing Applications
Accession number :
edsair.doi...........6e5cb7a57c03b3b9f4da7ca684873510
Full Text :
https://doi.org/10.1177/1094342014560354