Back to Search Start Over

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers.

Authors :
Jaulmes, Luc
Moreto, Miquel
Ayguade, Eduard
Labarta, Jesus
Valero, Mateo
Casas, Marc
Source :
IEEE Transactions on Parallel & Distributed Systems. Sep2018, Vol. 29 Issue 9, p1961-1974. 14p.
Publication Year :
2018

Abstract

Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, e.g., by Error Correcting Codes (ECC). For a program to be fault-tolerant, it needs to also handle the Errors that are Detected and Uncorrected (DUE), such as an ECC encountering too many bit flips in a codeword. While correcting an error has an overhead in itself, it can also affect the progress of a program. The most generic technique, rolling back the program state to a previously taken checkpoint, sets back any progress done since then. Alternately, application specific techniques exist, such as restarting an iterative program with its latest iteration's values as initial guess. We introduce a novel error correction technique for iterative linear solvers, designed to preserve both the progress made and the solver's future convergence by recovering the program's state exactly. Leveraging the asynchrony of task-based programming models, we mask our technique's overhead by overlapping error correction with the solver's normal workload. Our technique relies on analysing solvers to find redundancy in the form of relations between data. We are then able to restore discarded or corrupted data by recomputing or inverting the appropriate relations. We demonstrate that this approach allows to recover any part of three widely used Krylov subspace methods: CG, GMRES and BiCGStab, and their pre-conditioned versions. We implement our technique for CG and recover lost data at the scale of a memory page, which is the granularity at which Operating Systems (OS) report memory errors on commodity hardware, and study the effect of varying the memory page size to address non-standard sizes and the possible use of huge pages in High Performance Computing (HPC). When compared to checkpointing and to the state-of-the-art algorithmic restart technique, on small (8 cores) to large scale (1024 cores), our methods show less overhead. A trade-off arises between our straightforward and asynchronous approaches, based on the rate at which faults happen. At the lowest considered rate and page size, overlapping recoveries decreases their average cost from 5.40 to 2.24 percent of the ideal faultless execution time. Our methods generally outperform the state-of-the-art even with increased overheads on big page sizes, and perform similarly on edge cases. These results also indicate that our techniques are increasingly efficient as the matrix size increases. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10459219
Volume :
29
Issue :
9
Database :
Academic Search Index
Journal :
IEEE Transactions on Parallel & Distributed Systems
Publication Type :
Academic Journal
Accession number :
131229247
Full Text :
https://doi.org/10.1109/TPDS.2018.2817524