Back to Search Start Over

ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based model drift determination.

Authors :
Breitenbach, Tim
Malavalli Divakar, Shrikanth
Rasbach, Lauritz
Jahnke, Patrick
Source :
Journal of Parallel & Distributed Computing. Mar2024, Vol. 185, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

• End-to-End solution of memory reliability in data center environment combing ML-models and engineering tools. • Memory error pattern modeled as images to apply CNN models. • Providing a visualization tool for memory errors on cell level. • Providing an MLOps framework for operating the CNN model in production. • Concept for self-supervised data labeling. With the trend towards multi-socket server systems, the demand for random access memory (RAM) per server increased. The consequence are more DIMM sockets per server. Since every dual in-line memory module (DIMM), which comprises a series of dynamic random-access memory integrated circuits, has a probability of failure, RAM issues became a dominant failure pattern for servers. The concept introduced in this work contributes to improving the reliability of data centers by avoiding RAM failures and mitigating their impact. For this purpose, an ML-driven framework is provided to estimate the probability of memory failure for each RAM module. The ML framework is based on structural information between correctable (CE) and uncorrectable errors (UE). In a common memory scenario, a corrupted bit within a module can be restored by redundancy using an error correction code (ECC), resulting in a CE. However, if there is more than one corrupted bit within a group of bits covered by the ECC, the information cannot be restored, resulting in a UE. Consequently, the related task requesting the memory content, and the corresponding service may crash. There is evidence that UEs have a CE history and structural relation between the CEs. However, for the case of UEs without a CE history or of a false decision of the ML framework, we extend the total framework by engineering measures to mitigate the impact of a UE by avoiding kernel panic and using backups. The engineering measures use a mapping between physical and logical memory addresses. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
07437315
Volume :
185
Database :
Academic Search Index
Journal :
Journal of Parallel & Distributed Computing
Publication Type :
Academic Journal
Accession number :
174322930
Full Text :
https://doi.org/10.1016/j.jpdc.2023.104800