Back to Search Start Over

Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC

Authors :
Timalsina, Madan
Gerhardt, Lisa
Tyler, Nicholas
Blaschke, Johannes P.
Arndt, William
Publication Year :
2024

Abstract

This paper presents an in-depth examination of checkpoint-restart mechanisms in High-Performance Computing (HPC). It focuses on the use of Distributed MultiThreaded CheckPointing (DMTCP) in various computational settings, including both within and outside of containers. The study is grounded in real-world applications running on NERSC Perlmutter, a state-of-the-art supercomputing system. We discuss the advantages of checkpoint-restart (C/R) in managing complex and lengthy computations in HPC, highlighting its efficiency and reliability in such environments. The role of DMTCP in enhancing these workflows, especially in multi-threaded and distributed applications, is thoroughly explored. Additionally, the paper delves into the use of HPC containers, such as Shifter and Podman-HPC, which aid in the management of computational tasks, ensuring uniform performance across different environments. The methods, results, and potential future directions of this research, including its application in various scientific domains, are also covered, showcasing the critical advancements made in computational methodologies through this study.<br />Comment: 9 pages, 4 figures

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2407.19117
Document Type :
Working Paper