1. Supercomputer 3D Digital Twin for User Focused Real-Time Monitoring
- Author
-
Bergeron, William, Hubbell, Matthew, Mojica, Daniel, Reuther, Albert, Arcand, William, Bestor, David, Burrill, Daniel, Chansup, Byun, Gadepally, Vijay, Houle, Michael, Jananthan, Hayden, Jones, Michael, Luszczek, Piotr, Michaleas, Peter, Milechin, Lauren, Prout, Julie Mullen Andrew, Rosa, Antonio, Yee, Charles, and Kepner, Jeremy
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
Real-time supercomputing performance analysis is a critical aspect of evaluating and optimizing computational systems in a dynamic user environment. The operation of supercomputers produce vast quantities of analytic data from multiple sources and of varying types so compiling this data in an efficient matter is critical to the process. MIT Lincoln Laboratory Supercomputing Center has been utilizing the Unity 3D game engine to create a Digital Twin of our supercomputing systems for several years to perform system monitoring. Unity offers robust visualization capabilities making it ideal for creating a sophisticated representation of the computational processes. As we scale the systems to include a diversity of resources such as accelerators and the addition of more users, we need to implement new analysis tools for the monitoring system. The workloads in research continuously change, as does the capability of Unity, and this allows us to adapt our monitoring tools to scale and incorporate features enabling efficient replay of system wide events, user isolation, and machine level granularity. Our system fully takes advantage of the modern capabilities of the Unity Engine in a way that intuitively represents the real time workload performed on a supercomputer. It allows HPC system engineers to quickly diagnose usage related errors with its responsive user interface which scales efficiently with large data sets.
- Published
- 2024