Author: "Wellein, Gerhard" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wellein, Gerhard"' showing total 388 results

Start Over Author "Wellein, Gerhard"

388 results on '"Wellein, Gerhard"'

1. Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Author: Lacey, Dane C., Alappat, Christie L., Lange, Florian, Hager, Georg, Fehske, Holger, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) provides an opportunity for data reuse since the same matrix data is loaded from main memory multiple times, an opportunity that has only recently been exploited successfully with the Recursive Algebraic Coloring Engine (RACE). Using RACE, one considers a graph based formulation of the SpMV and employs s level-based implementation of SpMV for reuse of relevant matrix data. However, the underlying data dependencies have restricted the use of this concept to shared memory parallelization and thus to single compute nodes. Enabling cache blocking for distributed-memory parallelization of MPK is challenging due to the need for explicit communication and synchronization of data in neighboring levels. In this work, we propose and implement a flexible method that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme that fulfills all data dependencies among processes. Compared to a "traditional" distributed memory parallel MPK, our new Distributed Level-Blocked MPK yields substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from various scientific applications. Finally, we address a modern quantum physics problem to demonstrate the applicability of our method, achieving a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster., Comment: 15 pages, 12 figures, 5 tables; added affiliation & extended acknowledgment
Published: 2024

2. Charge-order melting in the one-dimensional Edwards model

Author: Lange, Florian, Wellein, Gerhard, and Fehske, Holger
Subjects: Condensed Matter - Strongly Correlated Electrons
Abstract: We use infinite matrix-product-state techniques to study the time evolution of the charge-density-wave (CDW) order after a quench or a light pulse in a fundamental fermion-boson model. The motion of fermions in the model is linked to the creation of bosonic excitations, which counteracts the melting of the CDW order. For low-energy quenches corresponding to a change of the boson relaxation rate, we find behavior similar to that in an effective $t$-$V$ model. When the boson energy is quenched instead or a light pulse is applied to the system, the transient dynamics are more complex, with the CDW order first quickly decreasing to an intermediate value while the density-wave-like order of the bosons rises. In the case of pulse irradiation, the subsequent time-evolution of the CDW order depends strongly on the photon frequency. For frequencies slightly below the boson energy, we observe a temporary increase of the CDW order parameter. Our results reveal the complex physics of driven Mott insulators in low-dimensional systems with strong correlations., Comment: 6 pages, 6 figures, supplemental material
Published: 2024
Full Text: View/download PDF

3. Alya towards Exascale: Optimal OpenACC Performance of the Navier-Stokes Finite Element Assembly on GPUs

Author: Owen, Herbert, Ernst, Dominik, Gruber, Thomas, Lemkuhl, Oriol, Houzeaux, Guillaume, Gasparino, Lucas, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: This paper addresses the challenge of providing portable and highly efficient code structures for CPU and GPU architectures. We choose the assembly of the right-hand term in the incompressible flow module of the High-Performance Computational Mechanics code Alya, which is one of the two CFD codes in the Unified European Benchmark Suite. Starting from an efficient CPU-code and a related OpenACC-port for GPUs we successively investigate performance potentials arising from code specialization, algorithmic restructuring and low-level optimizations. We demonstrate that only the combination of these different dimensions of runtime optimization unveils the full performance potential on the GPU and CPU. Roofline-based performance modelling is applied in this process and we demonstrate the need to investigate new optimization strategies if a classical roofline limit such as memory bandwidth utilization is achieved, rather than stopping the process. The final unified OpenACC-based implementation boosts performance by more than 50x on an NVIDIA A100 GPU (achieving approximately 2.5 TF/s FP64) and a further factor of 5x for an Intel Icelake based CPU-node (achieving approximately 1.0 TF/s FP64). The insights gained in our manual approach lays ground implementing unified but still highly efficient code structures for related kernels in Alya and other applications. These can be realized by manual coding or automatic code generation frameworks.
Published: 2024
Full Text: View/download PDF

4. CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion

Author: Laukemann, Jan, Gruber, Thomas, Hager, Georg, Oryspayev, Dossay, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire Rapids" server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measurements and microbenchmarks to study memory data traffic behavior, we can connect the breakdowns to SpecI2M, a new write-allocate evasion feature in current Intel CPUs. For serial and full-node cases we are able to predict the memory data volume analytically with an error of a few percent. We find that if the number of processes is prime, SpecI2M fails to work properly, which we can attribute to short inner loops emerging from the one-dimensional domain decomposition in this case. We can also rule out other possible causes of the prime number effect, such as breaking layer conditions, MPI communication overhead, and load imbalance., Comment: 19 pages including artifact appendix; 11 figures, 1 table; numerous corrections, esp. in Table 1
Published: 2023
Full Text: View/download PDF

5. Physical Oscillator Model for Supercomputing

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Physics - Computational Physics
Abstract: A parallel program together with the parallel hardware it is running on is not only a vehicle to solve numerical problems, it is also a complex system with interesting dynamical behavior: resynchronization and desynchronization of parallel processes, propagating phases of idleness, and the peculiar effects of noise and system topology are just a few examples. We propose a physical oscillator model (POM) to describe aspects of the dynamics of interacting parallel processes. Motivated by the well-known Kuramoto Model, a process with its regular compute-communicate cycles is modeled as an oscillator which is coupled to other oscillators (processes) via an interaction potential. Instead of a simple all-to-all connectivity, we employ a sparse topology matrix mapping the communication structure and thus the inter-process dependencies of the program onto the oscillator model and propose two interaction potentials that are suitable for different scenarios in parallel computing: resource-scalable and resource-bottlenecked applications. The former are not limited by a resource bottleneck such as memory bandwidth or network contention, while the latter are. Unlike the original Kuramoto model, which has a periodic sinusoidal potential that is attractive for small angles, our characteristic potentials are always attractive for large angles and only differ in the short-distance behavior. We show that the model with appropriate potentials can mimic the propagation of delays and the synchronizing and desynchronizing behavior of scalable and bottlenecked parallel programs, respectively., Comment: 5 pages, 2 figures
Published: 2023
Full Text: View/download PDF

6. SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In this work, fundamental performance, power, and energy characteristics of the full SPEChpc 2021 benchmark suite are assessed on two different clusters based on Intel Ice Lake and Sapphire Rapids CPUs using the MPI-only codes' variants. We use memory bandwidth, data volume, and scalability metrics in order to categorize the benchmarks and pinpoint relevant performance and scalability bottlenecks on the node and cluster levels. Common patterns such as memory bandwidth limitation, dominating communication and synchronization overhead, MPI serialization, superlinear scaling, and alignment issues could be identified, in isolation or in combination, showing that SPEChpc 2021 is representative of many HPC workloads. Power dissipation and energy measurements indicate that the modern Intel server CPUs have such a high idle power level that race-to-idle is the paramount strategy for energy to solution and energy-delay product minimization. On the chip level, only memory-bound code shows a clear advantage of Sapphire Rapids compared to Ice Lake in terms of energy to solution., Comment: 9 pages, 6 figures; corrected links to system docs
Published: 2023
Full Text: View/download PDF

7. Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs

Author: Alappat, Christie, Thies, Jonas, Hager, Georg, Fehske, Holger, and Wellein, Gerhard
Subjects: Mathematics - Numerical Analysis, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Sparse linear iterative solvers are essential for many large-scale simulations. Much of the runtime of these solvers is often spent in the implicit evaluation of matrix polynomials via a sequence of sparse matrix-vector products. A variety of approaches has been proposed to make these polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular practice to approximate triangular solves by a matrix polynomial to increase parallelism. Such algorithms allow to evaluate the polynomial using a so-called matrix power kernel (MPK), which computes the product between a power of a sparse matrix A and a dense vector x, or a related operation. Recently we have shown that using the level-based formulation of sparse matrix-vector multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we can perform temporal cache blocking of MPK to increase its performance. In this work, we demonstrate the application of this cache-blocking optimization in sparse iterative solvers. By integrating the RACE library into the Trilinos framework, we demonstrate the speedups achieved in preconditioned) s-step GMRES, polynomial preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we achieve speedups of up to 3x on modern multi-core compute nodes. For algorithms with moderate contributions from subspace orthogonalization, the gain reduces significantly, which is often caused by the insufficient quality of the orthogonalization routines. Finally, we showcase the application of RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind) and highlight the new opportunities and perspectives opened up by RACE as a cache-blocking technique for MPK-enabled sparse solvers., Comment: 25 pages, 11 figures, 3 tables
Published: 2023

8. Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

Author: Afzal, Ayesha, Hager, Georg, Markidis, Stefano, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate injection of delays can improve performance if certain conditions are met. This leads to the counter-intuitive conclusion that noise, independent of its source, is not always detrimental but can be leveraged for performance improvements. We employ phase-space graphs as a new tool to visualize parallel program dynamics. They are useful in spotting certain patterns in parallel execution that will easily go unnoticed with traditional tracing tools. We investigate five different microbenchmarks and applications on different supercomputer platforms: an MPI-augmented STREAM Triad, two implementations of Lattice-Boltzmann fluid solvers, and the LULESH and HPCG proxy applications., Comment: 18 pages, 14 figures, 7 tables. Corrected Fig. 4 layout
Published: 2023
Full Text: View/download PDF

9. MD-Bench: Engineering the in-core performance of short-range molecular dynamics kernels from state-of-the-art simulation packages

Author: Machado, Rafael Ravedutti Lucio, Eitzinger, Jan, Laukemann, Jan, Hager, Georg, Köstler, Harald, and Wellein, Gerhard
Subjects: Physics - Chemical Physics, Computer Science - Performance, Physics - Computational Physics
Abstract: Molecular dynamics (MD) simulations provide considerable benefits for the investigation and experimentation of systems at atomic level. Their usage is widespread into several research fields, but their system size and timescale are also crucially limited by the computing power they can make use of. Performance engineering of MD kernels is therefore important to understand their bottlenecks and point out possible improvements. For that reason, we developed MD-Bench, a proxy-app for short-range MD kernels that implements state-of-the-art algorithms from multiple production applications such as LAMMPS and GROMACS. MD-Bench is intended to have simpler, understandable and extensible source code, as well as to be transparent and suitable for teaching, benchmarking and researching MD algorithms. In this paper we introduce MD-Bench, describe its design and structure and implemented algorithms. Finally, we show five usage examples of MD-Bench and describe how these are useful to have a deeper understanding of MD kernels from a performance point of view, also exposing some interesting performance insights., Comment: 17 pages, 10 figures, 5 tables. arXiv admin note: text overlap with arXiv:2207.13094
Published: 2023

10. MD-Bench: A generic proxy-app toolbox for state-of-the-art molecular dynamics algorithms

Author: Machado, Rafael Ravedutti Lucio, Eitzinger, Jan, Köstler, Harald, and Wellein, Gerhard
Subjects: Physics - Computational Physics, Computer Science - Performance
Abstract: Proxy-apps, or mini-apps, are simple self-contained benchmark codes with performance-relevant kernels extracted from real applications. Initially used to facilitate software-hardware co-design, they are a crucial ingredient for serious performance engineering, especially when dealing with large-scale production codes. MD-Bench is a new proxy-app in the area of classical short-range molecular dynamics. In contrast to existing proxy-apps in MD (e.g. miniMD and coMD) it does not resemble a single application code, but implements state-of-the art algorithms from multiple applications (currently LAMMPS and GROMACS). The MD-Bench source code is understandable, extensible and suited for teaching, benchmarking and researching MD algorithms. Primary design goals are transparency and simplicity, a developer is able to tinker with the source code down to the assembly level. This paper introduces MD-Bench, explains its design and structure, covers implemented optimization variants, and illustrates its usage on three examples., Comment: 12 Pages, 2 figures, submitted to PPAM22
Published: 2022

11. Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Author: Afzal, Ayesha, Hager, Georg, Wellein, Gerhard, and Markidis, Stefano
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Performance
Abstract: This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics., Comment: 12 pages, 9 figures, 1 table
Published: 2022
Full Text: View/download PDF

12. The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the program code and the hardware, especially when execution and communication bottlenecks are involved. In this paper we investigate the effect of "bottleneck evasion" and how it can lead to automatic overlap of communication overhead with computation. Bottleneck evasion leads to a gradual loss of the initial bulk-synchronous behavior of a parallel code so that its processes become desynchronized. This occurs most prominently in memory-bound programs, which is why we choose memory-bound benchmark and application codes, specifically an MPI-augmented STREAM Triad, sparse matrix-vector multiplication, and a collective-avoiding Chebyshev filter diagonalization code to demonstrate the consequences of desynchronization on two different supercomputing platforms. We investigate the role of idle waves as possible triggers for desynchronization and show the impact of automatic asynchronous communication for a spectrum of code properties and parameters, such as saturation point, matrix structures, domain decomposition, and communication concurrency. Our findings reveal how eliminating synchronization points (such as collective communication or barriers) precipitates performance improvements that go beyond what can be expected by simply subtracting the overhead of the collective from the overall runtime., Comment: 13 pages, 7 figures, 6 tables
Published: 2022
Full Text: View/download PDF

13. Level-based Blocking for Sparse Matrices: Sparse Matrix-Power-Vector Multiplication

Author: Alappat, Christie L., Hager, Georg, Schenk, Olaf, and Wellein, Gerhard
Subjects: Mathematics - Numerical Analysis, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: The multiplication of a sparse matrix with a dense vector (SpMV) is a key component in many numerical schemes and its performance is known to be severely limited by main memory access. Several numerical schemes require the multiplication of a sparse matrix polynomial with a dense vector, which is typically implemented as a sequence of SpMVs. This results in low performance and ignores the potential to increase the arithmetic intensity by reusing the matrix data from cache. In this work we use the recursive algebraic coloring engine (RACE) to enable blocking of sparse matrix data across the polynomial computations. In the graph representing the sparse matrix we form levels using a breadth-first search. Locality relations of these levels are then used to improve spatial and temporal locality when accessing the matrix data and to implement an efficient multithreaded parallelization. Our approach is independent of the matrix structure and avoids shortcomings of existing "blocking" strategies in terms of hardware efficiency and parallelization overhead. We quantify the quality of our implementation using performance modelling and demonstrate speedups of up to 3$\times$ and 5$\times$ compared to an optimal SpMV-based baseline on a single multicore chip of recent Intel and AMD architectures. As a potential application, we demonstrate the benefit of our implementation for a Chebyshev time propagation scheme, representing the class of polynomial approximations to exponential integrators. Further numerical schemes which may benefit from our developments include $s$-step Krylov solvers and power clustering algorithms., Comment: 18 pages, 19 figures, 3 tables
Published: 2022
Full Text: View/download PDF

14. Analytical Performance Estimation during Code Generation on Modern GPUs

Author: Ernst, Dominik, Holzer, Markus, Hager, Georg, Knorr, Matthias, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box machine learning to select the best-performing configuration. This paper identifies the relevant performance-defining mechanisms for memory-intensive GPU applications through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient code candidates with high accuracy. We examine the changes of the A100 GPU architecture compared to the predecessor V100 and address the challenges of how to model the data transfer volumes through the new memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range-four 3D-25pt stencil and a complex two-phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best-performing candidate. The method is not limited to stencil kernels but can be integrated into any code generator that can generate the required address expressions., Comment: arXiv admin note: substantial text overlap with arXiv:2107.01143
Published: 2022
Full Text: View/download PDF

15. Opening the Black Box: Performance Estimation during Code Generation for GPUs

Author: Ernst, Dominik, Hager, Georg, Holzer, Markus, Knorr, Matthias, and Wellein, Gerhard
Subjects: Computer Science - Performance, C.4
Abstract: Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning. This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy. Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate. The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.
Published: 2021
Full Text: View/download PDF

16. Valley filtering in strain-induced $\alpha$-$\mathcal{T}_3$ quantum dots

Author: Filusch, Alexander, Bishop, Alan R., Saxena, Avadh, Wellein, Gerhard, and Fehske, Holger
Subjects: Condensed Matter - Mesoscale and Nanoscale Physics
Abstract: We test the valley-filtering capabilities of a quantum dot inscribed by locally straining an $\alpha$-$\mathcal{T}_3$ lattice. Specifically, we consider an out-of-plane Gaussian bump in the center of a four-terminal configuration and calculate the generated pseudomagnetic field having an opposite direction for electrons originating from different valleys, the resulting valley-polarized currents, and the conductance between the injector and collector situated opposite one another. Depending on the quantum dot's width and width-to-height ratio, we detect different transport regimes with and without valley filtering for both the $\alpha$-$\mathcal{T}_3$ and dice lattice structures. In addition, we analyze the essence of the conductance resonances with a high valley polarization in terms of related (pseudo-) Landau levels, the spatial distribution of the local density of states, and the local current densities. The observed local charge and current density patterns reflect the local inversion symmetry breaking by the strain, besides the global inversion symmetry breaking due to the scaling parameter $\alpha$. By this way we can also filter out different sublattices., Comment: 8 pages, 5 figures, slightly corrected version
Published: 2021
Full Text: View/download PDF

17. Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Most distributed-memory bulk-synchronous parallel programs in HPC assume that compute resources are available continuously and homogeneously across the allocated set of compute nodes. However, long one-off delays on individual processes can cause global disturbances, so-called idle waves, by rippling through the system. This process is mainly governed by the communication topology of the underlying parallel code. This paper makes significant contributions to the understanding of idle wave dynamics. We study the propagation mechanisms of idle waves across the ranks of MPI-parallel programs. We present a validated analytic model for their propagation velocity with respect to communication parameters and topology, with a special emphasis on sparse communication patterns. We study the interaction of idle waves with MPI collectives and show that, depending on the implementation, a collective may be transparent to the wave. Finally we analyze two mechanisms of idle wave decay: topological decay, which is rooted in differences in communication characteristics among parts of the system, and noise-induced decay, which is caused by system or application noise. We show that noise-induced decay is largely independent of noise characteristics but depends only on the overall noise power. An analytic expression for idle wave decay rate with respect to noise power is derived. For model validation we use microbenchmarks and stencil algorithms on three different supercomputing platforms., Comment: 19 pages, 10 figures, 2 tables
Published: 2021
Full Text: View/download PDF

18. ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

Author: Alappat, Christie, Meyer, Nils, Laukemann, Jan, Gruber, Thomas, Hager, Georg, Wellein, Gerhard, and Wettig, Tilo
Subjects: Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing, High Energy Physics - Lattice
Abstract: The A64FX CPU is arguably the most powerful Arm-based processor design to date. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis of the A64FX used in the Fujitsu FX1000 supercomputer at a level of detail that allows for the construction of Execution-Cache-Memory (ECM) performance models for steady-state loops. In the process we identify architectural peculiarities that point to viable generic optimization strategies. After validating the model using simple streaming loops we apply the insight gained to sparse matrix-vector multiplication (SpMV) and the domain wall (DW) kernel from quantum chromodynamics (QCD). For SpMV we show why the CRS matrix storage format is not a good practical choice on this architecture and how the SELL-C-sigma format can achieve bandwidth saturation. For the DW kernel we provide a cache-reuse analysis and show how an appropriate choice of data layout for complex arrays can realize memory-bandwidth saturation in this case as well. A comparison with state-of-the-art high-end Intel Cascade Lake AP and Nvidia V100 systems puts the capabilities of the A64FX into perspective. We also explore the potential for power optimizations using the tuning knobs provided by the Fugaku system, achieving energy savings of about 31% for SpMV and 18% for DW., Comment: 32 pages, 25 figures, 6 tables
Published: 2021
Full Text: View/download PDF

19. MD-Bench: A Generic Proxy-App Toolbox for State-of-the-Art Molecular Dynamics Algorithms

Author: Ravedutti Lucio Machado, Rafael, Eitzinger, Jan, Köstler, Harald, Wellein, Gerhard, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wyrzykowski, Roman, editor, Dongarra, Jack, editor, Deelman, Ewa, editor, and Karczewski, Konrad, editor
Published: 2023
Full Text: View/download PDF

20. Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Author: Afzal, Ayesha, Hager, Georg, Wellein, Gerhard, Markidis, Stefano, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wyrzykowski, Roman, editor, Dongarra, Jack, editor, Deelman, Ewa, editor, and Karczewski, Konrad, editor
Published: 2023
Full Text: View/download PDF

21. MD-Bench: A performance-focused prototyping harness for state-of-the-art short-range molecular dynamics algorithms

Author: Ravedutti Lucio Machado, Rafael, Eitzinger, Jan, Laukemann, Jan, Hager, Georg, Köstler, Harald, and Wellein, Gerhard
Published: 2023
Full Text: View/download PDF

22. An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or task-based programming models can lead to thread desynchronization. Hence, the simplifying assumption that all cores execute the same loop can not be upheld. Motivated by observations on plain and modified versions of the HPCG benchmark, we construct a performance model of execution of memory-bound loop kernels. It can predict the memory bandwidth share per kernel on a memory contention domain depending on the number of active cores and which other workload the kernel is paired with. The only code features required are the single-thread cache line access frequency per kernel, which is directly related to the single-thread memory bandwidth, and its saturated bandwidth. It can either be measured directly or predicted using the Execution-Cache-Memory (ECM) performance model. The computational intensity of the kernels and the detailed structure of the code is of no significance. We validate our model on Intel Broadwell, Intel Cascade Lake, and AMD Rome processors pairing various streaming and stencil kernels. The error in predicting the bandwidth share per kernel is less than 8%., Comment: 10 pages, 9 figures
Published: 2020

23. Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

Author: Alappat, Christie L., Laukemann, Jan, Gruber, Thomas, Hager, Georg, Wellein, Gerhard, Meyer, Nils, and Wettig, Tilo
Subjects: Computer Science - Performance, Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The A64FX CPU powers the current number one supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, we construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. We also identify architectural peculiarities and derive optimization hints. Applying the ECM model to sparse matrix-vector multiplication (SpMV), we motivate why the CRS matrix storage format is inappropriate and how the SELL-C-sigma format with suitable code optimizations can achieve bandwidth saturation for SpMV., Comment: 6 pages, 5 figures, 3 tables
Published: 2020
Full Text: View/download PDF

24. Multiway $p$-spectral graph cuts on Grassmann manifolds

Author: Pasadakis, Dimosthenis, Alappat, Christie Louis, Schenk, Olaf, and Wellein, Gerhard
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning, 68R10 (Primary), 90C27 (Secondary), G.2.1, G.2.2
Abstract: Nonlinear reformulations of the spectral clustering method have gained a lot of recent attention due to their increased numerical benefits and their solid mathematical background. We present a novel direct multiway spectral clustering algorithm in the $p$-norm, for $p \in (1, 2]$. The problem of computing multiple eigenvectors of the graph $p$-Laplacian, a nonlinear generalization of the standard graph Laplacian, is recasted as an unconstrained minimization problem on a Grassmann manifold. The value of $p$ is reduced in a pseudocontinuous manner, promoting sparser solution vectors that correspond to optimal graph cuts as $p$ approaches one. Monitoring the monotonic decrease of the balanced graph cuts guarantees that we obtain the best available solution from the $p$-levels considered. We demonstrate the effectiveness and accuracy of our algorithm in various artificial test-cases. Our numerical examples and comparative results with various state-of-the-art clustering methods indicate that the proposed method obtains high quality clusters both in terms of balanced graph cut metrics and in terms of the accuracy of the labelling assignment. Furthermore, we conduct studies for the classification of facial images and handwritten characters to demonstrate the applicability in real-world datasets.
Published: 2020
Full Text: View/download PDF

25. Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

Author: Alappat, Christie L., Hofmann, Johannes, Hager, Georg, Fehske, Holger, Bishop, Alan R., and Wellein, Gerhard
Subjects: Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Hardware platforms in high performance computing are constantly getting more complex to handle even when considering multicore CPUs alone. Numerous features and configuration options in the hardware and the software environment that are relevant for performance are not even known to most application users or developers. Microbenchmarks, i.e., simple codes that fathom a particular aspect of the hardware, can help to shed light on such issues, but only if they are well understood and if the results can be reconciled with known facts or performance models. The insight gained from microbenchmarks may then be applied to real applications for performance analysis or optimization. In this paper we investigate two modern Intel x86 server CPU architectures in depth: Broadwell EP and Cascade Lake SP. We highlight relevant hardware configuration settings that can have a decisive impact on code performance and show how to properly measure on-chip and off-chip data transfer bandwidths. The new victim L3 cache of Cascade Lake and its advanced replacement policy receive due attention. Finally we use DGEMM, sparse matrix-vector multiplication, and the HPCG benchmark to make a connection to relevant application scenarios., Comment: 19 pages, 9 figures, 3 tables. Corrected affiliations and acknowledgments
Published: 2020
Full Text: View/download PDF

26. Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Analytic, first-principles performance modeling of distributed-memory parallel codes is notoriously imprecise. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation time does often not yield a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern caused by system noise, variations in communication time, and inherent load imbalance. In this paper, we highlight the specific cases of provoked and spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that although desynchronization can introduce increased waiting time per process, it does not necessarily cause lower resource utilization but can lead to an increase in available bandwidth per core. In case of significant communication overhead, even natural noise can shove the system into a state of automatic overlap of communication and computation, improving the overall time to solution. The saturation point, i.e., the number of processes per memory domain required to achieve full memory bandwidth, is pivotal in the dynamics of this process and the emerging stable wave pattern. We also demonstrate how hybrid MPI-OpenMP programming can prevent desirable desynchronization by eliminating the bandwidth bottleneck among processes. A Chebyshev filter diagonalization application is used to demonstrate some of the observed effects in a realistic setting., Comment: 18 pages, 8 figures
Published: 2020
Full Text: View/download PDF

27. Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Author: Laukemann, Jan, Hammer, Julian, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., white-box) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding of the performance-relevant interactions between hardware architecture and loop code. The Open Source Architecture Code Analyzer (OSACA) is a static analysis tool for predicting the execution time of sequential loops. It previously supported only x86 (Intel and AMD) architectures and simple, optimistic full-throughput execution. We have heavily extended OSACA to support ARM instructions and critical path prediction including the detection of loop-carried dependencies, which turns it into a versatile cross-architecture modeling tool. We show runtime predictions for code on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on machine models from available documentation and semi-automatic benchmarking. The predictions are compared with actual measurements., Comment: 6 pages, 3 figures
Published: 2019
Full Text: View/download PDF

28. Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Author: Afzal, Ayesha, primary, Hager, Georg, additional, Wellein, Gerhard, additional, and Markidis, Stefano, additional
Published: 2023
Full Text: View/download PDF

29. MD-Bench: A Generic Proxy-App Toolbox for State-of-the-Art Molecular Dynamics Algorithms

Author: Ravedutti Lucio Machado, Rafael, primary, Eitzinger, Jan, additional, Köstler, Harald, additional, and Wellein, Gerhard, additional
Published: 2023
Full Text: View/download PDF

30. A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication

Author: Alappat, Christie L., Hager, Georg, Schenk, Olaf, Thies, Jonas, Basermann, Achim, Bishop, Alan R., Fehske, Holger, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today's multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves in accordance with the Roofline model. Outliers are discussed and analyzed in detail. While we focus on SymmSpMV in this paper, our algorithm and software is applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring., Comment: 40 pages, 23 figures
Published: 2019
Full Text: View/download PDF

31. Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors

Author: Hofmann, Johannes, Alappat, Christie L., Hager, Georg, Fey, Dietmar, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, Computer Science - Performance
Abstract: We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises the loop code, problem sizes, and other runtime parameters, while a machine model is an abstraction of all performance-relevant properties of a CPU. We introduce a generic method for determining machine models and present results for relevant server-processor architectures by Intel, AMD, IBM, and Marvell/Cavium. Considering this wide range of architectures, the set of features required for adequate performance modeling is surprisingly small. To validate our approach, we compare performance predictions to empirical data for an OpenMP-parallel preconditioned CG algorithm, which includes compute- and memory-bound kernels. Both single- and multicore analysis shows that the model exhibits average and maximum relative errors of 5% and 10%. Deviations from the model and insights gained are discussed in detail., Comment: 12 pages, 7 figures
Published: 2019
Full Text: View/download PDF

32. Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT

Author: Hornich, Julian, Hammer, Julian, Hager, Georg, Gruber, Thomas, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Stencil algorithms have been receiving considerable interest in HPC research for decades. The techniques used to approach multi-core stencil performance modeling and engineering span basic runtime measurements, elaborate performance models, detailed hardware counter analysis, and thorough scaling behavior evaluation. Due to the plurality of approaches and stencil patterns, we set out to develop a generalizable methodology for reproducible measurements accompanied by state-of-the-art performance models. Our open-source toolchain, and collected results are publicly available in the "Intranode Stencil Performance Evaluation Collection" (INSPECT). We present the underlying methodologies, models and tools involved in gathering and documenting the performance behavior of a collection of typical stencil patterns across multiple architectures and hardware configuration options. Our aim is to endow performance-aware application developers with reproducible baseline performance data and validated models to initiate a well-defined process of performance assessment and optimization.
Published: 2019
Full Text: View/download PDF

33. Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Analytic, first-principles performance modeling of distributed-memory applications is difficult due to a wide spectrum of random disturbances caused by the application and the system. These disturbances (commonly called "noise") destroy the assumptions of regularity that one usually employs when constructing simple analytic models. Despite numerous efforts to quantify, categorize, and reduce such effects, a comprehensive quantitative understanding of their performance impact is not available, especially for long delays that have global consequences for the parallel application. In this work, we investigate various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pinpoint the mechanisms behind delay propagation. We analyze the dependence of the propagation speed of idle waves emanating from injected delays with respect to the execution and communication properties of the application, study how such delays decay under increased noise levels, and how they interact with each other. We also show how fine-grained noise can make a system immune against the adverse effects of propagating idle waves. Our results contribute to a better understanding of the collective phenomena that manifest themselves in distributed-memory parallel applications., Comment: 10 pages, 9 figures; title changed
Published: 2019
Full Text: View/download PDF

34. Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs

Author: Ernst, Dominik, Hager, Georg, Thies, Jonas, and Wellein, Gerhard
Subjects: Computer Science - Mathematical Software, Computer Science - Performance
Abstract: General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA's current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU., Comment: 12 pages, 22 figures. Extended version of arXiv:1905.03136v1 for journal submission
Published: 2019
Full Text: View/download PDF

35. Analytic Performance Modeling and Analysis of Detailed Neuron Simulations

Author: Cremonesi, Francesco, Hager, Georg, Wellein, Gerhard, and Schürmann, Felix
Subjects: Computer Science - Performance, Computer Science - Computational Engineering, Finance, and Science
Abstract: Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible. The exponential growth of parallel computer performance has been supporting these developments, and at the same time maintainers of neuroscientific simulation code have strived to optimally and efficiently exploit new hardware features. Current state of the art software for the simulation of biological networks has so far been developed using performance engineering practices, but a thorough analysis and modeling of the computational and performance characteristics, especially in the case of morphologically detailed neuron simulations, is lacking. Other computational sciences have successfully used analytic performance engineering and modeling methods to gain insight on the computational properties of simulation kernels, aid developers in performance optimizations and eventually drive co-design efforts, but to our knowledge a model-based performance analysis of neuron simulations has not yet been conducted. We present a detailed study of the shared-memory performance of morphologically detailed neuron simulations based on the Execution-Cache-Memory (ECM) performance model. We demonstrate that this model can deliver accurate predictions of the runtime of almost all the kernels that constitute the neuron models under investigation. The gained insight is used to identify the main governing mechanisms underlying performance bottlenecks in the simulation. The implications of this analysis on the optimization of neural simulation software and eventually co-design of future hardware architectures are discussed. In this sense, our work represents a valuable conceptual and quantitative contribution to understanding the performance properties of biological networks simulations., Comment: 18 pages, 6 figures, 15 tables
Published: 2019
Full Text: View/download PDF

36. Analytical performance estimation during code generation on modern GPUs

Author: Ernst, Dominik, Holzer, Markus, Hager, Georg, Knorr, Matthias, and Wellein, Gerhard
Published: 2023
Full Text: View/download PDF

37. Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

Author: Laukemann, Jan, Hammer, Julian, Hofmann, Johannes, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance, Computer Science - Software Engineering
Abstract: An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions are an indispensable component of analytical performance models, such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a deep understanding of the performance-relevant interactions between hardware architecture and loop code. We present the Open Source Architecture Code Analyzer (OSACA), a static analysis tool for predicting the execution time of sequential loops comprising x86 instructions under the assumption of an infinite first-level cache and perfect out-of-order scheduling. We show the process of building a machine model from available documentation and semi-automatic benchmarking, and carry it out for the latest Intel Skylake and AMD Zen micro-architectures. To validate the constructed models, we apply them to several assembly kernels and compare runtime predictions with actual measurements. Finally we give an outlook on how the method may be generalized to new architectures., Comment: 11 pages, 4 figures, 7 tables
Published: 2018
Full Text: View/download PDF

38. Benefits from using mixed precision computations in the ELPA-AEO and ESSEX-II eigensolver projects

Author: Alvermann, Andreas, Basermann, Achim, Bungartz, Hans-Joachim, Carbogno, Christian, Ernst, Dominik, Fehske, Holger, Futamura, Yasunori, Galgon, Martin, Hager, Georg, Huber, Sarah, Huckle, Thomas, Ida, Akihiro, Imakura, Akira, Kawai, Masatoshi, Köcher, Simone, Kreutzer, Moritz, Kus, Pavel, Lang, Bruno, Lederer, Hermann, Manin, Valeriy, Marek, Andreas, Nakajima, Kengo, Nemec, Lydia, Reuter, Karsten, Rippl, Michael, Röhrig-Zöllner, Melven, Sakurai, Tetsuya, Scheffler, Matthias, Scheurer, Christoph, Shahzad, Faisal, Brambila, Danilo Simoes, Thies, Jonas, and Wellein, Gerhard
Subjects: Physics - Computational Physics, Condensed Matter - Materials Science
Abstract: We first briefly report on the status and recent achievements of the ELPA-AEO (Eigenvalue Solvers for Petaflop Applications - Algorithmic Extensions and Optimizations) and ESSEX II (Equipping Sparse Solvers for Exascale) projects. In both collaboratory efforts, scientists from the application areas, mathematicians, and computer scientists work together to develop and make available efficient highly parallel methods for the solution of eigenvalue problems. Then we focus on a topic addressed in both projects, the use of mixed precision computations to enhance efficiency. We give a more detailed description of our approaches for benefiting from either lower or higher precision in three selected contexts and of the results thus obtained.
Published: 2018

39. Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs

Author: Kreutzer, Moritz, Hager, Georg, Ernst, Dominik, Fehske, Holger, Bishop, Alan R., and Wellein, Gerhard
Subjects: Computer Science - Mathematical Software, Computer Science - Performance, Physics - Computational Physics
Abstract: Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address space supported by both architectures under consideration: Intel Xeon Phi "Knights Landing" and Nvidia "Pascal." We propose two optimizations: (1) Subspace blocking is applied for improved performance and data access efficiency. We also show that it allows transparently handling problems much larger than the high-bandwidth memory without significant performance penalties. (2) Pipelining of communication and computation phases of successive subspaces is implemented to hide communication costs without extra memory traffic. As an application scenario we use filter diagonalization studies on topological insulator materials. Performance numbers on up to 512 nodes of the OakForest-PACS and Piz Daint supercomputers are presented, achieving beyond 100 Tflop/s for computing 100 inner eigenvalues of sparse matrices of dimension one billion., Comment: 18 pages, 8 figures
Published: 2018
Full Text: View/download PDF

40. Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis

Author: Wittmann, Markus, Haag, Viktor, Zeiser, Thomas, Köstler, Harald, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Lattice Boltzmann methods (LBM) are an important part of current computational fluid dynamics (CFD). They allow easy implementations and boundary handling. However, competitive time to solution not only depends on the choice of a reasonable method, but also on an efficient implementation on modern hardware. Hence, performance optimization has a long history in the lattice Boltzmann community. A variety of options exists regarding the implementation with direct impact on the solver performance. Experimenting and evaluating each option often is hard as the kernel itself is typically embedded in a larger code base. With our suite of lattice Boltzmann kernels we provide the infrastructure for such endeavors. Already included are several kernels ranging from simple to fully optimized implementations. Although these kernels are not fully functional CFD solvers, they are equipped with a solid verification method. The kernels may act as an reference for performance comparisons and as a blue print for optimization strategies. In this paper we give an overview of already available kernels, establish a performance model for each kernel, and show a comparison of implementations and recent architectures., Comment: preprint, submitted to Computer & Fluids Special Issue DSFD2017
Published: 2017
Full Text: View/download PDF

41. Validation of hardware events for successful performance pattern identification in High Performance Computing

Author: Röhl, Thomas, Eitzinger, Jan, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Hardware performance monitoring (HPM) is a crucial ingredient of performance analysis tools. While there are interfaces like LIKWID, PAPI or the kernel interface perf\_event which provide HPM access with some additional features, many higher level tools combine event counts with results retrieved from other sources like function call traces to derive (semi-)automatic performance advice. However, although HPM is available for x86 systems since the early 90s, only a small subset of the HPM features is used in practice. Performance patterns provide a more comprehensive approach, enabling the identification of various performance-limiting effects. Patterns address issues like bandwidth saturation, load imbalance, non-local data access in ccNUMA systems, or false sharing of cache lines. This work defines HPM event sets that are best suited to identify a selection of performance patterns on the Intel Haswell processor. We validate the chosen event sets for accuracy in order to arrive at a reliable pattern detection mechanism and point out shortcomings that cannot be easily circumvented due to bugs or limitations in the hardware.
Published: 2017
Full Text: View/download PDF

42. Multiway p-spectral graph cuts on Grassmann manifolds

Author: Pasadakis, Dimosthenis, Alappat, Christie Louis, Schenk, Olaf, and Wellein, Gerhard
Published: 2022
Full Text: View/download PDF

43. CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

Author: Shahzad, Faisal, Thies, Jonas, Kreutzer, Moritz, Zeiser, Thomas, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks.
Published: 2017

44. LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses

Author: Röhl, Thomas, Eitzinger, Jan, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological cases, provides instant performance feedback to the users, offers initial data to judge on the optimization potential of applications and helps to build a statistical foundation about application specific system usage. The LIKWID monitoring stack is a modular framework build on top of the LIKWID tools library. It aims on enabling job specific performance monitoring using HPM data, system metrics and application-level data for small to medium sized commodity clusters. Moreover, it is designed to integrate in existing monitoring infrastructures to speed up the change from pure system monitoring to job-aware monitoring., Comment: 4 pages, 4 figures. Accepted for HPCMASPA 2017, the Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications, held in conjunction with IEEE Cluster 2017, Honolulu, HI, September 5, 2017
Published: 2017
Full Text: View/download PDF

45. An analysis of core- and chip-level architectural features in four generations of Intel server processors

Author: Hofmann, Johannes, Hager, Georg, Wellein, Gerhard, and Fey, Dietmar
Subjects: Computer Science - Performance
Abstract: This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broad- well) with a focus on performance with floating point workloads. Starting on the core level and going down the memory hierarchy we cover instruction throughput for floating-point instructions, L1 cache, address generation capabilities, core clock speed and its limitations, L2 and L3 cache bandwidth and latency, the impact of Cluster on Die (CoD) and cache snoop modes, and the Uncore clock speed. Using microbenchmarks we study the influence of these factors on code performance. This insight can then serve as input for analytic performance models. We show that the energy efficiency of the LINPACK and HPCG benchmarks can be improved considerably by tuning the Uncore clock speed without sacrificing performance, and that the Graph500 benchmark performance may profit from a suitable choice of cache snoop mode settings.
Published: 2017

46. Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

Author: Hammer, Julian, Eitzinger, Jan, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern architectures is close to impossible. Analytic loop performance modeling is a useful way to understand the relevant bottlenecks of code execution based on simple machine models. The Roofline Model and the Execution-Cache-Memory (ECM) model are proven approaches to performance modeling of loop nests. In comparison to the Roofline model, the ECM model can also describes the single-core performance and saturation behavior on a multicore chip. We give an introduction to the Roofline and ECM models, and to stencil performance modeling using layer conditions (LC). We then present Kerncraft, a tool that can automatically construct Roofline and ECM models for loop nests by performing the required code, data transfer, and LC analysis. The layer condition analysis allows to predict optimal spatial blocking factors for loop nests. Together with the models it enables an ab-initio estimate of the potential benefits of loop blocking optimizations and of useful block sizes. In cases where LC analysis is not easily possible, Kerncraft supports a cache simulator as a fallback option. Using a 25-point long-range stencil we demonstrate the usefulness and predictive power of the Kerncraft tool., Comment: 22 pages, 5 figures
Published: 2017
Full Text: View/download PDF

47. Extreme Scale-out SuperMUC Phase 2 - lessons learned

Author: Hammer, Nicolay, Jamitzky, Ferdinand, Satzger, Helmut, Allalen, Momme, Block, Alexander, Karmakar, Anupam, Brehm, Matthias, Bader, Reinhold, Iapichino, Luigi, Ragagnin, Antonio, Karakasis, Vasilios, Kranzlmüller, Dieter, Bode, Arndt, Huber, Herbert, Kühn, Martin, Machado, Rui, Grünewald, Daniel, Edelmann, Philipp V. F., Röpke, Friedrich K., Wittmann, Markus, Zeiser, Thomas, Wellein, Gerhard, Mathias, Gerald, Schwörer, Magnus, Lorenzen, Konstantin, Federrath, Christoph, Klessen, Ralf, Bamberg, Karl-Ulrich, Ruhl, Hartmut, Schornbaum, Florian, Bauer, Martin, Nikhil, Anand, Qi, Jiaxing, Klimach, Harald, Stüben, Hinnerk, Deshmukh, Abhishek, Falkenstein, Tobias, Dolag, Klaus, and Petkova, Margarita
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Astrophysics - Instrumentation and Methods for Astrophysics, Physics - Computational Physics, Physics - Fluid Dynamics
Abstract: In spring 2015, the Leibniz Supercomputing Centre (Leibniz-Rechenzentrum, LRZ), installed their new Peta-Scale System SuperMUC Phase2. Selected users were invited for a 28 day extreme scale-out block operation during which they were allowed to use the full system for their applications. The following projects participated in the extreme scale-out workshop: BQCD (Quantum Physics), SeisSol (Geophysics, Seismics), GPI-2/GASPI (Toolkit for HPC), Seven-League Hydro (Astrophysics), ILBDC (Lattice Boltzmann CFD), Iphigenie (Molecular Dynamic), FLASH (Astrophysics), GADGET (Cosmological Dynamics), PSC (Plasma Physics), waLBerla (Lattice Boltzmann CFD), Musubi (Lattice Boltzmann CFD), Vertex3D (Stellar Astrophysics), CIAO (Combustion CFD), and LS1-Mardyn (Material Science). The projects were allowed to use the machine exclusively during the 28 day period, which corresponds to a total of 63.4 million core-hours, of which 43.8 million core-hours were used by the applications, resulting in a utilization of 69%. The top 3 users were using 15.2, 6.4, and 4.7 million core-hours, respectively., Comment: 10 pages, 5 figures, presented at ParCo2015 - Advances in Parallel Computing, held in Edinburgh, September 2015. The final publication is available at IOS Press through http://dx.doi.org/10.3233/978-1-61499-621-7-827
Published: 2016
Full Text: View/download PDF

48. ESSEX: Equipping Sparse Solvers For Exascale

Author: Alappat, Christie L., Alvermann, Andreas, Basermann, Achim, Fehske, Holger, Futamura, Yasunori, Galgon, Martin, Hager, Georg, Huber, Sarah, Imakura, Akira, Kawai, Masatoshi, Kreutzer, Moritz, Lang, Bruno, Nakajima, Kengo, Röhrig-Zöllner, Melven, Sakurai, Tetsuya, Shahzad, Faisal, Thies, Jonas, Wellein, Gerhard, Barth, Timothy J., Series Editor, Griebel, Michael, Series Editor, Keyes, David E., Series Editor, Nieminen, Risto M., Series Editor, Roose, Dirk, Series Editor, Schlick, Tamar, Series Editor, Bungartz, Hans-Joachim, editor, Reiz, Severin, editor, Uekermann, Benjamin, editor, Neumann, Philipp, editor, and Nagel, Wolfgang E., editor
Published: 2020
Full Text: View/download PDF

49. EXASTEEL: Towards a Virtual Laboratory for the Multiscale Simulation of Dual-Phase Steel Using High-Performance Computing

Author: Klawonn, Axel, Lanser, Martin, Uran, Matthias, Rheinbach, Oliver, Köhler, Stephan, Schröder, Jörg, Scheunemann, Lisa, Brands, Dominik, Balzani, Daniel, Gandhi, Ashutosh, Wellein, Gerhard, Wittmann, Markus, Schenk, Olaf, Janalík, Radim, Barth, Timothy J., Series Editor, Griebel, Michael, Series Editor, Keyes, David E., Series Editor, Nieminen, Risto M., Series Editor, Roose, Dirk, Series Editor, Schlick, Tamar, Series Editor, Bungartz, Hans-Joachim, editor, Reiz, Severin, editor, Uekermann, Benjamin, editor, Neumann, Philipp, editor, and Nagel, Wolfgang E., editor
Published: 2020
Full Text: View/download PDF

50. Performance Engineering for a Tall & Skinny Matrix Multiplication Kernels on GPUs

Author: Ernst, Dominik, Hager, Georg, Thies, Jonas, Wellein, Gerhard, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wyrzykowski, Roman, editor, Deelman, Ewa, editor, Dongarra, Jack, editor, and Karczewski, Konrad, editor
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

388 results on '"Wellein, Gerhard"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources