Author: "Hager, Georg" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Hager, Georg"' showing total 402 results

Start Over Author "Hager, Georg"

402 results on '"Hager, Georg"'

1. Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Author: Lacey, Dane C., Alappat, Christie L., Lange, Florian, Hager, Georg, Fehske, Holger, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) provides an opportunity for data reuse since the same matrix data is loaded from main memory multiple times, an opportunity that has only recently been exploited successfully with the Recursive Algebraic Coloring Engine (RACE). Using RACE, one considers a graph based formulation of the SpMV and employs s level-based implementation of SpMV for reuse of relevant matrix data. However, the underlying data dependencies have restricted the use of this concept to shared memory parallelization and thus to single compute nodes. Enabling cache blocking for distributed-memory parallelization of MPK is challenging due to the need for explicit communication and synchronization of data in neighboring levels. In this work, we propose and implement a flexible method that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme that fulfills all data dependencies among processes. Compared to a "traditional" distributed memory parallel MPK, our new Distributed Level-Blocked MPK yields substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from various scientific applications. Finally, we address a modern quantum physics problem to demonstrate the applicability of our method, achieving a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster., Comment: 15 pages, 12 figures, 5 tables; added affiliation & extended acknowledgment
Published: 2024

2. CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion

Author: Laukemann, Jan, Gruber, Thomas, Hager, Georg, Oryspayev, Dossay, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire Rapids" server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measurements and microbenchmarks to study memory data traffic behavior, we can connect the breakdowns to SpecI2M, a new write-allocate evasion feature in current Intel CPUs. For serial and full-node cases we are able to predict the memory data volume analytically with an error of a few percent. We find that if the number of processes is prime, SpecI2M fails to work properly, which we can attribute to short inner loops emerging from the one-dimensional domain decomposition in this case. We can also rule out other possible causes of the prime number effect, such as breaking layer conditions, MPI communication overhead, and load imbalance., Comment: 19 pages including artifact appendix; 11 figures, 1 table; numerous corrections, esp. in Table 1
Published: 2023
Full Text: View/download PDF

3. Physical Oscillator Model for Supercomputing

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Physics - Computational Physics
Abstract: A parallel program together with the parallel hardware it is running on is not only a vehicle to solve numerical problems, it is also a complex system with interesting dynamical behavior: resynchronization and desynchronization of parallel processes, propagating phases of idleness, and the peculiar effects of noise and system topology are just a few examples. We propose a physical oscillator model (POM) to describe aspects of the dynamics of interacting parallel processes. Motivated by the well-known Kuramoto Model, a process with its regular compute-communicate cycles is modeled as an oscillator which is coupled to other oscillators (processes) via an interaction potential. Instead of a simple all-to-all connectivity, we employ a sparse topology matrix mapping the communication structure and thus the inter-process dependencies of the program onto the oscillator model and propose two interaction potentials that are suitable for different scenarios in parallel computing: resource-scalable and resource-bottlenecked applications. The former are not limited by a resource bottleneck such as memory bandwidth or network contention, while the latter are. Unlike the original Kuramoto model, which has a periodic sinusoidal potential that is attractive for small angles, our characteristic potentials are always attractive for large angles and only differ in the short-distance behavior. We show that the model with appropriate potentials can mimic the propagation of delays and the synchronizing and desynchronizing behavior of scalable and bottlenecked parallel programs, respectively., Comment: 5 pages, 2 figures
Published: 2023
Full Text: View/download PDF

4. SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In this work, fundamental performance, power, and energy characteristics of the full SPEChpc 2021 benchmark suite are assessed on two different clusters based on Intel Ice Lake and Sapphire Rapids CPUs using the MPI-only codes' variants. We use memory bandwidth, data volume, and scalability metrics in order to categorize the benchmarks and pinpoint relevant performance and scalability bottlenecks on the node and cluster levels. Common patterns such as memory bandwidth limitation, dominating communication and synchronization overhead, MPI serialization, superlinear scaling, and alignment issues could be identified, in isolation or in combination, showing that SPEChpc 2021 is representative of many HPC workloads. Power dissipation and energy measurements indicate that the modern Intel server CPUs have such a high idle power level that race-to-idle is the paramount strategy for energy to solution and energy-delay product minimization. On the chip level, only memory-bound code shows a clear advantage of Sapphire Rapids compared to Ice Lake in terms of energy to solution., Comment: 9 pages, 6 figures; corrected links to system docs
Published: 2023
Full Text: View/download PDF

5. Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs

Author: Alappat, Christie, Thies, Jonas, Hager, Georg, Fehske, Holger, and Wellein, Gerhard
Subjects: Mathematics - Numerical Analysis, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Sparse linear iterative solvers are essential for many large-scale simulations. Much of the runtime of these solvers is often spent in the implicit evaluation of matrix polynomials via a sequence of sparse matrix-vector products. A variety of approaches has been proposed to make these polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular practice to approximate triangular solves by a matrix polynomial to increase parallelism. Such algorithms allow to evaluate the polynomial using a so-called matrix power kernel (MPK), which computes the product between a power of a sparse matrix A and a dense vector x, or a related operation. Recently we have shown that using the level-based formulation of sparse matrix-vector multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we can perform temporal cache blocking of MPK to increase its performance. In this work, we demonstrate the application of this cache-blocking optimization in sparse iterative solvers. By integrating the RACE library into the Trilinos framework, we demonstrate the speedups achieved in preconditioned) s-step GMRES, polynomial preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we achieve speedups of up to 3x on modern multi-core compute nodes. For algorithms with moderate contributions from subspace orthogonalization, the gain reduces significantly, which is often caused by the insufficient quality of the orthogonalization routines. Finally, we showcase the application of RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind) and highlight the new opportunities and perspectives opened up by RACE as a cache-blocking technique for MPK-enabled sparse solvers., Comment: 25 pages, 11 figures, 3 tables
Published: 2023

6. Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

Author: Afzal, Ayesha, Hager, Georg, Markidis, Stefano, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate injection of delays can improve performance if certain conditions are met. This leads to the counter-intuitive conclusion that noise, independent of its source, is not always detrimental but can be leveraged for performance improvements. We employ phase-space graphs as a new tool to visualize parallel program dynamics. They are useful in spotting certain patterns in parallel execution that will easily go unnoticed with traditional tracing tools. We investigate five different microbenchmarks and applications on different supercomputer platforms: an MPI-augmented STREAM Triad, two implementations of Lattice-Boltzmann fluid solvers, and the LULESH and HPCG proxy applications., Comment: 18 pages, 14 figures, 7 tables. Corrected Fig. 4 layout
Published: 2023
Full Text: View/download PDF

7. MD-Bench: Engineering the in-core performance of short-range molecular dynamics kernels from state-of-the-art simulation packages

Author: Machado, Rafael Ravedutti Lucio, Eitzinger, Jan, Laukemann, Jan, Hager, Georg, Köstler, Harald, and Wellein, Gerhard
Subjects: Physics - Chemical Physics, Computer Science - Performance, Physics - Computational Physics
Abstract: Molecular dynamics (MD) simulations provide considerable benefits for the investigation and experimentation of systems at atomic level. Their usage is widespread into several research fields, but their system size and timescale are also crucially limited by the computing power they can make use of. Performance engineering of MD kernels is therefore important to understand their bottlenecks and point out possible improvements. For that reason, we developed MD-Bench, a proxy-app for short-range MD kernels that implements state-of-the-art algorithms from multiple production applications such as LAMMPS and GROMACS. MD-Bench is intended to have simpler, understandable and extensible source code, as well as to be transparent and suitable for teaching, benchmarking and researching MD algorithms. In this paper we introduce MD-Bench, describe its design and structure and implemented algorithms. Finally, we show five usage examples of MD-Bench and describe how these are useful to have a deeper understanding of MD kernels from a performance point of view, also exposing some interesting performance insights., Comment: 17 pages, 10 figures, 5 tables. arXiv admin note: text overlap with arXiv:2207.13094
Published: 2023

8. Orthogonal layers of parallelism in large-scale eigenvalue computations

Author: Alvermann, Andreas, Hager, Georg, and Fehske, Holger
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases -- an exciton and a strongly correlated electron system -- which incur either small or large communication overhead., Comment: Final version, almost as published. 32 pages, 12 figures
Published: 2022
Full Text: View/download PDF

9. Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Author: Afzal, Ayesha, Hager, Georg, Wellein, Gerhard, and Markidis, Stefano
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Performance
Abstract: This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics., Comment: 12 pages, 9 figures, 1 table
Published: 2022
Full Text: View/download PDF

10. The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the program code and the hardware, especially when execution and communication bottlenecks are involved. In this paper we investigate the effect of "bottleneck evasion" and how it can lead to automatic overlap of communication overhead with computation. Bottleneck evasion leads to a gradual loss of the initial bulk-synchronous behavior of a parallel code so that its processes become desynchronized. This occurs most prominently in memory-bound programs, which is why we choose memory-bound benchmark and application codes, specifically an MPI-augmented STREAM Triad, sparse matrix-vector multiplication, and a collective-avoiding Chebyshev filter diagonalization code to demonstrate the consequences of desynchronization on two different supercomputing platforms. We investigate the role of idle waves as possible triggers for desynchronization and show the impact of automatic asynchronous communication for a spectrum of code properties and parameters, such as saturation point, matrix structures, domain decomposition, and communication concurrency. Our findings reveal how eliminating synchronization points (such as collective communication or barriers) precipitates performance improvements that go beyond what can be expected by simply subtracting the overhead of the collective from the overall runtime., Comment: 13 pages, 7 figures, 6 tables
Published: 2022
Full Text: View/download PDF

11. Level-based Blocking for Sparse Matrices: Sparse Matrix-Power-Vector Multiplication

Author: Alappat, Christie L., Hager, Georg, Schenk, Olaf, and Wellein, Gerhard
Subjects: Mathematics - Numerical Analysis, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: The multiplication of a sparse matrix with a dense vector (SpMV) is a key component in many numerical schemes and its performance is known to be severely limited by main memory access. Several numerical schemes require the multiplication of a sparse matrix polynomial with a dense vector, which is typically implemented as a sequence of SpMVs. This results in low performance and ignores the potential to increase the arithmetic intensity by reusing the matrix data from cache. In this work we use the recursive algebraic coloring engine (RACE) to enable blocking of sparse matrix data across the polynomial computations. In the graph representing the sparse matrix we form levels using a breadth-first search. Locality relations of these levels are then used to improve spatial and temporal locality when accessing the matrix data and to implement an efficient multithreaded parallelization. Our approach is independent of the matrix structure and avoids shortcomings of existing "blocking" strategies in terms of hardware efficiency and parallelization overhead. We quantify the quality of our implementation using performance modelling and demonstrate speedups of up to 3$\times$ and 5$\times$ compared to an optimal SpMV-based baseline on a single multicore chip of recent Intel and AMD architectures. As a potential application, we demonstrate the benefit of our implementation for a Chebyshev time propagation scheme, representing the class of polynomial approximations to exponential integrators. Further numerical schemes which may benefit from our developments include $s$-step Krylov solvers and power clustering algorithms., Comment: 18 pages, 19 figures, 3 tables
Published: 2022
Full Text: View/download PDF

12. Analytical Performance Estimation during Code Generation on Modern GPUs

Author: Ernst, Dominik, Holzer, Markus, Hager, Georg, Knorr, Matthias, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box machine learning to select the best-performing configuration. This paper identifies the relevant performance-defining mechanisms for memory-intensive GPU applications through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient code candidates with high accuracy. We examine the changes of the A100 GPU architecture compared to the predecessor V100 and address the challenges of how to model the data transfer volumes through the new memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range-four 3D-25pt stencil and a complex two-phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best-performing candidate. The method is not limited to stencil kernels but can be integrated into any code generator that can generate the required address expressions., Comment: arXiv admin note: substantial text overlap with arXiv:2107.01143
Published: 2022
Full Text: View/download PDF

13. Opening the Black Box: Performance Estimation during Code Generation for GPUs

Author: Ernst, Dominik, Hager, Georg, Holzer, Markus, Knorr, Matthias, and Wellein, Gerhard
Subjects: Computer Science - Performance, C.4
Abstract: Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning. This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy. Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate. The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.
Published: 2021
Full Text: View/download PDF

14. Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Most distributed-memory bulk-synchronous parallel programs in HPC assume that compute resources are available continuously and homogeneously across the allocated set of compute nodes. However, long one-off delays on individual processes can cause global disturbances, so-called idle waves, by rippling through the system. This process is mainly governed by the communication topology of the underlying parallel code. This paper makes significant contributions to the understanding of idle wave dynamics. We study the propagation mechanisms of idle waves across the ranks of MPI-parallel programs. We present a validated analytic model for their propagation velocity with respect to communication parameters and topology, with a special emphasis on sparse communication patterns. We study the interaction of idle waves with MPI collectives and show that, depending on the implementation, a collective may be transparent to the wave. Finally we analyze two mechanisms of idle wave decay: topological decay, which is rooted in differences in communication characteristics among parts of the system, and noise-induced decay, which is caused by system or application noise. We show that noise-induced decay is largely independent of noise characteristics but depends only on the overall noise power. An analytic expression for idle wave decay rate with respect to noise power is derived. For model validation we use microbenchmarks and stencil algorithms on three different supercomputing platforms., Comment: 19 pages, 10 figures, 2 tables
Published: 2021
Full Text: View/download PDF

15. ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

Author: Alappat, Christie, Meyer, Nils, Laukemann, Jan, Gruber, Thomas, Hager, Georg, Wellein, Gerhard, and Wettig, Tilo
Subjects: Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing, High Energy Physics - Lattice
Abstract: The A64FX CPU is arguably the most powerful Arm-based processor design to date. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis of the A64FX used in the Fujitsu FX1000 supercomputer at a level of detail that allows for the construction of Execution-Cache-Memory (ECM) performance models for steady-state loops. In the process we identify architectural peculiarities that point to viable generic optimization strategies. After validating the model using simple streaming loops we apply the insight gained to sparse matrix-vector multiplication (SpMV) and the domain wall (DW) kernel from quantum chromodynamics (QCD). For SpMV we show why the CRS matrix storage format is not a good practical choice on this architecture and how the SELL-C-sigma format can achieve bandwidth saturation. For the DW kernel we provide a cache-reuse analysis and show how an appropriate choice of data layout for complex arrays can realize memory-bandwidth saturation in this case as well. A comparison with state-of-the-art high-end Intel Cascade Lake AP and Nvidia V100 systems puts the capabilities of the A64FX into perspective. We also explore the potential for power optimizations using the tuning knobs provided by the Fugaku system, achieving energy savings of about 31% for SpMV and 18% for DW., Comment: 32 pages, 25 figures, 6 tables
Published: 2021
Full Text: View/download PDF

16. Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Author: Afzal, Ayesha, Hager, Georg, Wellein, Gerhard, Markidis, Stefano, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wyrzykowski, Roman, editor, Dongarra, Jack, editor, Deelman, Ewa, editor, and Karczewski, Konrad, editor
Published: 2023
Full Text: View/download PDF

17. MD-Bench: A performance-focused prototyping harness for state-of-the-art short-range molecular dynamics algorithms

Author: Ravedutti Lucio Machado, Rafael, Eitzinger, Jan, Laukemann, Jan, Hager, Georg, Köstler, Harald, and Wellein, Gerhard
Published: 2023
Full Text: View/download PDF

18. An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or task-based programming models can lead to thread desynchronization. Hence, the simplifying assumption that all cores execute the same loop can not be upheld. Motivated by observations on plain and modified versions of the HPCG benchmark, we construct a performance model of execution of memory-bound loop kernels. It can predict the memory bandwidth share per kernel on a memory contention domain depending on the number of active cores and which other workload the kernel is paired with. The only code features required are the single-thread cache line access frequency per kernel, which is directly related to the single-thread memory bandwidth, and its saturated bandwidth. It can either be measured directly or predicted using the Execution-Cache-Memory (ECM) performance model. The computational intensity of the kernels and the detailed structure of the code is of no significance. We validate our model on Intel Broadwell, Intel Cascade Lake, and AMD Rome processors pairing various streaming and stencil kernels. The error in predicting the bandwidth share per kernel is less than 8%., Comment: 10 pages, 9 figures
Published: 2020

19. Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

Author: Alappat, Christie L., Laukemann, Jan, Gruber, Thomas, Hager, Georg, Wellein, Gerhard, Meyer, Nils, and Wettig, Tilo
Subjects: Computer Science - Performance, Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The A64FX CPU powers the current number one supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, we construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. We also identify architectural peculiarities and derive optimization hints. Applying the ECM model to sparse matrix-vector multiplication (SpMV), we motivate why the CRS matrix storage format is inappropriate and how the SELL-C-sigma format with suitable code optimizations can achieve bandwidth saturation for SpMV., Comment: 6 pages, 5 figures, 3 tables
Published: 2020
Full Text: View/download PDF

20. Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

Author: Alappat, Christie L., Hofmann, Johannes, Hager, Georg, Fehske, Holger, Bishop, Alan R., and Wellein, Gerhard
Subjects: Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Hardware platforms in high performance computing are constantly getting more complex to handle even when considering multicore CPUs alone. Numerous features and configuration options in the hardware and the software environment that are relevant for performance are not even known to most application users or developers. Microbenchmarks, i.e., simple codes that fathom a particular aspect of the hardware, can help to shed light on such issues, but only if they are well understood and if the results can be reconciled with known facts or performance models. The insight gained from microbenchmarks may then be applied to real applications for performance analysis or optimization. In this paper we investigate two modern Intel x86 server CPU architectures in depth: Broadwell EP and Cascade Lake SP. We highlight relevant hardware configuration settings that can have a decisive impact on code performance and show how to properly measure on-chip and off-chip data transfer bandwidths. The new victim L3 cache of Cascade Lake and its advanced replacement policy receive due attention. Finally we use DGEMM, sparse matrix-vector multiplication, and the HPCG benchmark to make a connection to relevant application scenarios., Comment: 19 pages, 9 figures, 3 tables. Corrected affiliations and acknowledgments
Published: 2020
Full Text: View/download PDF

21. Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Analytic, first-principles performance modeling of distributed-memory parallel codes is notoriously imprecise. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation time does often not yield a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern caused by system noise, variations in communication time, and inherent load imbalance. In this paper, we highlight the specific cases of provoked and spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that although desynchronization can introduce increased waiting time per process, it does not necessarily cause lower resource utilization but can lead to an increase in available bandwidth per core. In case of significant communication overhead, even natural noise can shove the system into a state of automatic overlap of communication and computation, improving the overall time to solution. The saturation point, i.e., the number of processes per memory domain required to achieve full memory bandwidth, is pivotal in the dynamics of this process and the emerging stable wave pattern. We also demonstrate how hybrid MPI-OpenMP programming can prevent desirable desynchronization by eliminating the bandwidth bottleneck among processes. A Chebyshev filter diagonalization application is used to demonstrate some of the observed effects in a realistic setting., Comment: 18 pages, 8 figures
Published: 2020
Full Text: View/download PDF

22. Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Author: Laukemann, Jan, Hammer, Julian, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., white-box) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding of the performance-relevant interactions between hardware architecture and loop code. The Open Source Architecture Code Analyzer (OSACA) is a static analysis tool for predicting the execution time of sequential loops. It previously supported only x86 (Intel and AMD) architectures and simple, optimistic full-throughput execution. We have heavily extended OSACA to support ARM instructions and critical path prediction including the detection of loop-carried dependencies, which turns it into a versatile cross-architecture modeling tool. We show runtime predictions for code on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on machine models from available documentation and semi-automatic benchmarking. The predictions are compared with actual measurements., Comment: 6 pages, 3 figures
Published: 2019
Full Text: View/download PDF

23. Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Author: Afzal, Ayesha, primary, Hager, Georg, additional, Wellein, Gerhard, additional, and Markidis, Stefano, additional
Published: 2023
Full Text: View/download PDF

24. A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication

Author: Alappat, Christie L., Hager, Georg, Schenk, Olaf, Thies, Jonas, Basermann, Achim, Bishop, Alan R., Fehske, Holger, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today's multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves in accordance with the Roofline model. Outliers are discussed and analyzed in detail. While we focus on SymmSpMV in this paper, our algorithm and software is applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring., Comment: 40 pages, 23 figures
Published: 2019
Full Text: View/download PDF

25. Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors

Author: Hofmann, Johannes, Alappat, Christie L., Hager, Georg, Fey, Dietmar, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, Computer Science - Performance
Abstract: We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises the loop code, problem sizes, and other runtime parameters, while a machine model is an abstraction of all performance-relevant properties of a CPU. We introduce a generic method for determining machine models and present results for relevant server-processor architectures by Intel, AMD, IBM, and Marvell/Cavium. Considering this wide range of architectures, the set of features required for adequate performance modeling is surprisingly small. To validate our approach, we compare performance predictions to empirical data for an OpenMP-parallel preconditioned CG algorithm, which includes compute- and memory-bound kernels. Both single- and multicore analysis shows that the model exhibits average and maximum relative errors of 5% and 10%. Deviations from the model and insights gained are discussed in detail., Comment: 12 pages, 7 figures
Published: 2019
Full Text: View/download PDF

26. Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT

Author: Hornich, Julian, Hammer, Julian, Hager, Georg, Gruber, Thomas, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Stencil algorithms have been receiving considerable interest in HPC research for decades. The techniques used to approach multi-core stencil performance modeling and engineering span basic runtime measurements, elaborate performance models, detailed hardware counter analysis, and thorough scaling behavior evaluation. Due to the plurality of approaches and stencil patterns, we set out to develop a generalizable methodology for reproducible measurements accompanied by state-of-the-art performance models. Our open-source toolchain, and collected results are publicly available in the "Intranode Stencil Performance Evaluation Collection" (INSPECT). We present the underlying methodologies, models and tools involved in gathering and documenting the performance behavior of a collection of typical stencil patterns across multiple architectures and hardware configuration options. Our aim is to endow performance-aware application developers with reproducible baseline performance data and validated models to initiate a well-defined process of performance assessment and optimization.
Published: 2019
Full Text: View/download PDF

27. Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study

Author: Afzal, Ayesha, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Analytic, first-principles performance modeling of distributed-memory applications is difficult due to a wide spectrum of random disturbances caused by the application and the system. These disturbances (commonly called "noise") destroy the assumptions of regularity that one usually employs when constructing simple analytic models. Despite numerous efforts to quantify, categorize, and reduce such effects, a comprehensive quantitative understanding of their performance impact is not available, especially for long delays that have global consequences for the parallel application. In this work, we investigate various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pinpoint the mechanisms behind delay propagation. We analyze the dependence of the propagation speed of idle waves emanating from injected delays with respect to the execution and communication properties of the application, study how such delays decay under increased noise levels, and how they interact with each other. We also show how fine-grained noise can make a system immune against the adverse effects of propagating idle waves. Our results contribute to a better understanding of the collective phenomena that manifest themselves in distributed-memory parallel applications., Comment: 10 pages, 9 figures; title changed
Published: 2019
Full Text: View/download PDF

28. Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs

Author: Ernst, Dominik, Hager, Georg, Thies, Jonas, and Wellein, Gerhard
Subjects: Computer Science - Mathematical Software, Computer Science - Performance
Abstract: General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA's current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU., Comment: 12 pages, 22 figures. Extended version of arXiv:1905.03136v1 for journal submission
Published: 2019
Full Text: View/download PDF

29. Analytic Performance Modeling and Analysis of Detailed Neuron Simulations

Author: Cremonesi, Francesco, Hager, Georg, Wellein, Gerhard, and Schürmann, Felix
Subjects: Computer Science - Performance, Computer Science - Computational Engineering, Finance, and Science
Abstract: Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible. The exponential growth of parallel computer performance has been supporting these developments, and at the same time maintainers of neuroscientific simulation code have strived to optimally and efficiently exploit new hardware features. Current state of the art software for the simulation of biological networks has so far been developed using performance engineering practices, but a thorough analysis and modeling of the computational and performance characteristics, especially in the case of morphologically detailed neuron simulations, is lacking. Other computational sciences have successfully used analytic performance engineering and modeling methods to gain insight on the computational properties of simulation kernels, aid developers in performance optimizations and eventually drive co-design efforts, but to our knowledge a model-based performance analysis of neuron simulations has not yet been conducted. We present a detailed study of the shared-memory performance of morphologically detailed neuron simulations based on the Execution-Cache-Memory (ECM) performance model. We demonstrate that this model can deliver accurate predictions of the runtime of almost all the kernels that constitute the neuron models under investigation. The gained insight is used to identify the main governing mechanisms underlying performance bottlenecks in the simulation. The implications of this analysis on the optimization of neural simulation software and eventually co-design of future hardware architectures are discussed. In this sense, our work represents a valuable conceptual and quantitative contribution to understanding the performance properties of biological networks simulations., Comment: 18 pages, 6 figures, 15 tables
Published: 2019
Full Text: View/download PDF

30. Analytical performance estimation during code generation on modern GPUs

Author: Ernst, Dominik, Holzer, Markus, Hager, Georg, Knorr, Matthias, and Wellein, Gerhard
Published: 2023
Full Text: View/download PDF

31. Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

Author: Laukemann, Jan, Hammer, Julian, Hofmann, Johannes, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance, Computer Science - Software Engineering
Abstract: An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions are an indispensable component of analytical performance models, such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a deep understanding of the performance-relevant interactions between hardware architecture and loop code. We present the Open Source Architecture Code Analyzer (OSACA), a static analysis tool for predicting the execution time of sequential loops comprising x86 instructions under the assumption of an infinite first-level cache and perfect out-of-order scheduling. We show the process of building a machine model from available documentation and semi-automatic benchmarking, and carry it out for the latest Intel Skylake and AMD Zen micro-architectures. To validate the constructed models, we apply them to several assembly kernels and compare runtime predictions with actual measurements. Finally we give an outlook on how the method may be generalized to new architectures., Comment: 11 pages, 4 figures, 7 tables
Published: 2018
Full Text: View/download PDF

32. Benefits from using mixed precision computations in the ELPA-AEO and ESSEX-II eigensolver projects

Author: Alvermann, Andreas, Basermann, Achim, Bungartz, Hans-Joachim, Carbogno, Christian, Ernst, Dominik, Fehske, Holger, Futamura, Yasunori, Galgon, Martin, Hager, Georg, Huber, Sarah, Huckle, Thomas, Ida, Akihiro, Imakura, Akira, Kawai, Masatoshi, Köcher, Simone, Kreutzer, Moritz, Kus, Pavel, Lang, Bruno, Lederer, Hermann, Manin, Valeriy, Marek, Andreas, Nakajima, Kengo, Nemec, Lydia, Reuter, Karsten, Rippl, Michael, Röhrig-Zöllner, Melven, Sakurai, Tetsuya, Scheffler, Matthias, Scheurer, Christoph, Shahzad, Faisal, Brambila, Danilo Simoes, Thies, Jonas, and Wellein, Gerhard
Subjects: Physics - Computational Physics, Condensed Matter - Materials Science
Abstract: We first briefly report on the status and recent achievements of the ELPA-AEO (Eigenvalue Solvers for Petaflop Applications - Algorithmic Extensions and Optimizations) and ESSEX II (Equipping Sparse Solvers for Exascale) projects. In both collaboratory efforts, scientists from the application areas, mathematicians, and computer scientists work together to develop and make available efficient highly parallel methods for the solution of eigenvalue problems. Then we focus on a topic addressed in both projects, the use of mixed precision computations to enhance efficiency. We give a more detailed description of our approaches for benefiting from either lower or higher precision in three selected contexts and of the results thus obtained.
Published: 2018

33. Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs

Author: Kreutzer, Moritz, Hager, Georg, Ernst, Dominik, Fehske, Holger, Bishop, Alan R., and Wellein, Gerhard
Subjects: Computer Science - Mathematical Software, Computer Science - Performance, Physics - Computational Physics
Abstract: Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address space supported by both architectures under consideration: Intel Xeon Phi "Knights Landing" and Nvidia "Pascal." We propose two optimizations: (1) Subspace blocking is applied for improved performance and data access efficiency. We also show that it allows transparently handling problems much larger than the high-bandwidth memory without significant performance penalties. (2) Pipelining of communication and computation phases of successive subspaces is implemented to hide communication costs without extra memory traffic. As an application scenario we use filter diagonalization studies on topological insulator materials. Performance numbers on up to 512 nodes of the OakForest-PACS and Piz Daint supercomputers are presented, achieving beyond 100 Tflop/s for computing 100 inner eigenvalues of sparse matrices of dimension one billion., Comment: 18 pages, 8 figures
Published: 2018
Full Text: View/download PDF

34. On the accuracy and usefulness of analytic energy models for contemporary multicore processors

Author: Hofmann, Johannes, Hager, Georg, and Fey, Dietmar
Subjects: Computer Science - Performance
Abstract: This paper presents refinements to the execution-cache-memory performance model and a previously published power model for multicore processors. The combination of both enables a very accurate prediction of performance and energy consumption of contemporary multicore processors as a function of relevant parameters such as number of active cores as well as core and Uncore frequencies. Model validation is performed on the Sandy Bridge-EP and Broadwell-EP microarchitectures. Production-related variations in chip quality are demonstrated through a statistical analysis of the fit parameters obtained on one hundred Broadwell-EP CPUs of the same model. Insights from the models are used to explain the performance- and energy-related behavior of the processors for scalable as well as saturating (i.e., memory-bound) codes. In the process we demonstrate the models' capability to identify optimal operating points with respect to highest performance, lowest energy-to-solution, and lowest energy-delay product and identify a set of best practices for energy-efficient execution.
Published: 2018
Full Text: View/download PDF

35. Validation of hardware events for successful performance pattern identification in High Performance Computing

Author: Röhl, Thomas, Eitzinger, Jan, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Hardware performance monitoring (HPM) is a crucial ingredient of performance analysis tools. While there are interfaces like LIKWID, PAPI or the kernel interface perf\_event which provide HPM access with some additional features, many higher level tools combine event counts with results retrieved from other sources like function call traces to derive (semi-)automatic performance advice. However, although HPM is available for x86 systems since the early 90s, only a small subset of the HPM features is used in practice. Performance patterns provide a more comprehensive approach, enabling the identification of various performance-limiting effects. Patterns address issues like bandwidth saturation, load imbalance, non-local data access in ccNUMA systems, or false sharing of cache lines. This work defines HPM event sets that are best suited to identify a selection of performance patterns on the Intel Haswell processor. We validate the chosen event sets for accuracy in order to arrive at a reliable pattern detection mechanism and point out shortcomings that cannot be easily circumvented due to bugs or limitations in the hardware.
Published: 2017
Full Text: View/download PDF

36. A domain-specific language and matrix-free stencil code for investigating electronic properties of Dirac and topological materials

Author: Pieper, Andreas, Hager, Georg, and Fehske, Holger
Subjects: Physics - Computational Physics, Computer Science - Performance
Abstract: We introduce PVSC-DTM (Parallel Vectorized Stencil Code for Dirac and Topological Materials), a library and code generator based on a domain-specific language tailored to implement the specific stencil-like algorithms that can describe Dirac and topological materials such as graphene and topological insulators in a matrix-free way. The generated hybrid-parallel (MPI+OpenMP) code is fully vectorized using Single Instruction Multiple Data (SIMD) extensions. It is significantly faster than matrix-based approaches on the node level and performs in accordance with the roofline model. We demonstrate the chip-level performance and distributed-memory scalability of basic building blocks such as sparse matrix-(multiple-) vector multiplication on modern multicore CPUs. As an application example, we use the PVSC-DTM scheme to (i) explore the scattering of a Dirac wave on an array of gate-defined quantum dots, to (ii) calculate a bunch of interior eigenvalues for strong topological insulators, and to (iii) discuss the photoemission spectra of a disordered Weyl semimetal., Comment: 16 pages, 2 tables, 11 figures
Published: 2017
Full Text: View/download PDF

37. CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

Author: Shahzad, Faisal, Thies, Jonas, Kreutzer, Moritz, Zeiser, Thomas, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks.
Published: 2017

38. LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses

Author: Röhl, Thomas, Eitzinger, Jan, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological cases, provides instant performance feedback to the users, offers initial data to judge on the optimization potential of applications and helps to build a statistical foundation about application specific system usage. The LIKWID monitoring stack is a modular framework build on top of the LIKWID tools library. It aims on enabling job specific performance monitoring using HPM data, system metrics and application-level data for small to medium sized commodity clusters. Moreover, it is designed to integrate in existing monitoring infrastructures to speed up the change from pure system monitoring to job-aware monitoring., Comment: 4 pages, 4 figures. Accepted for HPCMASPA 2017, the Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications, held in conjunction with IEEE Cluster 2017, Honolulu, HI, September 5, 2017
Published: 2017
Full Text: View/download PDF

39. An analysis of core- and chip-level architectural features in four generations of Intel server processors

Author: Hofmann, Johannes, Hager, Georg, Wellein, Gerhard, and Fey, Dietmar
Subjects: Computer Science - Performance
Abstract: This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broad- well) with a focus on performance with floating point workloads. Starting on the core level and going down the memory hierarchy we cover instruction throughput for floating-point instructions, L1 cache, address generation capabilities, core clock speed and its limitations, L2 and L3 cache bandwidth and latency, the impact of Cluster on Die (CoD) and cache snoop modes, and the Uncore clock speed. Using microbenchmarks we study the influence of these factors on code performance. This insight can then serve as input for analytic performance models. We show that the energy efficiency of the LINPACK and HPCG benchmarks can be improved considerably by tuning the Uncore clock speed without sacrificing performance, and that the Graph500 benchmark performance may profit from a suitable choice of cache snoop mode settings.
Published: 2017

40. Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

Author: Hammer, Julian, Eitzinger, Jan, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern architectures is close to impossible. Analytic loop performance modeling is a useful way to understand the relevant bottlenecks of code execution based on simple machine models. The Roofline Model and the Execution-Cache-Memory (ECM) model are proven approaches to performance modeling of loop nests. In comparison to the Roofline model, the ECM model can also describes the single-core performance and saturation behavior on a multicore chip. We give an introduction to the Roofline and ECM models, and to stencil performance modeling using layer conditions (LC). We then present Kerncraft, a tool that can automatically construct Roofline and ECM models for loop nests by performing the required code, data transfer, and LC analysis. The layer condition analysis allows to predict optimal spatial blocking factors for loop nests. Together with the models it enables an ab-initio estimate of the potential benefits of loop blocking optimizations and of useful block sizes. In cases where LC analysis is not easily possible, Kerncraft supports a cache simulator as a fallback option. Using a 25-point long-range stencil we demonstrate the usefulness and predictive power of the Kerncraft tool., Comment: 22 pages, 5 figures
Published: 2017
Full Text: View/download PDF

41. ESSEX: Equipping Sparse Solvers For Exascale

Author: Alappat, Christie L., Alvermann, Andreas, Basermann, Achim, Fehske, Holger, Futamura, Yasunori, Galgon, Martin, Hager, Georg, Huber, Sarah, Imakura, Akira, Kawai, Masatoshi, Kreutzer, Moritz, Lang, Bruno, Nakajima, Kengo, Röhrig-Zöllner, Melven, Sakurai, Tetsuya, Shahzad, Faisal, Thies, Jonas, Wellein, Gerhard, Barth, Timothy J., Series Editor, Griebel, Michael, Series Editor, Keyes, David E., Series Editor, Nieminen, Risto M., Series Editor, Roose, Dirk, Series Editor, Schlick, Tamar, Series Editor, Bungartz, Hans-Joachim, editor, Reiz, Severin, editor, Uekermann, Benjamin, editor, Neumann, Philipp, editor, and Nagel, Wolfgang E., editor
Published: 2020
Full Text: View/download PDF

42. Performance Engineering for a Tall & Skinny Matrix Multiplication Kernels on GPUs

Author: Ernst, Dominik, Hager, Georg, Thies, Jonas, Wellein, Gerhard, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wyrzykowski, Roman, editor, Deelman, Ewa, editor, Dongarra, Jack, editor, and Karczewski, Konrad, editor
Published: 2020
Full Text: View/download PDF

43. Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors

Author: Hofmann, Johannes, Fey, Dietmar, Riedmann, Michael, Eitzinger, Jan, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient SIMD-vectorized implementations on recent multi- and manycore processors. Using low-level instruction analysis and the execution-cache-memory (ECM) performance model we pinpoint the relevant performance bottlenecks for single-core and thread-parallel execution, and predict performance and saturation behavior. We show that the Kahan-enhanced scalar product comes at almost no additional cost compared to the naive (non-Kahan) scalar product if appropriate low-level optimizations, notably SIMD vectorization and unrolling, are applied. The ECM model is extended appropriately to accommodate not only modern Intel multicore chips but also the Intel Xeon Phi "Knights Corner" coprocessor and an IBM POWER8 CPU. This allows us to discuss the impact of processor features on the performance across four modern architectures that are relevant for high performance computing., Comment: 15 pages, 10 figures
Published: 2016
Full Text: View/download PDF

44. Analysis of Intel's Haswell Microarchitecture Using The ECM Model and Microbenchmarks

Author: Hofmann, Johannes, Fey, Dietmar, Eitzinger, Jan, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture
Abstract: This paper presents an in-depth analysis of Intel's Haswell microarchitecture for streaming loop kernels. Among the new features examined is the dual-ring Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, core improvements as new and improved execution units, as well as improvements throughout the memory hierarchy. The Execution-Cache-Memory diagnostic performance model is used together with a generic set of microbenchmarks to quantify the efficiency of the microarchitecture. The set of microbenchmarks is chosen such that it can serve as a blueprint for other streaming loop kernels., Comment: arXiv admin note: substantial text overlap with arXiv:1509.03118
Published: 2015

45. Optimization of an electromagnetics code with multicore wavefront diamond blocking and multi-dimensional intra-tile parallelization

Author: Malas, Tareq M., Hornich, Julian, Hager, Georg, Ltaief, Hatem, Pflaum, Christoph, and Keyes, David E.
Subjects: Computer Science - Computational Engineering, Finance, and Science, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Understanding and optimizing the properties of solar cells is becoming a key issue in the search for alternatives to nuclear and fossil energy sources. A theoretical analysis via numerical simulations involves solving Maxwell's Equations in discretized form and typically requires substantial computing effort. We start from a hybrid-parallel (MPI+OpenMP) production code that implements the Time Harmonic Inverse Iteration Method (THIIM) with Finite-Difference Frequency Domain (FDFD) discretization. Although this algorithm has the characteristics of a strongly bandwidth-bound stencil update scheme, it is significantly different from the popular stencil types that have been exhaustively studied in the high performance computing literature to date. We apply a recently developed stencil optimization technique, multicore wavefront diamond tiling with multi-dimensional cache block sharing, and describe in detail the peculiarities that need to be considered due to the special stencil structure. Concurrency in updating the components of the electric and magnetic fields provides an additional level of parallelism. The dependence of the cache size requirement of the optimized code on the blocking parameters is modeled accurately, and an auto-tuner searches for optimal configurations in the remaining parameter space. We were able to completely decouple the execution from the memory bandwidth bottleneck, accelerating the implementation by a factor of three to four compared to an optimal implementation with pure spatial blocking on an 18-core Intel Haswell CPU.
Published: 2015

46. Multi-dimensional intra-tile parallelization for memory-starved stencil computations

Author: Malas, Tareq, Hager, Georg, Ltaief, Hatem, and Keyes, David
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Optimizing the performance of stencil algorithms has been the subject of intense research over the last two decades. Since many stencil schemes have low arithmetic intensity, most optimizations focus on increasing the temporal data access locality, thus reducing the data traffic through the main memory interface with the ultimate goal of decoupling from this bottleneck. There are, however, only few approaches that explicitly leverage the shared cache feature of modern multicore chips. If every thread works on its private, separate cache block, the available cache space can become too small, and sufficient temporal locality may not be achieved. We propose a flexible multi-dimensional intra-tile parallelization method for stencil algorithms on multicore CPUs with a shared outer-level cache. This method leads to a significant reduction in the required cache space without adverse effects from hardware prefetching or TLB shortage. Our \emph{Girih} framework includes an auto-tuner to select optimal parameter configurations on the target hardware. We conduct performance experiments on two contemporary Intel processors and compare with the state-of-the-art stencil frameworks PLUTO and Pochoir, using four corner-case stencil schemes and a wide range of problem sizes. \emph{Girih} shows substantial performance advantages and best arithmetic intensity at almost all problem sizes, especially on low-intensity stencils with variable coefficients. We study in detail the performance behavior at varying grid size using phenomenological performance modeling. Our analysis of energy consumption reveals that our method can save energy by reduced DRAM bandwidth usage even at marginal performance gain. It is thus well suited for future architectures that will be strongly challenged by the cost of data movement, be it in terms of performance or energy consumption.
Published: 2015

47. High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations

Author: Pieper, Andreas, Kreutzer, Moritz, Alvermann, Andreas, Galgon, Martin, Fehske, Holger, Hager, Georg, Lang, Bruno, and Wellein, Gerhard
Subjects: Mathematics - Numerical Analysis, Condensed Matter - Mesoscale and Nanoscale Physics, Computer Science - Numerical Analysis
Abstract: We study Chebyshev filter diagonalization as a tool for the computation of many interior eigenvalues of very large sparse symmetric matrices. In this technique the subspace projection onto the target space of wanted eigenvectors is approximated with filter polynomials obtained from Chebyshev expansions of window functions. After the discussion of the conceptual foundations of Chebyshev filter diagonalization we analyze the impact of the choice of the damping kernel, search space size, and filter polynomial degree on the computational accuracy and effort, before we describe the necessary steps towards a parallel high-performance implementation. Because Chebyshev filter diagonalization avoids the need for matrix inversion it can deal with matrices and problem sizes that are presently not accessible with rational function methods based on direct or iterative linear solvers. To demonstrate the potential of Chebyshev filter diagonalization for large-scale problems of this kind we include as an example the computation of the $10^2$ innermost eigenpairs of a topological insulator matrix with dimension $10^9$ derived from quantum physics applications., Comment: 31 pages, 14 figures, 4 tables. Extended version
Published: 2015
Full Text: View/download PDF

48. Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft

Author: Hammer, Julian, Hager, Georg, Eitzinger, Jan, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced developers since it requires in-depth knowledge about the hardware and how it interacts with the software. We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests. Starting from the loop source code, the problem size, and a description of the underlying hardware, Kerncraft can ideally predict the single-core performance and scaling behavior of loops on multicore processors using the Roofline or the Execution-Cache-Memory (ECM) model. We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling., Comment: 11 pages, 4 figures, 8 listings
Published: 2015
Full Text: View/download PDF

49. GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

Author: Kreutzer, Moritz, Thies, Jonas, Röhrig-Zöllner, Melven, Pieper, Andreas, Shahzad, Faisal, Galgon, Martin, Basermann, Achim, Fehske, Holger, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Mathematical Software
Abstract: While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack., Comment: 32 pages, 11 figures
Published: 2015
Full Text: View/download PDF

50. Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero

Author: Wittmann, Markus, Zeiser, Thomas, Hager, Georg, and Wellein, Gerhard
Subjects: Computer Science - Performance
Abstract: Simple floating point operations like addition or multiplication on normalized floating point values can be computed by current AMD and Intel processors in three to five cycles. This is different for denormalized numbers, which appear when an underflow occurs and the value can no longer be represented as a normalized floating-point value. Here the costs are about two magnitudes higher.
Published: 2015

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

402 results on '"Hager, Georg"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources