7 results on '"Wellein, Gerhard"'
Search Results
2. Optimization and performance evaluation of the IDR iterative Krylov solver on GPUs.
- Author
-
Anzt, Hartwig, Kreutzer, Moritz, Ponce, Eduardo, Peterson, Gregory D., Wellein, Gerhard, and Dongarra, Jack
- Subjects
MATHEMATICAL optimization ,GRAPHICS processing units ,KERNEL functions ,SPARSE matrices ,DIMENSION reduction (Statistics) ,COMPUTER algorithms - Abstract
In this paper, we present an optimized GPU implementation for the induced dimension reduction algorithm. We improve data locality, combine it with an efficient sparse matrix vector kernel, and investigate the potential of overlapping computation with communication as well as the possibility of concurrent kernel execution. A comprehensive performance evaluation is conducted using a suitable performance model. The analysis reveals efficiency of up to 90%, which indicates that the implementation achieves performance close to the theoretically attainable bound. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
3. GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems.
- Author
-
Kreutzer, Moritz, Thies, Jonas, Röhrig-Zöllner, Melven, Pieper, Andreas, Shahzad, Faisal, Galgon, Martin, Basermann, Achim, Fehske, Holger, Hager, Georg, and Wellein, Gerhard
- Subjects
SPARSE matrices ,LINEAR algebra ,GRAPHICS processing units ,HETEROGENEOUS computing ,SOFTWARE libraries (Computer programming) - Abstract
While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring 'standard' as well as 'accelerated' resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The 'General, Hybrid, and Optimized Sparse Toolkit' (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the 'MPI+X' paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack. The library code and several applications are available as open source. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
4. INCREASING THE PERFORMANCE OF THE JACOBI-DAVIDSON METHOD BY BLOCKING.
- Author
-
RÖHRIG-ZÖLLNER, MELVEN, THIES, JONAS, KREUTZER, MORITZ, ALVERMANN, ANDREAS, PIEPER, ANDREAS, BASERMANN, ACHIM, HAGER, GEORG, WELLEIN, GERHARD, and FEHSKE, HOLGER
- Subjects
JACOBI-Davidson method ,EIGENFUNCTIONS ,SPARSE matrices ,EIGENVALUES ,HIGH performance computing ,CENTRAL processing units ,NUMERICAL analysis - Abstract
Block variants of the Jacobi-Davidson method for computing a few eigenpairs of a large sparse matrix are known to improve the robustness of the standard algorithm when it comes to computing multiple or clustered eigenvalues. In practice, however, they are typically avoided because the total number of matrix-vector operations increases. In this paper we present the implementation of a block Jacobi-Davidson solver. By detailed performance engineering and numerical experiments we demonstrate that the increase in operations is typically more than compensated by performance gains through better cache usage on modern CPUs, resulting in a method that is both more efficient and robust than its single vector counterpart. The steps to be taken to achieve a block speedup involve both kernel optimizations for sparse matrix and block vector operations, and algorithmic choices to allow using blocked operations in most parts of the computation. We discuss the aspect of avoiding synchronization in the algorithm and show by numerical experiments with our hybrid parallel implementation that a significant speedup through blocking can be achieved for a variety of matrices on up to 5 120 CPU cores as long as at least about 20 eigenpairs are sought. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
5. A UNIFIED SPARSE MATRIX DATA FORMAT FOR EFFICIENT GENERAL SPARSE MATRIX-VECTOR MULTIPLICATION ON MODERN PROCESSORS WITH WIDE SIMD UNITS.
- Author
-
KREUTZER, MORITZ, HAGER, GEORG, WELLEIN, GERHARD, FEHSKE, HOLGER, and BISHOP, ALAN R.
- Subjects
SPARSE matrices ,VECTORS (Calculus) ,MULTIPLICATION ,INFORMATION retrieval ,SIMD (Computer architecture) ,GRAPHICS processing units ,COMPUTER programming ,NETWORK performance - Abstract
Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-σ, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from general-purpose graphics processing units and vector computer programming. We discuss the advantages of SELL-C-σ compared to established formats like Compressed Row Storage and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi, and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-σ spMVM kernel. SELL-C-σ comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent ( "catch-all" ) sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
6. Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation.
- Author
-
Kreutzer, Moritz, Hager, Georg, Wellein, Gerhard, Fehske, Holger, Basermann, Achim, and Bishop, Alan R.
- Abstract
Sparse matrix-vector multiplication (spMVM) is the dominant operation in many sparse solvers. We investigate performance properties of spMVM with matrices of various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded jagged diagonals storage" (pJDS) format is proposed which may substantially reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme while making no assumptions about the matrix structure. In our test scenarios the pJDS format cuts the overall spMVM memory footprint on the GPGPU by up to 70%, and achieves 91% to 130% of the ELLPACK-R performance. Using a suitable performance model we identify performance bottlenecks on the node level that invalidate some types of matrix structures for efficient multi-GPGPU parallelization. For appropriate sparsity patterns we extend previous work on distributed-memory parallel spMVM to demonstrate a scalable hybrid MPI-GPGPU code, achieving efficient overlap of communication and computation. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
7. A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS.
- Author
-
SHAHZAD, FAISAL, WITTMANN, MARKUS, KREUTZER, MORITZ, ZEISER, THOMAS, HAGER, GEORG, and WELLEIN, GERHARD
- Subjects
HIGH performance computing ,MEAN time between failure ,FAULT tolerance (Engineering) ,LATTICE Boltzmann methods ,SPARSE matrices - Abstract
The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher level of parallelism of the basic building blocks (i.e., CPUs, memory units, networking components, etc.). The reliability of each of these basic components does not increase at the same rate as the rate of hardware parallelism. This results in a reduction of the mean time to failure (MTTF) of the whole system. A fault tolerance environment is thus indispensable to run large applications on such clusters. Checkpoint/Restart (C/R) is the classic and most popular method to minimize failure damage. Its ease of implementation makes it useful, but typically it introduces significant overhead to the application. Several efforts have been made to reduce the C/R overhead. In this paper we compare various C/R techniques for their overheads by implementing them on two different categories of applications. These approaches are based on parallel-file-system (PFS)-level checkpoints (synchronous/asynchronous) and node-level checkpoints. We utilize the Scalable Checkpoint/Restart (SCR) library for the comparison of node-level checkpoints. For asynchronous PFS-level checkpoints, we use the Damaris library, the SCR asynchronous feature, and application-based checkpointing via dedicated threads. Our baseline for overhead comparison is the naïve application-based synchronous PFS-level checkpointing method. A 3D lattice-Boltzmann (LBM) flow solver and a Lanczos eigenvalue solver are used as prototypical applications in which all the techniques considered here may be applied. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.