1. Parallel spherical harmonic transforms on heterogeneous architectures (graphics processing units/multi-core CPUs)
- Author
-
Pierre Esterie, Laura Grigori, Mikolaj Szydlarski, Radek Stompor, and Joel Falcou
- Subjects
Multi-core processor ,Computer Networks and Communications ,Computer science ,Message Passing Interface ,Parallel algorithm ,010103 numerical & computational mathematics ,GPU cluster ,Parallel computing ,Supercomputer ,01 natural sciences ,Porting ,Computer Science Applications ,Theoretical Computer Science ,CUDA ,Computational Theory and Mathematics ,0103 physical sciences ,Scalability ,0101 mathematics ,Graphics ,010303 astronomy & astrophysics ,Software - Abstract
Spherical harmonic transforms SHT are at the heart of many scientific and practical applications ranging from climate modelling to cosmological observations. In many of these areas, new cutting-edge science goals have been recently proposed requiring simulations and analyses of experimental or observational data at very high resolutions and of unprecedented volumes. Both these aspects pose formidable challenge for the currently existing implementations of the transforms. This paper describes parallel algorithms for computing SHT with two variants of intra-node parallelism appropriate for novel supercomputer architectures, multi-core processors and Graphic Processing Units GPU. It also discusses their performance, alone and embedded within a top-level, Message Passing Interface-based parallelisation layer ported from the S2HAT library, in terms of their accuracy, overall efficiency and scalability. We show that our inverse SHT run on GeForce 400 Series GPUs equipped with latest Compute Unified Device Architecture architecture Fermi outperforms the state of the art implementation for a multi-core processor executed on a current Intel Core i7-2600K. Furthermore, we show that an Message Passing Interface/Compute Unified Device Architecture version of the inverse transform run on a cluster of 128 Nvidia Tesla S1070 is as much as 3times faster than the hybrid Message Passing Interface/OpenMP version executed on the same number of quad-core processors Intel Nehalem for problem sizes motivated by our target applications. Performance of the direct transforms is however found to be at the best comparable in these cases. We discuss in detail the algorithmic solutions devised for the major steps involved in the transforms calculation, emphasising those with a major impact on their overall performance and elucidates the sources of the dichotomy between the direct and the inverse operations.Copyright © 2013 John Wiley & Sons, Ltd.
- Published
- 2013
- Full Text
- View/download PDF