1. Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods
- Author
-
George Biros, Andreas Gerstlauer, Dhairya Malhotra, Lizy K. John, Mochamad Asri, and Jiajun Wang
- Subjects
020203 distributed computing ,Random access memory ,Computer science ,business.industry ,Pipeline (computing) ,02 engineering and technology ,Supercomputer ,Idle ,Software pipelining ,Software ,Computational Theory and Mathematics ,Application-specific integrated circuit ,Hardware and Architecture ,Embedded system ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,x86 ,Hardware acceleration ,System on a chip ,business ,Dram - Abstract
In this article, we study performance and energy saving benefits of hardware acceleration under different hardware configurations and usage scenarios for a state-of-the-art Fast Multipole Method (FMM), which is a popular N-body method. We use a dedicated Application Specific Integrated Circuit (ASIC) to accelerate General Matrix-Matrix Multiply (GEMM) operations. FMM is widely used in applications and is representative example of the workload for many HPC applications. We compare architectures that integrate the GEMM ASIC next to, in or near main memory with an on-chip coupling aimed at minimizing or avoiding repeated round-trip transfers through DRAM for communication between accelerator and CPU. We study tradeoffs using detailed and accurately calibrated x86 CPU, accelerator and DRAM simulations. Our results show that simply moving accelerators closer to the chip does not necessarily lead to performance/energy gains. We demonstrate that, while careful software blocking and on-chip placement optimizations can reduce DRAM accesses by 2X over a naive on-chip integration, these dramatic savings in DRAM traffic do not automatically translate into significant total energy or runtime savings. This is chiefly due to the application characteristics, the high idle power and effective hiding of memory latencies in modern systems. Only when more aggressive co-optimizations such as software pipelining and overlapping are applied, additional performance and energy savings can be unlocked by 37 and 35 percent respectively over baseline acceleration. When similar optimizations (pipelining and overlapping) are applied with an off-chip integration, on-chip integration delivers up to 20 percent better performance and 17 percent less total energy consumption than off-chip integration.
- Published
- 2021
- Full Text
- View/download PDF