Start Over

Fast static analysis for compile-time restructuring of application parallelism on Graphics Processing Units

Authors :: Stawinoga, Nicolai
Field, Tony
Publication Year :: 2019
Publisher :: Imperial College London, 2019.
Abstract: Parallelism is everywhere, with co-processors such as Graphics Processing Units (GPUs) accelerating the performance of applications such as training deep-learning neural networks, climate forecasting, bitcoin mining, medical imaging, or data analytics on platforms ranging from desktop computers to cloud computing and high performance clusters to mobile phones. Code optimisations enable realising the available performance of such devices, and automating these optimisations enables performance portability of software between different architectures. In this thesis, we consider two code optimisations that can improve application performance by reducing the degree of hardware and software parallelism in a program execution: thread coarsening, which by merging threads reduces the number of threads launched, and artificial occupancy reduction, which limits the number of threads simultaneously processed by allocating superfluous resources. We show how occupancy prediction through re-compilation can enable the selection of near-optimal coarsening factors at compile-time, by which thread coarsening can be applied in a fully automated manner without requiring auto-tuning. We demonstrate that our approach can achieve a maximum speedup of 5.08x (1.30x average) across three different NVidia GPU architectures, two modes of coarsening, different problem sizes, and for code pre-optimised to different degrees. When trying to predict the likely effects of thread coarsening, it is important to consider the effects it might have on cache pressure. We describe how a fast static analysis based on partial symbolic execution can be implemented to identify cache line re-use in programs. We demonstrate how this heuristic approach can improve on the runtime and memory requirements of a more extensive re-use distance analysis by several orders of magnitude, causing it to be sufficiently light-weight for run-time execution. We show that the analysis is able to identify kernels that are likely to experience an increase in cache pressure after coarsening. We explore the interaction of thread coarsening and artificial occupancy reduction, which can have negative effects on cache pressure and processor workload, respectively. We show that the two optimisation techniques can cancel these out when applied in combination, and yield a performance improvement of 8% in some cases. We investigate whether the cache line re-use analysis can identify candidates for artificial occupancy reduction.

Subjects :: 004

Details

Language :: English
Database :: British Library EThOS
Publication Type :: Dissertation/ Thesis
Accession number :: edsble.788962
Document Type :: Electronic Thesis or Dissertation
Full Text :: https://doi.org/10.25560/73865

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Fast static analysis for compile-time restructuring of application parallelism on Graphics Processing Units

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Fast static analysis for compile-time restructuring of application parallelism on Graphics Processing Units

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources