1. Exploiting vector code semantics for efficient data cache prefetching
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Martínez Palau, Francesc, Torrents Lapuerta, Martí, Armejach Sanosa, Adrià, Casas, Marc, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Martínez Palau, Francesc, Torrents Lapuerta, Martí, Armejach Sanosa, Adrià, and Casas, Marc
- Abstract
Emerging workloads from domains like high performance computing, data analytics or deep learning consume large amounts of memory bandwidth. To mitigate this problem, computing systems include large and deep memory cache hierarchies that exploit both spatial and temporal locality. In this context, hardware data cache prefetching constitutes a useful method to anticipate cache misses and boost performance. Despite their success in terms of high coverage rates, current data cache prefetchers incur a significant number of late and sometimes useless prefetches. Additionally, these state-of-the-art prefetchers are not aware of architecture trends towards larger vector units and vector-length agnostic instruction sets. This paper demonstrates that these trends bring new prefetching opportunities that make it possible to increase the accuracy and timeliness of any state-of-the-art prefetcher with a negligible area cost. We propose the the Register Vector Length Agnostic (ReVeLA) prefetcher. ReVeLA exploits program semantics present in vectorized codes. The ReVeLA prefetcher complements existing data cache prefetchers by providing highly accurate prefetch requests that improve prefetching timeliness and accuracy without significantly increasing memory bandwidth consumption. When applied on top of a state-of-the-art out-of-order vector processor, ReVeLA delivers a speed-up of 1.23 × with respect to a system without any prefetching approach. When combined with the NextLine, BOP, SPP, and PPF prefetchers, ReVeLA improves performance by 6.57%, 4.46%, 11.83%, and 11.40% respectively, with respect to a vector processor equipped with these prefetching approaches. Additionally, our evaluation demonstrates that ReVeLA increases memory bandwidth consumption by only 3.74% when combined with the most performing data cache prefetcher of our experimental campaign., This work has been partially supported by the Spanish Ministry of Science and Innovation MCIN/AEI/10.13039/501100011033 (contract PID2019-107255GB-C21) and ESF Investing in your future, the Generalitat of Catalunya (contract 2021-SGR-00763), the European HiPEAC Network of Excellence, and the European Processor Initiative (EPI), which is part of the European Union’s Horizon 2020 research and innovation program under grant agreement No. 826647. A. Armejach is a Serra Hunter Fellow. The authors thank the Departament de Recerca i Universitats de la Generalitat de Catalunya for supporting the Research Group "Performance understanding, analysis, and simulation/emulation of novel architectures" (Code: 2021 SGR 00865)., Peer Reviewed, Postprint (author's final draft)
- Published
- 2024