Author: "Shalf, John" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shalf, John"' showing total 839 results

Start Over Author "Shalf, John"

839 results on '"Shalf, John"'

1. Comparison of Vectorization Capabilities of Different Compilers for X86 and ARM CPUs

Author: Sakib, Nazmus, Prabhu, Tarun, Santhi, Nandakishore, Shalf, John, and Badawy, Abdel-Hameed A.
Subjects: Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Most modern processors contain vector units that simultaneously perform the same arithmetic operation over multiple sets of operands. The ability of compilers to automatically vectorize code is critical to effectively using these units. Understanding this capability is important for anyone writing compute-intensive, high-performance, and portable code. We tested the ability of several compilers to vectorize code on x86 and ARM. We used the TSVC2 suite, with modifications that made it more representative of real-world code. On x86, GCC reported 54% of the loops in the suite as having been vectorized, ICX reported 50%, and Clang, 46%. On ARM, GCC reported 56% of the loops as having been vectorized, ACFL reported 54%, and Clang, 47%. We found that the vectorized code did not always outperform the unvectorized code. In some cases, given two very similar vectorizable loops, a compiler would vectorize one but not the other. We also report cases where a compiler vectorized a loop on only one of the two platforms. Based on our experiments, we cannot definitively say if any one compiler is significantly better than the others at vectorizing code on any given platform., Comment: IEEE HPEC 2024
Published: 2025

2. Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

Author: Michelogiannakis, George, Arafa, Yehia, Cook, Brandon, Dai, Liang Yuan, Badawy, Abdel-Hameed, Glick, Madeleine, Wang, Yuyang, Bergman, Keren, and Shalf, John
Subjects: Information and Computing Sciences, Engineering, Electronics, Sensors and Digital Hardware
Abstract: The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4× fewer memory modules and 2× fewer NICs than a non-disaggregated baseline.
Published: 2023

3. Preparing for the Future -- Rethinking Proxy Apps

Author: Matsuoka, Satoshi, Domke, Jens, Wahib, Mohamed, Drozd, Aleksandr, Bair, Ray, Chien, Andrew A., Vetter, Jeffrey S., and Shalf, John
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center to develop their own benchmarks. Unfortunately, proxy applications force HPC centers and providers (vendors) into a an undesirable state of rigidity, in contrast to the fast-moving trends of current technology and future heterogeneity. To accommodate an extremely-heterogeneous future, we have to reconsider how to co-design supercomputers during the next decade, and avoid repeating the past mistakes. This position paper outlines the current state-of-the-art in system co-design, challenges encountered over the past years, and a proposed plan to move forward.
Published: 2022

4. ASA: A ccelerating S parse A ccumulation in Column-wise SpGEMM

Author: Zhang, Chao, Bremer, Maximilian, Chan, Cy, Shalf, John, and Guo, Xiaochen
Subjects: SpGEMM, sparse accumulation, sparse linear algebra, Markov clustering, Computer Software, Electrical and Electronic Engineering
Abstract: Sparse linear algebra is an important kernel in many different applications. Among various sparse general matrix-matrix multiplication (SpGEMM) algorithms, Gustavson's column-wise SpGEMM has good locality when reading input matrix and can be easily parallelized by distributing the computation of different columns of an output matrix to different processors. However, the sparse accumulation (SPA) step in column-wise SpGEMM, which merges partial sums from each of the multiplications by the row indices, is still a performance bottleneck. The state-of-the-art software implementation uses a hash table for partial sum search in the SPA, which makes SPA the largest contributor to the execution time of SpGEMM. There are three reasons that cause the SPA to become the bottleneck: (1) hash probing requires data-dependent branches that are difficult for a branch predictor to predict correctly; (2) the accumulation of partial sum is dependent on the results of the hash probing, which makes it difficult to hide the hash probing latency; and (3) hash collision requires time-consuming linear search and optimizations to reduce these collisions require an accurate estimation of the number of non-zeros in each column of the output matrix.This work proposes ASA architecture to accelerate the SPA. ASA overcomes the challenges of SPA by (1) executing the partial sum search and accumulate with a single instruction through ISA extension to eliminate data-dependent branches in hash probing, (2) using a dedicated on-chip cache to perform the search and accumulation in a pipelined fashion, (3) relying on the parallel search capability of a set-associative cache to reduce search latency, and (4) delaying the merging of overflowed entries. As a result, ASA achieves an average of 2.25× and 5.05× speedup as compared to the state-of-the-art software implementation of a Markov clustering application and its SpGEMM kernel, respectively. As compared to a state-of-the-art hashing accelerator design, ASA achieves an average of 1.95× speedup in the SpGEMM kernel.
Published: 2022

5. A Case For Intra-rack Resource Disaggregation in HPC

Author: Michelogiannakis, George, Klenk, Benjamin, Cook, Brandon, Teh, Min Yee, Glick, Madeleine, Dennison, Larry, Bergman, Keren, and Shalf, John
Subjects: Distributed Computing and Systems Software, Information and Computing Sciences, Engineering, Electronics, Sensors and Digital Hardware, Disaggregation, HPC, utilization, memory, LDMS, Computer Software, Electrical and Electronic Engineering, Electronics, sensors and digital hardware, Distributed computing and systems software
Abstract: The expected halt of traditional technology scaling is motivating increased heterogeneity in high-performance computing (HPC) systems with the emergence of numerous specialized accelerators. As heterogeneity increases, so does the risk of underutilizing expensive hardware resources if we preserve today's rigid node configuration and reservation strategies. This has sparked interest in resource disaggregation to enable finer-grain allocation of hardware resources to applications. However, there is currently no data-driven study of what range of disaggregation is appropriate in HPC. To that end, we perform a detailed analysis of key metrics sampled in NERSC's Cori, a production HPC system that executes a diverse open-science HPC workload. In addition, we profile a variety of deep-learning applications to represent an emerging workload. We show that for a rack (cabinet) configuration and applications similar to Cori, a central processing unit with intra-rack disaggregation has a 99.5% probability to find all resources it requires inside its rack. In addition, ideal intra-rack resource disaggregation in Cori could reduce memory and NIC resources by 5.36% to 69.01% and still satisfy the worst-case average rack utilization.
Published: 2022

6. TIGER: Topology-aware Assignment using Ising machines Application to Classical Algorithm Tasks and Quantum Circuit Gates

Author: Butko, Anastasiia, Turimbetov, Ilyas, Michelogiannakis, George, Donofrio, David, Unat, Didem, and Shalf, John
Subjects: Computer Science - Emerging Technologies, Quantum Physics, C.3, D.0, H.0
Abstract: Optimally mapping a parallel application to compute and communication resources is increasingly important as both system size and heterogeneity increase. A similar mapping problem exists in gate-based quantum computing where the objective is to map tasks to gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing task assignment approaches are either heuristic or based on physical optimization algorithms, providing different speed and solution quality trade-offs. Ising machines such as quantum and digital annealers have recently become available and offer an alternative hardware solution to solve this type of optimization problems. In this paper, we propose an algorithm that allows solving the topology-aware assignment problem using Ising machines. We demonstrate the algorithm on two use cases, i.e. classical task scheduling and quantum circuit gate scheduling. TIGER---topology-aware task/gate assignment mapper tool---implements our proposed algorithms and automatically integrates them into the quantum software environment. To address the limitations of physical solver, we propose and implement a domain-specific partition strategy that allows solving larger-scale problems and a weight optimization algorithm that allows tuning Ising model parameters to achieve better restuls. We use D-Wave's quantum annealer to demonstrate our algorithm and evaluate the proposed tool flow in terms of performance, partition efficiency, and solution quality. Results show significant speed-up compared to classical solutions, better scalability, and higher solution quality when using TIGER together with the proposed partition method. It reduces the data movement cost by 68\% in average for quantum circuit assignment compared to the IBM QX optimizer., Comment: 15 pages, 10 figures
Published: 2020

7. Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity

Author: Vetter, Jeffrey S, Brightwell, Ron, Gokhale, Maya, McCormick, Pat, Ross, Rob, Shalf, John, Antypas, Katie, Donofrio, David, Humble, Travis, Schuman, Catherine, Van Essen, Brian, Yoo, Shinjae, Aiken, Alex, Bernholdt, David, Byna, Suren, Cameron, Kirk, Cappello, Frank, Chapman, Barbara, Chien, Andrew, Hall, Mary, Hartman-Baker, Rebecca, Lan, Zhiling, Lang, Michael, Leidel, John, Li, Sherry, Lucas, Robert, Mellor-Crummey, John, Peltz Jr., Paul, Peterka, Thomas, Strout, Michelle, and Wilke, Jeremiah
Published: 2021

8. 2019 Computing Sciences Strategic Plan

Author: Yelick, Kathy, Agarwal, Deb, Bard, Debbie, Shalf, John, Almgren, Ann, Bhimji, Wahid, Brown, Ben, Carter, Jonathan, Jong, Bert, Doerfler, Doug, Donofrio, David, Guok, Chin, Iancu, Costin, Kiran, Mariam, Li, Sherry, Nugent, Peter, Prabhat, M, Ramakrishnan, Lavanya, Vasudevan, Dilip, Wright, Nick, Cademartori, Helen, Antypas, Katie, and Kincade, Kathy
Published: 2021

9. Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization

Author: Butko, Anastasiia, Michelogiannakis, George, Williams, Samuel, Iancu, Costin, Donofrio, David, Shalf, John, Carter, Jonathan, and Siddiqi, Irfan
Subjects: Quantum Physics
Abstract: Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a methodology to quantitatively assess the effectiveness of the ISA to encode quantum circuits for intermediate-scale quantum devices with O($10^2$) qubits. The characterization model that we define reflects performance, the ability to meet timing constraint implications, scalability for future quantum chips, and other important considerations making them useful guides for future designs. Using our methodology, we propose scalar (QUASAR) and vector (qV) quantum ISAs as extensions and compare them with other ISAs in metrics such as circuit encoding efficiency, the ability to meet real-time gate cycle requirements of quantum chips, and the ability to scale to more qubits., Comment: 10 pages, 8 figures
Published: 2019
Full Text: View/download PDF

10. Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization

Author: Butko, Anastasiia, Michelogiannakis, George, Williams, Samuel, Iancu, Costin, Donofrio, David, Shalf, John, Carter, Jonathan, and Siddiqi, Irfan
Subjects: Quantum Physics, Engineering, Electronics, Sensors and Digital Hardware, Physical Sciences, quantum control processor, ISA extension, RISC-V, quantum circuit characterization, specialized architecture, quant-ph
Abstract: Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a methodology to quantitatively assess the effectiveness of the ISA to encode quantum circuits for intermediate-scale quantum devices with O(102) qubits. The characterization model that we define reflects performance, the ability to meet timing constraint implications, scalability for future quantum chips, and other important considerations making them useful guides for future designs. Using our methodology, we propose scalar (QUASAR) and vector (qV) quantum ISAs as extensions and compare them with other ISAs in metrics such as circuit encoding efficiency, the ability to meet real-time gate cycle requirements of quantum chips, and the ability to scale to more qubits.
Published: 2020

11. PINE: Photonic Integrated Networked Energy efficient datacenters (ENLITENED Program) [Invited]

Author: Glick, Madeleine, Abrams, Nathan C, Cheng, Qixiang, Teh, Min Yee, Hung, Yu-Han, Jimenez, Oscar, Liu, Songtao, Okawachi, Yoshitomo, Meng, Xiang, Johansson, Leif, Ghobadi, Manya, Dennison, Larry, Michelogiannakis, George, Shalf, John, Liu, Alan, Bowers, John, Gaeta, Alex, Lipson, Michal, and Bergman, Keren
Subjects: Communications Engineering, Engineering, Electronics, Sensors and Digital Hardware, Affordable and Clean Energy, Communications engineering, Electronics, sensors and digital hardware
Abstract: We review the motivation, goals, and achievements of the Photonic Integrated Networked Energy efficient datacenter (PINE) project, which is part of the Advanced Research Projects Agency-Energy (ARPA-E) ENergy-efficient Light-wave Integrated Technology Enabling Networks that Enhance Dataprocessing (ENLITENED) program. The PINE program leverages the unique features of photonic technologies to enable alternative mega-datacenters and high-performance computing (HPC) system architectures that deliver more substantial energy efficiency improvements than can be achieved through link energy efficiency alone. In phase 1 of the program, the PINE system architecture demonstrated an average factor of 2.2 ×2.2× improvement in transactions/joule across a diverse set of HPC and datacenter applications. In phase 2, PINE will demonstrate an aggressive 1.0 pJ/bit total link budget with high-bandwidth-density dense wavelength-division multiplexing (DWDM) links to enable additional 2.5 ×2.5× or more efficiency gains through deep resource disaggregation.
Published: 2020

12. The Future of Computing beyond Moore's Law

Author: Shalf, John, primary
Published: 2023
Full Text: View/download PDF

13. The future of computing beyond Moores Law

Author: Shalf, John
Subjects: Engineering, Electronics, Sensors and Digital Hardware, high-performance computing, Moore's Law, microelectronics, lithography, computing, post-CMOS, Moore’s Law, General Science & Technology
Abstract: Moore's Law is a techno-economic model that has enabled the information technology industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. Advances in silicon lithography have enabled this exponential miniaturization of electronics, but, as transistors reach atomic scale and fabrication costs continue to rise, the classical technological driver that has underpinned Moore's Law for 50 years is failing and is anticipated to flatten by 2025. This article provides an updated view of what a post-exascale system will look like and the challenges ahead, based on our most recent understanding of technology roadmaps. It also discusses the tapering of historical improvements, and how it affects options available to continue scaling of successors to the first exascale machine. Lastly, this article covers the many different opportunities and strategies available to continue computing performance improvements in the absence of historical technology drivers. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'.
Published: 2020

14. TraceTracker: Hardware/Software Co-Evaluation for Large-Scale I/O Workload Reconstruction

Author: Kwon, Miryeong, Zhang, Jie, Park, Gyuyoung, Choi, Wonil, Donofrio, David, Shalf, John, Kandemir, Mahmut, and Jung, Myoungsoo
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture
Abstract: Block traces are widely used for system studies, model verifications, and design analyses in both industry and academia. While such traces include detailed block access patterns, existing trace-driven research unfortunately often fails to find true-north due to a lack of runtime contexts such as user idle periods and system delays, which are fundamentally linked to the characteristics of target storage hardware. In this work, we propose TraceTracker, a novel hardware/software co-evaluation method that allows users to reuse a broad range of the existing block traces by keeping most their execution contexts and user scenarios while adjusting them with new system information. Specifically, our TraceTracker's software evaluation model can infer CPU burst times and user idle periods from old storage traces, whereas its hardware evaluation method remasters the storage traces by interoperating the inferred time information, and updates all inter-arrival times by making them aware of the target storage system. We apply the proposed co-evaluation model to 577 traces, which were collected by servers from different institutions and locations a decade ago, and revive the traces on a high-performance flash-based storage array. The evaluation results reveal that the accuracy of the execution contexts reconstructed by TraceTracker is on average 99% and 96% with regard to the frequency of idle operations and the total idle periods, respectively., Comment: This paper is accepted by and will be published at 2017 IEEE International Symposium on Workload Characterization
Published: 2017

15. SimpleSSD: Modeling Solid State Drives for Holistic System Simulation

Author: Jung, Myoungsoo, Zhang, Jie, Abulila, Ahmed, Kwon, Miryeong, Shahidi, Narges, Shalf, John, Kim, Nam Sung, and Kandemir, Mahmut
Subjects: Computer Science - Hardware Architecture
Abstract: Existing solid state drive (SSD) simulators unfortunately lack hardware and/or software architecture models. Consequently, they are far from capturing the critical features of contemporary SSD devices. More importantly, while the performance of modern systems that adopt SSDs can vary based on their numerous internal design parameters and storage-level configurations, a full system simulation with traditional SSD models often requires unreasonably long runtimes and excessive computational resources. In this work, we propose SimpleSSD, a highfidelity simulator that models all detailed characteristics of hardware and software, while simplifying the nondescript features of storage internals. In contrast to existing SSD simulators, SimpleSSD can easily be integrated into publicly-available full system simulators. In addition, it can accommodate a complete storage stack and evaluate the performance of SSDs along with diverse memory technologies and microarchitectures. Thus, it facilitates simulations that explore the full design space at different levels of system abstraction., Comment: This paper has been accepted at IEEE Computer Architecture Letters (CAL)
Published: 2017
Full Text: View/download PDF

16. Extreme Heterogeneity 2018: DOE ASCR Basic Research Needs Workshop on Extreme Heterogeneity

Author: Byna, Surendra, Vetter, Jeff, Brightwell, Ron, Gokhale, Maya, McCormick, Patrick, Ross, Robert, Shalf, John, Antypas, Katie, Donofrio, David, Dubey, Anshu, Humble, T, Schuman, C, Van Essen, Brian, Yoo, S, Aiken, A, Bernholdt, D, Cameron, Kirk, Cappello, F, Chapman, B, Chien, A, Hall, M, Hartman-Baker, R, Lan, Zhiling, Lang, M, Leidel, J, Li, S, Lucas, R, Mellor-Crummey, J, Peltz, P, Peterka, T, Strout, M, and Wilke, J
Published: 2018

17. BoxLib with Tiling: An AMR Software Framework

Author: Zhang, Weiqun, Almgren, Ann, Day, Marcus, Nguyen, Tan, Shalf, John, and Unat, Didem
Subjects: Computer Science - Mathematical Software, Physics - Computational Physics, 97N80
Abstract: In this paper we introduce a block-structured adaptive mesh refinement (AMR) software framework that incorporates tiling, a well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With the expectation of many more cores per node on next generation architectures, the ability to effectively utilize threads within a node is essential, and the current model for parallelization will not be sufficient. We describe a new version of BoxLib in which the tiling constructs are embedded so that BoxLib-based applications can easily realize expected performance gains without extra effort on the part of the application developer. We also discuss a path forward to enable future versions of BoxLib to take advantage of NUMA-aware optimizations using the TiDA portable library., Comment: Accepted for publication in SIAM J. on Scientific Computing
Published: 2016

18. Can the United States Maintain Its Leadership in High-Performance Computing? - A report from the ASCAC Subcommittee on American Competitiveness and Innovation to the ASCR Office

Author: Dongarra, Jack, primary, Deelman, Ewa, additional, Hey, Tony, additional, Matsuoka, Satoshi, additional, Sarakar, Vivek, additional, Bell, Greg, additional, Foster, Ian, additional, Keyes, David, additional, Kranzlmueller, Dieter, additional, Lucas, Bob, additional, Parker, Lynne, additional, Shalf, John, additional, Stanzione, Dan, additional, Stevens, Rick, additional, and Yelick, Katherine, additional
Published: 2023
Full Text: View/download PDF

19. 1 Applications and key performance indicators for data communications

Author: Stone, Robert, Shalf, John, Carmean, Doug, Seyedi, Ashkan, and Schmidtke, Katharine
Subjects: Information and Computing Sciences, Engineering, Electronics, Sensors and Digital Hardware
Published: 2023

20. SimpleSSD: Modeling Solid State Drives for Holistic System Simulation

Author: Jung, Myoungsoo, Zhang, Jie, Abulila, Ahmed, Kwon, Miryeong, Shahidi, Narges, Shalf, John, Kim, Nam Sung, and Kandemir, Mahmut
Subjects: Built Environment and Design, Architecture, Affordable and Clean Energy, Hardware, computer architecture, parallel processing, computational modeling, systems simulation, microprocessors, software, Computer Hardware & Architecture
Abstract: Existing solid state drive (SSD) simulators unfortunately lack hardware and/or software architecture models. Consequently, they are far from capturing the critical features of contemporary SSD devices. More importantly, while the performance of modern systems that adopt SSDs can vary based on their numerous internal design parameters and storage-level configurations, a full system simulation with traditional SSD models often requires unreasonably long runtimes and excessive computational resources. In this work, we propose SimpleSSD, a high-fidelity simulator that models all detailed characteristics of hardware and software, while simplifying the nondescript features of storage internals. In contrast to existing SSD simulators, SimpleSSD can easily be integrated into publicly-available full system simulators. In addition, it can accommodate a complete storage stack and evaluate the performance of SSDs along with diverse memory technologies and microarchitectures. Thus, it facilitates simulations that explore the full design space at different levels of system abstraction.
Published: 2018

21. Last Level Collective Hardware Prefetching for Data-Parallel Applications

Author: Michelogiannakis, George and Shalf, John
Subjects: Built Environment and Design, Engineering, Architecture, Electronics, Sensors and Digital Hardware, Affordable and Clean Energy, prefetch, cache, data-parallel
Abstract: With rapidly increasing parallelism, DRAM performance and power have surfaced as primary constraints from consumer electronics to high performance computing (HPC) for a variety of applications, including bulk-synchronous data-parallel applications which are key drivers for multi-core, with examples including image processing, climate modeling, physics simulation, gaming, face recognition, and many others. We present the last-level collective prefetcher (LLCP), a purely hardware last-level cache (LLC) prefetcher that exploits the highly correlated prefetch patterns of data-parallel algorithms that would otherwise not be recognized by a prefetcher that is oblivious to data parallelism. LLCP generates prefetches on behalf of multiple cores in memory address order to maximize DRAM efficiency and bandwidth, and can prefetch from multiple memory pages without expensive translations. Compared to well-established other prefetchers, LLCP improves execution time by 5.5% on average (10% maximum), increases DRAM bandwidth by 9% to 18%, decreases DRAM rank energy by 6%, produces 27% more timely prefetches, and increases coverage by 25% at minimum.
Published: 2017

22. APHiD: Hierarchical Task Placement to Enable a Tapered Fat Tree Topology for Lower Power and Cost in HPC Networks:

Author: Michelogiannakis, George, Ibrahim, Khaled Z., Shalf, John, Wilke, Jeremiah J., Knight, Samuel, and Kenny, Joseph P.
Subjects: HPC networking
Abstract: The power and procurement cost of bandwidth in system-wide networks has forced a steady drop in the byte/flop ratio. This trend of computation becoming faster relative to the network is expected to hold. In this paper, we explore how cost-oriented task placement enables reducing the cost of system-wide networks by enabling high performance even on tapered topologies where more bandwidth is provisioned at lower levels. We describe APHiD, an efficient hierarchical placement algorithm that uses new techniques to improve the quality of heuristic solutions and reduces the demand on high-level, expensive bandwidth in hierarchical topologies. We apply APHiD to a tapered fat-tree, demonstrating that APHiD maintains application scalability even for severely tapered network configurations. Using simulation, we show that for tapered networks APHiD improves performance by more than 50% over random placement and even 15% in some cases over costlier, state-of-the-art placement algorithms.
Published: 2017

23. Trends in Data Locality Abstractions for HPC Systems

Author: Unat, Didem, Dubey, Anshu, Hoefler, Torsten, Shalf, John, Abraham, Mark, Bianco, Mauro, Chamberlain, Bradford L, Cledat, Romain, Edwards, H Carter, Finkel, Hal, Fuerlinger, Karl, Hannig, Frank, Jeannot, Emmanuel, Kamil, Amir, Keasler, Jeff, Kelly, Paul HJ, Leung, Vitus, Ltaief, Hatem, Maruyama, Naoya, Newburn, Chris J, and Pericas, Miquel
Subjects: Distributed Computing and Systems Software, Information and Computing Sciences, Data locality, programming abstractions, high-performance computing, data layout, locality-aware runtimes, Computer Software, Distributed Computing, Communications Technologies, Distributed computing and systems software
Abstract: The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
Published: 2017

24. Continuing the Scaling of Digital Computing Post Moore’s Law:

Author: Michelogiannakis, George, Shalf, John, Donofrio, David, and Bachan, John
Abstract: The approaching end of traditional CMOS technology scaling that up until now followed Moore's law is coming to an end in the next decade.However, the DOE has come to depend on the rapid, predictable, and cheap scaling of computing performance to meet mission needs for scientific theory, large scale experiments, and national security. Moving forward, performance scaling of digital computing will need to originate from energy and cost reductions that are a result of novel architectures, devices, manufacturing technologies,and programming models. The deeper issue presented by these changes is the threat to DOE’s mission and to the future economic growth of the U.S. computing industry and to society as a whole.With the impending end of Moore’s law, it is imperative for the Office of Advanced Scientific Computing Research (ASCR) to develop a balanced research agenda to assess the viability of novel semiconductor technologiesand navigate the ensuing challenges. This report identifies four areas and research directions for ASCR and how eachcan be used to preserve performance scaling of digital computing beyond exascale and after Moore's law ends.
Published: 2016

25. Optical Interconnects and Extreme Computing

Author: Bergman, Keren, Shalf, John, and Hausken, Tom
Subjects: Optoelectronics & Photonics
Published: 2016

26. NANDFlashSim

Author: Jung, Myoungsoo, Choi, Wonil, Gao, Shuwen, Wilson, Ellis Herbert, Donofrio, David, Shalf, John, and Kandemir, Mahmut Taylan
Subjects: Engineering, Electronics, Sensors and Digital Hardware, Affordable and Clean Energy, Non-volatile memory, NAND flash memory, cycle-level simulation, solid state disk, performance evaluation, Data Format, Networking & Telecommunications, Communications engineering, Distributed computing and systems software
Abstract: As the popularity of NAND flash expands in arenas from embedded systems to high-performance computing, a high-fidelity understanding of its specific properties becomes increasingly important. Further, with the increasing trend toward multiple-die, multiple-plane architectures and high-speed interfaces, flash memory systems are expected to continue to scale and cheapen, resulting in their broader proliferation. However, when designing NAND-based devices, making decisions about the optimal system configuration is nontrivial, because flash is sensitive to a number of parameters and suffers from inherent latency variations, and no available tools suffice for studying these nuances. The parameters include the architectures, such as multidie and multiplane, diverse node technologies, bit densities, and cell reliabilities. Therefore, we introduce NANDFlashSim, a high-fidelity, latency-variation-aware, and highly configurable NAND-flash simulator, which implements a detailed timing model for 16 state-of-the-art NAND operations. Using NANDFlashSim, we notably discover the following. First, regardless of the operation, reads fail to leverage internal parallelism. Second, MLC provides lower I/O bus contention than SLC, but contention becomes a serious problem as the number of dies increases. Third, many-die architectures outperform many-plane architectures for disk-friendly workloads. Finally, employing a high-performance I/O bus or an increased page size does not enhance energy savings. Our simulator is available at http://nfs.camelab.org.
Published: 2016

27. Asynchronous AMR on Multi-GPUs

Author: Farooqi, Muhammad Nufail, Nguyen, Tan, Zhang, Weiqun, Almgren, Ann S., Shalf, John, Unat, Didem, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Weiland, Michèle, editor, Juckeland, Guido, editor, Alam, Sadaf, editor, and Jagode, Heike, editor
Published: 2019
Full Text: View/download PDF

28. Computing beyond Moore's Law

Author: Shalf, John M and Leland, Robert
Subjects: Clinical Research, Information and Computing Sciences, Software Engineering
Published: 2015

29. ExaSAT: An exascale co-design tool for performance modeling

Author: Unat, Didem, Chan, Cy, Zhang, Weiqun, Williams, Samuel, Bachan, John, Bell, John, and Shalf, John
Subjects: Information and Computing Sciences, Software Engineering, Affordable and Clean Energy, Performance modeling, exascale co-design, exascale systems, performance analysis, combustion codes, abstract machine model, compiler analysis, cache modeling, stencil applications, design trade-offs, Distributed Computing, Applied computing, Distributed computing and systems software
Abstract: One of the emerging challenges to designing HPC systems is understanding and projecting the requirements of exascale applications. In order to determine the performance consequences of different hardware designs, analytic models are essential because they can provide fast feedback to the co-design centers and chip designers without costly simulations. However, current attempts to analytically model program performance typically rely on the user manually specifying a performance model. We introduce the ExaSAT framework that automates the extraction of parameterized performance models directly from source code using compiler analysis. The parameterized analytic model enables quantitative evaluation of a broad range of hardware design trade-offs and software optimizations on a variety of different performance metrics, with a primary focus on data movement as a metric. We demonstrate the ExaSAT framework's ability to perform deep code analysis of a proxy application from the Department of Energy Combustion Co-design Center to illustrate its value to the exascale co-design process. ExaSAT analysis provides insights into the hardware and software trade-offs and lays the groundwork for exploring a more targeted set of design points using cycle-accurate architectural simulators.
Published: 2015

30. Towards an Integrated Strategy to Preserve Digital Computing Performance Scaling Using Emerging Technologies

Author: Vasudevan, Dilip, Butko, Anastasiia, Michelogiannakis, George, Donofrio, David, Shalf, John, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Kunkel, Julian M., editor, Yokota, Rio, editor, Taufer, Michela, editor, and Shalf, John, editor
Published: 2017
Full Text: View/download PDF

31. Reconfigurable Silicon Photonic Interconnect for Many-Core Architecture

Author: Guan, Hang, Rumley, Sébastien, Wen, Ke, Donofrio, David, Shalf, John, Bergman, Keren, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Kunkel, Julian M., editor, Yokota, Rio, editor, Taufer, Michela, editor, and Shalf, John, editor
Published: 2017
Full Text: View/download PDF

32. OpenSoC Fabric: On-Chip Network Generator: Using Chisel to Generate a Parameterizable On-Chip Interconnect Fabric

Author: Fatollahi-Fard, Farzad, Donofrio, David, Michelogiannakis, George, and Shalf, John
Abstract: Recent advancements in technology scaling have sparked a trend towards greater integration with large-scale chips containing thousands of processors connected to memories and other I/O devices using non-trivial network topologies. Software simulation suffers from long execution times or reduced accuracy in such complex systems, whereas hardware RTL development is too time-consuming. We present OpenSoC Fabric, a parameterizable and powerful on-chip network generator for evaluating future large-scape chip multiprocessors and SoCs. OpenSoC Fabric leverages a new hardware DSL, Chisel, which contains powerful abstractions provided by its base language, Scala, and generates both software (C++) and hardware (Verilog) models from a single code base. This is in contrast to other tools readily available which typically provide either software or hardware models, but not both. The OpenSoC Fabric infrastructure is modeled after existing state-of-the-art simulators, offers large and powerful collections of configuration options, is open-source, and uses object-oriented design and functional programming to make functionality extension as easy as possible.
Published: 2014

33. Reimagining Codesign for Advanced Scientific Computing: Report for the ASCR Workshop on Reimagining Codesign

Author: Ang, James, primary, Chien, Andrew, additional, Hammond, Simon, additional, Hoisie, Adolfy, additional, Karlin, Ian, additional, Pakin, Scott, additional, Shalf, John, additional, and Vetter, Jeffrey, additional
Published: 2022
Full Text: View/download PDF

34. Cactus Framework: Black Holes to Gamma Ray Bursts

Author: Schnetter, Erik, Ott, Christian D., Allen, Gabrielle, Diener, Peter, Goodale, Tom, Radke, Thomas, Seidel, Edward, and Shalf, John
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin. They are among the most scientifically interesting astrophysical systems, and the riddle concerning their central engines and emission mechanisms is one of the most complex and challenging problems of astrophysics today. In this article we outline our petascale approach to the GRB problem and discuss the computational toolkits and numerical codes that are currently in use and that will be scaled up to run on emerging petaflop scale computing platforms in the near future. Petascale computing will require additional ingredients over conventional parallelism. We consider some of the challenges which will be caused by future petascale architectures, and discuss our plans for the future development of the Cactus framework and its applications to meet these challenges in order to profit from these new architectures., Comment: 16 pages, 4 figures. To appear in Petascale Computing: Algorithms and Applications, Ed. D. Bader, CRC Press LLC (2007)
Published: 2007

35. Envisioning Science in 2050

Author: Ahrens, James, primary, Boehnlein, Amber, additional, Carlson, Rich, additional, Elliot, Joshua, additional, Fagnan, Kjiersten, additional, Ferrier, Nicola, additional, Foster, Ian, additional, Gimpel, Lee, additional, Shalf, John, additional, and Ratner, Dan, additional
Published: 2022
Full Text: View/download PDF

36. Software Design Space Exploration for Exascale Combustion Co-design

Author: Chan, Cy, Unat, Didem, Lijewski, Michael, Zhang, Weiqun, Bell, John, and Shalf, John
Subjects: Information and Computing Sciences, Software Engineering, Affordable and Clean Energy, Artificial Intelligence & Image Processing, Information and computing sciences
Abstract: The design of hardware for next-generation exascale computing systems will require a deep understanding of how software optimizations impact hardware design trade-offs. In order to characterize how co-tuning hardware and software parameters affects the performance of combustion simulation codes, we created ExaSAT, a compiler-driven static analysis and performance modeling framework. Our framework can evaluate hundreds of hardware/software configurations in seconds, providing an essential speed advantage over simulators and dynamic analysis techniques during the co-design process. Our analytic performance model shows that advanced code transformations, such as cache blocking and loop fusion, can have a significant impact on choices for cache and memory architecture. Our modeling helped us identify tuned configurations that achieve a 90% reduction in memory traffic, which could significantly improve performance and reduce energy consumption. These techniques will also be useful for the development of advanced programming models and runtimes, which must reason about these optimizations to deliver better performance and energy efficiency. © 2013 Springer-Verlag.
Published: 2013

37. The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment

Author: Allen, Gabrielle, Angulo, David, Foster, Ian, Lanfermann, Gerd, Liu, Chuang, Radke, Thomas, Seidel, Ed, and Shalf, John
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, D.1.3
Abstract: The ability to harness heterogeneous, dynamically available "Grid" resources is attractive to typically resource-starved computational scientists and engineers, as in principle it can increase, by significant factors, the number of cycles that can be delivered to applications. However, new adaptive application structures and dynamic runtime system mechanisms are required if we are to operate effectively in Grid environments. In order to explore some of these issues in a practical setting, we are developing an experimental framework, called Cactus, that incorporates both adaptive application structures for dealing with changing resource characteristics and adaptive resource selection mechanisms that allow applications to change their resource allocations (e.g., via migration) when performance falls outside specified limits. We describe here the adaptive resource selection mechanisms and describe how they are used to achieve automatic application migration to "better" resources following performance degradation. Our results provide insights into the architectural structures required to support adaptive resource selection. In addition, we suggest that this "Cactus Worm" is an interesting challenge problem for Grid computing., Comment: 14 pages, 5 figures, to be published in International Journal of Supercomputing Applications
Published: 2001

38. Optimization of Geometric Multigrid for Emerging Multi-and Manycore Processors

Author: Williams, Samuel, Kalamkar, Dhiraj D, Amik, Singh, Deshpande, An M, van Straalen, Brian, Smelyanskiy, Mikhail, Almgren, Ann, Dubey, Pradeep, Shalf, John, and Oliker, Leonid
Subjects: Geometric Multigrid, communication-avoiding, multicore, Xeon Phi, Knights Corner, OpenMP, auto-tuning
Abstract: Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel® Xeon® E5-2670 and X5550 processor-based Infiniband clusters, as well as the new Intel® Xeon Phi coprocessor (Knights Corner). Our work examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding, dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations. © 2012 IEEE.
Published: 2012

39. Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Author: Williams, Samuel, Oliker, Leonid, Carter, Jonathan, and Shalf, John
Abstract: We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-based supercomputing platforms. Our examined application is a structured grid-based Lattice Boltzmann computation that simulates homogeneous isotropic turbulence in magnetohydrodynamics. First, we examine sophisticated sequential auto-tuning techniques including loop transformations, virtual vectorization, and use of ISA-specific intrinsics. Next, we present a variety of parallel optimization approaches including programming model exploration (at MPI, MPI/OpenMP, and MPI/Pthreads), as well as data and thread decomposition strategies designed to mitigate communication bottlenecks. Finally, we evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. Results show that our unique tuning approach improves performance and energy requirements by up to 3.4× using 49,152 cores, while providing a portable optimization methodology for a variety of numerical methods on forthcoming HPC systems. Copyright 2011 ACM.
Published: 2011

40. Hardware/software co-design for energy-efficient seismic modeling

Author: Krueger, Jens, Donofrio, David, Shalf, John, Mohiyuddin, Marghoob, Williams, Samuel, Oliker, Leonid, and Pfreund, Franz-Josef
Subjects: Networking and Information Technology R&D (NITRD), Affordable and Clean Energy
Abstract: Reverse Time Migration (RTM) has become the standard for high-quality imaging in the seismic industry. RTM relies on PDE solutions using stencils that are 8th order or larger, which require large-scale HPC clusters to meet the computational demands. However, the rising power con- sumption of conventional cluster technology has prompted investigation of architectural alternatives that other higher computational efficiency. In this work, we compare the performance and energy efficiency of three architectural alternatives - the Intel Nehalem X5530 multicore processor, the NVIDIA Tesla C2050 GPU, and a general-purpose manycore chip design optimized for high-order wave equations called "Green Wave". We have developed an FPGA-accelerated architectural simulation platform to accurately model the power and performance of the Green Wave design. Results show that across highly-tuned high-order RTM stencils, the Green Wave implementation can offer up to 8× and 3.5× energy efficiency improvement per node respectively, com- pared with the Nehalem and GPU platforms. These results point to the enormous potential energy advantages of our hardware/software co-design methodology. Copyright 2011 ACM.
Published: 2011

41. Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems

Author: Madduri, Kamesh, Ibrahim, Khaled Z, Williams, Samuel, Im, Eun-Jin, Ethier, Stephane, Shalf, John, and Oliker, Leonid
Subjects: Nuclear and Plasma Physics, Information and Computing Sciences, Applied Computing, Physical Sciences
Abstract: The gyrokinetic Particle-in-Cell (PIC) method is a critical computational tool enabling petascale fusion simulation re-search. In this work, we present novel multi- and manycore-centric optimizations to enhance performance of GTC, a PIC-based production code for studying plasma microtur-bulence in tokamak devices. Our optimizations encompass all six GTC sub-routines and include multi-level particle and grid decompositions designed to improve multi-node parallel scaling, particle binning for improved load balance, GPU acceleration of key subroutines, and memory-centric optimizations to improve single-node scaling and reduce memory utilization. The new hybrid MPI-OpenMP and MPI-OpenMP-CUDA GTC versions achieve up to a 2× speedup over the production Fortran code on four parallel systems - clusters based on the AMD Magny-Cours, Intel Nehalem-EP, IBM BlueGene/P, and NVIDIA Fermi architectures. Finally, strong scaling experiments provide insight into parallel scalability, memory utilization, and programmability trade-offs for large-scale gyrokinetic PIC simulations, while attaining a 1.6× speedup on 49,152 XE6 cores. Copyright 2011 ACM.
Published: 2011

42. Hardware/software co‐design of global cloud system resolving models

Author: Wehner, Michael F, Oliker, Leonid, Shalf, John, Donofrio, David, Drummond, Leroy A, Heikes, Ross, Kamil, Shoaib, Kono, Celal, Miller, Norman, Miura, Hiroaki, Mohiyuddin, Marghoob, Randall, David, and Yang, Woo‐Sun
Subjects: Climate Action, Atmospheric Sciences
Published: 2011

43. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

Author: Williams, Samuel, Oliker, Leonid, Vuduc, Richard, Shalf, John, Yelick, Katherine, and Demmel, James
Published: 2010

44. An Auto-Tuning Framework for Parallel Multicore Stencil Computations

Author: Kamil, Shoaib, Chan, Cy, Oliker, Leonid, Shalf, John, and Williams, Samuel
Abstract: Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addition, the large variety of stencil kernels used in practice makes this computation pattern difficult to assemble into a library. This work presents a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA, thus allowing performance portability across diverse computer architectures, including the AMD Barcelona, Intel Nehalem, Sun Victoria Falls, and the latest NVIDIA GPUs. Results show that our generalized methodology delivers significant performance gains of up to 22× speedup over the reference serial implementation. Overall we demonstrate that such domain-specific auto-tuners hold enormous promise for architectural efficiency, programmer productivity, performance portability, and algorithmic adaptability on existing and emerging multicore systems. © 2010 IEEE.
Published: 2010

45. Nonintrusive AMR Asynchrony for Communication Optimization

Author: Farooqi, Muhammad Nufail, Unat, Didem, Nguyen, Tan, Zhang, Weiqun, Almgren, Ann, Shalf, John, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Rivera, Francisco F., editor, Pena, Tomás F., editor, and Cabaleiro, José C., editor
Published: 2017
Full Text: View/download PDF

46. A design methodology for domain-optimized power-efficient supercomputing

Author: Mohiyuddin, Marghoob, Murphy, Mark, Oliker, Leonid, Shalf, John, Wawrzynek, John, and Williams, Samuel
Subjects: Built Environment and Design, Engineering, Information and Computing Sciences, Electrical Engineering, Electronics, Sensors and Digital Hardware, Architecture, Affordable and Clean Energy
Abstract: As power has become the pre-eminent design constraint for future HPC systems, computational efficiency is being emphasized over simply peak performance. Recently, static benchmark codes have been used to find a power efficient architecture. Unfortunately, because compilers generate sub-optimal code, benchmark performance can be a poor indicator of the performance potential of architecture design points. Therefore, we present hardware/software cotuning as a novel approach for system design, in which traditional architecture space exploration is tightly coupled with software auto-tuning for delivering substantial improvements in area and power efficiency. We demonstrate the proposed methodology by exploring the parameter space of a Tensilica-based multi-processor running three of the most heavily used kernels in scientific computing, each with widely varying micro-architectural requirements: sparse matrix vector multiplication, stencil-based computations, and general matrix-matrix multiplication. Results demonstrate that co-tuning significantly improves hardware area and energy efficiency - a key driver for next generation of HPC system design. Copyright 2009 ACM.
Published: 2009

47. Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

Author: Madduri, Kamesh, Williams, Samuel, Ethier, Stéphane, Oliker, Leonid, Shalf, John, Strohmaier, Erich, and Yelicky, Katherine
Subjects: Information and Computing Sciences, Applied Computing
Abstract: We present multicore parallelization strategies for the particle-to-grid interpolation step in the Gyrokinetic Toroidal Code (GTC), a 3D particle-in-cell (PIC) application to study turbulent transport in magnetic-confinement fusion devices. Particle-grid interpolation is a known performance bottleneck in several PIC applications. In GTC, this step involves particles depositing charges to a 3D toroidal mesh, and multiple particles may contribute to the charge at a grid point. We design new parallel algorithms for the GTC charge deposition kernel, and analyze their performance on three leading multicore platforms. We implement thirteen different variants for this kernel and identify the best-performing ones given typical PIC parameters such as the grid size, number of particles per cell, and the GTC-specific particle Larmor radius variation. We find that our best strategies can be 2x faster than the reference optimized MPI implementation, and our analysis provides insight into desirable architectural features for high-performance PIC simulation codes. Copyright 2009 ACM.
Published: 2009

48. Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

Author: Michelogiannakis, George, primary, Arafa, Yehia, additional, Cook, Brandon, additional, Dai, Liang Yuan, additional, Hameed Badawy, Abdel-Hameed, additional, Glick, Madeleine, additional, Wang, Yuyang, additional, Bergman, Keren, additional, and Shalf, John, additional
Published: 2023
Full Text: View/download PDF

49. Optimization of sparse matrix–vector multiplication on emerging multicore platforms

Author: Williams, Samuel, Oliker, Leonid, Vuduc, Richard, Shalf, John, Yelick, Katherine, and Demmel, James
Subjects: Multicore, Sparse, Performance, Autotuning, HPC, Cell, Niagara, Distributed Computing, Cognitive Sciences
Abstract: We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms. © 2008 Elsevier B.V.
Published: 2009

50. 2019 Computing Sciences Strategic Plan

Author: Yelick, Kathy, primary, Agarwal, Deb, additional, Bard, Debbie, additional, Shalf, John, additional, Almgren, Ann, additional, Bhimji, Wahid, additional, Brown, Ben, additional, Carter, Jonathan, additional, Jong, Bert, additional, Doerfler, Doug, additional, Donofrio, David, additional, Guok, Chin, additional, Iancu, Costin, additional, Kiran, Mariam, additional, Li, Sherry, additional, Nugent, Peter, additional, Prabhat, M., additional, Ramakrishnan, Lavanya, additional, Vasudevan, Dilip, additional, Wright, Nick, additional, Cademartori, Helen, additional, Antypas, Katie, additional, and Kincade, Kathy, additional
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

839 results on '"Shalf, John"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources