25 results on '"Charles J. Archer"'
Search Results
2. Early Evaluation of Scalable Fabric Interface for PGAS Programming Models.
- Author
-
Miao Luo, Kayla Seager, Karthik S. Murthy, Charles J. Archer, Sayantan Sur, and Sean Hefty
- Published
- 2014
- Full Text
- View/download PDF
3. Efficient implementation of MPI-3 RMA over openFabrics interfaces
- Author
-
Sayantan Sur, Erik Paulson, Hajime Fujita, María Jesús Garzarán, Charles J. Archer, and Chongxiao Cao
- Subjects
MPICH ,ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Computer Networks and Communications ,Computer science ,business.industry ,Message Passing Interface ,010103 numerical & computational mathematics ,Software_PROGRAMMINGTECHNIQUES ,01 natural sciences ,Computer Graphics and Computer-Aided Design ,Theoretical Computer Science ,010101 applied mathematics ,Software ,Artificial Intelligence ,Hardware and Architecture ,Embedded system ,Programming paradigm ,0101 mathematics ,business - Abstract
The Message Passing Interface (MPI) standard supports Remote Memory Access (RMA) operations, where a process can read or write memory of another process without requiring the target process to be involved in the communication. This enables new more efficient programming models. This paper describes the RMA design and implementation in MPICH-OFI, an MPICH-based open source implementation of the MPI standard that uses the OpenFabrics Interfaces* (OFI*) to communicate with the underlying network fabric. MPICH-OFI is based on a new communication layer called CH4, which was designed to achieve high performance by minimizing the runtime software overhead and by having an internal API that is well aligned with MPI functions. MPICH-OFI uses the OpenFabrics Interfaces (OFI), a lightweight communication framework to support modern high-speed interconnects. Thanks to CH4 and OFI, MPICH-OFI achieves low latency and high bandwidth for RMA operations. Our experimental results using microbenchmarks show that MPICH-OFI achieves more than 3x better put/get latency and bandwidth than MPICH CH3, 10% better latency than Open MPI and MVAPICH2, and more than 1.7x bandwidth than MVAPICH2 for small messages ( ≤ 4KB), on Intel® Omni-Path Architecture.
- Published
- 2019
- Full Text
- View/download PDF
4. Software combining to mitigate multithreaded MPI contention
- Author
-
Shintaro Iwasaki, Chongxiao Cao, Charles J. Archer, Hajime Fujita, Yanfei Guo, Pavan Balaji, Min Si, Kenjiro Taura, Jeff R. Hammond, Kenneth Raffenetti, Sagar Thapaliya, María Jesús Garzarán, Mikhail Shiryaev, Michael Chuvelev, Abdelhalim Amer, and Michael Alan Blocksome
- Subjects
Software ,Computer science ,business.industry ,Embedded system ,Scalability ,Thread safety ,Thread (computing) ,Software_PROGRAMMINGTECHNIQUES ,business ,Lock (computer science) - Abstract
Efforts to mitigate lock contention from concurrent threaded accesses to MPI have reduced contention through fine-grained locking, avoided locking altogether by offloading communication to dedicated threads, or alleviated negative side effects from contention by using better lock management protocols. The blocking nature of lock-based methods, however, wastes the asynchrony benefits of nonblocking MPI operations, and the offloading model sacrifices CPU resources and incurs unnecessary software offloading overheads under low contention. We propose new thread safety models, CSync and LockQ, based on software combining, a form of software offloading without the requirement for dedicated threads; a thread holding the lock combines work of threads that failed their lock acquisitions. We demonstrate that CSync, a direct application of software combining, improves scalability but suffers from lack of asynchrony and incurs unnecessary offloading. LockQ alleviates these shortcomings by leveraging MPI semantics to relax synchronization and reduce offloading requirements. We present the implementation, analysis, and evaluation of these models on a modern network fabric and show that LockQ outperforms most existing thread safety models in low- and high-contention regimes.
- Published
- 2019
- Full Text
- View/download PDF
5. Why is MPI so slow?
- Author
-
Alexander Sannikov, Sangmin Seo, Yanfei Guo, Ken Raffenetti, Paul Fischer, Tomislav Janjusic, Thilina Rathnayake, Michael Alan Blocksome, Jithin Jose, Matthew Otten, Hajime Fujita, Sergey Oblomov, Sayantan Sur, Masamichi Takagi, Pavan Balaji, Masayuki Hatanaka, Misun Min, Abdelhalim Amer, Paul Coffman, Wesley Bland, Akhil Langer, Michael Chuvelev, Dmitry Durnov, Charles J. Archer, Min Si, Lena Oden, Gengbin Zheng, and Xin Zhao
- Subjects
020203 distributed computing ,Network architecture ,business.industry ,Computer science ,010103 numerical & computational mathematics ,02 engineering and technology ,01 natural sciences ,Variety (cybernetics) ,Software ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,business ,PATH (variable) - Abstract
This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack---all the way from the application to the low-level network communication API---takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes.
- Published
- 2017
- Full Text
- View/download PDF
6. Memory Compression Techniques for Network Address Management in MPI
- Author
-
Ken Raffenetti, Pavan Balaji, Michael A. Blocksome, Wesley Bland, Scott Parker, Charles J. Archer, and Yanfei Guo
- Subjects
020203 distributed computing ,Computer science ,Address space ,Distributed computing ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,Data structure ,01 natural sciences ,Metadata ,Logical address ,Memory management ,Network address ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Integer (computer science) - Abstract
MPI allows applications to treat processes as a logical collection of integer ranks for each MPI communicator, while internally translating these logical ranks into actual network addresses. In current MPI implementations the management and lookup of such network addresses use memory sizes that are proportional to the number of processes in each communicator. In this paper, we propose a new mechanism, called AV-Rankmap, for managing such translation. AV-Rankmap takes advantage of logical patterns in rank-address mapping that most applications naturally tend to have, and it exploits the fact that some parts of network address structures are naturally more performance critical than others. It uses this information to compress the memory used for network address management. We demonstrate that AV-Rankmap can achieve performance similar to or better than that of other MPI implementations while using significantly less memory.
- Published
- 2017
- Full Text
- View/download PDF
7. BlueGene/L applications: Parallelism On a Massive Scale
- Author
-
Ümit V. Çatalyürek, Mehul Patel, Alan Gara, Robert K. Yates, Martin Schulz, José E. Moreira, Bor Chan, Kai Kadau, William Clarence McLendon, Franz Franchetti, Peter Williams, Andy Yoo, Keith Henderson, Bob Walkup, Bruce Hendrickson, Timothy C. Germann, George Almási, Christoph Überhuber, Erik W. Draeger, James C. Sexton, John A. Gunnels, Andrew W. Cook, Edmond Chow, Stefan Kral, Frederick H. Streitz, Vasily V. Bulatov, Jeffrey Greenough, Gyan Bhanot, Steve Louis, C. A. Rendleman, Manish Gupta, Charles J. Archer, Michael Welcome, Jürgen Lorenz, Francois Gygi, William H. Cabot, Bronis R. de Supinski, Alison Kubota, Peter S. Lomdahl, Brian J. Miller, Thomas E. Spelce, and James N. Glosli
- Subjects
TOP500 ,ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Scale (ratio) ,Hardware and Architecture ,Computer science ,Parallelism (grammar) ,Code (cryptography) ,Parallel computing ,IBM ,Software ,Theoretical Computer Science - Abstract
BlueGene/L (BG/L), developed through a partnership between IBM and Lawrence Livermore National Laboratory (LLNL), is currently the world's largest system both in terms of scale, with 131,072 processors, and absolute performance, with a peak rate of 367 Tflop/s. BG/L has led the last four Top500 lists with a Linpack rate of 280.6 Tflop/s for the full machine installed at LLNL and is expected to remain the fastest computer in the next few editions. However, the real value of a machine such as BG/L derives from the scientific breakthroughs that real applications can produce by successfully using its unprecedented scale and computational power. In this paper, we describe our experiences with eight large scale applications on BG/ L from several application domains, ranging from molecular dynamics to dislocation dynamics and turbulence simulations to searches in semantic graphs. We also discuss the challenges we faced when scaling these codes and present several successful optimization techniques. All applications show excellent scaling behavior, even at very large processor counts, with one code even achieving a sustained performance of more than 100 Tflop/s, clearly demonstrating the real success of the BG/L design.
- Published
- 2008
- Full Text
- View/download PDF
8. EUDOC on the IBM Blue Gene/L system: Accelerating the transfer of drug discoveries from laboratory to patient
- Author
-
Jeffrey S. McAllister, T. J. Mullins, Amanda Peters, Charles J. Archer, Brent Allen Swartz, Brian E. Smith, Brian Paul Wallenfelt, R. G. Musselman, K. W. Pinnow, and Yuan Ping Pang
- Subjects
Engineering ,Virtual screening ,General Computer Science ,business.industry ,Serial code ,Supercomputer ,computer.software_genre ,Identification (information) ,Computer architecture ,Scalability ,Benchmark (computing) ,Operating system ,SIMD ,IBM ,business ,computer - Abstract
EUDOC™ is a molecular docking program that has successfully helped to identify new drug leads. This virtual screening (VS) tool identifies drug candidates by computationally testing the binding of these drugs to biologically important protein targets. This approach can reduce the research time required of biochemists, accelerating the identification of therapeutically useful drugs and helping to transfer discoveries from the laboratory to the patient. Migration of the EUDOC application code to the IBM Blue Gene/L™ (BG/L) supercomputer has been highly successful. This migration led to a 200-fold improvement in elapsed time for a representative VS application benchmark. Three focus areas provided benefits. First, we enhanced the performance of serial code through application redesign, hand-tuning, and increased usage of SIMD (single-instruction, multiple-data) floating-point unit operations. Second, we studied computational load-balancing schemes to maximize processor utilization and application scalability for the massively parallel architecture of the BG/L system. Third, we greatly enhanced system I/O interaction design. We also identified and resolved severe performance bottlenecks, allowing for efficient performance on more than 4,000 processors. This paper describes specific improvements in each of the areas of focus.
- Published
- 2008
- Full Text
- View/download PDF
9. The Blue Gene/L Supercomputer: A Hardware and Software Story
- Author
-
Peter Bergner, Valentina Salapura, Mark E. Giampapa, Alan Gara, Gordon G. Stewart, Todd A. Inglett, Jose R. Brunheroto, A. A. Bright, Paul W. Coteus, Rick A. Rand, Ralph Bellofatto, Thomas Eugene Engelsiepen, Dirk Hoenicke, Pavlos M. Vranas, Sam Ellis, Derek Lieber, Brian Paul Wallenfelt, Michael A. Blocksome, David Roy Limpert, R. A. Haring, Dong Chen, Shawn A. Hall, Mark G. Megerian, Mathias Blumrich, Patrick Joseph McCarthy, Tom Gooding, Martin Ohmacht, Todd E. Takken, José E. Moreira, Philip Heidelberger, R. Bickford, Michael B. Mundy, A. Sanomiya, Paul G. Crumley, José G. Castaños, Richard Michael Shok, Brian E. Smith, Gerrard V. Kopcsay, Jeffrey J. Parker, Ramendra K. Sahoo, Joe Ratterman, George Almási, Roger L. Haskin, Michael Brian Brutman, Charles J. Archer, and Don Darrell Reed
- Subjects
ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,business.industry ,Computer science ,computer.software_genre ,Supercomputer ,Theoretical Computer Science ,Blue gene ,Software ,Operating system ,business ,National laboratory ,computer ,Computer hardware ,Information Systems - Abstract
The Blue Gene/L system at the Department of Energy Lawrence Livermore National Laboratory in Livermore, California is the world's most powerful supercomputer. It has achieved groundbreaking performance in both standard benchmarks as well as real scientific applications. In that process, it has enabled new science that simply could not be done before. Blue Gene/L was developed by a relatively small team of dedicated scientists and engineers. This article is both a description of the Blue Gene/L supercomputer as well as an account of how that system was designed, developed, and delivered. It reports on the technical characteristics of the system that made it possible to build such a powerful supercomputer. It also reports on how teams across the world worked around the clock to accomplish this milestone of high-performance computing.
- Published
- 2007
- Full Text
- View/download PDF
10. Blue Gene/L programming and operating environment
- Author
-
Michael B. Mundy, Paul G. Crumley, Manish Gupta, Jose R. Brunheroto, Charles J. Archer, G. G. Stewart, Brian E. Smith, Patrick Joseph McCarthy, D. Limpert, Mark Megerian, D. Reed, G. Almasi, A. Sanomiya, José G. Castaños, Ramendra K. Sahoo, Michael Brian Brutman, Ralph Bellofatto, M. Mendell, Derek Lieber, R. Shok, Todd A. Inglett, José E. Moreira, and P. Bergner
- Subjects
General Computer Science ,Operating environment ,Computer science ,Operating system ,Systems architecture ,Leverage (statistics) ,Architecture ,computer.software_genre ,Supercomputer ,computer ,Porting ,Blue gene ,System software - Abstract
With up to 65,536 compute nodes and a peak performance of more than 360 teraflops, the Blue Gene®/L (BG/L) supercomputer represents a new level of massively parallel systems. The system software stack for BG/L creates a programming and operating environment that harnesses the raw power of this architecture with great effectiveness. The design and implementation of this environment followed three major principles: simplicity, performance, and familiarity. By specializing the services provided by each component of the system architecture, we were able to keep each one simple and leverage the BG/L hardware features to deliver high performance to applications. We also implemented standard programming interfaces and programming languages that greatly simplified the job of porting applications to BG/L. The effectiveness of our approach has been demonstrated by the operational success of several prototype and production machines, which have already been scaled to 16,384 nodes.
- Published
- 2005
- Full Text
- View/download PDF
11. UV-vis and Binding Studies of Cobalt Tetrasulfophthalocyanine–Thiolate Complexes as Intermediates of the Merox Process
- Author
-
Eduard M. Tyapochkin, Ali Navid, Charles J. Archer, and Evguenii I. Kozliak
- Subjects
chemistry.chemical_classification ,chemistry.chemical_compound ,Monomer ,Ultraviolet visible spectroscopy ,chemistry ,Autoxidation ,Thiol ,Phthalocyanine ,Stacking ,General Chemistry ,Transition metal thiolate complex ,Photochemistry ,Catalysis - Abstract
Intermediates of the cobalt tetrasulfophthalocyanine ( CoTSPc )-catalyzed thiol autoxidation were studied by UV-vis spectroscopy. All thiolates react with CoTSPc resulting in the formation of 1:1 complexes. Three major factors control both the stability and aggregation of the complexes: thiolate basicity, metal-to-ligand charge transfer (MLCT) and π stacking. Basic thiolates partially reduce C oII TSPc , whereas CoTSPc complexes with low-basicity aliphatic thiolates ( p K a < 4) do not exhibit Co (II) reduction, based on the absence of the characteristic Co (I) charge transfer band at 450 nm. CoTSPc complexes with aliphatic and bulky aromatic thiolates appear to be aggregated in aqueous solutions and are characterized by a broad band at 650 nm. Non-bulky aromatic thiolates of low basicity ( p K a < 6) form unique stable monomeric Co II TSPc complexes. This unique spectral feature can be attributed to π stacking between the phthalocyanine ring and thiolate. Comparison of binding constants shows that the partial reduction of Co (II) significantly contributes to the thiolate binding. A combination of aromatic π stacking and MLCT appears to be responsible for the observed 1000-fold stronger binding of non-basic aromatic thiolates as compared with aliphatic ligands of similar basicity. Kinetic studies confirm the importance of the thiolate binding type for catalysis.
- Published
- 1999
- Full Text
- View/download PDF
12. Case Study: LRZ Liquid Cooling, Energy Management, Contract Specialities
- Author
-
Ingmar Meijer, Achim Bomelburg, Torsten Bloth, Torsten Wilde, Herbert Huber, Steffen Waitz, Charles J. Archer, and Axel Auweter
- Subjects
Presentation ,Computer cooling ,Computer science ,business.industry ,Energy management ,media_common.quotation_subject ,Mechanical engineering ,Data center ,business ,Efficient energy use ,media_common - Abstract
This presentation explores energy management, liquid cooling and heat re-use as well as contract specialities for LRZ: Leibniz-Rechenzentrum.
- Published
- 2012
- Full Text
- View/download PDF
13. Network Endpoints for Clusters of SMPs
- Author
-
Gabriel Tanase, Hanhong Xue, Gheorghe Almasi, and Charles J. Archer
- Subjects
Critical section ,Computer science ,business.industry ,InfiniBand ,Simultaneous multithreading ,computer.software_genre ,Networking hardware ,Software ,Computer architecture ,Multithreading ,Operating system ,Cache ,business ,computer ,System software - Abstract
Modern large scale parallel machines feature an increasingly deep hierarchy of interconnections. Individual processing cores employ simultaneous multithreading (SMT) to better exploit functional units, multiple coherent processors are collocated in a node to better exploit links to cache, memory and network (SMP), and multiple nodes are interconnected by specialized low latency/high speed networks. Current trends indicate ever wider SMP nodes in the future. To service these nodes, modern high performance network devices (including Infiniband and all of IBM's recent offerings) offer the ability to sub-divide the network devices' resources among the processing threads. System software, however, lags in exploiting these capabilities, leaving users of e.g., MPI[14], UPC[19] in a bind, requiring complex and fragile workarounds in user programs. In this paper we discuss our implementation of endpoints, the software paradigm central to the IBM PAMI messaging library [3]. A PAMI endpoint is an expression in software of a slice of the network device. System software can service endpoints without serializing the many threads on an SMP by forcing them through a critical section. In the paper we describe the basic guarantees offered by PAMI to the programmer, and how these can be used to enable efficient implementations of high level libraries and programming languages like UPC. We evaluate the efficiency of our implementation on a novel P7IHsystem with up to 4096 cores, running micro benchmarks designed to find performance deficiencies in the endpoints implementation of both point-to-point and collective functions.
- Published
- 2012
- Full Text
- View/download PDF
14. Composable, non-blocking collective operations on power7 IH
- Author
-
Hanhong Xue, Charles J. Archer, Gabriel Tanase, and Gheorghe Almasi
- Subjects
Shared memory ,Computer science ,Node (networking) ,Scalability ,Hierarchical organization ,Parallel computing ,Latency (engineering) ,Blocking (statistics) ,Simultaneous multithreading ,System software - Abstract
The Power7 IH (P7IH) is one of IBM's latest generation of supercomputers. Like most modern parallel machines, it has a hierarchical organization consisting of simultaneous multithreading (SMT) within a core, multiple cores per processor, multiple processors per node (SMP), and multiple SMPs per cluster. A low latency/high bandwidth network with specialized accelerators is used to interconnect the SMP nodes. System software is tuned to exploit the hierarchical organization of the machine.In this paper we present a novel set of collective operations that take advantage of the P7IH hardware. We discuss non blocking collective operations implemented using point to point messages, shared memory and accelerator hardware. We show how collectives can be composed to exploit the hierarchical organization of the P7IH for providing low latency, high bandwidth operations. We demonstrate the scalability of the collectives we designed by including experimental results on a P7IH system with up to 4096 cores.
- Published
- 2012
- Full Text
- View/download PDF
15. The deep computing messaging framework
- Author
-
Michael A. Blocksome, Dong Chen, Ahmad Faraj, Joseph D. Ratterman, Mark E. Giampapa, Gabor Dozsa, Philip Heidelberger, Sameer Kumar, Charles J. Archer, Brian E. Smith, Gheorghe Almasi, and Jeffrey J. Parker
- Subjects
Application programming interface ,Computer science ,Interface (Java) ,Node (networking) ,Message passing ,Message Passing Interface ,Operating system ,Programming paradigm ,Supercomputer ,computer.software_genre ,Direct memory access ,computer - Abstract
We present the architecture of the Deep Computing Messaging Framework (DCMF), a message passing runtime designed for the Blue Gene/P machine and other HPC architectures. DCMF has been designed to easily support several programming paradigms such as the Message Passing Interface (MPI), Aggregate Remote Memory Copy Interface (ARMCI), Charm++, and others. This support is made possible as DCMF provides an application programming interface (API) with active messages and non-blocking collectives. DCMF is being open sourced and has a layered component based architecture with multiple levels of abstraction, allowing the members of the community to contribute new components to its design at the various layers. The DCMF runtime can be extended to other architectures through the development of architecture specific implementations of interface classes. The production DCMF runtime on Blue Gene/P takes advantage of the direct memory access (DMA) hardware to offload message passing work and achieve good overlap of computation and communication. We take advantage of the fact that the Blue Gene/P node is a symmetric multi-processor with four cache-coherent cores and use multi-threading to optimize the performance on the collective network. We also present a performance evaluation of the DCMF runtime on Blue Gene/P and show that it delivers performance close to hardware limits.
- Published
- 2008
- Full Text
- View/download PDF
16. Design and Implementation of a One-Sided Communication Interface for the IBM eServer Blue Gene
- Author
-
José E. Moreira, J. Nieplocha, Charles J. Archer, Michael A. Blocksome, José G. Castaños, Brian E. Smith, Derek Lieber, G. Almasi, Todd A. Inglett, P. McCarthy, Joseph D. Ratterman, Sriram Krishnamoorthy, Vinod Tipparaju, Albert Sidelnik, and Michael B. Mundy
- Subjects
Grid network ,business.industry ,Computer science ,Message passing ,Global Arrays ,Supercomputer ,computer.software_genre ,Software ,Kernel (image processing) ,Computer architecture ,Operating system ,IBM ,Interrupt ,business ,computer - Abstract
This paper discusses the design and implementation of a one-sided communication interface for the IBM Blue Gene/L supercomputer. This interface facilitates ARMCI and the Global Arrays toolkit and can be used by other one-sided communication libraries. New protocols, interrupt driven communication, and compute node kernel enhancements were required to enable these libraries. Three possible methods for enabling ARMCI on the Blue Gene/L software stack are discussed. A detailed look into the development process shows how the implementation of the one-sided communication interface was completed. This was accomplished on a compressed time scale with the collaboration of various organizations within IBM and open source communities. In addition to enabling the one-sided libraries, bandwidth enhancements were made for communication along a diagonal on the Blue Gene/L torus network. The maximum bandwidth improved by a factor of three. This work will enable a variety of one-sided applications to run on Blue Gene/L
- Published
- 2006
- Full Text
- View/download PDF
17. Blue Gene system software---Design and implementation of a one-sided communication interface for the IBM eServer Blue Gene® supercomputer
- Author
-
Derek Lieber, J. Nieplocha, José G. Castaños, Joseph D. Ratterman, Charles J. Archer, Michael B. Mundy, Vinod Tipparaju, Sriram Krishnamoorthy, José E. Moreira, G. Almasi, Albert Sidelnik, Todd A. Inglett, Brian E. Smith, P. McCarthy, and Michael A. Blocksome
- Subjects
Grid network ,Computer science ,business.industry ,Global Arrays ,Supercomputer ,computer.software_genre ,Blue gene ,Software ,Kernel (image processing) ,Operating system ,IBM ,Interrupt ,business ,computer - Abstract
This paper discusses the design and implementation of a one-sided communication interface for the IBM Blue Gene/L supercomputer. This interface facilitates ARMCI and the Global Arrays toolkit and can be used by other one-sided communication libraries. New protocols, interrupt driven communication, and compute node kernel enhancements were required to enable these libraries. Three possible methods for enabling ARMCI on the Blue Gene/L software stack are discussed. A detailed look into the development process shows how the implementation of the one-sided communication interface was completed. This was accomplished on a compressed time scale with the collaboration of various organizations within IBM and open source communities. In addition to enabling the one-sided libraries, bandwidth enhancements were made for communication along a diagonal on the Blue Gene/L torus network. The maximum bandwidth improved by a factor of three. This work will enable a variety of one-sided applications to run on Blue Gene/L.
- Published
- 2006
- Full Text
- View/download PDF
18. Scaling physics and material science applications on a massively parallel Blue Gene/L system
- Author
-
Alan Gara, José E. Moreira, Peter Williams, Thomas E. Spelce, George Almási, Manish Gupta, Robert K. Yates, Alison Kubota, Bob Walkup, Francois Gygi, Charles J. Archer, James N. Glosli, Andrew W. Cook, Vasily V. Bulatov, Gyan Bhanot, Steve Louis, James C. Sexton, C. A. Rendleman, Jeffrey Greenough, Frederick H. Streitz, and Bronis R. de Supinski
- Subjects
Interconnection ,ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Scale (ratio) ,Computer science ,Node (networking) ,Scalability ,Parallel computing ,L-system ,Software architecture ,Scaling ,Massively parallel - Abstract
Blue Gene/L represents a new way to build supercomputers, using a large number of low power processors, together with multiple integrated interconnection networks. Whether real applications can scale to tens of thousands of processors (on a machine like Blue Gene/L) has been an open question. In this paper, we describe early experience with several physics and material science applications on a 32,768 node Blue Gene/L system, which was installed recently at the Lawrence Livermore National Laboratory. Our study shows some problems in the applications and in the current software implementation, but overall, excellent scaling of these applications to 32K nodes on the current Blue Gene/L system. While there is clearly room for improvement, these results represent the first proof point that MPI applications can effectively scale to over ten thousand processors. They also validate the scalability of the hardware and software architecture of Blue Gene/L.
- Published
- 2005
- Full Text
- View/download PDF
19. Optimization of MPI collective communication on BlueGene/L systems
- Author
-
C. Christopher Erway, Yili Zheng, Charles J. Archer, Burkhard Steinmacher-Burow, José E. Moreira, George Almási, Philip Heidelberger, and Xavier Martorell
- Subjects
Collective communication ,ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Shared memory ,Computer science ,Natural language programming ,Parallel computing ,Supercomputer ,Performance results - Abstract
BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.
- Published
- 2005
- Full Text
- View/download PDF
20. Early Experience with Scientific Applications on the Blue Gene/L Supercomputer
- Author
-
Yuriy Zhestkov, Robert S. Germain, Frank Suits, Katherine Riley, Henry M. Tufo, Alessandro Curioni, Bob Walkup, Richard Loft, Wanda Andreoni, Gyan Bhanot, Pavlos M. Vranas, Christopher Ward, Manish Gupta, Aleksandr Rayshubskiy, José E. Moreira, Charles J. Archer, Mike Pitman, Dong Chen, Blake G. Fitch, George Almási, John A. Gunnels, Maria Eleftheriou, Philip Heidelberg, Alan Gara, T. Voran, and James C. Sexton
- Subjects
Computer architecture ,business.industry ,Computer science ,Node (networking) ,Embedded system ,Scalability ,Software development ,Message Passing Interface ,Systems architecture ,business ,Supercomputer ,Software architecture ,System software - Abstract
Blue Gene/L uses a large number of low power processors, together with multiple integrated interconnection networks, to build a supercomputer with low cost, space and power consumption. It uses a novel system software architecture designed with application scalability in mind. However, whether real applications will scale to tens of thousands of processors has been an open question. In this paper, we describe early experience with several applications on a 16,384 node Blue Gene/L system. This study establishes that applications from a broad variety of scientific disciplines can effectively scale to thousands of processors. The results reported in this study represent the highest performance ever demonstrated for most of these applications, and in fact, show effective scaling for the first time ever on thousands of processors.
- Published
- 2005
- Full Text
- View/download PDF
21. Implementing MPI on the BlueGene/L Supercomputer
- Author
-
Brian Toonen, George Almási, Kurt Walter Pinnow, Philip Heidelberger, Charles J. Archer, C. Christopher Erway, José E. Moreira, Joe Ratterman, William Gropp, Xavier Martorell, Burkhard Steinmacher-Burow, Nils Smeds, and José G. Castaños
- Subjects
ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Grid network ,Computer science ,Node (networking) ,Message passing ,Tree network ,Parallel computing ,Network topology ,Supercomputer - Abstract
The BlueGene/L supercomputer will consist of 65,536 dual-processor compute nodes interconnected by two high-speed networks: a three-dimensional torus network and a tree topology network. Each compute node can only address its own local memory, making message passing the natural programming model for BlueGene/L. In this paper we present our implementation of MPI for BlueGene/L. In particular, we discuss how we leveraged the architectural features of BlueGene/L to arrive at an efficient implementation of MPI in this machine. We validate our approach by comparing MPI performance against the hardware limits and also the relative performance of the different modes of operation of BlueGene/L. We show that dedicating one of the processors of a node to communication functions greatly improves the bandwidth achieved by MPI operation, whereas running two MPI tasks per compute node can have a positive impact on application performance.
- Published
- 2004
- Full Text
- View/download PDF
22. Architecture and Performance of the BlueGene/L Message Layer
- Author
-
Xavier Martorell, John A. Gunnels, George Almási, Charles J. Archer, José E. Moreira, and Philip Heidelberger
- Subjects
ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Grid network ,Computer science ,business.industry ,Message passing ,Parallel computing ,computer.software_genre ,Supercomputer ,Tree (data structure) ,Software ,Virtual machine ,Layer (object-oriented design) ,business ,computer ,Cache coherence - Abstract
The BlueGene/L supercomputer is planned to consist of 65,536 dual-processor compute nodes interconnected by high speed torus and tree networks. Compute nodes can only address local memory, making message passing the natural programming model for the machine. In this paper we present the architecture and performance of the BlueGene/L message layer, the software library that makes an efficient MPI implementation possible. We describe the components and protocols of the message layer, and present microbenchmark based performance results for several aspects of the library.
- Published
- 2004
- Full Text
- View/download PDF
23. MPI on BlueGene/L: Designing an Efficient General Purpose Messaging Solution for a Large Cellular System
- Author
-
José E. Moreira, Brian Toonen, José G. Castaños, Xavier Martorell, Silvius Rus, George Almási, William Gropp, Manish Gupta, and Charles J. Archer
- Subjects
ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Cellular architecture ,Grid network ,business.industry ,Computer science ,Embedded system ,Message passing ,Scalability ,System on a chip ,Software architecture ,business ,Cache coherence - Abstract
The BlueGene/L computer uses system-on-a-chip integration and a highly scalable 65,536-node cellular architecture to deliver 360 Tflops of peak computing power. Efficient operation of the machine requires a fast, scalable, and standards compliant MPI library. In this paper, we discuss our efforts to port the MPICH2 library to BlueGene/L.
- Published
- 2003
- Full Text
- View/download PDF
24. Breaking the petaflops barrier
- Author
-
A. Emerich, Charles J. Archer, J. A. Fritzjunker, D. Grice, S. H. Lewis, James E. Carey, T. Schimke, P. McCarthy, Cornell G. Wright, Philip J. Sanders, H. Brandt, and P. R. Germann
- Subjects
Serviceability (computer) ,ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,General Computer Science ,business.operation ,business.industry ,Computer science ,Node (networking) ,Programming complexity ,computer.software_genre ,Petascale computing ,Software ,Roadrunner ,Synchronization (computer science) ,Operating system ,IBM ,business ,computer - Abstract
In this paper, we discuss the impact of petascale computing and the major issues to getting beyond a petaflops. We describe IBM approaches to petascale computing but with a major focus on the Los Alamos National Laboratory Roadrunner machine. We provide an overview of the hardware and software structures, focusing on the new triblade compute node architecture and the corresponding data control and synchronization software support to enable high-performance computing applications on this architecture. The fundamental technology drivers and issues for petascale computing and beyond are software complexity, energy efficiency, and system reliability, availability, and serviceability.
- Published
- 2009
- Full Text
- View/download PDF
25. Design and implementation of message-passing services for the Blue Gene/L supercomputer
- Author
-
C. Christopher Erway, Joseph D. Ratterman, Charles J. Archer, Burkhard Steinmacher-Burow, Brian Toonen, José E. Moreira, José G. Castaños, William Gropp, John A. Gunnels, Xavier Martorell, K. W. Pinnow, P. Heidelberger, and G. Almasi
- Subjects
Coprocessor ,General Computer Science ,Computer science ,Node (networking) ,Message passing ,Bandwidth (signal processing) ,Message Passing Interface ,Parallel computing ,Supercomputer ,computer.software_genre ,Mode (computer interface) ,Operating system ,computer ,Massively parallel - Abstract
The Blue Gene®/L (BG/L) supercomputer, with 65,536 dual-processor compute nodes, was designed from the ground up to support efficient execution of massively parallel message-passing programs. Part of this support is an optimized implementation of the Message Passing Interface (MPI), which leverages the hardware features of BG/L. MPI for BG/L is implemented on top of a more basic message-passing infrastructure called the message layer. This message layer can be used both to implement other higher-level libraries and directly by applications. MPI and the message layer are used in the two BG/L modes of operation: the coprocessor mode and the virtual node mode. Performance measurements show that our message-passing services deliver performance close to the hardware limits of the machine. They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.