Descriptor: "Programming paradigm" / Journal: parallel computing - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Programming paradigm"' showing total 79 results

Start Over Descriptor "Programming paradigm" Journal parallel computing

79 results on '"Programming paradigm"'

1. Efficient implementation of MPI-3 RMA over openFabrics interfaces

Author: Sayantan Sur, Erik Paulson, Hajime Fujita, María Jesús Garzarán, Charles J. Archer, and Chongxiao Cao
Subjects: MPICH, ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION, Computer Networks and Communications, Computer science, business.industry, Message Passing Interface, 010103 numerical & computational mathematics, Software_PROGRAMMINGTECHNIQUES, 01 natural sciences, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, 010101 applied mathematics, Software, Artificial Intelligence, Hardware and Architecture, Embedded system, Programming paradigm, 0101 mathematics, business
Abstract: The Message Passing Interface (MPI) standard supports Remote Memory Access (RMA) operations, where a process can read or write memory of another process without requiring the target process to be involved in the communication. This enables new more efficient programming models. This paper describes the RMA design and implementation in MPICH-OFI, an MPICH-based open source implementation of the MPI standard that uses the OpenFabrics Interfaces* (OFI*) to communicate with the underlying network fabric. MPICH-OFI is based on a new communication layer called CH4, which was designed to achieve high performance by minimizing the runtime software overhead and by having an internal API that is well aligned with MPI functions. MPICH-OFI uses the OpenFabrics Interfaces (OFI), a lightweight communication framework to support modern high-speed interconnects. Thanks to CH4 and OFI, MPICH-OFI achieves low latency and high bandwidth for RMA operations. Our experimental results using microbenchmarks show that MPICH-OFI achieves more than 3x better put/get latency and bandwidth than MPICH CH3, 10% better latency than Open MPI and MVAPICH2, and more than 1.7x bandwidth than MVAPICH2 for small messages ( ≤ 4KB), on Intel® Omni-Path Architecture.
Published: 2019

2. iPregel: Vertex-centric programmability vs memory efficiency and performance, why choose?

Author: J. Mark Bull, Nick Brown, Timothy A. K. Zakian, Zhenjiang Hu, and Ludovic Anthony Richard Capelli
Subjects: FOS: Computer and information sciences, Vertex (graph theory), Computer Networks and Communications, Computer science, 010103 numerical & computational mathematics, Parallel computing, 01 natural sciences, Computer Graphics and Computer-Aided Design, Execution time, Graph, Theoretical Computer Science, 010101 applied mathematics, Computer Science - Distributed, Parallel, and Cluster Computing, Shared memory, Artificial Intelligence, Hardware and Architecture, Asynchronous communication, Programming paradigm, Memory footprint, Distributed, Parallel, and Cluster Computing (cs.DC), 0101 mathematics, Software
Abstract: The vertex-centric programming model, designed to improve the programmability in graph processing application writing, has attracted great attention over the years. Multiple shared memory frameworks that have implemented the vertex-centric interface all expose a common tradeoff: programmability against memory efficiency and performance.Our approach consists in preserving vertex-centric programmability, while implementing optimisations missing from Femto-Graph, developing new ones and designing these so they are transparent to a user’s application code, hence not impacting programmability. We therefore implemented our own shared memory vertex-centric framework iPregel, relying on in-memory storage and synchronous execution. In this paper, we evaluate it against FemtoGraph, whose characteristics are identical, but also an asynchronous counterpart GraphChi and the vertex-subset-centric framework Ligra. Our experiments include three of the most popular vertex-centric benchmark applications over 4 real-world publicly accessible graphs, which cover all orders of magnitude between a million to a billion edges. We then measure the execution time and the peak memory usage. Finally, we evaluate the programmability of each framework by comparing it against the original Pregel, Google’s closed-source implementation that started the whole area of vertex-centric programming.Experiments demonstrate that iPregel, like FemtoGraph, does not sacrifice vertex-centric programmability for additional performance and memory efficiency optimisations, which contrasts with GraphChi and Ligra. Sacrificing vertex-centric programmability allowed the latter to benefit from substantial performance and memory efficiency gains. However, experiments demonstrate that iPregel is up to 2,300 times faster than FemtoGraph, as well as generating a memory footprint up to 100 times smaller. These results greatly change the situation; Ligra and GraphChi are up to 17,000 and 700 times faster than FemtoGraph but, when comparing against iPregel, this maximum speed-up drops to 10. Furthermore, on PageRank, it is iPregel that proves to be the fastest overall. When it comes to memory efficiency, the same observation applies; Ligra and GraphChi are 100 and 50 times lighter than FemtoGraph, but iPregel nullifies these benefits: it provides the same memory efficiency as Ligra and even proves to be 3 to 6 times lighter than GraphChi on average. In other words, iPregel demonstrates that preserving vertex-centric programmability is not incompatible with a competitive performance and memory efficiency.
Published: 2019

3. pointerchain: Tracing pointers to their roots – A case study in molecular dynamics simulations

Author: Sunita Chandrasekaran, Millad Ghane, and Margaret S. Cheung
Subjects: Computer Networks and Communications, Computer science, business.industry, Embarrassingly parallel, 010103 numerical & computational mathematics, Parallel computing, Tracing, Data structure, 01 natural sciences, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, 010101 applied mathematics, CUDA, Software, Artificial Intelligence, Hardware and Architecture, Pointer (computer programming), Programming paradigm, 0101 mathematics, business, Codebase
Abstract: As scientific frameworks become sophisticated, so do their data structures. A data structure typically includes pointers and arrays to other structures in order to preserve application’s state. In order to ensure data consistency from a scientific application on a modern high performance computing (HPC) architecture, the management of such pointers on the host and the device, has become complicated in terms of memory allocations because they occupy separate memory spaces. It becomes so severe that one must go through a chain of pointers to extract the effective address. In this paper, we propose to reduce the need of excessive data transfer by introducing the idea of pointerchain , a directive that replaces the pointer chains with their corresponding effective address inside the parallel region of a code. Based on our analysis, pointerchain leads to a 39% and 38% reduction in the amount of generated codes and the total executed instructions, respectively. With pointerchain , we have parallelized CoMD, a Molecular Dynamics (MD) proxy application on heterogeneous HPC architectures while maintaining a single portable codebase. This portable codebase utilizes OpenACC, an emerging directive-based programming model, to address the need of memory allocations from three computational kernels in CoMD. Two of the three embarrassingly parallel kernels highly benefit from OpenACC and perform better than the hand-written CUDA counterparts. The third kernel performed 61% of peak performance of its CUDA counterpart. The three kernels are common modules in any MD simulations. Our findings provides useful insights into parallelizing legacy MD software across heterogeneous platforms.
Published: 2019

4. Resilient computational applications using Coarray Fortran

Author: Alessandro Fanfarillo, Daniel Nagle, Dinshaw S. Balsara, and Sudip K. Garain
Subjects: Computer Networks and Communications, business.industry, Fortran, Computer science, Probabilistic logic, 010103 numerical & computational mathematics, Supercomputer, 01 natural sciences, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, 010101 applied mathematics, Software, Computer engineering, Artificial Intelligence, Hardware and Architecture, Programming paradigm, 0101 mathematics, Programmer, business, Coarray Fortran, computer, computer.programming_language
Abstract: With the increase in the number of hardware components and layers of the software stack in High Performance Computing (HPC) there will likely be an increment in number of hardware and software failures, which will be user-visible. Even under the most optimistic assumptions about the individual components reliability, probabilistic amplification from using millions of nodes has a dramatic impact on the Mean Time Between Failure (MTBF) of the entire platform. Although several techniques to address this problem have been developed, the support provided by the programming model, for the user to mitigate or work around this issue, is still insufficient. The Fortran 2018 standard defines failed images, a new feature that allows the programmer to detect and manage image failures in a parallel program. In this paper we show how to use failed images and teams, another feature defined in the Fortran 2018 standard, to implement resilient computational applications.
Published: 2019

5. PMIx: Process management for exascale environments

Author: David G. Solt, Aurelien Bouteiller, Ralph H. Castain, and Joshua Hursey
Subjects: Process management, Computer Networks and Communications, Interface (Java), Computer science, Distributed computing, 010103 numerical & computational mathematics, 02 engineering and technology, computer.software_genre, 01 natural sciences, Theoretical Computer Science, Process management (computing), Artificial Intelligence, 020204 information systems, Systems management, 0202 electrical engineering, electronic engineering, information engineering, Orchestration (computing), 0101 mathematics, 020203 distributed computing, Computer Graphics and Computer-Aided Design, Workflow, Hardware and Architecture, Scalability, Programming paradigm, Resource allocation, State (computer science), computer, Software
Abstract: High-Performance Computing (HPC) applications have historically executed in static resource allocations, using programming models that ran independently from the resident system management stack (SMS). Achieving exascale performance that is both cost-effective and fits within site-level environmental constraints will, however, require that the application and SMS collaboratively orchestrate the flow of work to optimize resource utilization and compensate for on-the-fly faults. The Process Management Interface - Exascale (PMIx) community is committed to establishing scalable workflow orchestration by defining an abstract set of interfaces by which not only applications and tools can interact with the resident SMS, but also the various SMS components can interact with each other. This paper presents a high-level overview of the goals and current state of the PMIx standard, and lays out a roadmap for future directions.
Published: 2018

6. OpenMP application experiences: Porting to accelerated nodes

Author: Colleen Bertoni, Thomas R. W. Scogland, Vivek S. Kale, Swen Boehm, Paul R. C. Kent, Hal Finkel, Barbara Chapman, JaeHyuk Kwack, Markus Eisenbach, Kiran Ravikumar, Oscar Hernandez, Buu Pham, Ye Luo, Joseph Huber, Pui-Kuen Yeung, Piotr Luszczek, Johannes Doerfert, Seonmyeong Bak, Reuben D. Budiardja, Shintaro Iwasaki, Swaroop Pophale, Meifeng Lin, Vivek Sarkar, and Shilei Tian
Subjects: Computer Networks and Communications, business.industry, Computer science, Supercomputer, Computer Graphics and Computer-Aided Design, Porting, Port (computer networking), Theoretical Computer Science, Variety (cybernetics), Artificial Intelligence, Hardware and Architecture, Software deployment, Programming paradigm, Software engineering, business, Implementation, Software
Abstract: As recent enhancements to the OpenMP specification become available in its implementations, there is a need to share the results of experimentation in order to better understand the OpenMP implementation’s behavior in practice, to identify pitfalls, and to learn how the implementations can be effectively deployed in scientific codes. We report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy. Additionally, we identify important challenges and open problems related to the deployment of OpenMP. Through our report of experiences, we find that OpenMP implementations are successful on current supercomputing platforms and that OpenMP is a promising programming model to use for applications to be run on emerging and future platforms with accelerated nodes.
Published: 2022

7. Collectives in hybrid MPI+MPI code: Design, practice and performance

Author: Huan Zhou, Naweiluo Zhou, Ralf Schneider, and José Gracia
Subjects: Scheme (programming language), FOS: Computer and information sciences, Computer Networks and Communications, Semantics (computer science), Computer science, Message Passing Interface, Context (language use), 010103 numerical & computational mathematics, Parallel computing, Software_PROGRAMMINGTECHNIQUES, 01 natural sciences, Theoretical Computer Science, Artificial Intelligence, Synchronization (computer science), 0101 mathematics, computer.programming_language, Shared memory model, Message passing, Computer Graphics and Computer-Aided Design, 010101 applied mathematics, ComputingMilieux_GENERAL, Computer Science - Distributed, Parallel, and Cluster Computing, Shared memory, Hardware and Architecture, Programming paradigm, Distributed, Parallel, and Cluster Computing (cs.DC), computer, Software
Abstract: The use of hybrid scheme combining the message passing programming models for inter-node parallelism and the shared memory programming models for node-level parallelism is widely spread. Existing extensive practices on hybrid Message Passing Interface (MPI) plus Open Multi-Processing (OpenMP) programming account for its popularity. Nevertheless, strong programming efforts are required to gain performance benefits from the MPI+OpenMP code. An emerging hybrid method that combines MPI and the MPI shared memory model (MPI+MPI) is promising. However, writing an efficient hybrid MPI+MPI program -- especially when the collective communication operations are involved -- is not to be taken for granted. In this paper, we propose a new design method to implement hybrid MPI+MPI context-based collective communication operations. Our method avoids on-node memory replications (on-node communication overheads) that are required by semantics in pure MPI. We also offer wrapper primitives hiding all the design details from users, which comes with practices on how to structure hybrid MPI+MPI code with these primitives. The micro-benchmarks show that our collectives are comparable or superior to those in pure MPI context. We have further validated the effectiveness of the hybrid MPI+MPI model (which uses our wrapper primitives) in three computational kernels, by comparison to the pure MPI and hybrid MPI+OpenMP models., Comment: 14 pages. Accepted for publication in Parallel Computing
Published: 2020
Full Text: View/download PDF

8. Output nondeterminism detection for programming models combining dataflow with shared memory

Author: Hassan Salehe Matar, Serdar Tasiran, Erdal Mutlu, and Didem Unat
Subjects: Signal programming, Computer Networks and Communications, Dataflow, Computer science, Programming language, Dataflow programming, 020207 software engineering, 02 engineering and technology, Parallel computing, computer.software_genre, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Dependency graph, Shared memory, Artificial Intelligence, Hardware and Architecture, TheoryofComputation_LOGICSANDMEANINGSOFPROGRAMS, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, 020201 artificial intelligence & image processing, Compiler, Programmer, computer, Execution model, Software
Abstract: Implementing highly concurrent programs can be challenging because programmers can easily introduce unintended nondeterminism, which has the potential to affect the program output. We propose and implement a technique for detecting unintended nondeterminism in applications developed on shared memory systems with dataflow execution model. Such nondeterminism bugs may be caused by missing or incorrect ordering of task dependencies that are used for ensuring certain ordering of tasks. The proposed method is based on the formulation of happens-before relation on tasks executions in a dataflow dependency graph. Its implementation is composed of two main phases; log recording and detection. For recording the necessary information from the execution, the tool instruments the dataflow framework and the applications, on top of the LLVM compiler infrastructure. Later it processes the collected log and reports on the found output nondeterminism in the execution. The tool can integrate well with the development cycle to provide the programmer with a testing framework against possible nondeterminism bugs. To demonstrate its effectiveness, we study a set of benchmark applications written in Atomic DataFlow programming model and report on real nondeterminism bugs in them.
Published: 2018

9. Evaluating MPI resource usage summary statistics

Author: Kurt B. Ferreira and Scott Levy
Subjects: Computer Networks and Communications, Computer science, Message passing, Computer Graphics and Computer-Aided Design, Usage data, Theoretical Computer Science, Set (abstract data type), Resource (project management), Computer engineering, Artificial Intelligence, Hardware and Architecture, Synchronization (computer science), Key (cryptography), Programming paradigm, Software, TRACE (psycholinguistics)
Abstract: The Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on today’s high-performance computing (HPC) systems. This dominance stems from MPI’s powerful semantics for inter-process communication that has enabled scientists to write applications for simulating important physical phenomena. MPI does not, however, specify how messages and synchronization should be carried out. Those details are typically dependent on low-level architecture details and the message characteristics of the application. Therefore, analyzing an application’s MPI resource usage is critical to tuning MPI’s performance on a particular platform. The result of this analysis is typically a discussion of the mean message sizes, queue search lengths and message arrival times for a workload or set of workloads. While a discussion of the arithmetic mean in MPI resource usage might be the most intuitive summary statistic, it is not always the most accurate in terms of representing the underlying data. In this paper, we analyze MPI resource usage for a number of key MPI workloads using an existing MPI trace collector and discrete-event simulator. Our analysis demonstrates that the average, while easy and efficient to calculate, is a useful metric for characterizing latency and bandwidth measurements, but may not be a good representation of application message sizes, match list search depths, or MPI inter-operation times. Additionally, we show that the median and mode are superior choices in many cases. We also observe that the arithmetic mean is not the best representation of central tendency for data that are drawn from distributions that are multi-modal or have heavy tails. The results and analysis of our work provide valuable guidance on how we, as a community, should discuss and analyze MPI resource usage data for scientific applications.
Published: 2021

10. Toward performance-portable PETSc for GPU-based exascale systems

Author: Hannah Morgan, Karl Rupp, Junchao Zhang, Hong Zhang, Todd Munson, Jed Brown, Matthew G. Knepley, Mark F. Adams, Richard T. Mills, Stefano Zampini, Satish Balay, Alp Dener, Scott Kruger, and Barry Smith
Subjects: Flexibility (engineering), Computer Networks and Communications, Computer science, Parallel computing, Computer Graphics and Computer-Aided Design, Extensibility, Theoretical Computer Science, CUDA, Nonlinear system, Software portability, Artificial Intelligence, Hardware and Architecture, Scalability, Programming paradigm, Software
Abstract: The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization. The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from that used by the library, and it enables application developers to use their preferred programming model, such as Kokkos, RAJA, SYCL, HIP, CUDA, or OpenCL, on upcoming exascale systems. A blueprint for using GPUs from PETSc-based codes is provided, and case studies emphasize the flexibility and high performance achieved on current GPU-based systems.
Published: 2021

11. A technique to automatically determine Ad-hoc communication patterns at runtime

Author: Diego R. Llanos, Ana Moreton-Fernandez, and Arturo Gonzalez-Escribano
Subjects: 020203 distributed computing, Multi-core processor, Computer Networks and Communications, Computer science, Distributed computing, 020207 software engineering, 02 engineering and technology, Parallel computing, computer.software_genre, Supercomputer, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Artificial Intelligence, Hardware and Architecture, Scalability, Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Distributed memory, Compiler, computer, SPMD, Software
Abstract: Producción Científica, Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of shared-memory models. In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way. The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities. Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers., 2018-12-01, MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 (TIN2016-81840- REDT), COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS), and by the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETACIEMAT belongs to CIEMAT and the Government of Spain.
Published: 2017

12. Traleika Glacier: A hardware-software co-designed approach to exascale computing

Author: Joshua B. Fryman, Ankit More, Bala Seshasayee, Romain Cldat, Paul Griffin, Dave Dunning, Shekhar Borkar, Vincent Cav, and Sanjay Chatterjee
Subjects: 0301 basic medicine, Global address space, Computer Networks and Communications, Computer science, Distributed computing, 02 engineering and technology, computer.software_genre, Theoretical Computer Science, 03 medical and health sciences, Runtime system, Software, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Hardware compatibility list, Hardware architecture, 020203 distributed computing, business.industry, Computer Graphics and Computer-Aided Design, Exascale computing, Petascale computing, 030104 developmental biology, Hardware and Architecture, Programming paradigm, Operating system, Hardware acceleration, business, computer
Abstract: The Traleika Glacier architecture, targeted at exascale hardware, is proposed.A task-based runtime system, the Open Community Runtime is presented.The experience of co-designing hardware and software for exascale is described. The move from current petascale machines to future exascale machines will need both hardware improvements and software changes. Hardware will need to evolve to focus primarily on features that lower energy consumption: near-threshold voltage operation, fine-grained power and clock management and heterogeneity. Software will also need to evolve and be able to express more parallelism, become more dynamic and adaptable in order to be able to operate on a much more variable hardware.In this paper, we present Traleika Glacier, an effort that seeks to evaluate radical design changes to meet the constraints, both in terms of power and cost, of exascale computing. The salient features of the hardware design presented in the work include a) a use of heterogeneous cores, b) a redesign of the memory system that centers around hierarchical scratchpads and a global address space, c) the hardware acceleration of certain memory and network operations through specialized engines and, d) very fine-grained control and monitoring capabilities. On the software side, we describe a task-based runtime system, the Open Community Runtime (OCR) which aims to express a wide range of higher-level programming models with a very limited set of core concepts: event-driven tasks for computation, events for synchronization and relocatable data-blocks for data management.
Published: 2017

13. Evaluation of the partitioned global address space (PGAS) model for an inviscid Euler solver

Author: Alexander Ostermann, Martina Prugger, and Lukas Einkemmer
Subjects: FOS: Computer and information sciences, Computer Networks and Communications, Computer science, 02 engineering and technology, Parallel computing, Software_PROGRAMMINGTECHNIQUES, Theoretical Computer Science, Artificial Intelligence, Inviscid flow, Unified Parallel C, 0202 electrical engineering, electronic engineering, information engineering, Partitioned global address space, Euler solver, computer.programming_language, 020203 distributed computing, Numerical analysis, Computer Graphics and Computer-Aided Design, Computer Science - Distributed, Parallel, and Cluster Computing, Hardware and Architecture, Programming paradigm, Computer Science - Mathematical Software, 020201 artificial intelligence & image processing, Distributed, Parallel, and Cluster Computing (cs.DC), Mathematical Software (cs.MS), computer, Software, Xeon Phi
Abstract: In this paper we evaluate the performance of Unified Parallel C (which implements the partitioned global address space programming model) using a numerical method that is widely used in fluid dynamics. In order to evaluate the incremental approach to parallelization (which is possible with UPC) and its performance characteristics, we implement different levels of optimization of the UPC code and compare it with an MPI parallelization on four different clusters of the Austrian HPC infrastructure (LEO3, LEO3E, VSC2, VSC3) and on an Intel Xeon Phi. We find that UPC is significantly easier to develop in compared to MPI and that the performance achieved is comparable to MPI in most situations. The obtained results show worse performance (on VSC2), competitive performance (on LEO3, LEO3E and VSC3), and superior performance (on the Intel Xeon Phi)., Comment: Parallel Computing 2016
Published: 2016

14. CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters

Author: Akshay Venkatesh, Ching-Hsiang Chu, Khaled Hamidouche, Ammar Ahmad Awan, Dhabaleswar K. Panda, and Hari Subramoni
Subjects: 020203 distributed computing, Remote direct memory access, Computer Networks and Communications, Computer science, Address space, 020209 energy, InfiniBand, 02 engineering and technology, Parallel computing, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, CUDA, Shared memory, Kernel (image processing), Artificial Intelligence, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Partitioned global address space, Software
Abstract: GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs. It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs for OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to potential performance improvements being untapped. In this paper, we introduce "CUDA-Aware" concepts for OpenSHMEM that enable operations to be directly performed from/on buffers residing in GPU's memory. We propose novel and efficient designs that ensure "truly one-sided" communication for different intra-/inter-node configurations while working around the hardware limitations. We achieve 2.5 × and 7 × improvement in point-point communication for intra-node and inter-node configurations, respectively. Our proposed framework achieves 2.2 µ s for an intra-node 8-byte put operation from CPU to local GPU and 3.13 µ s for an inter-node 8-byte put operation from GPU to remote GPU. The proposed designs lead to 19% reduction in the execution time of Stencil2D application kernel from the SHOC benchmark suite on Wilkes system which is composed of 64 dual-GPU nodes. Similarly, the evolution time of GPULBM application is reduced by 45% on 64 GPUs. On 8 GPUs per node CS-Storm-based system, we show 50% and 23% improvement on 32 and 64 GPUs, respectively.
Published: 2016

15. Integrating blocking and non-blocking MPI primitives with task-based programming models

Author: Josep M. Perez, Antonio J. Peña, Xavier Teruel, Kevin Sala, Vicenç Beltran, Jesús Labarta, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: FOS: Computer and information sciences, Schedule, Application program interfaces (Computer software), Computer Networks and Communications, Computer science, media_common.quotation_subject, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, 01 natural sciences, Theoretical Computer Science, Task (project management), Runtime system, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Task, 0101 mathematics, Function (engineering), Queue, OmpSs-2, media_common, Informàtica::Programació [Àrees temàtiques de la UPC], OpenMP, Interfícies de programació d'aplicacions (Programari), Interoperability, Computer Graphics and Computer-Aided Design, Blocking (computing), 020202 computer hardware & architecture, Computer Science - Distributed, Parallel, and Cluster Computing, Hardware and Architecture, Scalability, Programming paradigm, MPI, Distributed, Parallel, and Cluster Computing (cs.DC), TAMPI, Software
Abstract: In this paper we present the Task-Aware MPI library (TAMPI) that integrates both blocking and non-blocking MPI primitives with task-based programming models. The TAMPI library leverages two new runtime APIs to improve both programmability and performance of hybrid applications. The first API allows to pause and resume the execution of a task depending on external events. This API is used to improve the interoperability between blocking MPI communication primitives and tasks. When an MPI operation executed inside a task blocks, the task running is paused so that the runtime system can schedule a new task on the core that became idle. Once the blocked MPI operation is completed, the paused task is put again on the runtime system's ready queue, so eventually it will be scheduled again and its execution will be resumed. The second API defers the release of dependencies associated with a task completion until some external events are fulfilled. This API is composed only of two functions, one to bind external events to a running task and another function to notify about the completion of external events previously bound. TAMPI leverages this API to bind non-blocking MPI operations with tasks, deferring the release of their task dependencies until both task execution and all its bound MPI operations are completed. Our experiments reveal that the enhanced features of TAMPI not only simplify the development of hybrid MPI+OpenMP applications that use blocking or non-blocking MPI primitives but they also naturally overlap computation and communication phases, which improves application performance and scalability by removing artificial dependencies across communication tasks., European Commission's projects: INTERTWinE (EC-H2020-671602), Marie Sk{\l}odowska-Curie (EC-H2020-749516). Postprint submitted to the Parallel Computing Journal (Elsevier). Figures from section 7.2 updated, typos corrected
Published: 2019
Full Text: View/download PDF

16. Accelerated molecular dynamics simulation of Silicon Crystals on TaihuLight using OpenACC

Author: Jianguo Liang, Rong Hua, You Fu, Hao Zhang, and Zhu Wenqiang
Subjects: Amdahl's law, Speedup, TOP500, Xeon, Computer Networks and Communications, Computer science, Parallel computing, SW26010, computer.software_genre, Computer Graphics and Computer-Aided Design, Execution time, Porting, Theoretical Computer Science, symbols.namesake, Artificial Intelligence, Hardware and Architecture, Programming paradigm, symbols, Compiler, computer, Software, Sunway TaihuLight
Abstract: The Sunway TaihuLight with the theoretical peak performance of 125PFlop/s is now ranked third in the TOP500 list. It provides a high-level programming model named OpenACC, which extends the OpenACC 2.0 standard with some customized extensions. We assess the performance of the extended programming model and the SW26010 heterogeneous many-core processor for running molecular dynamics (MD) simulation of solid covalent crystals using many-body potentials, such as the Tersoff potentials. Considering the special architecture of the SW26010 processor, we implement the porting of the MD simulation of silicon crystals using the Sunway OpenACC under the guidance of the extended Amdahl’s law. Since the Sunway OpenACC compiler cannot deal with the performance bottleneck of the MD simulation of silicon crystals, we implement two primary optimizations including designing the software cache and minimizing the access frequency of the main memory on an intermediate version of the code generated by the compiler. Experimental results indicate that a single-process many-core speedup of 12.89x can be achieved by using manual optimization strategies. Compared with the execution time of the serial version on Intel (R) Xeon (R) CPU E5-2620 v4 processor, 8.7x speedup can be achieved.
Published: 2020

17. Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

Author: Bei Wang, William Tang, Taisuke Boku, Keisuke Tsugane, Hitoshi Murai, and Mitsuhisa Sato
Subjects: Computer Networks and Communications, Data parallelism, Computer science, Fortran, Message Passing Interface, 02 engineering and technology, Parallel computing, Thread (computing), computer.software_genre, 01 natural sciences, 010305 fluids & plasmas, Theoretical Computer Science, Artificial Intelligence, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Partitioned global address space, computer.programming_language, Programming language, 020207 software engineering, Computer Graphics and Computer-Aided Design, Hardware and Architecture, Parallel programming model, Programming paradigm, Distributed memory, Memory model, computer, Software
Abstract: We propose the hybrid-view programming approach in PGAS language XcalableMP.We port Gyrokinetic Toroidal Code - Princeton (GTC-P) to XcalableMP.The comparison of the performance and productivity with XMP and MPI implementations.Hybrid-view implementation increases the readability of the code. Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensional gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. The performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.
Published: 2016

18. Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

Author: José Nelson Amaral, Ettore Tiotto, Xavier Martorell, Michail Alvanos, Montse Farreras, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Computer Networks and Communications, Computer science, Parallel programming (Computer science), Optimizing compiler, 02 engineering and technology, Parallel computing, Programació en paral·lel (Informàtica), Theoretical Computer Science, Artificial Intelligence, Unified Parallel C, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), Compiler optimization, Instrumentation (computer programming), Partitioned global address space, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], computer.programming_language, Address space, Communication, Locality, Unified parallel C, Computer Graphics and Computer-Aided Design, 020202 computer hardware & architecture, Hardware and Architecture, Programming paradigm, 020201 artificial intelligence & image processing, computer, Software
Abstract: We improve performance of fine-grain UPC applications by orders of magnitude.We introduce a novel shared-data localization transformation.We present a thorough performance analysis and evaluation.We show that reducing run-time calls is crucial for performance.We achieve performance comparable to C and MPI using the UPC programming model. Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector-executor transformation results in excessive instrumentation that hinders performance.This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003., the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 × their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 × for applications with irregular accesses.
Published: 2016

19. Hybrid MPI-thread parallelization of adaptive mesh operations

Author: Mark S. Shephard, Dan Ibanez, and Ian Dunn
Subjects: POSIX Threads, Computer Networks and Communications, Computer science, Message passing, Message Passing Interface, 010103 numerical & computational mathematics, Thread (computing), Parallel computing, Supercomputer, 01 natural sciences, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, 010101 applied mathematics, Shared memory, Artificial Intelligence, Hardware and Architecture, Hybrid system, Scalability, Programming paradigm, 0101 mathematics, Software
Abstract: Development of a hybrid MPI-thread programming system called PCU.Inter-thread message passing, including non-blocking collectives.A novel, scalable termination detection technique for communication rounds.Hybrid parallel scalability to 16K cores on an IBM Blue Gene/Q. Many of the world's leading supercomputer architectures are a hybrid of shared memory and network-distributed memory. Such an architecture lends itself to a hybrid MPI-thread programming model. We first present an implementation of inter-thread message passing based on the MPI and pthread libraries. In addition, we present an efficient implementation of termination detection for communication rounds. We use the term phased message passing to denote the communication interface based on this termination detection. This interface is then used to implement parallel operations for adaptive unstructured meshes, and the performance of resulting applications is compared to pure MPI operation. We also present new workflows enabled by the ability to vary the number of threads during runtime.
Published: 2016

20. Design and evaluation of efficient global data movement in partitioned global address space

Author: Mitsuhisa Sato and Hitoshi Murai
Subjects: Theoretical computer science, Computer Networks and Communications, Computer science, Movement (music), media_common.quotation_subject, 010103 numerical & computational mathematics, computer.software_genre, 01 natural sciences, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, 010101 applied mathematics, Artificial Intelligence, Hardware and Architecture, Feature (computer vision), Section (archaeology), Computer cluster, Programming paradigm, Partitioned global address space, Compiler, 0101 mathematics, Function (engineering), computer, Software, media_common
Abstract: Global data movement is the most general, and therefore important, function of inter-node communication in the partitioned global address space programming models, such as XcalableMP. Our implementation of it consists of compile-time and run-time optimization for specific cases and run-time processing based on the calculus of common-stride section descriptors for general cases, which allows efficient construction of communication schedules for global data movement. As a result of the evaluation of the implementation on the K computer and a common Linux cluster, it is verified to be effective and useful as a compiler feature in most cases.
Published: 2020

21. Parallelization and scalability analysis of inverse factorization using the chunks and tasks programming model

Author: Elias Rudberg, Anton G. Artemov, and Emanuel H. Rubensson
Subjects: FOS: Computer and information sciences, D.1.3, Computer Networks and Communications, Computer science, Inverse, G.1.3, Numerical Analysis (math.NA), Parallel computing, 65Y05, 65F30, 65F50, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Computational Engineering, Finance, and Science (cs.CE), Factorization, Artificial Intelligence, Hardware and Architecture, Scalability, FOS: Mathematics, Data_FILES, Programming paradigm, Mathematics - Numerical Analysis, Computer Science - Computational Engineering, Finance, and Science, Software
Abstract: We present three methods for distributed memory parallel inverse factorization of block-sparse Hermitian positive definite matrices. The three methods are a recursive variant of the AINV inverse Cholesky algorithm, iterative refinement, and localized inverse factorization, respectively. All three methods are implemented using the Chunks and Tasks programming model, building on the distributed sparse quad-tree matrix representation and parallel matrix-matrix multiplication in the publicly available Chunks and Tasks Matrix Library (CHTML). Although the algorithms are generally applicable, this work was mainly motivated by the need for efficient and scalable inverse factorization of the basis set overlap matrix in large scale electronic structure calculations. We perform various computational tests on overlap matrices for quasi-linear Glutamic Acid-Alanine molecules and three-dimensional water clusters discretized using the standard Gaussian basis set STO-3G with up to more than 10 million basis functions. We show that for such matrices the computational cost increases only linearly with system size for all the three methods. We show both theoretically and in numerical experiments that the methods based on iterative refinement and localized inverse factorization outperform previous parallel implementations in weak scaling tests where the system size is increased in direct proportion to the number of processes. We show also that compared to the method based on pure iterative refinement the localized inverse factorization requires much less communication., Comment: 20 pages, 7 figures, corrected the author list
Published: 2019

22. Global transformations for legacy parallel applications via structural analysis and rewriting

Author: Adrian Prantl, Sriram Krishnamoorthy, Daniel Chavarría-Miranda, Ajay Panyala, and Wenjing Ma
Subjects: Source code, Computer Networks and Communications, Fortran, Computer science, media_common.quotation_subject, Parallel computing, computer.software_genre, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Domain (software engineering), Artificial Intelligence, Hardware and Architecture, Scalability, Programming paradigm, Benchmark (computing), Compiler, Rewriting, computer, Software, media_common, computer.programming_language
Abstract: Performance and scalability optimization of large HPC applications is currently a labor-intensive, manual process with very low productivity. Major difficulties come from the disaggregated environment for HPC application development: the compiler is only involved in local decisions (core or multithreaded domain), while a library-based, communication-oriented programming model realizes whole-machine parallelism. Realizing any major global change in such a disaggregated environment is very difficult and involves changing large portions of the source code. We present semi-automated techniques, based on structural analysis and rewriting, for performing global transformations on an HPC application source code. We present two case studies using the Self-Consistent Field (SCF) standalone benchmark as well as the Coupled Cluster (CCSD) module (2.9 million lines of Fortran code), a key module of the NWChem computational chemistry application. We demonstrate how structural rewriting techniques can be used to automate transformations that affect multiple sections of the application’s source code. We show that the transformations can be applied in a systematic fashion across the source code bases with minimal manual effort. These transformations improve the scalability of the SCF benchmark by more than two orders of magnitude and the performance of the full CCSD module by a factor of four.
Published: 2015

23. Tuning and hybrid parallelization of a genetic-based multi-point statistics simulation code

Author: Julián M. Ortiz, Cristóbal Samaniego, Oscar Peredo, José R. Herrero, and Barcelona Supercomputing Center
Subjects: Parallel computing, Code optimization, Computer Networks and Communications, Computer science, Execution time, Theoretical Computer Science, Artificial Intelligence, Stochastic simulation, Geostatistics, Multi point, Simulació, Mètodes de, Enginyeria biomèdica [Àrees temàtiques de la UPC], Locality, Genetic algorithms, Program optimization, Supercomputer, Computer Graphics and Computer-Aided Design, Simulation methods, Automatic parallelization, Hardware and Architecture, Programming paradigm, Cache, Multi-point statistics, Genètica, Software
Abstract: One of the main difficulties using multi-point statistical (MPS) simulation based on annealing techniques or genetic algorithms concerns the excessive amount of time and memory that must be spent in order to achieve convergence. In this work we propose code optimizations and parallelization schemes over a genetic-based MPS code with the aim of speeding up the execution time. The code optimizations involve the reduction of cache misses in the array accesses, avoid branching instructions and increase the locality of the accessed data. The hybrid parallelization scheme involves a fine-grain parallelization of loops using a shared-memory programming model (OpenMP) and a coarse-grain distribution of load among several computational nodes using a distributed-memory programming model (MPI). Convergence, execution time and speed-up results are presented using 2D training images of sizes 100 × 100 × 1 and 1000 × 1000 × 1 on a distributed-shared memory supercomputing facility.
Published: 2014

24. A compiler infrastructure for embedded heterogeneous MPSoCs

Author: Maximilian Odendahl, Mark Bertsch, Weihua Sheng, Gerd Ascheid, Vitaliy Volevach, Rainer Leupers, and Stefan Schürmans
Subjects: Computer Networks and Communications, Computer science, business.industry, 020206 networking & telecommunications, 02 engineering and technology, Dynamic compilation, Construct (python library), MPSoC, computer.software_genre, Computer Graphics and Computer-Aided Design, 020202 computer hardware & architecture, Theoretical Computer Science, Domain (software engineering), Software, Computer architecture, Artificial Intelligence, Hardware and Architecture, Embedded system, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Key (cryptography), Compiler, business, computer
Abstract: Programming heterogeneous MPSoCs (Multi-Processor Systems on Chip) is a grand challenge for embedded SoC providers and users today. In this paper, we argue for the need and significance of positioning the language and tool design from the perspective of practicality to address this challenge. We motivate, describe and justify such a practical design of a compilation framework for heterogeneous MPSoCs targeting the domain of streaming applications, named MAPS (MPSoC Application Programming Studio). MAPS defines a clean, light-weight C language extension to capture streaming programming models. A retargetable source-to-source compiler is developed to provide key capabilities to construct practical compilation frameworks for real-world, complex MPSoC platforms. Our results have shown that MAPS is a promising compiler infrastructure that enables programming of heterogeneous MPSoCs and increases productivity of MPSoC software developers.
Published: 2014

25. X10-FT: Transparent fault tolerance for APGAS language and runtime

Author: Binyu Zang, Zhijun Hao, Chenning Xie, and Haibo Chen
Subjects: Multi-core processor, Computer Networks and Communications, Computer science, Distributed computing, Fault tolerance, computer.software_genre, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Set (abstract data type), Artificial Intelligence, Hardware and Architecture, Asynchronous communication, Paxos, Operating system, Programming paradigm, Partitioned global address space, Compiler, computer, Software
Abstract: The asynchronous partitioned global address space (APGAS) model is a programming model aiming at unifying programming on multicore and clusters, with good productivity. However, it currently lacks support for fault tolerance (FT) such that a single transient failure may render hours to months of computation useless. In this paper, we thoroughly analyze the feasibility of providing fault tolerance for APGAS model and make the first attempt to add fault tolerance support to an APGAS language called X10. Based on the analysis, we design and implement a fault-tolerance framework called X10-FT that leverages renowned techniques in distributed systems like distributed file systems and Paxos, as well as specific solutions based on the characteristics of the APGAS model to make checkpoints and consensus. This allows the system to transparently handle machine failures at different granularities. Using the features of the APGAS model, we extend the X10 compiler to automatically locate execution points to checkpoint program states without any intervention from programmers. Evaluation using a set of benchmarks shows that the cost for fault tolerance is modest.
Published: 2014

26. Characterizing the challenges and evaluating the efficacy of a CUDA-to-OpenCL translator

Author: Gabriel Martinez, Paul Sathre, Mark K. Gardner, and Wu-chun Feng
Subjects: Source lines of code, Computer Networks and Communications, Computer science, Graphics processing unit, Symmetric multiprocessor system, Parallel computing, ComputerSystemsOrganization_PROCESSORARCHITECTURES, Software_PROGRAMMINGTECHNIQUES, computer.software_genre, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, CUDA, Computer architecture, Artificial Intelligence, Hardware and Architecture, Programming paradigm, Compiler, General-purpose computing on graphics processing units, computer, Software
Abstract: The proliferation of heterogeneous computing systems has led to increased interest in parallel architectures and their associated programming models. One of the most promising models for heterogeneous computing is the accelerator model, and one of the most cost-effective, high-performance accelerators currently available is the general-purpose, graphics processing unit (GPU). Two similar programming environments have been proposed for GPUs: CUDA and OpenCL. While there are more lines of code already written in CUDA, OpenCL is an open standard that supports a broader. Hence, there is significant interest in automatic translation from CUDA to OpenCL. The contributions of this work are three-fold: (1) an extensive characterization of the subtle challenges of translation, (2) CU2CL (CUDA to OpenCL) - an implementation of a translator, and (3) an evaluation of CU2CL with respect to coverage of CUDA, translation performance, and performance of the translated applications.
Published: 2013

27. MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy

Author: Hisham Mohamed and Stéphane Marchand-Maillet
Subjects: Speedup, Computer Networks and Communications, Computer science, Pipeline (computing), Nearest neighbor search, Search engine indexing, Message Passing Interface, Parallel computing, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Artificial Intelligence, Hardware and Architecture, Data exchange, Programming paradigm, Software
Abstract: MapReduce is a programming model proposed to simplify large-scale data processing. In contrast, the message passing interface (MPI) standard is extensively used for algorithmic parallelization, as it accommodates an efficient communication infrastructure. In the original implementation of MapReduce, the reduce function can only start processing following termination of the map function. If the map function is slow for any reason, this will affect the whole running time. In this paper, we propose MapReduce overlapping using MPI, which is an adapted structure of the MapReduce programming model for fast intensive data processing. Our implementation is based on running the map and the reduce functions concurrently in parallel by exchanging partial intermediate data between them in a pipeline fashion using MPI. At the same time, we maintain the usability and the simplicity of MapReduce. Experimental results based on three different applications (WordCount, Distributed Inverted Indexing and Distributed Approximate Similarity Search) show a good speedup compared to the earlier versions of MapReduce such as Hadoop and the available MPI-MapReduce implementations.
Published: 2013

28. Special Issue on Topics on Heterogeneous Computing

Author: Antonio J. Peña and Sunita Chandrasekaran
Subjects: Computer Networks and Communications, Computer science, Distributed computing, Symmetric multiprocessor system, 010103 numerical & computational mathematics, 01 natural sciences, Computer Graphics and Computer-Aided Design, Field (computer science), Theoretical Computer Science, 010101 applied mathematics, Set (abstract data type), Artificial Intelligence, Hardware and Architecture, Workload scheduling, Programming paradigm, 0101 mathematics, Unconventional computing, Software
Abstract: This special issue includes a set of articles in the heterogeneous computing field, advancing the state-of-the-art in programming models, tools, algorithms, applications, and workload scheduling.
Published: 2017

29. 8th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM’17)

Author: Min Si, Pavan Balaji, and Zhiyi Huang
Subjects: Artificial Intelligence, Computer Networks and Communications, Hardware and Architecture, Computer science, business.industry, Programming paradigm, Software engineering, business, Computer Graphics and Computer-Aided Design, Software, Theoretical Computer Science
Published: 2018

30. A framework for argument-based task synchronization with automatic detection of dependencies

Author: Basilio B. Fraguela and Carlos H. González
Subjects: Theoretical computer science, Computer Networks and Communications, Computer science, Parallel programming, Libraries, Task parallelism, 02 engineering and technology, Synchronization, computer.software_genre, Theoretical Computer Science, Task (project management), Out-of-order execution, Artificial Intelligence, Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, 020203 distributed computing, Programming language, Programming models, 020207 software engineering, Computer Graphics and Computer-Aided Design, Hardware and Architecture, Dependencies, Programming paradigm, Parallelism (grammar), Implicit parallelism, computer, Software
Abstract: [Abstract] Synchronization in parallel applications can be achieved either implicitly or explicitly. Implicit synchronization is typical of programming environments that provide predefined, and often simple, patterns of parallelism such as data-parallel libraries and languages and skeletal operations. Nevertheless, more flexible approaches that allow to express arbitrary task-level parallel computations without a predefined structure request in turn that the user explicitly specifies the synchronization needed among the parallel tasks. In this paper we present a library-based approach that enables arbitrary patterns of parallelism with minimal effort for the user. Our proposal is the first generic approach to express parallelism we know of that requires neither explicit synchronizations nor a detail of the dependencies of the parallel tasks. Our strategy relies on expressing the parallel tasks as functions that convey their dependencies implicitly by means of their arguments. These function arguments are analyzed by our library, called DepSpawn, when a parallel task is spawned in order to enforce its dependencies. Our experiments indicate that DepSpawn is very competitive, both in terms of performance and programmability, with respect to a widespread high-level approach like OpenMP. Xunta de Galicia; INCITE08PXIB105161PR Ministerio de Ciencia e Innovación; TIN2010-16735 Ministerio de Educación de España; AP2009-4752
Published: 2013

31. Improving performance of adaptive component-based dataflow middleware

Author: Timothy D. R. Hartley, ímit V. Çatalyürek, and Erik Saule
Subjects: Computer Networks and Communications, Dataflow, Computer science, Distributed computing, Symmetric multiprocessor system, computer.software_genre, Computer Graphics and Computer-Aided Design, Execution time, Theoretical Computer Science, Software framework, Artificial Intelligence, Hardware and Architecture, Asynchronous communication, Middleware (distributed applications), Component (UML), Programming paradigm, Implementation, computer, Software
Abstract: Making the best use of modern computational resources for distributed applications requires expert knowledge of low-level programming tools, or a productive high-level and high-performance programming framework. Unfortunately, even state-of-the-art high-level frameworks still require the developer to conduct a tedious manual tuning step to find the work partitioning which gives the best application execution performance. Here, we present a novel framework, with which developers can easily create high-performance dataflow applications, without the tedious tuning process. We compare the performance of our approach to that of three distributed programming frameworks which differ significantly in their programming paradigm, their support for multi-core CPUs and accelerators, and their load-balancing approach. These three frameworks are DataCutter, a component-based dataflow framework, KAAPI, a framework using asynchronous function calls, and MR-MPI, a MapReduce implementation. By highly optimizing the implementations of three applications on the four frameworks and comparing the execution time performance of the runtime engines, we show their strengths and weaknesses. We show that our approach achieves good performance for a wide range of applications, with a much-reduced development cost.
Published: 2012

32. Using shared arrays in message-driven parallel programs

Author: Phil Miller, Laxmikant V. Kale, and Aaron T. Becker
Subjects: Flexibility (engineering), Computer Networks and Communications, Computer science, Distributed computing, Message passing, Message Passing Interface, Parallel computing, Loose coupling, Type (model theory), Object (computer science), Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Artificial Intelligence, Hardware and Architecture, Parallel programming model, Reactive programming, Programming paradigm, Array programming, Programming domain, Software, Compile time
Abstract: This paper describes a safe and efficient combination of the object-based message-driven execution and shared array parallel programming models. In particular, we demonstrate how this combination engenders the composition of loosely coupled parallel modules safely accessing a common shared array. That loose coupling enables both better flexibility in parallel execution and greater ease of implementing multi-physics simulations. As a case study, we describe how the parallelization of a new method for molecular dynamics simulation benefits from both of these advantages. We also describe a system of typed handle objects that embed some of the determinacy constraints of the Multiphase Shared Array programming model in the C++ type system, to catch some violations at compile time. The combined programming model communicates in terms of these handles as a natural means of detecting and preventing errors.
Published: 2012

33. Using explicit platform descriptions to support programming of heterogeneous many-core systems

Author: Siegfried Benkner, Sabri Pllana, and Martin Sandrieser
Subjects: 020203 distributed computing, Source code, Computer Networks and Communications, Programming language, Computer science, Distributed computing, media_common.quotation_subject, Symmetric multiprocessor system, 02 engineering and technology, computer.software_genre, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Scheduling (computing), Runtime system, Architectural pattern, Artificial Intelligence, Hardware and Architecture, Server, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, 020201 artificial intelligence & image processing, General-purpose computing on graphics processing units, computer, Software, media_common
Abstract: Heterogeneous many-core systems constitute a viable approach for coping with power constraints in modern computer architectures and can now be found across the whole computing landscape ranging from mobile devices, to desktop systems and servers, all the way to high-end supercomputers and large-scale data centers. While these systems promise to offer superior performance-power ratios, programming heterogeneous many-core architectures efficiently has been shown to be notoriously difficult. Programmers typically are forced to take into account a plethora of low-level architectural details and usually have to resort to a combination of different programming models within a single application. In this paper we propose a platform description language (PDL) that enables to capture key architectural patterns of commonly used heterogeneous computing systems. PDL architecture descriptions support both programmers and toolchains by providing platform-specific information in a well-defined and explicit manner. We have developed a prototype source-to-source compilation framework that utilizes PDL descriptors to transform sequential task-based programs with source code annotations into a form that is convenient for execution on heterogeneous many-core systems. Our framework relies on a component-based approach that accommodates for different implementation variants of tasks, customized for different parts of a heterogeneous platform, and utilizes an advanced runtime system for exploiting parallelism through dynamic task scheduling. We show various usage scenarios of our PDL and demonstrate the effectiveness of our framework for a commonly used scientific kernel and a financial application on different configurations of a state-of-the-art CPU/GPU system.
Published: 2012

34. Generating synchronization statements in divide-and-conquer programs

Author: Rob V. van Nieuwpoort, Pieter Hijma, Henri E. Bal, Ceriel J. H. Jacobs, Computer Systems, Network Institute, High Performance Distributed Computing, and Secure and Liable Computer Systems
Subjects: Divide and conquer algorithms, SDG 16 - Peace, Computer Networks and Communications, Computer science, Programming language, SDG 16 - Peace, Justice and Strong Institutions, sync, Static program analysis, Parallel computing, Cilk, computer.software_genre, Computer Graphics and Computer-Aided Design, Justice and Strong Institutions, Jam sync, Synchronization, Theoretical Computer Science, Artificial Intelligence, Hardware and Architecture, Asynchronous communication, Synchronization (computer science), Programming paradigm, computer, Software, computer.programming_language
Abstract: Divide-and-conquer is a well-known and important programming model that supports efficient execution of parallel applications on multi-cores, clusters, and grids. In divide-and-conquer systems such as Satin or Cilk, recursive calls are automatically transformed into jobs that execute asynchronously. Since the calls are non-blocking, consecutive calls are the source of parallelism. However, programmers have to manually enforce synchronization with sync statements that indicate where the system has to wait for the result of the asynchronous jobs. In this article, we investigate the feasibility of automatically inserting sync statements to relieve programmers of the burden of thinking about synchronization. We investigate whether correctness can be guaranteed and to what extent the amount of parallelism is reduced. We discuss the code analysis algorithms that are needed in detail. To evaluate our approach, we have extended the Satin divide-and-conquer system, which targets efficient execution on grids, with a sync generator. Our experiments show that, with our analysis, we can automatically generate synchronization statements in virtually all real-life cases: in 31 out of 35 real-world applications the sync statements are placed optimally. The automatic placement is correct in all cases, and in one case the sync generator corrected synchronization errors in an application (FFT). © 2011 Elsevier B.V. All rights reserved.
Published: 2012

35. Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

Author: Scott Pakin, Darren J. Kerbyson, and Michael Lang
Subjects: business.operation, Computer Networks and Communications, Computer science, Node (networking), Distributed computing, Supercomputer, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Petascale computing, Roadrunner, Artificial Intelligence, Hardware and Architecture, Hybrid system, Programming paradigm, Upstream (networking), business, Software
Abstract: Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance especially in hybrid systems using accelerators. Processor-cores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contains wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost is typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional steps in the parallel computation and higher use of on-chip communications. This tradeoff is explored using a performance model. An implementation using the reverse-acceleration programming model on the petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in communication performance exists.
Published: 2011

36. Swift: A language for distributed parallel scripting

Author: Ben Clifford, Daniel S. Katz, Justin M. Wozniak, Mihael Hategan, Michael Wilde, and Ian Foster
Subjects: File system, Many-task computing, Computer Networks and Communications, Dataflow, Computer science, Programming language, Programming complexity, computer.software_genre, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Artificial Intelligence, Hardware and Architecture, Scripting language, Parallel programming model, Programming paradigm, Data-intensive computing, computer, Software, Language construct
Abstract: Scientists, engineers, and statisticians must execute domain-specific application programs many times on large collections of file-based data. This activity requires complex orchestration and data management as data is passed to, from, and among application invocations. Distributed and parallel computing resources can accelerate such processing, but their use further increases programming complexity. The Swift parallel scripting language reduces these complexities by making file system structures accessible via language constructs and by allowing ordinary application programs to be composed into powerful parallel scripts that can efficiently utilize parallel and distributed resources. We present Swift's implicitly parallel and deterministic programming model, which applies external applications to file collections using a functional style that abstracts and simplifies distributed parallel execution.
Published: 2011

37. High performance computing using MPI and OpenMP on multi-core parallel systems

Author: Rupak Biswas, Haoqiang Jin, Piyush Mehrotra, Dennis C. Jespersen, Barbara Chapman, and Lei Huang
Subjects: Multi-core processor, Computer Networks and Communications, Computer science, Node (networking), Locality, Parallel computing, Supercomputer, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Petascale computing, Shared memory, Computer architecture, Artificial Intelligence, Hardware and Architecture, Programming paradigm, Distributed memory, IBM, Software
Abstract: The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures.
Published: 2011

38. Performance implications of synchronization structure in parallel programming

Author: Valentín Cardeñoso-Payo, Arturo Gonzalez-Escribano, and Arjan J. C. van Gemund
Subjects: Computer Networks and Communications, Data parallelism, Computer science, Distributed computing, Task parallelism, Parallel computing, Structured programming, Computer Graphics and Computer-Aided Design, Inductive programming, Theoretical Computer Science, Procedural programming, Artificial Intelligence, Hardware and Architecture, Parallel programming model, Programming paradigm, Reactive programming, Implicit parallelism, Programmer, Software
Abstract: The restricted synchronization structure of so-called structured parallel programming paradigms has an advantageous effect on programmer productivity, cost modeling, and scheduling complexity. However, imposing these restrictions can lead to a loss of parallelism, compared to using a programming approach that does not impose synchronization structure. In this paper we study the potential loss of parallelism when expressing parallel computations into a programming model which limits the computation graph (DAG) to series-parallel topology, which characterizes all well-known structured programming models. We present an analytical model that approximately captures this loss of parallelism in terms of simple parameters that are related to DAG topology and workload distribution. We validate the model using a wide range of synthetic and real-world parallel computations running on shared and distributed-memory machines. Although the loss of parallelism is theoretically unbounded, our measurements show that for all above applications the performance loss due to choosing a series-parallel structured model is invariably limited up to 10%. In all cases, the loss of parallelism is predictable provided the topology and workload variability of the DAG are known.
Published: 2009

39. Overcoming performance bottlenecks in using OpenMP on SMP clusters

Author: Yang-Suk Kee, Woo-Chul Jeun, Soonhoi Ha, and Changdon Kee
Subjects: Computer Networks and Communications, Computer science, Software distributed shared memory, Parallel computing, computer.software_genre, Computer Graphics and Computer-Aided Design, Execution time, Theoretical Computer Science, Transmission (telecommunications), Artificial Intelligence, Hardware and Architecture, Computer cluster, Programming paradigm, Operating system, Overhead (computing), Mutual exclusion, computer, Execution model, Software
Abstract: This paper presents a new parallel programming environment called ParADE to enable easy, portable, and high-performance computing for SMP clusters. Different from the prior studies, ParADE separates the programming model from the execution model: it enables shared-address-space programming while it realizes hybrid execution of message-passing and shared-address-space. To overcome the poor performance of conventional OpenMP on SDSM (Software Distributed Shared Memory), ParADE implements an intelligent OpenMP translator supporting efficient mutual exclusion and efficient page transmission. The experimental results on a Linux cluster demonstrate that ParADE reduces mutual exclusion overhead and overall execution time.
Published: 2008

40. Scout: a data-parallel programming language for graphics processors

Author: Greg Roth, Patrick McCormick, Sharen J. Cummins, Jeff Inman, James Ahrens, and Jamaludin Mohd-Yusof
Subjects: Computer Networks and Communications, Computer science, Graphics hardware, Symmetric multiprocessor system, computer.software_genre, Theoretical Computer Science, Computer graphics, Artificial Intelligence, Reactive programming, Graphics, Parallel rendering, Programming language, Software rendering, Scientific visualization, Computer Graphics and Computer-Aided Design, Graphics pipeline, Visualization, Real-time computer graphics, Graphics software, Computer architecture, Hardware and Architecture, High-level programming language, Parallel programming model, Programming paradigm, Fourth-generation programming language, General-purpose computing on graphics processing units, computer, Texture memory, Software
Abstract: Commodity graphics hardware has seen incredible growth in terms of performance, programmability, and arithmetic precision. Even though these trends have been primarily driven by the entertainment industry, the price-to-performance ratio of graphics processors (GPUs) has attracted the attention of many within the high-performance computing community. While the performance of the GPU is well suited for computational science, the programming interface, and several hardware limitations, have prevented their wide adoption. In this paper we present Scout, a data-parallel programming language for graphics processors that hides the nuances of both the underlying hardware and supporting graphics software layers. In addition to general-purpose programming constructs, the language provides extensions for scientific visualization operations that support the exploration of existing or computed data sets.
Published: 2007

41. Is MPI suitable for a generative design-pattern system?

Author: José Nelson Amaral, Paras Mehta, and Duane Szafron
Subjects: Java, Computer Networks and Communications, Computer science, Functional logic programming, Comparison of multi-paradigm programming languages, computer.software_genre, Theoretical Computer Science, Third-generation programming language, Artificial Intelligence, Reactive programming, Fifth-generation programming language, Generative Design, Programming domain, computer.programming_language, Declarative programming, Programming language, Second-generation programming language, Computer Graphics and Computer-Aided Design, Inductive programming, Procedural programming, Hardware and Architecture, Parallel programming model, Software design pattern, Programming paradigm, Fourth-generation programming language, First-generation programming language, computer, Software, Generative grammar, Programming language theory
Abstract: Generative parallel design patterns can be used to improve the productivity of parallel program development. However many of the generative design-pattern systems are developed for target languages that are not widely used by the high-performance computing community. This paper describes an initial effort to develop a system that will hopefully answer the question in the title in the affirmative. This new system is ostensibly based on, and built upon the experience with, the successful CO2P3S system. Significant challenges must be overcome to implement the features of a system that generates frameworks conceived for an object-oriented programming language (Java) into a parallel-annotated procedural language (MPI/C).
Published: 2006

42. A hierarchical distributed-shared memory parallel Branch&Bound application with PVM and OpenMP for multiprocessor clusters

Author: Beniamino Di Martino, Nicola Mazzocca, Salvatore Venticinque, and Rocco Aversa
Subjects: Distributed shared memory, Branch and bound, Computer Networks and Communications, Computer science, Uniform memory access, Multiprocessing, Parallel computing, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Shared memory, Artificial Intelligence, Hardware and Architecture, Programming paradigm, Distributed memory, SPMD, Software
Abstract: Branch&Bound (B&B) is a technique widely used to solve combinatorial optimization problems in physics and engineering science. In this paper we show how the combined use of PVM and OpenMP libraries can be a promising approach to exploit the intrinsic parallel nature of this class of application and to obtain efficient code for hybrid computational architectures. We described how both the shared memory and the distributed memory programming models can be applied to implement the same algorithm for the inter-nodes and intra-node parallelization. Some experimental tests on a local area network (LAN) of workstations are finally discussed.
Published: 2005

43. Optimizing neural networks on SIMD parallel computers

Author: Arun Jagota, Andrea Di Blas, and Richard Hughey
Subjects: Artificial neural network, Computer Networks and Communications, Computer science, Parallel computing, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Hopfield network, Clique problem, Artificial Intelligence, Hardware and Architecture, Asynchronous communication, Programming paradigm, Combinatorial optimization, SIMD, Implementation, Software
Abstract: Hopfield neural networks are often used to solve difficult combinatorial optimization problems. Multiple restarts versions find better solutions but are slow on serial computers. Here, we study two parallel implementations on SIMD computers of multiple restarts Hopfield networks for solving the maximum clique problem. The first one is a fine-grained implementation on the Kestrel Parallel Processor, a linear SIMD array designed and built the University of California, Santa Cruz. The second one is an implementation on the MasPar MP-2 according to the ''SIMD Phase Programming Model'', a new method to solve asynchronous, irregular problems on SIMD machines. We find that the neural networks map well to the parallel architectures and afford substantial speedups with respect to the serial program, without sacrificing solution quality.
Published: 2005

44. Scalability of hybrid programming for a CFD code on the Earth Simulator

Author: Mitsuo Yokokawa, Takashi Ishihara, Atsuya Uno, Ken'ichi Itakura, and Yukio Kaneda
Subjects: Computer Networks and Communications, Computer science, business.industry, Parallel computing, Thread (computing), Computational fluid dynamics, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Computational science, Hybrid programming, Artificial Intelligence, Hardware and Architecture, Parallel programming model, Scalability, Programming paradigm, Reactive programming, business, Software, Simulation
Abstract: The Earth Simulator (ES) is an SMP cluster system. There are two types of parallel programming models available on the ES. One is a flat programming model, in which a parallel program is implemented by MPI interfaces only, both within an SMP node and among nodes. The other is a hybrid programming model, in which a parallel program is written by using thread programming within an SMP node and MPI programming among nodes simultaneously. It is generally known that it is difficult to obtain the same high level of performance using the hybrid programming model as can be achieved with the flat programming model.In this paper, we have evaluated scalability of the code for direct numerical simulation of the Navier-Stokes equations on the ES. The hybrid programming model achieves the sustained performance of 346.9 Gflop/s, while the flat programming model achieves 296.4 Gflop/s with 16 PNs of the ES for a DNS problem size of 2563. For small scale problems, however, the hybrid programming model is not as efficient because of microtasking overhead. It is shown that there is an advantage for the hybrid programming model on the ES for the larger size problems.
Published: 2004

45. Special Issue on Parallel Programming Models and Systems Software for High-End Computing

Author: Abhinav Vishnu, Yong Chen, and Pavan Balaji
Subjects: Resource-oriented architecture, Theoretical computer science, Computer Networks and Communications, Computer science, Computer programming, 010103 numerical & computational mathematics, computer.software_genre, 01 natural sciences, Theoretical Computer Science, Software development process, System programming, Software, Artificial Intelligence, Reactive programming, Software system, 0101 mathematics, Programming domain, Programming in the large and programming in the small, business.industry, Software development, Computer Graphics and Computer-Aided Design, Inductive programming, 010101 applied mathematics, Software framework, Extreme programming practices, Hardware and Architecture, Component-based software engineering, Software construction, Programming paradigm, Software design, Software engineering, business, computer, Programming language theory
Abstract: This special issue features a collection of papers that extend the literature in unique ways, improving the state of art of programming models and systems software for high-end computing systems.
Published: 2016

46. SilkRoad II: mixed paradigm cluster computing with RC_dag consistency

Author: Liang Peng, Weng-Fai Wong, and Chung-Kwong Yuen
Subjects: Divide and conquer algorithms, Theoretical computer science, Computer Networks and Communications, Computer science, Programming language, Software distributed shared memory, Cilk, Directed acyclic graph, computer.software_genre, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Runtime system, Artificial Intelligence, Hardware and Architecture, Computer cluster, Programming paradigm, Memory model, computer, Software, computer.programming_language
Abstract: A parallel programming paradigm indicates the way to express applications. It also restricts the algorithms that may be used in the applications. Unfortunately, runtime systems for parallel computing often impose a particular programming paradigm. For a wider choice of algorithms, it is therefore desirable to support more than one paradigm.In this paper, we propose a formalism for modeling parallel programming paradigms from a graph-theoretic view of their execution instance dag and the memory consistency assumptions. This model allows us to formally reason about the properties of parallel programming paradigms that are hitherto only known informally and intuitively. We propose the concept of general paradigm and show that the single program multiple data, the divide and conquer, and the master/slave paradigms are all sub-sets of this general paradigm. We will also propose, a super set of these three paradigms which we called the mixed paradigm and introduce the RC_dag memory consistency model.We also present our work on SilkRoad II, a variant of the Cilk runtime system for cluster computing. What is unique about SilkRoad II is its memory model which supports multiple paradigms with the underlying software distributed shared memory. Our experimental results show that the stronger RC_dag can achieve performance comparable to LC of Cilk while supporting a bigger set of paradigms with rather good performance.
Published: 2003

47. Message passing and shared address space parallelism on an SMP cluster

Author: Rupak Biswas, Hongzhang Shan, Leonid Oliker, and Jaswinder Pal Singh
Subjects: Distributed shared memory, Computer Networks and Communications, Computer science, Address space, Distributed computing, Message passing, Locality, Message Passing Interface, Parallel computing, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Artificial Intelligence, Hardware and Architecture, Scalability, Programming paradigm, Implementation, Software
Abstract: Currently, message passing (MP) and shared address space (SAS) are the two leading parallel programming paradigms. MP has been standardized with MPI, and is the more common and mature approach; however, code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, we compare the performance of and the programming effort required for six applications under both programming models on a 32-processor PC-SMP cluster, a platform that is becoming increasingly attractive for high-end scientific computing. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and/or complex commumcation patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications, while being competitive for the others. A hybrid MPI + SAS strategy shows only a small performance advantage over pure MPI in some cases. Finally, improved implementations of two MPI collective operations on PC-SMP clusters are presented.
Published: 2003

48. From patterns to frameworks to parallel programs

Author: Kai Tan, Jonathan Schaeffer, John Anvik, Steve MacDonald, Duane Szafron, and Steven Bromling
Subjects: Correctness, Computer Networks and Communications, Programming language, Computer science, Design pattern, computer.software_genre, Computer Graphics and Computer-Aided Design, Inductive programming, Theoretical Computer Science, Domain (software engineering), Artificial Intelligence, Hardware and Architecture, Parallel programming model, Software design pattern, Reactive programming, Programming paradigm, Programming domain, Programmer, computer, Software
Abstract: Object-oriented programming, design patterns, and frameworks are abstraction techniques that have been used to reduce the complexity of sequential programming. This paper describes our approach of applying these three techniques to the more difficult parallel programming domain. The Parallel Design Patterns (PDP) process, the basis of the CO2P3S parallel programming system, combines these techniques in a layered development model. The result is a new approach to parallel programming that addresses correctness and openness in a unique way. At the topmost developmem layer, a customized framework is generated from a design pattern specification of the parallel structure of the program. This framework encapsulates all of the structural details of the pattern, including communication and synchronization, to prevent programmer errors and ensure correctness. Lower layers are used only for performance tuning to make the code as efficient as necessary. This paper describes CO2P3S, based on the PDP process, and demonstrates it using an example application. We also provide results from a usability study of CO2P3S.
Published: 2002

49. A component-based approach to build a portable and flexible middleware for metacomputing

Author: M. Di Santo, Franco Frattolillo, Eugenio Zimeo, and Wilma Russo
Subjects: Java, Computer Networks and Communications, business.industry, Computer science, Distributed computing, Node (networking), computer.software_genre, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Metacomputing, Software, Grid computing, Artificial Intelligence, Hardware and Architecture, Middleware, Component (UML), Middleware (distributed applications), Programming paradigm, Operating system, The Internet, business, computer, Virtual network, computer.programming_language
Abstract: The huge amount of computing resources in the Internet makes it possible to build meta-computers for solving large-scale problems. Despite the great availability of software infrastructures for managing such systems, metacomputer programming is often based on models that do not appear to be suitable to run applications on wide-area, unreliable, highly-variable networks of computers. In this paper, we present a customisable, Java-based middleware which provides programmers with a portable and flexible framework to run applications over a hierarchical, virtual network architecture. The middleware is designed according to a component-based approach that enables the execution behaviour of each computing node to be customised in order to satisfy application needs. The paper shows some examples of programming model customisation and demonstrates that flexibility can be achieved without significantly compromising performance.
Published: 2002

50. The programming model of ASSIST, an environment for parallel and distributed portable applications

Author: Marco Vanneschi
Subjects: Computer Networks and Communications, business.industry, Computer science, Programming language, Semantics (computer science), Data parallelism, Software development, Reuse, Data structure, computer.software_genre, Computer Graphics and Computer-Aided Design, Skeleton (computer programming), Operational semantics, Theoretical Computer Science, Software portability, Artificial Intelligence, Hardware and Architecture, Component (UML), Programming paradigm, Programmer, business, computer, Software
Abstract: A software development system based upon integrated skeleton technology (ASSIST) is a proposal of a new programming environment oriented to the development of parallel and distributed high-performance applications according to a unified approach. The main goals are: high-level programmability and software productivity for complex multidisciplinary applications, including data-intensive and interactive software; performance portability across different platforms, in particular large-scale platforms and grids; effective reuse of parallel software; efficient evolution of applications through versions that scale according to the underlying technologies.The purpose of this paper is to show the principles of the proposed approach in terms of the programming model (successive papers will deal with the environment implementation and with performance evaluation). The features and the characteristics of the ASSIST programming model are described according to an operational semantics style and using examples to drive the presentation, to show the expressive power and to discuss the research issues.According to our previous experience in structured parallel programming, in ASSIST we wish to overcome some limitations of the classical skeletons approach to improve generality and flexibility, expressive power and efficiency for irregular, dynamic and interactive applications, as well as for complex combinations of task and data parallelism. A new paradigm, called "parallel module" (parmod) is defined which, in addition to expressing the semantics of several skeletons as particular cases, is able to express more general parallel and distributed program structures, including both data-flow and nondeterministic reactive computations. ASSIST allows the programmer to design the applications in the form of generic graphs of parallel components. Another distinguishing feature is that ASSIST modules are able to utilize external objects, including shared data structures and abstract objects (e.g. CORBA), with standard interfacing mechanisms. In turn, an ASSIST application can be reused and exported as a component for other applications, possibly expressed in different formalisms.
Published: 2002

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

79 results on '"Programming paradigm"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources