432 results on '"Intel Paragon"'
Search Results
2. Routing with finite speeds of memory and network
- Author
-
Sibeyn, Jop F., Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Prívara, Igor, editor, and Ružička, Peter, editor
- Published
- 1997
- Full Text
- View/download PDF
3. High performance computational chemistry: NWChem and fully distributed parallel applications
- Author
-
Guest, M. F., Aprà, E., Bernholdt, D. E., Früchtl, H. A., Harrison, R. J., Kendall, R. A., Kutteh, R. A., Long, X., Nicholas, J. B., Nichols, J. A., Taylor, H. L., Wong, A. T., Fann, G. I., Littlefield, R. J., Nieplocha, J., Goos, G., editor, Hartmanis, J., editor, van Leeuwen, J., editor, Dongarra, Jack, editor, Madsen, Kaj, editor, and Waśniewski, Jerzy, editor
- Published
- 1996
- Full Text
- View/download PDF
4. System management tools for SHPC systems — Partition management
- Author
-
Günther, Holger, Brockners, Frank, Bemmerl, Thomas, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Hertzberger, Bob, editor, and Serazzi, Giuseppe, editor
- Published
- 1995
- Full Text
- View/download PDF
5. Message-passing-systems on workstation clusters and parallel computers — the impact of software- and network-architectures on applications
- Author
-
Resch, Michael, Geiger, Alfred, Zikeli, Jörg, Goos, G., editor, Hartmanis, J., editor, Gentzsch, Wolfgang, editor, and Harms, Uwe, editor
- Published
- 1994
- Full Text
- View/download PDF
6. Reference implementation of scalable I/O low-level API on Intel Paragon.
- Author
-
Sun, Ninghui
- Abstract
The Scalable I/O (SIO) Initiative’s Low-Level Application Programming Interface (SIO LLAPI) provides file system implementers with a simple low-Level interface to support high-level parallel I/O interfaces efficiently and effectively. This paper describes a reference implementation and the evaluation of the SIO LLAPI on the Intel Paragon multicomputer. The implementation provides the file system structure and striping algorithm, compatible with the Parallel File System (PFS) of Intel Paragon, and runs either inside the kernel or as a user level library. The scatter-gather addressing read/write, asynchronous I/O, client caching and prefetching mechanism, file access hint mechanism collective I/O and highly efficient file copy have been implemented. The preliminary experience shows that the SIO LLAPI provides opportunities of significant performance improvement and is easy to implement. Some high level file system interfaces and applications, such as PFS, ADIO and Hartree-Fock application, are also implemented on top of SIO. The performance of PFS is at least the same as that of Intel’s native PFS, and in many cases, such as small sequential file access, huge I/O requests and collective I/O, it is stable and much better. The SIO features help to support high level interfaces easily, quickly and more efficiently, and the cache, prefetching, hints are useful to get better performance based on different access models. The scalability and performance of SIO are limited by the network latency, network scalable bandwidth, memory copy bandwidth, memory size and pattern of I/O requests. The tradeoff between generality and efficiency should be considered in implementation. [ABSTRACT FROM AUTHOR]
- Published
- 1999
- Full Text
- View/download PDF
7. MCGS: A Modified Conjugate Gradient Squared Algorithm for Nonsymmetric Linear Systems.
- Author
-
Maheswaran, Muthucumaru, Webb, Kevin, and Siegel, Howard
- Abstract
The conjugate gradient squared (CGS) algorithm is a Krylov subspace algorithm that can be used to obtain fast solutions for linear systems (Ax=b) with complex nonsymmetric, very large, and very sparse coefficient matrices (A). By considering electromagnetic scattering problems as examples, a study of the performance and scalability of this algorithm on two MIMD machines is presented. A modified CGS (MCGS) algorithm, where the synchronization overhead is effectively reduced by a factor of two, is proposed in this paper. This is achieved by changing the computation sequence in the CGS algorithm. Both experimental and theoretical analyses are performed to investigate the impact of this modification on the overall execution time. From the theoretical and experimental analysis it is found that CGS is faster than MCGS for smaller number of processors and MCGS outperforms CGS as the number of processors increases. Based on this observation, a set of algorithms approach is proposed, where either CGS or MGS is selected depending on the values of the dimension of the A matrix (N) and number of processors (P). The set approach provides an algorithm that is more scalable than either the CGS or MCGS algorithms. The experiments performed on a 128-processor mesh Intel Paragon and on a 16-processor IBM SP2 with multistage network indicate that MCGS is approximately 20% faster than CGS. [ABSTRACT FROM AUTHOR]
- Published
- 1999
- Full Text
- View/download PDF
8. Parallel Distributive Join Algorithm on the Intel Paragon.
- Author
-
Chung, Soon and Chatterjee, Arindam
- Abstract
In this paper, we analyze the performance of the parallel Distributive Join algorithm that we proposed in Chung and Yang 1995. We implemented the algorithm on an Intel Paragon machine and analyzed the effect of the number of processors and the join selectivity on the performance of the algorithm. We also compared the performance of the Distributive Join (DJ) algorithm with that of the Hybrid-Hash(HH) join algorithm. Our results show that the DJ performs comparably with the HH over the entire range of number of processors used and different join selectivities. A big advantage of the parallel DJ algorithm over the HH join algorithm is that it can easily support non-equijoin operations. The results can also be used to estimate the performance of file I/O intensive applications to be implemented on the Intel Paragon machine. [ABSTRACT FROM AUTHOR]
- Published
- 1999
- Full Text
- View/download PDF
9. Parallel Image Correlation: Case Study to Examine Trade-Offs in Algorithm-to-Machine Mappings.
- Author
-
Armstrong, James, Maheswaran, Muthucumaru, Theys, Mitchell, Siegel, Howard, Nichols, Mark, and Casey, Kenneth
- Abstract
Performance of a parallel algorithm on a parallel machine depends not only on the time complexity of the algorithm, but also on how the underlying machine supports the fundamental operations used by the algorithm. This study analyzes various mappings of image correlation algorithms in SIMD, MIMD, and mixed-mode environments. Experiments were conducted on the Intel Paragon, MasPar MP-1, nCUBE 2, and PASM prototype. The machine features considered in this study include: modes of parallelism, communication/computation ratio, network topology and implementation, SIMD CU/PE overlap, and communication/computation overlap. Performance of an implementation can be enhanced by using algorithmic techniques that match the machine features. Some algorithmic techniques discussed here are additional communication versus redundant computation, data block transfers, and communication/computation overlap. The results presented are applicable to a large class of image processing tasks. Case studies, such as the one presented here, are a necessary step in developing software tools for mapping an application task onto a single parallel machine and for mapping the subtasks of an application task, or a set of independent application tasks, onto a heterogeneous suite of parallel machines. [ABSTRACT FROM AUTHOR]
- Published
- 1998
- Full Text
- View/download PDF
10. Real-time MPEG-2 video encoding on parallel and distributed systems
- Author
-
Shahriar M. Akramullah
- Subjects
business.industry ,Computer science ,Frame (networking) ,computer.file_format ,MPEG-2 ,Encoding (memory) ,Codec ,business ,computer ,Encoder ,Massively parallel ,Computer hardware ,Intel Paragon ,Data compression - Abstract
The information technology of future is likely to be highly dependent on digital video. The key obstacle in realizing this technology, however, lies in the transmission, access and storage requirements of huge amount of data. Compression of a video sequence is an inevitable solution to overcome this obstacle. While various video compression standards have been proposed, their software implementations pose a formidable challenge due to their great deal of processing requirements. On the other hand, the hardware solutions may not be cost-effective. This thesis deals with parallel implementation of the MPEG-2 video encoder on various parallel and distributed platforms. We use a data-parallel approach and exploit parallelism within each frame, unlike some of the previous approaches that employ multiple processing of several disjoint video sequences. This makes our encoder suitable for real-time applications where the complete video sequence may not be present on the disk and may become available on a frame-by-frame basis with time. The encoder also provides control over various parameters, and has the flexibility to allow the inclusion of various algorithms for different stages of the codec. An encoding rate higher than 30 frames/set has been achieved on the Intel Paragon. The encoder is portable across various platforms and allows the user to control the granularity of the problem by enabling it to run on a few fast workstations in a coarse-grained fashion as well as on large scale massively parallel processors in a fine-grained fashion
- Published
- 2014
11. An experimental evaluation of I/O optimizations on different applications
- Author
-
David E. Bernholdt, Mahmut Kandemir, Alok Choudhary, and Meenakshi A. Kandaswamy
- Subjects
Input/output ,business.industry ,Computer science ,Distributed computing ,Message passing ,Disk layout ,Parallel computing ,Program optimization ,Application software ,computer.software_genre ,Software ,Computational Theory and Mathematics ,Hardware and Architecture ,Scalability ,Signal Processing ,Concurrent computing ,business ,computer ,Intel Paragon - Abstract
Many large-scale applications have significant I/O requirements as well as computational and memory requirements. Unfortunately, the limited number of I/O nodes provided in a typical configuration of the modern message-passing distributed-memory architectures such as the Intel Paragon and the IBM SP-2 limits the I/O performance of these applications severely. In this paper, we examine some software optimization techniques and evaluate their effects in five different I/O-intensive codes from both small and large application domains. Our goals in this study are twofold. First, we want to understand the behavior of large-scale data-intensive applications and the impact of I/O subsystems on their performance and vice versa. Second, and more importantly, we strive to determine the solutions for improving the applications' performance by a mix of software techniques. Our results reveal that different applications can benefit from different optimizations. For example, we found that some applications benefit from file layout optimizations, whereas others take advantage of collective I/O. A combination of architectural and software solutions is normally needed to obtain good I/O performance. For example, we show that with a limited number of I/O resources, it is possible to obtain good performance by using appropriate software optimizations. We also show that beyond a certain level, imbalance in the architecture results in performance degradation even when using optimized software, thereby indicating the necessity of an increase in I/O resources.
- Published
- 2002
12. Design and Implementation of a Parallel I/O Runtime System for Irregular Applications
- Author
-
Jaechun No, Jesus Carretero Perez, Sung-Soon Park, and Alok Choudhary
- Subjects
Input/output ,Runtime system ,Artificial Intelligence ,Computer Networks and Communications ,Hardware and Architecture ,Computer science ,Parallel computing ,Parallel I/O ,Software ,Theoretical Computer Science ,Intel Paragon ,Scheduling (computing) - Abstract
We present the design, implementation, and evaluation of a runtime system based on collective I/O techniques for irregular applications. The design is motivated by the requirements of a large number of science and engineering applications including teraflops applications, where the data must be reorganized into a canonical form for further processing or restarts. We present two designs: “collective I/O” and “pipelined collective I/O.” In the first design, all processors participate in I/O simultaneously, making scheduling of I/O requests simpler but creating possible contention at the I/O nodes. In the second design, processors are organized into several groups so that only one group performs I/O while the next group performs the communication to rearrange data and this entire process is dynamically pipelined to reduce I/O node contention. In other words, the design provides support for dynamic contention management. We also present a software caching method using collective I/O to reduce I/O cost by reusing the data already present in the memory of other nodes. Chunking and on-line compression mechanisms are included in both models. We present performance results on the Intel Paragon at Caltech and on the ASCI/Red teraflops machine at Sandia National Laboratories.
- Published
- 2002
13. Scalability versus Execution Time in Scalable Systems
- Author
-
Xian-He Sun
- Subjects
Relation (database) ,Computer Networks and Communications ,Computer science ,Parallel algorithm ,Parallel computing ,Execution time ,Theoretical Computer Science ,Range (mathematics) ,Parallel processing (DSP implementation) ,Artificial Intelligence ,Hardware and Architecture ,Scalability ,Systems architecture ,Point (geometry) ,Software ,Intel Paragon - Abstract
Parallel programming is elusive. The relative performance of different parallel implementations varies with machine architecture, system and problem size. How to compare different implementations over a wide range of machine architectures and problem sizes has not been well addressed due to its difficulty. Scalability has been proposed in recent years to reveal scaling properties of parallel algorithms and machines. In this paper, the relation between scalability and execution time is carefully studied. The concepts of crossing point analysis and range comparison are introduced. Crossing point analysis finds slow/fast performance crossing points of parallel algorithms and machines. Range comparison compares performance over a wide range of ensemble and problem size via scalability and crossing point analysis. Three algorithms from scientific computing are implemented on an Intel Paragon and an IBM SP2 parallel computer. Experimental and theoretical results show how the combination of scalability, crossing point analysis, and range comparison provides a practical solution for scalable performance evaluation and prediction. While our testings are conducted on homogeneous parallel computers, the proposed methodology applies to heterogeneous and network computing as well.
- Published
- 2002
14. A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures
- Author
-
Jack Dongarra, Greg Henry, and David S. Watkins
- Subjects
Computational Mathematics ,ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Schur decomposition ,Applied Mathematics ,Scalability ,QR algorithm ,Distributed memory ,Parallel computing ,IBM ,Supercomputer ,Intel Paragon ,QR decomposition ,Mathematics - Abstract
One approach to solving the nonsymmetric eigenvalue problem in parallel is to parallelize the QR algorithm. Not long ago, this was widely considered to be a hopeless task. Recent efforts have led to significant advances, although the methods proposed up to now have suffered from scalability problems. This paper discusses an approach to parallelizing the QR algorithm that greatly improves scalability. A theoretical analysis indicates that the algorithm is ultimately not scalable, but the nonscalability does not become evident until the matrix dimension is enormous. Experiments on the Intel Paragon system, the IBM SP2 supercomputer, the SGI Origin 2000, and the Intel ASCI Option Red supercomputer are reported.
- Published
- 2002
15. Optimizing noncontiguous accesses in MPI–IO
- Author
-
William Gropp, Ewing Lusk, and Rajeev Thakur
- Subjects
Unix ,Hardware_MEMORYSTRUCTURES ,Computer Networks and Communications ,Computer science ,Subroutine ,Parallel computing ,computer.software_genre ,Computer Graphics and Computer-Aided Design ,Parallel I/O ,Theoretical Computer Science ,Software portability ,Memory management ,Parallel processing (DSP implementation) ,Artificial Intelligence ,Hardware and Architecture ,Operating system ,Benchmark (computing) ,computer ,Software ,Intel Paragon - Abstract
The I/O access patterns of many parallel applications consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access noncontiguous data with a single I/O function call, unlike in Unix I/O. In this paper, we explain how critical this feature of MPI-IO is for high performance and how it enables implementations to perform optimizations. We first provide a classification of the different ways of expressing an application's I/O needs in MPI-IO -- we classify them into four levels, called levels 0--3. We demonstrate that, for applications with noncontiguous access patterns, the I/O performance improves dramatically if users write their applications to make level-3 requests (noncontiguous, collective) rather than level-0 requests (Unix style). We then describe how our MPI-IO implementation, ROMIO, delivers high performance for noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how we have implemented these optimizations portably on multiple machines and file systems, controlledmore » their memory requirements, and also achieved high performance. We demonstrate the performance and portability with performance results for three applications -- an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC) -- on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.« less
- Published
- 2002
16. Square Lattice Self-Avoiding Walks and Polygons
- Author
-
Anthony J. Guttmann and Andrew R. Conway
- Subjects
Combinatorics ,Discrete mathematics ,Series (mathematics) ,Simple (abstract algebra) ,Isotropy ,Generating function ,Discrete Mathematics and Combinatorics ,Supercomputer ,Square lattice ,Mathematics ,Term (time) ,Intel Paragon - Abstract
We give an algorithm for the enumeration of self-avoiding walks on the (anisotropic) square lattice. Application of the algorithm on a 1024 processor Intel Paragon supercomputer resulted in a 51 term series. For (isotropic) square lattice self-avoiding polygons, a related algorithm has produced a 90 term series. Careful analysis provides compelling evidence for simple rational values of the exponents in both the dominant and subdominant terms in the asymptotic form of the coefficients. We also advance compelling arguments – but not a proof – that the generating function for SAW is not differentiably finite. The corresponding result for SAP has recently been proved.
- Published
- 2001
17. Local versus global lookahead in conservative parallel simulations
- Author
-
Carl Tropper and Azzedine Boukerche
- Subjects
Speedup ,Computer Networks and Communications ,Computer science ,Parallel computing ,Deadlock ,Computer Graphics and Computer-Aided Design ,Theoretical Computer Science ,Scheduling (computing) ,Artificial Intelligence ,Hardware and Architecture ,Dijkstra's algorithm ,Algorithm ,Software ,Sequential algorithm ,Intel Paragon - Abstract
This paper presents an algorithm, which we refer to as SGTNE, to efficiently obtain lookahead information from a cluster of processors in a parallel simulation in order to unblock (logical) processes (LP) in a given processor. The SGTNE algorithm is based on a TNE conservative synchronization scheme that relies on an independent execution of a shortest path algorithm in individual processors in order to provide lookahead to the resident LPs. Because TNE is executed on individual processors, it is susceptible to inter-processor deadlocks, which must be detected and broken at some cost. SGTNE (Semi-Global TNE) avoids these deadlocks by executing a shortest path algorithm over a snapshot of the LPs in a cluster of processors. An experimental study of SGTNE was conducted on an Intel Paragon A4. The study compared SGTNE to TNE and to an optimized version of Chandy–Misra (CM) null message algorithms. We also investigated several scheduling algorithms for SGTNE and determined factors influencing its performance, most notably the influence of partitioning. Our results indicate that SGTNE provides good speedup relative to the fastest sequential algorithm and that it out-performs TNE for the population level examined, SGTNE was 3–5 times as fast as the CM-algorithm.
- Published
- 2001
18. A scalable off-line MPEG-2 video encoding scheme using a multiprocessor system
- Author
-
Shahriar M. Akramullah, Ishfaq Ahmad, M.L. Liou, and Muhammad Kafil
- Subjects
Computer Networks and Communications ,Computer science ,Multimedia database ,Multiprocessing ,Parallel computing ,computer.file_format ,Computer Graphics and Computer-Aided Design ,Theoretical Computer Science ,Scheduling (computing) ,Idle ,MIMD ,Artificial Intelligence ,Hardware and Architecture ,MPEG-2 ,Scalability ,computer ,Encoder ,Software ,Data compression ,Intel Paragon - Abstract
Video compression plays a central role in a vast number of multimedia applications but its computational requirements overwhelm the capabilities of any present single processor system. In this paper, we explore the use of parallel machines like the Intel Paragon to compress MPEG-2 video sequences. The motivation is to build a production-based compression facility by exploiting the potential power of the available machine. Given a video sequence or a set of sequences, the aim of the parallel encoder is to achieve the maximum possible encoding rate. A collective scheduling scheme for the processors, I/O nodes, and disks is proposed that provides fast I/O, minimizes the idle times of processors, and enables the system to work in a highly balanced fashion. An efficient data layout scheme for storing video frames is also proposed in order for the I/O to sustain the desired data transfer rates. Using a small percentage of processors as the I/O nodes results in an efficient utilization of the system resources. As shown by experimental and analytical results, the encoding scheme is scalable and higher performance can be achieved with larger machines. The performance of the proposed scheme can be many times the real-time encoding rates with Standard Interface Format (SIF) and CCIR-601 video sequences. The experimental results indicate about two-fold gain in performance compared to the previous studies. Such a system is useful for the conversion of analog videos to compressed digital form in large studios, digital libraries, and other multimedia database environments. The proposed scheme partitions the system into groups of compute nodes, and I/O nodes, and can be easily extended to other MIMD machines or a set of networked workstations.
- Published
- 2001
19. Fast Parallel Direct Solvers for Coarse Grid Problems
- Author
-
Henry M. Tufo and Paul Fischer
- Subjects
Partial differential equation ,Computer Networks and Communications ,Nested dissection ,Differential equation ,Computer science ,Computation ,Domain decomposition methods ,Parallel computing ,Solver ,Theoretical Computer Science ,Multigrid method ,Factorization ,Elliptic partial differential equation ,Artificial Intelligence ,Hardware and Architecture ,Algorithm ,Software ,Intel Paragon - Abstract
We have developed a fast direct solver for parallel solution of coarse grid problems, Ax=b, such as arise when domain decomposition or multigrid methods are applied to elliptic partial differential equations in d space dimensions. The approach is based on a (quasi-) sparse factorization of the inverse of A. If A is n×n and the number of processors is P, the algorithm requires O(n?logP) time for communication and O(n1+?/P) time for computation, where ??d?1d. The method is particularly suited to leading-edge multicomputer systems having thousands of processors. It achieves minimal message startup costs and substantially reduced message volume and arithmetic complexity compared with competing methods, which require O(nlogP) time for communication and O(n1+?) or O(n2/P) time for computation. Timings on the Intel Paragon and ASCI-Red machines reflect these complexity estimates.
- Published
- 2001
20. Environmental modeling on massively parallel computers
- Author
-
Maria Zicarelli and Maria Antonietta Pirozzi
- Subjects
Environmental Engineering ,Environmental modeling ,Computer science ,Ecological Modeling ,Reliability (computer networking) ,Parallel algorithm ,Parallel computing ,Boundary value problem ,Solver ,Massively parallel ,Software ,Intel Paragon - Abstract
In a previous work we studied the concurrent implementation of a numerical model, CONDIFP, developed for the analysis of depth-averaged convection–diffusion problems. Initial experiments were conducted on the Intel Touchstone Delta System, using up to 512 processors and different problem sizes. As for other computation-intensive applications, the results demonstrated an asymptotic trend to unity efficiency when the computational load dominates the communication load. This paper relates some other numerical experiences, in both one and two space dimensions with various choices of initial and boundary conditions, carried out on the Intel Paragon XP/S Model L38 with the aim to illustrate the parallel solver versatility and reliability.
- Published
- 2000
21. Airshed Pollution Modeling in an HPF Style Environment
- Author
-
Jaspal Subhlok and Peter Steenkiste
- Subjects
Airshed ,Computer Networks and Communications ,Computer science ,Fortran ,Task parallelism ,Parallel computing ,computer.software_genre ,Bottleneck ,Theoretical Computer Science ,Computer architecture ,Artificial Intelligence ,Hardware and Architecture ,High-level programming language ,Compiler ,computer ,Software ,High Performance Fortran ,computer.programming_language ,Intel Paragon - Abstract
In this paper, we describe our experience with developing Airshed, a large pollution modeling application, in the Fx programming environment. We demonstrate that high level parallel programming languages like Fx and High Performance Fortran offer a simple and attractive model for developing portable and efficient parallel applications. Performance results are presented for the Airshed application executing on Intel Paragon and Cray T3D and T3E parallel computers. The results demonstrate that the application is “performance portable,” i.e., it achieves good and consistent performance across different architectures, and that the performance can be explained and predicted using a simple model for the communication and computation phases in the program. We also show how task parallelism was used to alleviate I/O related bottlenecks, an important consideration in many applications. Finally, we demonstrate how external parallel modules developed using different parallelization methods can be integrated in a relatively simple and flexible way with modules developed in the Fx compiler framework. Overall, our experience demonstrates that a high level parallel programming environment based on a language like HPF is suitable for developing complex multidisciplinary applications.
- Published
- 2000
22. Design, implementation and evaluation of parallel pipelined STAP on parallel computers
- Author
-
R. Linderman, Russell D. Brown, Alok Choudhary, Pramod K. Varshney, M. Linderman, Donald D. Weiner, and Wei-keng Liao
- Subjects
business.industry ,Computer science ,Parallel algorithm ,Aerospace Engineering ,Parallel computing ,Application software ,computer.software_genre ,Software ,Scalability ,Concurrent computing ,Electrical and Electronic Engineering ,business ,computer ,Intel Paragon - Abstract
Performance results are presented for the design and implementation of parallel pipelined space-time adaptive processing (STAP) algorithms on parallel computers. In particular, the issues involved in parallelization, our approach to parallelization, and performance results on an Intel Paragon are described. The process of developing software for such an application on parallel computers when latency and throughput are both considered together is discussed and tradeoffs considered with respect to inter and intratask communication and data redistribution are presented. The results show that not only scalable performance was achieved for individual component tasks of STAP but linear speedups were obtained for the integrated task performance, both for latency as well as throughput. Results are presented for up to 236 compute nodes (limited by the machine size available to us). Another interesting observation made from the implementation results is that performance improvement due to the assignment of additional processors to one task can improve the performance of other tasks without any increase in the number of processors assigned to them. Normally, this cannot be predicted by theoretical analysis.
- Published
- 2000
23. Design and performance evaluation of a portable parallel library for space-time adaptive processing
- Author
-
A.W. Bojanczyk and J.M. Lebak
- Subjects
Class (computer programming) ,Computer science ,Subroutine ,Parallel computing ,Execution time ,Space-time adaptive processing ,Software portability ,Test case ,Computational Theory and Mathematics ,Parallel processing (DSP implementation) ,Hardware and Architecture ,Signal Processing ,IBM ,Intel Paragon - Abstract
Space-time adaptive processing (STAP) refers to a class of methods for detecting targets using an array of sensors. Various STAP methods use similar operations on different data or in different orders. We have developed a portable, parallel library of subroutines for prototyping STAP methods. The subroutines work on the IBM SP2 and the Intel Paragon under three different operating systems and three different communication libraries, and can also be configured for other systems. We provide execution-time models for predicting the performance of each subroutine. Using the library routines, we created a parallel version of element-space pre-Doppler processing, three parallel versions of higher-order post-Doppler processing, and two versions of PRI-staggered post-Doppler processing. We implemented a fourth version of higher-order post-Doppler processing, the hybrid method, which uses a combination of fine-grain and coarse-grain parallelism to reduce execution time. The hybrid method can be used to improve performance when a large number of processors is available. Our execution time models generally predict the best method and predict execution times to within 10 percent or better for large test cases.
- Published
- 2000
24. Computational Experience with a Parallel Algorithm for Tetrangle Inequality Bound Smoothing
- Author
-
Kumar Rajan and Narsingh Deo
- Subjects
Pharmacology ,Discrete mathematics ,Magnetic Resonance Spectroscopy ,Triangle inequality ,General Mathematics ,General Neuroscience ,Immunology ,Molecular Conformation ,Parallel algorithm ,Binary logarithm ,General Biochemistry, Genetics and Molecular Biology ,Set (abstract data type) ,Combinatorics ,Computational Theory and Mathematics ,Parallel random-access machine ,Pairwise comparison ,General Agricultural and Biological Sciences ,Algorithms ,Smoothing ,General Environmental Science ,Mathematics ,Intel Paragon - Abstract
Determining molecular structure from interatomic distances is an important and challenging problem. Given a molecule with n atoms, lower and upper bounds on interatomic distances can usually be obtained only for a small subset of the $$\frac{{n(n - 1)}}{2}$$ atom pairs, using NMR. Given the bounds so obtained on the distances between some of the atom pairs, it is often useful to compute tighter bounds on all the $$\frac{{n(n - 1)}}{2}$$ pairwise distances. This process is referred to as bound smoothing. The initial lower and upper bounds for the pairwise distances not measured are usually assumed to be 0 and ∞. One method for bound smoothing is to use the limits imposed by the triangle inequality. The distance bounds so obtained can often be tightened further by applying the tetrangle inequality—the limits imposed on the six pairwise distances among a set of four atoms (instead of three for the triangle inequalities). The tetrangle inequality is expressed by the Cayley—Menger determinants. For every quadruple of atoms, each pass of the tetrangle inequality bound smoothing procedure finds upper and lower limits on each of the six distances in the quadruple. Applying the tetrangle inequalities to each of the ( 4 n ) quadruples requires O(n 4) time. Here, we propose a parallel algorithm for bound smoothing employing the tetrangle inequality. Each pass of our algorithm requires O(n 3 log n) time on a CREW PRAM (Concurrent Read Exclusive Write Parallel Random Access Machine) with $$O\left( {\frac{n}{{\log n}}} \right)$$ processors. An implementation of this parallel algorithm on the Intel Paragon XP/S and its performance are also discussed.
- Published
- 1999
25. Efficient parallel reduction to bidiagonal form
- Author
-
Benedikt Großer and Bruno Lang
- Subjects
Computer Networks and Communications ,ScaLAPACK ,Parallel computing ,Computer Graphics and Computer-Aided Design ,Theoretical Computer Science ,Reduction (complexity) ,Singular value ,Matrix (mathematics) ,Artificial Intelligence ,Hardware and Architecture ,Linear algebra ,Singular value decomposition ,Software ,Mathematics ,Sparse matrix ,Intel Paragon - Abstract
Most methods for calculating the SVD (singular value decomposition) require to first bidiagonalize the matrix. The blocked reduction of a general, dense matrix to bidiagonal form, as implemented in ScaLAPACK, does about one half of the operations with BLAS3. By subdividing the reduction into two stages dense → banded and banded → bidiagonal with cubic and quadratic arithmetic costs, respectively, we are able to carry out a much higher portion of the calculations in matrix–matrix multiplications. Thus, higher performance can be expected. This paper presents and compares three parallel techniques for reducing a full matrix to banded form. (The second reduction stage is described in another paper [B. Lang, Parallel Comput. 22 (1996) 1–18]). Numerical experiments on the Intel Paragon and IBM SP/1 distributed memory parallel computers demonstrate that the two-stage reduction approach can be significantly superior if only the singular values are required.
- Published
- 1999
26. DIAGONAL-IMPLICITLY ITERATED RUNGE–KUTTA METHODS ON DISTRIBUTED MEMORY MACHINES
- Author
-
Gudula Rünger and Thomas Rauber
- Subjects
Computer science ,Data parallelism ,MathematicsofComputing_NUMERICALANALYSIS ,Numerical methods for ordinary differential equations ,Parallel computing ,Cash–Karp method ,Mathematics::Numerical Analysis ,Theoretical Computer Science ,Runge–Kutta methods ,Computational Theory and Mathematics ,Ordinary differential equation ,Distributed memory ,Algorithm ,Dormand–Prince method ,Intel Paragon - Abstract
We consider diagonal-implicitly iterated Runge–Kutta methods which are one-step methods for stiff ordinary differential equations providing embedded solutions for stepsize control. In these methods, algorithmic parallelism is introduced at the expense of additional computations. In this paper, we concentrate on the algorithmic structure of these Runge–Kutta methods and consider several parallel variants of the method exploiting algorithmic and data parallelism in different ways. Our aim is to investigate whether these variants lead to good performance on current distributed memory machines such as the Intel Paragon and the IBM SP2. As test application we use ordinary differential equations with dense and sparse right-hand side functions.
- Published
- 1999
27. Resource scaling effects on MPP performance: the STAP benchmark implications
- Author
-
Zhiwei Xu, Choming Wang, Kai Hwang, and Cho-Li Wang
- Subjects
ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Memory hierarchy ,Computer science ,Data parallelism ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Single system image ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Scalability ,Benchmark (computing) ,SPMD ,Massively parallel ,Intel Paragon - Abstract
Presently, massively parallel processors (MPPs) are available only in a few commercial models. A sequence of three ASCI Teraflops MPPs has appeared before the new millenium. This paper evaluates six MPP systems through STAP benchmark experiments. The STAP is a radar signal processing benchmark which exploits regularly structured SPMD data parallelism. We reveal the resource scaling effects on MPP performance along orthogonal dimensions of machine size, processor speed, memory capacity messaging latency, and network bandwidth. We show how to achieve balanced resources scaling against enlarged workload (problem size). Among three commercial MPPs, the IBM SP2 shows the highest speed and efficiency, attributed to its well-designed network with middleware support for single system image. The Cray T3D demonstrates a high network bandwidth with a good NUMA memory hierarchy. The Intel Paragon trails far behind due to slow processors used and excessive latency experienced in passing messages. Our analysis projects the lowest STAP speed on the ASCI Red, compared with the projected speed of two ASCI Blue machines. This is attributed to slow processors used in ASCI Red and the mismatch between its hardware and software. The Blue Pacific shows the highest potential to deliver scalable performance up to thousands of nodes. The Blue Mountain is designed to have the highest network bandwidth. Our results suggest a limit on the scalability of the distributed shared-memory (DSM) architecture adopted in Blue Mountain. The scaling model offers a quantitative method to match resource scaling with problem scaling to yield a truly scalable performance. The model helps MPP designers optimize the processors, memory, network, and I/O subsystems of an MPP. For MPP users, the scaling results can be applied to partition a large workload for SPMD execution or to minimize the software overhead in collective communication or remote memory update operations. Finally, our scaling model is assessed to evaluate MPPs with benchmarks other than STAP.
- Published
- 1999
28. A Parallel Circuit-Partitioned Algorithm for Timing-Driven Standard Cell Placement
- Author
-
Prithviraj Banerjee and John A. Chandy
- Subjects
Standard cell ,Very-large-scale integration ,Computer Networks and Communications ,Computer science ,Circuit design ,Parallel algorithm ,Parallel computing ,Series and parallel circuits ,Theoretical Computer Science ,Artificial Intelligence ,Hardware and Architecture ,Simulated annealing ,Hardware_INTEGRATEDCIRCUITS ,Placement ,Algorithm ,Software ,Intel Paragon - Abstract
Simulated annealing based standard cell placement for VLSI designs has long been acknowledged as a computation-intensive process, and as a result, several research efforts have been undertaken to parallelize this algorithm. Parallel placement is most needed for very large circuits. Since these circuits do not fit in memory, the traditional approach has been to partition and place individual modules. This causes a degradation in placement quality in terms of area and wirelength. Our algorithm is circuit-partitioned and can handle arbitrarily large circuits on cluster-of-workstations-type parallel machines, such as the Intel Paragon and IBM SP-2. Most previous work in parallel placement has minimized just area and wirelength, but with current deep submicron designs, minimizing wirelength delay is most important. As a result the algorithm discussed in this paper also supports timing driven placement for partitioned circuits. The algorithm, calledmpiPLACE, has been tested on several large industry benchmarks on a variety of parallel architectures.
- Published
- 1999
29. BLOCK-JACOBI SVD ALGORITHMS FOR DISTRIBUTED MEMORY SYSTEMS I: HYPERCUBES AND RINGS*
- Author
-
Martin Bečka and Marian Vajtersic
- Subjects
General Computer Science ,Parallel algorithm ,Jacobi method ,Block matrix ,Parallel computing ,MIMD ,symbols.namesake ,Singular value decomposition ,symbols ,Hypercube ,Algorithm ,Mathematics ,Block (data storage) ,Intel Paragon - Abstract
The paper presents parallel algorithms for efficient solution of the Singular Value Decomposition (SVD) problem by the block two-sided Jacobi method. In this part of the work, we show how the method may be used on MIMD computers with hypercube and ring topologies. We analyse three types of orderings for solving SVD on block-structured submatrices from the point of view of communication requirements and suitability for parallel execution of the computational process The algorithms map well onto the hypercube topology. Two of the ordering schemes can also be directly implemented on rings. Results obtained on an Intel Paragon are shown and discussed for all the three types of orderings.
- Published
- 1999
30. [Untitled]
- Author
-
Arindam Chatterjee and Soon M. Chung
- Subjects
Hash join ,Distributive property ,Parallel processing (DSP implementation) ,Hardware and Architecture ,Computer science ,Join (sigma algebra) ,Parallel computing ,Algorithm ,Software ,Information Systems ,Theoretical Computer Science ,Range (computer programming) ,Intel Paragon - Abstract
In this paper, we analyze the performance of the parallel Distributive Join algorithm that we proposed in Chung and Yang 1995. We implemented the algorithm on an Intel Paragon machine and analyzed the effect of the number of processors and the join selectivity on the performance of the algorithm. We also compared the performance of the Distributive Join (DJ) algorithm with that of the Hybrid-Hash(HH) join algorithm. Our results show that the DJ performs comparably with the HH over the entire range of number of processors used and different join selectivities. A big advantage of the parallel DJ algorithm over the HH join algorithm is that it can easily support non-equijoin operations. The results can also be used to estimate the performance of file I/O intensive applications to be implemented on the Intel Paragon machine.
- Published
- 1999
31. [Untitled]
- Author
-
Eleanor Chu and Dianqin Wang
- Subjects
Toroid ,Fortran ,Computer science ,Parallel algorithm ,Multiprocessing ,Torus ,Parallel computing ,Load balancing (computing) ,Network topology ,Theoretical Computer Science ,Intel iPSC ,Hardware and Architecture ,Hypercube ,Computer Science::Operating Systems ,computer ,Software ,Information Systems ,computer.programming_language ,Intel Paragon - Abstract
In this article, we study the effects of network topology and load balancing on the performance of a new parallel algorithm for solving triangular systems of linear equations on distributed-memory message-passing multiprocessors. The proposed algorithm employs novel runtime data mapping and workload redistribution methods on a communication network which is configured as a toroidal mesh. A fully parameterized theoretical model is used to predict communication behaviors of the proposed algorithm relevant to load balancing, and the analytical performance results correctly determine the optimal dimensions of the toroidal mesh, which vary with the problem size, the number of available processors, and the hardware parameters of the machine. Further enhancement to the proposed algorithm is then achieved through redistributing the arithmetic workload at runtime. Our FORTRAN implementation of the proposed algorithm as well as its enhanced version has been tested on an Intel iPSC/2 hypercube, and the same code is also suitable for executing the algorithm on the iPSC/860 hypercube and the Intel Paragon mesh multiprocessor. The actual timing results support our theoretical findings, and they both confirm the very significant impact a network topology chosen at runtime can have on the computational load distribution, the communication behaviors and the overall performance of parallel algorithms.
- Published
- 1999
32. [Untitled]
- Author
-
Howard Jay Siegel, Kevin J. Webb, and Muthucumaru Maheswaran
- Subjects
Computer science ,Computation ,Linear system ,Krylov subspace ,Theoretical Computer Science ,MIMD ,Matrix (mathematics) ,Dimension (vector space) ,Hardware and Architecture ,ComputingMethodologies_SYMBOLICANDALGEBRAICMANIPULATION ,Overhead (computing) ,Algorithm ,Software ,Information Systems ,Intel Paragon - Abstract
The conjugate gradient squared (CGS) algorithm is a Krylov subspace algorithm that can be used to obtain fast solutions for linear systems (Axeb) with complex nonsymmetric, very large, and very sparse coefficient matrices (A). By considering electromagnetic scattering problems as examples, a study of the performance and scalability of this algorithm on two MIMD machines is presented. A modified CGS (MCGS) algorithm, where the synchronization overhead is effectively reduced by a factor of two, is proposed in this paper. This is achieved by changing the computation sequence in the CGS algorithm. Both experimental and theoretical analyses are performed to investigate the impact of this modification on the overall execution time. From the theoretical and experimental analysis it is found that CGS is faster than MCGS for smaller number of processors and MCGS outperforms CGS as the number of processors increases. Based on this observation, a set of algorithms approach is proposed, where either CGS or MGS is selected depending on the values of the dimension of the A matrix (N) and number of processors (P). The set approach provides an algorithm that is more scalable than either the CGS or MCGS algorithms. The experiments performed on a 128-processor mesh Intel Paragon and on a 16-processor IBM SP2 with multistage network indicate that MCGS is approximately 20% faster than CGS.
- Published
- 1999
33. [Untitled]
- Author
-
Ting-Chuen Pong, Ishfaq Ahmad, and Hoi-Man Yip
- Subjects
Speedup ,Data parallelism ,Computer science ,Parallel algorithm ,Image processing ,Parallel computing ,Filter (signal processing) ,Theoretical Computer Science ,Convolution ,Gaussian filter ,symbols.namesake ,MIMD ,Hardware and Architecture ,symbols ,Partial derivative ,Algorithm ,Software ,Information Systems ,Intel Paragon - Abstract
In this paper, we propose a parallel convolution algorithm for estimating the partial derivatives of 2D and 3D images on distributed-memory MIMD architectures. Exploiting the separable characteristics of the Gaussian filter, the proposed algorithm consists of multiple phases such that each phase corresponds to a separated filter. Furthermore, it exploits both the task and data parallelism, and reduces communication through data redistribution. We have implemented the proposed algorithm on the Intel Paragon and obtained a substantial speedup using more than 100 processors. The performance of the algorithm is also evaluated analytically. The analytical results confirming with the experimental results indicate that the proposed algorithm scales very well with the problem size and number of processors. We have also applied our algorithm to the design and implementation of an efficient parallel scheme for the 3D surface tracking process. Although our focus is on 3D image data, the algorithm is also applicable to 2D image data, and can be useful for a myriad of important applications including medical imaging, magnetic resonance imaging, ultrasonic imagery, scientific visualization, and image sequence analysis.
- Published
- 1999
34. [Untitled]
- Author
-
Howard Jay Siegel, Min Tan, and Janet M. Siegel
- Subjects
MIMD ,Parallel processing (DSP implementation) ,Computer science ,Parallel algorithm ,Video processing ,SIMD ,Parallel computing ,SPMD ,Motion vector ,Software ,Information Systems ,Theoretical Computer Science ,Intel Paragon - Abstract
Parallel algorithms, based on a distributed memory machine model, for an exhaustive search technique for motion vector estimation in video compression are being designed and evaluated. Results from the execution on a 16,384 processor MasPar MP-1 (an SIMD machine), a 140 node Intel Paragon XP/S and a 16 node IBM SP2 (two M IMD machines), and the 16 processor PASM prototype (a partitionable SIMD/MIMD mixed-mode machine) are presented. The trade-offs of using different modes of parallelism (SIMD, SPMD, and mixed-mode) and different data partitioning schemes (the rectangular and stripe subimage methods) are examined. The analytical and experimental results shown in this application study will help practitioners to predict and contrast the performance of different approaches to parallel implementation of this important video compression technique. The results presented are also applicable to a large class of image and video processing tasks. Case studies, such as the one presented here, are a necessary step in developing software tools for mapping an application task onto a single parallel machine and for mapping a set of independent application tasks, or the subtasks of a single application task, onto a heterogeneous suite of parallel machines.
- Published
- 1999
35. [Untitled]
- Author
-
Krishna R. Pattipati, Robert L. Popp, and Yaakov Bar-Shalom
- Subjects
MIMD ,Shared memory ,Computer science ,Distributed computing ,Interface (computing) ,Theory of computation ,General Decision Sciences ,Parallel computing ,Management Science and Operations Research ,SPMD ,Assignment problem ,Bottleneck ,Intel Paragon - Abstract
To date, there has been a lack of efficient and practical distributed‐ and shared‐memoryparallelizations of the data association problem for multitarget tracking. Filling this gap is oneof the primary focuses of the present work. We begin by describing our data association algorithmin terms of an Interacting Multiple Model (IMM) state estimator embedded into anoptimization framework, namely, a two‐dimensional (2D) assignment problem (i.e., weightedbipartite matching). Contrary to conventional wisdom, we show that the data association (oroptimization) problem is not the major computational bottleneck; instead, the interface to theoptimization problem, namely, computing the rather numerous gating tests and IMM stateestimates, covariance calculations, and likelihood function evaluations (used as cost coefficientsin the 2D assignment problem), is the primary source of the workload. Hence, for both ageneral-purpose shared‐memory MIMD (Multiple Instruction Multiple Data) multiprocessorsystem and a distributed‐memory Intel Paragon high‐performance computer, we developedparallelizations of the data association problem that focus on the interface problem. For theformer, a coarse‐grained dynamic parallelization was developed that realizes excellent performance(i.e., superlinear speedups) independent of numerous factors influencing problemsize (e.g., many models in the IMM, denseycluttered environments, contentious target‐measurementdata, etc.). For the latter, an SPMD (Single Program Multiple Data) parallelization wasdeveloped that realizes near‐linear speedups using relatively simple dynamic task allocationalgorithms. Using a real measurement database based on two FAA air traffic control radars, weshow that the parallelizations developed in this work offer great promise in practice.
- Published
- 1999
36. Design and implementation of a parallel cellular language for MIMD architectures
- Author
-
Stéphane Vialle, Thierry Cornu, and Yannick Lallement
- Subjects
ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,General Computer Science ,Artificial neural network ,Workstation ,Computer science ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,law.invention ,Constructed language ,Parallel language ,MIMD ,Computer architecture ,law ,Parallel programming model ,Applications of artificial intelligence ,Intel Paragon - Abstract
In this paper we present a new language, called ParCeL-1, intended to make easier the implementation of computation-intensive and artificial intelligence applications on highly parallel MIMD architectures. The computational model of ParCeL-1 is object-oriented and synchronous, based on virtual processors called cells that compute and communicate alternately. The current prototype is implemented on several multi-processor systems (Cray T3D, Intel Paragon, Telmat T-Node, workstation networks using PVM, and SGI Power Challenge Array). Several applications written in ParCeL-1 are described, together with their performances obtained on a 256-processor Cray T3D.
- Published
- 1998
37. A Comparison of Logical and Physical Parallel I/o pAtterns
- Author
-
Daniel A. Reed and Huseyin Simitci
- Subjects
020203 distributed computing ,SCSI ,Computer science ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,computer.software_genre ,01 natural sciences ,Parallel I/O ,Theoretical Computer Science ,Hardware and Architecture ,Logical conjunction ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,Operating system ,0101 mathematics ,computer ,Software ,Intel Paragon - Abstract
Although there are several extant studies of parallel scien tific application request patterns, there is little experimen tal data on the correlation of physical I/O patterns with application I/O stimuli. To understand these correlations, the authors have instrumented the SCSI device drivers of the Intel Paragon OSF/1 operating system to record key physical I/O activities, and have correlated this data with the I/O patterns of scientific applications captured via the Pablo analysis toolkit. This analysis shows that disk hard ware features profoundly affect the distribution of request delays and that current parallel file systems respond to parallel application I/O patterns in nonscalable ways.
- Published
- 1998
38. Performance analysis and optimization on a parallel atmospheric general circulation model code
- Author
-
John Z. Lou and John D. Farrara
- Subjects
Computer science ,Fast Fourier transform ,General Engineering ,Parallel computing ,Load balancing (computing) ,computer.software_genre ,Convolution ,Reduction (complexity) ,Grid computing ,Scalability ,Code (cryptography) ,Concurrent computing ,Distributed memory ,computer ,Massively parallel ,Intel Paragon - Abstract
An analysis is presented of the primary factors influencing the performance of a parallel implementation of the UCLA atmospheric general circulation model (AGCM) on distributed memory, massively parallel computer systems. Several modifications to the original parallel AGCM code aimed at improving its numerical efficiency, load balance and single node code performance are discussed. The impact of these optimization strategies on the performance on two of the state of the art parallel computers, the Intel Paragon and Cray T3D, is presented and analyzed. It is found that implementation of a load balanced FFT algorithm results in a reduction in overall execution time of approximately 45% compared to the original convolution based algorithm. Preliminary results of the application of a load balancing scheme for the physics part of the AGCM code suggest additional reductions in execution time of 15-20% can be achieved. Finally, several strategies for improving the single node performance of the code are presented, and the results obtained thus far suggest reductions in execution time in the range of 30-40% are possible.
- Published
- 1998
39. Using high performance Fortran for parallel programming
- Author
-
Thomas Zacharia, D. Miles, and Gorti B. Sarma
- Subjects
Deformation process simulation ,Finite element method ,Syntax (programming languages) ,Fortran ,Computer science ,Parallel program ,Polycrystal plasticity ,Parallel computing ,Solver ,Supercomputer ,Computational science ,Computational Mathematics ,Computational Theory and Mathematics ,Modelling and Simulation ,Modeling and Simulation ,Conjugate gradient method ,High Performance Fortran ,computer ,Massively parallel ,computer.programming_language ,Intel Paragon - Abstract
A finite element code with a polycrystal plasticity model for simulating deformation processing of metals has been developed for parallel computers using High Performance Fortran (HPF). The conversion of the code from an original implementation on the Connection Machine systems using CM Fortran is described. The sections of the code requiring minimal inter-processor communication are easily parallelized, by changing only the syntax for specifying data layout. However, the solver routine based on the conjugate gradient method required additional modifications, which are discussed in detail. The performance of the code on a massively parallel distributed-memory Intel PARAGON supercomputer is evaluated through timing statistics. Published by Elsevier Science Ltd.
- Published
- 1998
40. High-level programming of massively parallel computers based on shared virtual memory
- Author
-
Michael Gerndt
- Subjects
Global address space ,Computer Networks and Communications ,Computer science ,Distributed computing ,Overlay ,Theoretical Computer Science ,Memory address ,Artificial Intelligence ,Reactive programming ,Partitioned global address space ,Data diffusion machine ,Massively parallel ,Intel Paragon ,Distributed shared memory ,Message passing ,Uniform memory access ,Computer Graphics and Computer-Aided Design ,Memory map ,Inductive programming ,Extended memory ,Computer architecture ,Shared memory ,Hardware and Architecture ,High-level programming language ,Parallel programming model ,Virtual memory ,Scalability ,Programming paradigm ,Distributed memory ,Software ,Memory protection - Abstract
Highly parallel machines needed to solve compute-intensive scientific applications are based on the distribution of physical memory across the compute nodes. The drawback of such systems is the necessity to write applications in the message passing programming model. Therefore, a lot of research is going on in higher-level programming models and supportive hardware, operating system techniques, languages. The research direction outlined in this article is based on shared virtual memory systems, i.e., scalable parallel systems with a global address space which support an adaptive mapping of global addresses to physical memories. We introduce programming concepts and program optimizations for SVM systems in the context of the SVM-Fortran programming environment which is based on a shared virtual memory system implemented on Intel Paragon. The performance results for real applications proved that this environment enables users to obtain a similar or better performance than by programming in HPF.
- Published
- 1998
41. Compilation techniques for out-of-core parallel computations
- Author
-
Alok Choudhary, J. Ramanujam, Mahmut Kandemir, and Rajesh Bordawekar
- Subjects
Input/output ,Computer Networks and Communications ,Data parallelism ,Computer science ,Locality ,Parallel computing ,computer.software_genre ,Computer Graphics and Computer-Aided Design ,Theoretical Computer Science ,Artificial Intelligence ,Hardware and Architecture ,Virtual memory ,Out-of-core algorithm ,Compiler ,Disk storage ,computer ,Software ,Abstraction (linguistics) ,Intel Paragon - Abstract
The difficulty of handling out-of-core data limits the performance of supercomputers as well as the potential of the parallel machines. Since writing an efficient out-of-core version of a program is a difficult task and virtual memory systems do not perform well on scientific computations, we believe that there is a clear need for compiler directed explicit I/O approach for out-of-core computations. In this paper, we first present an out-of-core compilation strategy based on a disk storage abstraction. Then, we offer a compiler algorithm to optimize locality of disk accesses in out-of-core codes by choosing a good combination of file layouts on disks and loop transformations. We introduce memory coefficient and processor coefficient concepts to characterize the behavior of out-of-core programs under different memory constraints. We also enhance our algorithm to handle data-parallel programs which contain multiple loop nest. Our initial experimental results obtained on IBM SP-2 and Intel Paragon provide encouraging evidence that our approach is successful at optimizing programs which depend on disk-resident data in distributed-memory machines.
- Published
- 1998
42. Performance modeling for SPMD message-passing programs
- Author
-
Patrick H. Worley, Manish Madhukar, and Jürgen Brehm
- Subjects
Computer science ,Message passing ,Parallel programming model ,General Engineering ,Parallel algorithm ,Performance prediction ,Parallel computing ,Execution time ,SPMD ,Massively parallel ,Intel Paragon - Abstract
Today's massively parallel machines are typically message-passing systems consisting of hundreds or thousands of processors. Implementing parallel applications efficiently in this environment is a challenging task, and poor parallel design decisions can be expensive to correct. Tools and techniques that allow the fast and accurate evaluation of different parallelization strategies would significantly improve the productivity of application developers and increase throughput on parallel architectures. This paper investigates one of the major issues in building tools to compare parallelization strategies: determining what type of performance models of the application code and of the computer system are sufficient for a fast and accurate comparison of different strategies. The paper is built around a case study employing the performance prediction tool (PerPreT) to predict performance of the parallel spectral transform shallow water model code (PSTSWM) on the Intel Paragon. PSTSWM is a parallel application code that was designed to evaluate different parallel strategies for the spectral transform method as it is used in climate modeling and weather forecasting. Multiple parallel algorithms and algorithm variants are embedded in the code. PerPreT uses a relatively simple algebraic model to predict execution time for SPMD (single program multiple data) parallel applications. Applications are modeled through parameterized formulae for communication and computation, where the parameters include the problem size, the number of processors used to execute the program, and system characteristics (e.g. setup times for communication, link bandwidth and sustained computing performance per processor). In this paper we describe performance models that predict the performance of the different algorithms in PSTSWM accurately enough to allow them to be compared, establishing the feasibility of such a demanding application of performance modeling. We also discuss issues in generating and validating the performance models, emphasizing the practical importance of tools such as PerPreT in such studies. © 1998 John Wiley & Sons, Ltd.
- Published
- 1998
43. A study of application sensitivity to variation in message-passing latency and bandwidth
- Author
-
Allen C. Robinson, David R. Mackay, Edward J. Barragy, and Patrick H. Worley
- Subjects
ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,Computer science ,Message passing ,General Engineering ,Parallel algorithm ,Parallel computing ,Latency (engineering) ,Oak Ridge National Laboratory ,Intel Paragon - Abstract
This study measures the effects of changes in message latency and bandwidth for production-level codes on a current generation tightly coupled MPP, the Intel Paragon. Messages are sent multiple times to study the application sensitivity to variations in band - width and latency. This method preserves the effects of contention on the interconnection network. Two applications are studied, PCTH, a shock physics code developed at Sandia National Laboratories, and PSTSWM, a spectral shallow water code developed at Oak Ridge National Laboratory and Argonne National Laboratory. These codes are significant in that PCTH is a {open_quote}full physics{close_quotes} application code in production use, while PSTSWM serves as a parallel algorithm test bed and benchmark for production codes used in atmospheric modeling. They are also significant in that the message-passing behavior differs significantly between the two codes, each representing an important class of scientific message-passing applications.
- Published
- 1998
44. Front tracking in two and three dimensions
- Author
-
F. M. Tangerman, D. Tan, Xiaolin Li, James Glimm, T.M. Smith, M.J. Graham, John W. Grove, and Qiang Zhang
- Subjects
Physics ,Deposition and etching processes ,Interface (computing) ,010103 numerical & computational mathematics ,Numerical diffusion ,Grid ,Chip ,Tracking (particle physics) ,01 natural sciences ,Riemann problems ,Computational science ,010101 applied mathematics ,Computational Mathematics ,MIMD ,Richtmyer-Meshkov instability ,Computational Theory and Mathematics ,Modelling and Simulation ,Modeling and Simulation ,Node (circuits) ,0101 mathematics ,Front tracking ,Simulation ,Intel Paragon - Abstract
Front tracking is a method for solving conservation laws in which the evolution of discontinuities is determined through the solution of Riemann problems. This method often does not require highly refined grids, and it has no numerical diffusion. We show the success of this method through a comparison of simulations of the Richtmyer-Meshkov instability, an unstable material interface, with experimental data. Good simulations of such instabilities are notoriously difficult, and we also demonstrate for the same physical problem that grid orientations have no effect on the numerical solution. We also present the first results of our three-dimensional front tracking code by simulating an important aspect of the computer chip manufacturing process: material deposition and etching. Our two- and three-dimensional front tracking code is parallelized for MIMD architectures and runs on our 128 node Intel Paragon.
- Published
- 1998
45. Classical molecular simulations of complex, industrially-important systems on the intel paragon
- Author
-
R.K. Bhupathiraju, Peter T. Cummings, S.A. Gupta, S.T. Cui, Philip F. LoCascio, and Henry D. Cochran
- Subjects
Canonical ensemble ,chemistry.chemical_classification ,Monte Carlo method ,Non-equilibrium thermodynamics ,Domain decomposition methods ,Polymer ,Molecular dynamics ,Supercomputer ,Condensed Matter::Soft Condensed Matter ,Computational Mathematics ,Computational Theory and Mathematics ,chemistry ,Replicated data ,Modelling and Simulation ,Modeling and Simulation ,Domain decomposition ,Statistical physics ,Distributed memory supercomputer ,Mathematics ,Intel Paragon - Abstract
Advances in parallel supercomputing now make possible molecular-based engineering and science calculations that will soon revolutionize many technologies, such as those involving polymers and those involving aqueous electrolytes. We have developed a suite of message-passing codes for classical molecular simulation of these industrially-important systems on the Intel Paragon and summarize some of the results. Nonequilibrium, multiple time step molecular dynamics lets us investigate the rheology of molecular fluids. Chain molecule Monte Carlo simulations in the Gibbs ensemble permit calculation of phase equilibrium of long-chain molecular systems. Complementary equilibrium molecular dynamics yields fundamental insight into the technologically-important problem of liquid-liquid phase separation in polymer blends. Parallel codes for quaternion dynamics using techniques for handling long-range Coulombic forces allow study of ion pairing in supercritical aqueous electrolyte solutions.
- Published
- 1998
46. The first principles O[N] LSMS method and its applications to magnetic structure of alloys
- Author
-
G. M. Stocks, Don M. Nicholson, Yang Wang, and William A. Shelton
- Subjects
Parallel computing ,Magnetic structure ,Condensed matter physics ,Magnetic moment ,Scattering ,Oak Ridge National Laboratory ,Space (mathematics) ,Computational Mathematics ,Computational Theory and Mathematics ,Modeling and Simulation ,Modelling and Simulation ,First principles method ,Magnetic moments ,Alloys ,Multiple scattering theory ,Ground state ,Mathematics ,Intel Paragon - Abstract
We present an overview of the locally self-consistent multiple scattering (LSMS) method. The method is based on real space multiple scattering theory, is naturally highly parallel, and has been implemented on Intel Paragon parallel platforms within the Center for Computational Sciences at Oak Ridge National Laboratory. O[ N ]-scaling is demonstrated for unit cells as large as 1000 atoms. The LSMS method can be extended to treat noncollinear magnetic states of materials. We applied the method to calculating the ground state magnetic structure of Fe 0.65 Ni 0.35 alloys. The result indicates the possible existence of noncollinear arrangements of magnetic moments in these alloys.
- Published
- 1998
- Full Text
- View/download PDF
47. Key concepts for parallel out-of-core LU factorization
- Author
-
Jack Dongarra, David W. Walker, and Sven Hammarling
- Subjects
Parallel computing ,Computer Networks and Communications ,Computer science ,Performance ,Out-of-core LU factorization ,Theoretical Computer Science ,law.invention ,Artificial Intelligence ,Simple (abstract algebra) ,law ,Modelling and Simulation ,Performance model ,Intel Paragon ,ScaLAPACK ,Dense matrices ,Incomplete LU factorization ,Computer Graphics and Computer-Aided Design ,Parallel I/O ,LU decomposition ,Computational Mathematics ,Computational Theory and Mathematics ,Hardware and Architecture ,Modeling and Simulation ,Key (cryptography) ,Out-of-core algorithm ,Software - Abstract
This paper considers key ideas in the design of out-of-core dense LU factorization routines. A left-looking variant of the LU factorization algorithm is shown to require less I/O to disk than the right-looking variant, and is used to develop a parallel, out-of-core implementation. This implementation makes use of a small library of parallel I/O routines, together with ScaLAPACK and PBLAS routines. Results for runs on an Intel Paragon are presented and interpreted using a simple performance model.
- Published
- 1998
- Full Text
- View/download PDF
48. Massively parallel finite volume computation of three-dimensional thermal convective flows
- Author
-
Ping Wang
- Subjects
Finite volume method ,ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,business.industry ,Computer science ,Computation ,General Engineering ,Domain decomposition methods ,Parallel computing ,Computational science ,Algebraic equation ,Multigrid method ,Software ,business ,Massively parallel ,Intel Paragon - Abstract
A parallel implementation of the finite volume method for three-dimensional, time-dependent, thermal convective flows is presented. The algebraic equations resulting from the finite volume discretization are solved by a parallel multigrid method. A flexible parallel code has been implemented on distributed-memory systems, by using domain decomposition techniques and the MPI communication software. The code uses one-, two- or three-dimensional partition according to different geometries. It currently runs on the Intel Paragon, the Cray T3D, T3E, the IBM SP2 and the Beowulf systems, which can be ported easily to other parallel systems. A comparison of the wallclock time of the code between these systems is made, and code performances with respect to different numbers of processors are presented.
- Published
- 1998
49. High performance computing issues for model reduction/expansion
- Author
-
D. J. Inman, Calvin J. Ribbens, and D. F. Pilkey
- Subjects
Fine-tuning ,Mathematical optimization ,Modal data ,Computer science ,General Engineering ,Model expansion ,Supercomputer ,Reduction (mathematics) ,Software ,Finite element method ,Beam (structure) ,Intel Paragon - Abstract
Several methods are investigated for finite element model expansion, with particular attention paid to the role of high performance computing for this problem in the area of structural dynamics. Model expansion is used in model updating, which concerns itself with fine tuning finite element models so that they agree with experimentally obtained modal data. Three popular methods are explored: static expansion, dynamic expansion and Improved Reduced System. Beam and plate examples are used to investigate the performance of these methods on an Intel Paragon when faced with large problems and very few measurement data.
- Published
- 1998
50. Efficient parallel multigrid based solvers for large scale groundwater flow simulations
- Author
-
G. Mahinthakumar and Faisal Saied
- Subjects
Discretization ,Preconditioner ,Computer science ,MathematicsofComputing_NUMERICALANALYSIS ,Krylov subspace ,Solver ,Supercomputer ,Partial differential equations ,Computer Science::Numerical Analysis ,Mathematics::Numerical Analysis ,Computational science ,Computational Mathematics ,Multigrid method ,Computational Theory and Mathematics ,Modelling and Simulation ,Modeling and Simulation ,Computer Science::Mathematical Software ,Multiprocessors ,Distributed memory ,Numerical methods ,Hydrology ,Intel Paragon - Abstract
In this paper, we present parallel solvers for large linear systems arising from the finite-element discretization of three-dimensional groundwater flow problems. We have tested our parallel implementations on the Intel Paragon XP/S 150 supercomputer using up to 1024 parallel processors. Our solvers are based on multigrid and Krylov subspace methods. Our goal is to combine powerful algorithms and current generation high performance computers to enhance the capabilities of computer models for groundwater modeling. We show that multigrid can be a scalable algorithm on distributed memory machines. We demonstrate the effectiveness of parallel multigrid based solvers by solving problems requiring more than 64 million nodes in less than a minute. Our results show that multigrid as a stand alone solver works best for problems with smooth coefficients, but for rough coefficients it is best used as a preconditioner for a Krylov subspace method.
- Published
- 1998
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.