19 results on '"Shahid H"'
Search Results
2. The Tera Multithreaded Architecture and Unstructured Meshes
- Author
-
Bokhari, Shahid H and Mavriplis, Dimitri J
- Subjects
Computer Operations And Hardware - Abstract
The Tera Multithreaded Architecture (MTA) is a new parallel supercomputer currently being installed at San Diego Supercomputing Center (SDSC). This machine has an architecture quite different from contemporary parallel machines. The computational processor is a custom design and the machine uses hardware to support very fine grained multithreading. The main memory is shared, hardware randomized and flat. These features make the machine highly suited to the execution of unstructured mesh problems, which are difficult to parallelize on other architectures. We report the results of a study carried out during July-August 1998 to evaluate the execution of EUL3D, a code that solves the Euler equations on an unstructured mesh, on the 2 processor Tera MTA at SDSC. Our investigation shows that parallelization of an unstructured code is extremely easy on the Tera. We were able to get an existing parallel code (designed for a shared memory machine), running on the Tera by changing only the compiler directives. Furthermore, a serial version of this code was compiled to run in parallel on the Tera by judicious use of directives to invoke the "full/empty" tag bits of the machine to obtain synchronization. This version achieves 212 and 406 Mflop/s on one and two processors respectively, and requires no attention to partitioning or placement of data issues that would be of paramount importance in other parallel architectures.
- Published
- 1998
3. Balancing Contention and Synchronization on the Intel Paragon
- Author
-
Bokhari, Shahid H and Nicol, David M
- Subjects
Computer Systems - Abstract
The Intel Paragon is a mesh-connected distributed memory parallel computer. It uses an oblivious and deterministic message routing algorithm: this permits us to develop highly optimized schedules for frequently needed communication patterns. The complete exchange is one such pattern. Several approaches are available for carrying it out on the mesh. We study an algorithm developed by Scott. This algorithm assumes that a communication link can carry one message at a time and that a node can only transmit one message at a time. It requires global synchronization to enforce a schedule of transmissions. Unfortunately global synchronization has substantial overhead on the Paragon. At the same time the powerful interconnection mechanism of this machine permits 2 or 3 messages to share a communication link with minor overhead. It can also overlap multiple message transmission from the same node to some extent. We develop a generalization of Scott's algorithm that executes complete exchange with a prescribed contention. Schedules that incur greater contention require fewer synchronization steps. This permits us to tradeoff contention against synchronization overhead. We describe the performance of this algorithm and compare it with Scott's original algorithm as well as with a naive algorithm that does not take interconnection structure into account. The Bounded contention algorithm is always better than Scott's algorithm and outperforms the naive algorithm for all but the smallest message sizes. The naive algorithm fails to work on meshes larger than 12 x 12. These results show that due consideration of processor interconnect and machine performance parameters is necessary to obtain peak performance from the Paragon and its successor mesh machines.
- Published
- 1996
4. Communication overhead on the Intel Paragon, IBM SP2 and Meiko CS-2
- Author
-
Bokhari, Shahid H
- Subjects
Computer Operations And Hardware - Abstract
Interprocessor communication overhead is a crucial measure of the power of parallel computing systems-its impact can severely limit the performance of parallel programs. This report presents measurements of communication overhead on three contemporary commercial multicomputer systems: the Intel Paragon, the IBM SP2 and the Meiko CS-2. In each case the time to communicate between processors is presented as a function of message length. The time for global synchronization and memory access is discussed. The performance of these machines in emulating hypercubes and executing random pairwise exchanges is also investigated. It is shown that the interprocessor communication time depends heavily on the specific communication pattern required. These observations contradict the commonly held belief that communication overhead on contemporary machines is independent of the placement of tasks on processors. The information presented in this report permits the evaluation of the efficiency of parallel algorithm implementations against standard baselines.
- Published
- 1995
5. Multiphase complete exchange on Paragon, SP2 and CS-2
- Author
-
Bokhari, Shahid H
- Subjects
Computer Operations And Hardware - Abstract
The overhead of interprocessor communication is a major factor in limiting the performance of parallel computer systems. The complete exchange is the severest communication pattern in that it requires each processor to send a distinct message to every other processor. This pattern is at the heart of many important parallel applications. On hypercubes, multiphase complete exchange has been developed and shown to provide optimal performance over varying message sizes. Most commercial multicomputer systems do not have a hypercube interconnect. However, they use special purpose hardware and dedicated communication processors to achieve very high performance communication and can be made to emulate the hypercube quite well. Multiphase complete exchange has been implemented on three contemporary parallel architectures: the Intel Paragon, IBM SP2 and Meiko CS-2. The essential features of these machines are described and their basic interprocessor communication overheads are discussed. The performance of multiphase complete exchange is evaluated on each machine. It is shown that the theoretical ideas developed for hypercubes are also applicable in practice to these machines and that multiphase complete exchange can lead to major savings in execution time over traditional solutions.
- Published
- 1995
6. The Linux operating system: An introduction
- Author
-
Bokhari, Shahid H
- Subjects
Computer Operations And Hardware - Abstract
Linux is a Unix-like operating system for Intel 386/486/Pentium based IBM-PCs and compatibles. The kernel of this operating system was written from scratch by Linus Torvalds and, although copyrighted by the author, may be freely distributed. A world-wide group has collaborated in developing Linux on the Internet. Linux can run the powerful set of compilers and programming tools of the Free Software Foundation, and XFree86, a port of the X Window System from MIT. Most capabilities associated with high performance workstations, such as networking, shared file systems, electronic mail, TeX, LaTeX, etc. are freely available for Linux. It can thus transform cheap IBM-PC compatible machines into Unix workstations with considerable capabilities. The author explains how Linux may be obtained, installed and networked. He also describes some interesting applications for Linux that are freely available. The enormous consumer market for IBM-PC compatible machines continually drives down prices of CPU chips, memory, hard disks, CDROMs, etc. Linux can convert such machines into powerful workstations that can be used for teaching, research and software development. For professionals who use Unix based workstations at work, Linux permits virtually identical working environments on their personal home machines. For cost conscious educational institutions Linux can create world-class computing environments from cheap, easily maintained, PC clones. Finally, for university students, it provides an essentially cost-free path away from DOS into the world of Unix and X Windows.
- Published
- 1995
7. Experience with parametric binary dissection
- Author
-
Bokhari, Shahid H
- Subjects
Cybernetics - Abstract
Parametric Binary Dissection (PBD) is a new algorithm that can be used for partitioning graphs embedded in 2- or 3-dimensional space. It partitions explicitly on the basis of nodes + (lambda)x(edges cut), where lambda is the ratio of time to communicate over an edge to the time to compute at a node. The new algorithm is faster than the original binary dissection algorithm and attempts to obtain better partitions than the older algorithm, which only takes nodes into account. The performance of parametric dissection with plain binary dissection on 3 large unstructured 3-d meshes obtained from computational fluid dynamics and on 2 random graphs were compared. It was showm that the new algorithm can usually yield partitions that are substantially superior, but that its performance is heavily dependent on the input data.
- Published
- 1993
8. Multiphase complete exchange: A theoretical analysis
- Author
-
Bokhari, Shahid H
- Subjects
Computer Programming And Software - Abstract
Complete Exchange requires each of N processors to send a unique message to each of the remaining N-1 processors. For a circuit switched hypercube with N = 2(sub d) processors, the Direct and Standard algorithms for Complete Exchange are optimal for very large and very small message sizes, respectively. For intermediate sizes, a hybrid Multiphase algorithm is better. This carries out Direct exchanges on a set of subcubes whose dimensions are a partition of the integer d. The best such algorithm for a given message size m could hitherto only be found by enumerating all partitions of d. The Multiphase algorithm is analyzed assuming a high performance communication network. It is proved that only algorithms corresponding to equipartitions of d (partitions in which the maximum and minimum elements differ by at most 1) can possibly be optimal. The run times of these algorithms plotted against m form a hull of optimality. It is proved that, although there is an exponential number of partitions, (1) the number of faces on this hull is Theta(square root of d), (2) the hull can be found in theta(square root of d) time, and (3) once it has been found, the optimal algorithm for any given m can be found in Theta(log d) time. These results provide a very fast technique for minimizing communication overhead in many important applications, such as matrix transpose, Fast Fourier transform, and ADI.
- Published
- 1993
9. Parametric binary dissection
- Author
-
Bokhari, Shahid H, Crockett, Thomas W, and Nicol, David M
- Subjects
Cybernetics - Abstract
Binary dissection is widely used to partition non-uniform domains over parallel computers. This algorithm does not consider the perimeter, surface area, or aspect ratio of the regions being generated and can yield decompositions that have poor communication to computation ratio. Parametric Binary Dissection (PBD) is a new algorithm in which each cut is chosen to minimize load + lambda x(shape). In a 2 (or 3) dimensional problem, load is the amount of computation to be performed in a subregion and shape could refer to the perimeter (respectively surface) of that subregion. Shape is a measure of communication overhead and the parameter permits us to trade off load imbalance against communication overhead. When A is zero, the algorithm reduces to plain binary dissection. This algorithm can be used to partition graphs embedded in 2 or 3-d. Load is the number of nodes in a subregion, shape the number of edges that leave that subregion, and lambda the ratio of time to communicate over an edge to the time to compute at a node. An algorithm is presented that finds the depth d parametric dissection of an embedded graph with n vertices and e edges in O(max(n log n, de)) time, which is an improvement over the O(dn log n) time of plain binary dissection. Parallel versions of this algorithm are also presented; the best of these requires O((n/p) log(sup 3)p) time on a p processor hypercube, assuming graphs of bounded degree. How PBD is applied to 3-d unstructured meshes and yields partitions that are better than those obtained by plain dissection is described. Its application to the color image quantization problem is also discussed, in which samples in a high-resolution color space are mapped onto a lower resolution space in a way that minimizes the color error.
- Published
- 1993
10. A network flow model for load balancing in circuit-switched multicomputers
- Author
-
Bokhari, Shahid H
- Subjects
Computer Systems - Abstract
In multicomputers that utilize circuit switching or wormhole routing, communication overhead depends largely on link contention - the variation due to distance between nodes is negligible. This has a major impact on the load balancing problem. In this case, there are some nodes with excess load (sources) and others with deficit load (sinks) and it is required to find a matching of sources to sinks that avoids contention. The problem is made complex by the hardwired routing on currently available machines: the user can control only which nodes communicate but not how the messages are routed. Network flow models of message flow in the mesh and the hypercube were developed to solve this problem. The crucial property of these models is the correspondence between minimum cost flows and correctly routed messages. To solve a given load balancing problem, a minimum cost flow algorithm is applied to the network. This permits one to determine efficiently a maximum contention free matching of sources to sinks which, in turn, tells one how much of the given imbalance can be eliminated without contention.
- Published
- 1993
- Full Text
- View/download PDF
11. Multiphase complete exchange on a circuit switched hypercube
- Author
-
Bokhari, Shahid H
- Subjects
Mathematical And Computer Sciences (General) - Abstract
On a distributed memory parallel computer, the complete exchange (all-to-all personalized) communication pattern requires each of n processors to send a different block of data to each of the remaining n - 1 processors. This pattern is at the heart of many important algorithms, most notably the matrix transpose. For a circuit switched hypercube of dimension d(n = 2(sup d)), two algorithms for achieving complete exchange are known. These are (1) the Standard Exchange approach that employs d transmissions of size 2(sup d-1) blocks each and is useful for small block sizes, and (2) the Optimal Circuit Switched algorithm that employs 2(sup d) - 1 transmissions of 1 block each and is best for large block sizes. A unified multiphase algorithm is described that includes these two algorithms as special cases. The complete exchange on a hypercube of dimension d and block size m is achieved by carrying out k partial exchange on subcubes of dimension d(sub i) Sigma(sup k)(sub i=1) d(sub i) = d and effective block size m(sub i) = m2(sup d-di). When k = d and all d(sub i) = 1, this corresponds to algorithm (1) above. For the case of k = 1 and d(sub i) = d, this becomes the circuit switched algorithm (2). Changing the subcube dimensions d, varies the effective block size and permits a compromise between the data permutation and block transmission overhead of (1) and the startup overhead of (2). For a hypercube of dimension d, the number of possible combinations of subcubes is p(d), the number of partitions of the integer d. This is an exponential but very slowly growing function and it is feasible over these partitions to discover the best combination for a given message size. The approach was analyzed for, and implemented on, the Intel iPSC-860 circuit switched hypercube. Measurements show good agreement with predictions and demonstrate that the multiphase approach can substantially improve performance for block sizes in the 0 to 160 byte range. This range, which corresponds to 0 to 40 floating point numbers per processor, is commonly encountered in practical numeric applications. The multiphase technique is applicable to all circuit-switched hypercubes that use the common e-cube routing strategy.
- Published
- 1991
12. Complete exchange on the iPSC-860
- Author
-
Bokhari, Shahid H
- Subjects
Mathematical And Computer Sciences (General) - Abstract
The implementation of complete exchange on the circuit switched Intel iPSC-860 hypercube is described. This pattern, also known as all-to-all personalized communication, is the densest requirement that can be imposed on a network. On the iPSC-860, care needs to be taken to avoid edge contention, which can have a disastrous impact on communication time. There are basically two classes of algorithms that achieve contention-free complete exchange. The first contains the classical standard exchange algorithm that is generally useful for small message sizes. The second includes a number of optimal or near-optimal algorithms that are best for large messages. Measurement of communication overhead on the iPSC-860 are given and a notation for analyzing communication link usage is developed. It is shown that for the two classes of algorithms, there is substantial variation in performance with synchronization technique and choice of message protocol. Timings of six implementations are given; each of these is useful over a particular range of message size and cube dimension. Since the complete exchange is a superset of communication patterns, these timings represent upper bounds on the time required by an arbitrary communication requirement. These results indicate that the programmer needs to evaluate several possibilities before finalizing an implementation - a careful choice can lead to very significant savings in time.
- Published
- 1991
13. Efficient algorithms for a class of partitioning problems
- Author
-
Iqbal, M. Ashraf and Bokhari, Shahid H
- Subjects
Computer Programming And Software - Abstract
The problem of optimally partitioning the modules of chain- or tree-like tasks over chain-structured or host-satellite multiple computer systems is addressed. This important class of problems includes many signal processing and industrial control applications. Prior research has resulted in a succession of faster exact and approximate algorithms for these problems. Polynomial exact and approximate algorithms are described for this class that are better than any of the previously reported algorithms. The approach is based on a preprocessing step that condenses the given chain or tree structured task into a monotonic chain or tree. The partitioning of this monotonic take can then be carried out using fast search techniques.
- Published
- 1990
14. Communication overhead on the Intel iPSC-860 hypercube
- Author
-
Bokhari, Shahid H
- Subjects
Computer Operations And Hardware - Abstract
Experiments were conducted on the Intel iPSC-860 hypercube in order to evaluate the overhead of interprocessor communication. It is demonstrated that: (1) contrary to popular belief, the distance between two communicating processors has a significant impact on communication time, (2) edge contention can increase communication time by a factor of more than 7, and (3) node contention has no measurable impact.
- Published
- 1990
15. A network flow model for load balancing in circuit-switched multicomputers
- Author
-
Bokhari, Shahid H
- Subjects
Computer Systems - Abstract
In multicomputers that utilize circuit switching or wormhole routing, communication overhead depends largely on link contention - the variation due to distance between nodes is negligible. This has a major impact on the load balancing problem. In this case, there are some nodes with excess load (sources) and others with deficit load (sinks) and it is required to find a matching of sources to sinks that avoids contention. The problem is made complex by the hardwired routing on currently available machines: the user can control only which nodes communicate but not how the messages are routed. Network flow models of message flow in the mesh and the hypercube were developed to solve this problem. The crucial property of these models is the correspondence between minimum cost flows and correctly routed messages. To solve a given load balancing problem, a minimum cost flow algorithm is applied to the network. This permits one to determine efficiently a maximum contention free matching of sources to sinks which, in turn, tells one how much of the given imbalance can be eliminated without contention.
- Published
- 1990
16. Parallelization of a three-dimensional compressible transition code
- Author
-
Erlebacher, G, Hussaini, M. Y, and Bokhari, Shahid H
- Subjects
Computer Programming And Software - Abstract
The compressible, three-dimensional, time-dependent Navier-Stokes equations are solved on a 20 processor Flex/32 computer. The code is a parallel implementation of an existing code operational on the Cray-2 at NASA Ames, which performs direct simulations of the initial stages of the transition process of wall-bounded flow at supersonic Mach numbers. Spectral collocation in all three spatial directions (Fourier along the plate and Chebyshev normal to it) ensures high accuracy of the flow variables. By hiding most of the parallelism in low-level routines, the casual user is shielded from most of the nonstandard coding constructs. Speedups of 13 out of a maximum of 16 are achieved on the largest computational grids.
- Published
- 1990
17. Partitioning problems in parallel, pipelined, and distributed computing
- Author
-
Bokhari, Shahid H
- Subjects
Computer Systems - Abstract
The problem of optimally assigning the modules of a parallel program over the processors of a multiple-computer system is addressed. A sum-bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple-satellite system: partitioning multiple chain-structured parallel programs, multiple arbitrarily structured serial programs, and single-tree structured parallel programs. In addition, the problem of partitioning chain-structured parallel programs across chain-connected systems is solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple-computer architectures for a wide range of problems of practical interest.
- Published
- 1988
- Full Text
- View/download PDF
18. A partitioning strategy for nonuniform problems on multiprocessors
- Author
-
Berger, Marsha J and Bokhari, Shahid H
- Subjects
Computer Systems - Abstract
The partitioning of a problem on a domain with unequal work estimates in different subdomains is considered in a way that balances the work load across multiple processors. Such a problem arises for example in solving partial differential equations using an adaptive method that places extra grid points in certain subregions of the domain. A binary decomposition of the domain is used to partition it into rectangles requiring equal computational effort. The communication costs of mapping this partitioning onto different microprocessors: a mesh-connected array, a tree machine and a hypercube is then studied. The communication cost expressions can be used to determine the optimal depth of the above partitioning.
- Published
- 1987
19. A comparative analysis of static and dynamic load balancing strategies
- Author
-
Iqbal, M. Ashraf, Bokhari, Shahid H, and Saltz, Joel H
- Subjects
Computer Programming And Software - Abstract
The problem of uniformly distributing the load of a parallel program over a multiprocessor system was considered. A program was analyzed whose structure permits the computation of the optimal static solution. Then four strategies for load balancing were described and their performance compared. The strategies are: (1) the optimal static assignment algorithm which is guaranteed to yield the best static solution, (2) the static binary dissection method which is very fast but suboptimal, (3) the greedy algorithm, a static fully polynomial time approximation scheme, which estimates the optimal solution to arbitrary accuracy, and (4) the predictive dynamic load balancing heuristic which uses information on the precedence relationships within the program and outperforms any of the static methods. It is also shown that the overhead incurred by the dynamic heuristic is reduced considerably if it is started off with a static assignment provided by either of the three strategies.
- Published
- 1986
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.