82 results on '"Software pipelining"'
Search Results
2. Instruction Re-selection for Iterative Modulo Scheduling on High Performance Multi-issue DSPs
- Author
-
Cho, Doosan, Ravi, Ayyagari, Uh, Gang-Ryung, Paek, Yunheung, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Zhou, Xiaobo, editor, Sokolsky, Oleg, editor, Yan, Lu, editor, Jung, Eun-Sun, editor, Shao, Zili, editor, Mu, Yi, editor, Lee, Dong Chun, editor, Kim, Dae Young, editor, Jeong, Young-Sik, editor, and Xu, Cheng-Zhong, editor
- Published
- 2006
- Full Text
- View/download PDF
3. MIRS: Modulo Scheduling with Integrated Register Spilling
- Author
-
Zalamea, Javier, Llosa, Josep, Ayguadé, Eduard, Valero, Mateo, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Dietz, Henry G., editor
- Published
- 2003
- Full Text
- View/download PDF
4. Aggregate operation movement: A min-cut approach to global code motion
- Author
-
Lo, Raymond, Chan, Sun, Dehnert, Jim, Towle, Ross, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Bougé, Luc, editor, Fraigniaud, Pierre, editor, Mignotte, Anne, editor, and Robert, Yves, editor
- Published
- 1996
- Full Text
- View/download PDF
5. A comparison of modulo scheduling techniques for software pipelining
- Author
-
Pfahler, Peter, Piepenbrock, Georg, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Gyimóthy, Tibor, editor
- Published
- 1996
- Full Text
- View/download PDF
6. Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops
- Author
-
Wang, Jian, Gao, Guang R., Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Gyimóthy, Tibor, editor
- Published
- 1996
- Full Text
- View/download PDF
7. Delayed exceptions — Speculative execution of trapping instructions
- Author
-
Ertl, M. Anton, Krall, Andreas, Goos, Gerhard, editor, Hartmanis, Juris, editor, and Fritzson, Peter A., editor
- Published
- 1994
- Full Text
- View/download PDF
8. Software pipelining with CGA and proposed intrinsics on a reconfigurable processor for HEVC decoders
- Author
-
Ahn, Yong-Jo, Yoo, Jonghun, Jo, Hyun-Ho, and Sim, Donggyu
- Published
- 2019
- Full Text
- View/download PDF
9. Optimizing microprograms for recurrent loops on pipelined architectures using timed Petri nets
- Author
-
Hanen, Claire, Goos, G., editor, Hartmanis, J., editor, Barstow, D., editor, Brauer, W., editor, Brinch Hansen, P., editor, Gries, D., editor, Luckham, D., editor, Moler, C., editor, Pnueli, A., editor, Seegmüller, G., editor, Stoer, J., editor, Wirth, N., editor, and Rozenberg, Grzegorz, editor
- Published
- 1990
- Full Text
- View/download PDF
10. Low Power Compiler Optimization for Pipelining Scaling
- Author
-
Jen-Chieh Chang, Cheng-Yu Lee, Chia-Jung Chen, and Rong-Guey Chang
- Subjects
Loop optimization ,Computer science ,business.industry ,Pipeline (computing) ,Optimizing compiler ,Parallel computing ,Energy consumption ,computer.software_genre ,Dynamic voltage scaling ,Software pipelining ,Embedded system ,Profile-guided optimization ,Compiler ,business ,computer - Abstract
Low power has played an increasingly important role for embedded systems. To save power, lowering voltage and frequency is very straightforward and effective; therefore dynamic voltage scaling (DVS) has become a prevalent low-power technique. However, DVS makes no effect on power saving when the voltage reaches a lower bound. Fortunately, a technique called dynamic pipeline scaling (DPS) can overcome this limitation by switching pipeline modes at low-voltage level. Approaches proposed in previous work on DPS were based on hardware support. From viewpoint of compiler, little has been addressed on this issue. This paper presents a DPS optimization technique at compiler time to reduce power dissipation. The useful information of an application is exploited to devise an analytical model to assess the cost of enabling DPS mechanism. As a consequence we can determine the switching timing between pipeline modes at compiler time without causing significant run-time overhead. The experimental result shows that our approach is effective in reducing energy consumption.
- Published
- 2013
- Full Text
- View/download PDF
11. Synthetic Aperture Radar Data Processing on an FPGA Multi-core System
- Author
-
Jørgen Dall, Anders Kusk, Pascal Schleuniger, and Sven Karlsson
- Subjects
Synthetic aperture radar ,Data processing ,Multi-core processor ,Computer science ,business.industry ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,020206 networking & telecommunications ,02 engineering and technology ,CAS latency ,030218 nuclear medicine & medical imaging ,03 medical and health sciences ,0302 clinical medicine ,Network on a chip ,Software pipelining ,Radar imaging ,0202 electrical engineering, electronic engineering, information engineering ,Field-programmable gate array ,business ,Computer hardware - Abstract
Synthetic aperture radar, SAR, is a high resolution imaging radar. The direct back-projection algorithm allows for a precise SAR output image reconstruction and can compensate for deviations in the flight track of airborne radars. Often graphic processing units, GPUs are used for data processing as the back-projection algorithm is computationally expensive and highly parallel. However, GPUs may not be an appropriate solution for applications with strictly constrained space and power requirements. In this paper, we describe how we map a SAR direct back-projection application to a multi-core system on an FPGA. The fabric consisting of 64 processor cores and 2D mesh interconnect utilizes 60% of the hardware resources of a Xilinx Virtex-7 device with 550 thousand logic cells and consumes about 10 watt. We apply software pipelining to hide memory latency and reduce the hardware footprint by 14%. We show that the system provides real-time processing of a SAR application that maps a 3000m wide area with a resolution of 2x2 meters.
- Published
- 2013
- Full Text
- View/download PDF
12. Locality Improvement of Data-Parallel Adams–Bashforth Methods through Block-Based Pipelining of Time Steps
- Author
-
Matthias Korch
- Subjects
Software pipelining ,CPU cache ,Computer science ,Pipeline (computing) ,Locality ,Scalability ,Initial value problem ,Parallel computing ,Supercomputer ,Block (data storage) ,Linear multistep method - Abstract
Adams---Bashforth methods are a well-known class of explicit linear multi-step methods for the solution of initial value problems of ordinary differential equations. This article discusses different data-parallel implementation variants with different loop structures and communication patterns and compares the resulting locality and scalability. In particular, pipelining of time steps is employed to improve the locality of memory references. The comparison is based on detailed runtime experiments performed on parallel computer systems with different architectures, including the two supercomputer systems JUROPA and HLRB II.
- Published
- 2012
- Full Text
- View/download PDF
13. Pipelining Software and Services for Language Processing
- Author
-
Arif Bramantoro, Ulrich Schäfer, and Toru Ishida
- Subjects
Language Grid ,business.industry ,Computer science ,computer.internet_protocol ,Interoperability ,Software development ,computer.software_genre ,Business Process Execution Language ,Software pipelining ,Software ,Computer architecture ,The Internet ,Web service ,business ,Software engineering ,computer - Abstract
This chapter reports on our experiences with combining two different platforms in natural language processing research, i.e. Heart of Gold and the Language Grid, to provide more language resources available on the Web. Heart of Gold is known as middleware architecture for pipelining deep and shallow natural language processing components. The Language Grid is one of the service grid infrastructures built on top of the Internet to provide pipelined language services. Both of these frameworks provide composite language services and components. Having Heart of Gold integrated in the Language Grid environment contributes to increased interoperability among various language services. The integrated architecture also supports the combination of pipelined language services in the Language Grid and the pipelined natural language processing components in Heart of Gold to provide a better quality of language services available on the Web. Thus, language services with different characteristics can be combined based on the concept of Web service with different treatment of each combination. An evaluation is presented to show that the overhead of the integration is not significant.
- Published
- 2011
- Full Text
- View/download PDF
14. Translation Validation of Loop Optimizations and Software Pipelining in the TVOC Framework
- Author
-
Benjamin Goldberg
- Subjects
Loop optimization ,Transformation (function) ,Software pipelining ,LOOP (programming language) ,Programming language ,Computer science ,Process (computing) ,Code (cryptography) ,Compiler ,Translation (geometry) ,computer.software_genre ,computer - Abstract
Translation validation (TV) is the process of proving that the execution of a translator has generated an output that is a correct translation of the input. When applied to optimizing compilers, TV is used to prove that the generated target code is a correct translation of the source program being compiled. This is in contrast to verifying a compiler, i.e. ensuring that the compiler will generate correct target code for every possible source program - which is generally a far more difficult endeavor. This paper reviews the TVOC framework developed by Amir Pnueli and his colleagues for translation validation for optimizing compilers, where the program being compiled undergoes substantional transformation for the purposes of optimization. The paper concludes with a discussion of how recent work on the TV of software pipelining by Tristan & Leroy can be incorporated into the TVOC framework.
- Published
- 2010
- Full Text
- View/download PDF
15. A Remote Mirroring Architecture with Adaptively Cooperative Pipelining
- Author
-
Bing Liu, Zhenhai Zhao, Xiaoguang Liu, Tingting Qin, Yongzhi Song, and Gang Wang
- Subjects
Computer science ,business.industry ,Pipeline (computing) ,Parallel computing ,Encryption ,Two stages ,law.invention ,Software pipelining ,Computer architecture ,Backup ,law ,Internet Protocol ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Architecture ,business ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,Mirroring - Abstract
In recent years, the remote mirroring technology has attracted increasing attention In this paper, we present a novel adaptively cooperative pipelining model for remote mirroring systems Unlike the traditional pipelining model, this new model takes the decentralization of processors into account and adopts an adaptive batching strategy to alleviate imbalanced pipeline stages caused by this property To release the heavy load on CPU exerted by compression, encryption, TCP/IP protocol stack and so on, we design fine-grained pipelining, multi-threaded pipelining and hybrid pipelining We implement a remote mirroring prototype based on Linux LVM2 The experimental results show that, the adaptively cooperative pipelining model balances the primary and the backup sites - the two stages of the pipeline effectively, and fine-grained pipelining, multi-threaded pipelining and hybrid pipelining improve the performance remarkably.
- Published
- 2010
- Full Text
- View/download PDF
16. Implementation and Evaluation of Fast Parallel Packet Filters on a Cell Processor
- Author
-
Masato Tsuru and Yoshiyuki Yamashita
- Subjects
Software pipelining ,Software ,Computer science ,business.industry ,Filter (video) ,Network packet ,Embedded system ,Network processor ,Single-core ,Program optimization ,business ,Virtual network - Abstract
Packet filters are essential for most areas of recent information network technologies. While high-end expensive routers and firewalls are implemented in hardware-based, flexible and cost-effective ones are usually in software-based solutions using general-purpose CPUs but have less performance. The authors have studied the methods of applying code optimization techniques to the packet filters executing on a single core processor. In this paper, by utilizing the multi-core processor Cell Broadband Engine with software pipelining, we construct a parallelized and SIMDed packet filter 40 times faster than the naive C program filter executed on a single core.
- Published
- 2010
- Full Text
- View/download PDF
17. Concurrent Separation Logic for Pipelined Parallelization
- Author
-
Christian J. Bell, Andrew W. Appel, and David Walker
- Subjects
Multi-core processor ,Software pipelining ,Computer architecture ,Shared memory ,Computer science ,Separation logic ,Parallel computing ,Compiler ,Hoare logic ,computer.software_genre ,computer ,Operational semantics - Abstract
Recent innovations in automatic parallelizing compilers are showing impressive speedups on multicore processors using shared memory with asynchronous channels. We have formulated an operational semantics and proved sound a concurrent separation logic to reason about multithreaded programs that communicate asynchronously through channels and share memory. Our logic supports shared channel endpoints (multiple producers and consumers) and introduces histories to overcome limitations with local reasoning. We demonstrate how to transform a sequential proof into a parallelized proof that targets the output of the parallelizing optimization DSWP (Decoupled Software Pipelining).
- Published
- 2010
- Full Text
- View/download PDF
18. Ultra High Throughput Implementations for MD5 Hash Algorithm on FPGA
- Author
-
Liehui Jiang, Qiuxia Zhao, Yi Shao, and Yuliang Wang
- Subjects
Loop unrolling ,Software pipelining ,Speedup ,Computer science ,Pipeline (computing) ,Stratix ,Pentium ,Parallel computing ,Field-programmable gate array ,Throughput (business) ,AND gate - Abstract
This paper first presents a new architecture of MD5, which achieved the theoretical upper bound on throughput in the iterative architecture. And then based on the general proposed architecture, this paper implemented other two different kinds of pipelined architectures which are based on the iterative technique and the loop unrolling technique respectively. The latter with 32-stage pipelining reached a throughput up to 32.035Gbps on an Altera Stratix II GX EP2SGX90FF FPGA, and the speedup achieved 194x over the Intel Pentium 4 3.0 processor. At least to the authors’ knowledge, this is the fastest published FPGA-based design at the time of writing. At last the proposed designs are compared with other published MD5 designs, the designs in this paper have obvious advantages both in speed and logic requirements.
- Published
- 2010
- Full Text
- View/download PDF
19. MPSoC Design Using Application-Specific Architecturally Visible Communication
- Author
-
Philip Brisk, Paolo Ienne, Theo Kluter, and Edoardo Charbon
- Subjects
Hardware_MEMORYSTRUCTURES ,Speedup ,CPU cache ,business.industry ,Computer science ,NCCR-MICS ,MESI protocol ,NCCR-MICS/CL2 ,MESIF protocol ,Runtime system ,Software pipelining ,Embedded system ,Cache ,business ,Memory safety - Abstract
This paper advocates the placement of Architecturally Visible Communication (AVC) buffers between adjacent cores in MPSoCs to provide high-throughput communication for streaming applications. Producer/consumer relationships map poorly onto cache-based MPSoCs. Instead, we instantiate application specific AVC buffers on top of a distributed consistent and coherent cache-based system with shared main memory to provide the desired functionality. Using JPEG compression as a case study, we show that the use of AVC buffers in conjunction with parallel execution via heterogeneous software pipelining provides a speedup of as much as 4.2x compared to a baseline single processor system, with an increase in estimated memory energy consumption of only 1.6x. Additionally, we describe a method to integrate the AVC buffers into the L1 cache coherence protocol; this allows the runtime system to guarantee memory safety and coherence in situations where the parallelization of the application may be unsafe due to pointers that could not be resolved at compile time.
- Published
- 2009
- Full Text
- View/download PDF
20. Compiling Techniques for Coarse Grained Runtime Reconfigurable Architectures
- Author
-
Alexander Fell, Ranjani Narayan, Keshavan Varadarajan, Mythri Alle, and S. K. Nandy
- Subjects
Reduction (complexity) ,Data flow diagram ,Software pipelining ,Computer architecture ,Binary decision diagram ,Computer science ,Computation ,Optimizing compiler ,Compiler ,Parallel computing ,computer.software_genre ,computer ,Data-flow analysis - Abstract
In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer relationship. These HyperOps are hosted on computation structures that are provisioned on demand at runtime. We also report compiler optimizations that collectively reduce the overheads of data-driven computations in runtime reconfigurable architectures. On an average, HyperOps offer a 44% reduction in total execution time and a 18% reduction in management overheads as compared to using basic blocks as coarse grained operations. We show that HyperOps formed using our compiler are suitable to support data flow software pipelining.
- Published
- 2009
- Full Text
- View/download PDF
21. Deriving Efficient Data Movement from Decoupled Access/Execute Specifications
- Author
-
Anton Lokhmotov, Lee Howes, Paul H. J. Kelly, and Alastair F. Donaldson
- Subjects
Software pipelining ,Kernel (image processing) ,Computer science ,Computation ,Broadband ,Compiler ,Parallel computing ,Architecture ,computer.software_genre ,Programmer ,computer ,Implementation - Abstract
On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both the memory access pattern and the execution schedule of a computation kernel, the compiler or run-time system can derive efficient data movement, even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled Access/Execute specifications, allowing for automatic communication optimisations such as software pipelining and data reuse. We demonstrate the ease and efficiency of programming the Cell Broadband Engine architecture using these classes by implementing a set of benchmarks, which exhibit data reuse and non-affine access functions, and by comparing these implementations against alternative implementations, which use hand-written DMA transfers and software-based caching.
- Published
- 2009
- Full Text
- View/download PDF
22. Software Pipelining in Nested Loops with Prolog-Epilog Merging
- Author
-
Albert Cohen, Mohammed Fellahi, Architectures, Languages and Compilers to Harness the End of Moore Years (ALCHEMY), Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), and André Seznec and Joel Emer and Mike O'Boyle and Margaret Martonosi and Theo Ungerer
- Subjects
Loop unrolling ,ACM: C.: Computer Systems Organization/C.1: PROCESSOR ARCHITECTURES ,Loop inversion ,Computer science ,Pipeline (computing) ,[INFO.INFO-OH]Computer Science [cs]/Other [cs.OH] ,ACM: D.: Software/D.3: PROGRAMMING LANGUAGES/D.3.4: Processors ,02 engineering and technology ,Parallel computing ,020202 computer hardware & architecture ,Scheduling (computing) ,Prolog ,Software pipelining ,ACM: B.: Hardware/B.2: ARITHMETIC AND LOGIC STRUCTURES ,0202 electrical engineering, electronic engineering, information engineering ,[INFO.INFO-ES]Computer Science [cs]/Embedded Systems ,020201 artificial intelligence & image processing ,Nested loop join ,computer ,Inner loop ,computer.programming_language - Abstract
International audience; Software pipelining (or modulo scheduling) is a powerful back-end optimization to exploit instruction and vector parallelism. Software pipelining is particularly popular for embedded devices as it improves the computation throughput without increasing the size of the inner loop kernel (unlike loop unrolling), a desirable property to minimize the amount of code in local memories or caches. Unfortunately, common media and signal processing codes exhibit series of low-tripcount inner loops. In this situation, software pipelining is often not an option: it incurs severe fill/drain time overheads and code size expansion due to nested prologs and epilogs. We propose a method to pipeline series of inner loops without increasing the size of the loop nest, apart from an outermost prolog and epilog. Our method achieves significant code size savings and allows pipelining of low-trip-count loops. These benefits come at the cost of additional scheduling constraints, leading to a linear optimization problem to trade memory usage for pipelining opportunities.
- Published
- 2009
- Full Text
- View/download PDF
23. Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies
- Author
-
William J. Dally and Abhishek Das
- Subjects
Hardware_MEMORYSTRUCTURES ,Software pipelining ,Memory hierarchy ,Computer science ,Distributed computing ,Computation ,Parallel algorithm ,Interleaved memory ,Parallel computing ,Memory map ,Scheduling (computing) - Abstract
With the emergence of streaming and multi-core architectures, there is an increasing demand to map parallel algorithms efficiently across all architectures. This paper describes a platform-independent optimization framework called Stream Scheduling, that orchestrates parallel execution of bulk computations and data transfers, and allocates storage at multiple levels of a memory hierarchy. By adjusting block sizes, and applying software pipelining on bulk operations, it ensures computation-to-communication ratio is maximized on each level. We evaluate our framework on a diverse set of Sequoia applications, targeting systems with different memory hierarchies: a Cell blade, a distributed-memory cluster, and the Cell blade attached to a disk.
- Published
- 2008
- Full Text
- View/download PDF
24. Pipelined Java Virtual Machine Interpreters
- Author
-
Lex Augusteijn and Jan Hoogerbrugge
- Subjects
Source code ,Programming language ,Computer science ,media_common.quotation_subject ,Parallel computing ,computer.software_genre ,Bytecode ,Software pipelining ,Threaded code ,Very long instruction word ,Compiler ,Rewriting ,computer ,Interpreter ,media_common - Abstract
The performance of a Java Virtual Machine (JVM) interpreter running on a very long instruction word (VLIW) processor can be improved by means of pipelining. While one bytecode is in its execute stage, the next bytecode is in its decode stage, and the next bytecode is in its fetch stage. The paper describes how we implemented threading and pipelining by rewriting the source code of the interpreter and several modifications in the compiler. Experiments for evaluating the effectiveness of pipelining are described. Pipelining improves the execution speed of a threaded interpreter by 19.4% in terms of instruction count and 14.4% in terms of cycle count. Most of the simple bytecodes, like additions and multiplications, execute in four cycles. This number corresponds to the branch latency of our target VLIW processor. Thus most of the code of the interpreter is executed in branch delay slots.
- Published
- 2007
- Full Text
- View/download PDF
25. Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation
- Author
-
Santosh Nagarakatte and Ramaswamy Govindarajan
- Subjects
Linear programming ,business.industry ,Computer science ,Parallel computing ,Scheduling (computing) ,Software ,Software pipelining ,Code generation ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,business ,Heuristics ,Instruction-level parallelism ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,Register allocation - Abstract
In achieving higher instruction level parallelism, software pipelining increases the register pressure in the loop. The usefulness of the generated schedule may be restricted to cases where the register pressure is less than the available number of registers. Spill instructions need to be introduced otherwise. But scheduling these spill instructions in the compact schedule is a difficult task. Several heuristics have been proposed to schedule spill code. These heuristics may generate more spill code than necessary, and scheduling them may necessitate increasing the initiation interval. We model the problem of register allocation with spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. The formulation minimizes the increase in initiation interval (II) by optimally placing spill code and simultaneously minimizes the amount of spill code produced. To the best of our knowledge, this is the first integrated formulation for register allocation, optimal spill code generation and scheduling for software pipelined loops. The proposed formulation performs better than the existing heuristics by preventing an increase in II in 11.11% of the loops and generating 18.48% less spill code on average among the loops extracted from Perfect Club and SPEC benchmarks with a moderate increase in compilation time.
- Published
- 2007
- Full Text
- View/download PDF
26. Software Pipelining for Packet Filters
- Author
-
Yoshiyuki Yamashita and Masato Tsuru
- Subjects
Computer science ,Network packet ,Packet generator ,Parallel computing ,computer.software_genre ,Scheduling (computing) ,Software pipelining ,Computer engineering ,Basic block ,Packet analyzer ,Compiler ,Fast packet switching ,computer ,Processing delay - Abstract
Packet filters play an essential role in traffic management and security management on the Internet. In order to create software-based packet filters that are fast enough to work even under a DOS attack, it is vital to effectively combine both the higher-level optimization related to algorithmic structure and the lower-level optimization related to acceleration techniques in compiler study. In the present paper, we focus on the lower-level (machine code) optimization using software-pipelining, and report experimental results that indicate the potential of our approach for accelerating packet filter performance. The technical difficulty is that the packet filter is a lump of conditional branches, so that standard optimization techniques usually applied to basic blocks is not directly applicable to this problem. Using predicated execution and enhanced modulo scheduling, we solve this problem and achieve 20 times higher performance compared with a conventional interpreter-based packet filter. We also compare the proposed filters and compiler-based packet filters, and obtain a better than two-fold increase in performance.
- Published
- 2007
- Full Text
- View/download PDF
27. Loop Striping: Maximize Parallelism for Nested Loops
- Author
-
Edwin H.-M. Sha, Meilin Liu, Zili Shao, Meikang Qiu, and Chun Xue
- Subjects
For loop ,Loop counter ,Recursion ,Loop inversion ,Computer science ,Loop fusion ,Parallel computing ,Loop tiling ,Loop fission ,Loop splitting ,Software pipelining ,Loop nest optimization ,Loop interchange ,Nested loop join ,Algorithm - Abstract
The majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested loops. Most of the existing loop transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated loop bounds and loop indexes calculations. This paper proposes a new technique, loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where a stripe is a group of iterations in which all iterations are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for loop striping transformations. The experimental results show that loop striping always achieves better iteration period than software pipelining and loop unfolding, improving average iteration period by 50% and 54% respectively
- Published
- 2006
- Full Text
- View/download PDF
28. Instruction Re-selection for Iterative Modulo Scheduling on High Performance Multi-issue DSPs
- Author
-
Yunheung Paek, Gang-Ryung Uh, Doosan Cho, and Ayyagari Ravi
- Subjects
Loop unrolling ,Idle ,Software pipelining ,Computer science ,Iterative method ,Trace scheduling ,Modulo ,Compiler ,Parallel computing ,Dependence analysis ,computer.software_genre ,computer ,Scheduling (computing) - Abstract
An iterative modulo scheduling is very important for compilers targeting high performance multi-issue digital signal processors. This is because these processors are often severely limited by idle state functional units and thus the reduced idle units can have a positively significant impact on their performance. However, complex instructions, which are used in most recent DSPs such as mac, usually increase data dependence complexity, and such complex dependencies that exist in signal processing applications often restrict modulo scheduling freedom and therefore, become a limiting factor of the iterative modulo scheduler. In this work, we propose a technique that efficiently reselects instructions of an application loop code considering dependence complexity, which directly resolve the dependence constraint. That is specifically featured for accelerating software pipelining performance by minimizing length of intrinsic cyclic dependencies. To take advantage of this feature, few existing compilers support a loop unrolling based dependence relaxing technique, but only use them for some limited cases. This is mainly because the loop unrolling typically occurs an overhead of huge code size increment, and the iterative modulo scheduling with relaxed dependence techniques for general cases is an NP-hard problem that necessitates complex assignments of registers and functional units. Our technique uses a heuristic to efficiently handle this problem in pre-stage of iterative modulo scheduling without loop unrolling.
- Published
- 2006
- Full Text
- View/download PDF
29. SCAN: A Heuristic for Near-Optimal Software Pipelining
- Author
-
F. Blachot, Guillaume Huard, and Benoît Dupont de Dinechin
- Subjects
business.industry ,Computer science ,Modulo ,Optimizing compiler ,Parallel computing ,computer.software_genre ,Scheduling (computing) ,Software pipelining ,Software ,Very long instruction word ,Compiler ,Heuristics ,business ,Integer programming ,computer - Abstract
Software pipelining is a classic compiler optimization that improves the performances of inner loops on instruction-level parallel processors. In the context of embedded computing, applications are compiled prior to manufacturing the system, so it is possible to invest large amounts of time for compiler optimizations. Traditionally, software pipelining is performed by heuristics such as iterative modulo scheduling. Optimal software pipelining can be formulated as integer linear programs, however these formulations can take exponential time to solve. As a result, the size of loops that can be optimally software pipelined is quite limited. In this article, we present the SCAN heuristic, which enables to benefit from the integer linear programming formulations of software pipelining even on loops of significant size. The principle of the SCAN heuristic is to iteratively constrain the software pipelining problem until the integer linear programming formulation is solvable in reasonable time. We applied the SCAN heuristic to a multimedia benchmark for the ST200 VLIW processor. We show that it almost always compute an optimal solution for loops that are intractable by classic integer linear programming approaches. This improves performances by up to 33.3% over the heuristic modulo scheduling of the production ST200 compiler.
- Published
- 2006
- Full Text
- View/download PDF
30. Register Pressure in Software-Pipelined Loop Nests: Fast Computation and Impact on Architecture Design
- Author
-
Guang R. Gao and Alban Douillet
- Subjects
Schedule ,Software pipelining ,Computer science ,Processor register ,Pipeline (computing) ,Register file ,Parallel computing ,Loop variant ,Nested loop join ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,Register allocation - Abstract
Recently the Single-dimension Software Pipelining (SSP) technique was proposed to software pipeline loop nests at an arbitrary loop level [18,19,20]. However, SSP schedules require a high number of rotating registers, and may become infeasible if register needs exceed the number of available registers. It is therefore desirable to design a method to compute the register pressure quickly (without actually performing the register allocation) as an early measure of the feasibility of an SSP schedule. Such a method can also be instrumental to provide a valuable feedback to processor architects in their register files design decision, as far as the needs of loop nests are concerned. This paper presents a method that computes quickly the minimum number of rotating registers required by an SSP schedule. The results have demonstrated that the method is always accurate and is 3 to 4 orders of magnitude faster on average than the register allocator. Also, experiments suggest that 64 floating-point rotating registers are in general enough to accommodate the needs of the loop nests used in scientific computations.
- Published
- 2006
- Full Text
- View/download PDF
31. Register Allocation on Stream Processor with Local Register File
- Author
-
Mei Wen, Nan Wu, Ju Ren, Chunyuan Zhang, and Yi He
- Subjects
Instruction register ,Computer science ,Working set ,Register file ,Parallel computing ,computer.software_genre ,law.invention ,Memory address register ,Software pipelining ,law ,Control register ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,Processor register ,FLAGS register ,Register renaming ,Register window ,Stack register ,Microprocessor ,Very long instruction word ,Status register ,Basic block ,Operating system ,Index register ,Compiler ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Memory data register ,Instruction-level parallelism ,computer ,Register allocation - Abstract
Emerging stream processors for intensive computing use local register file to support ALUs array and use VLIW to explore instruction level parallelism. The current VLIW compilers for local register file such as ISCD work well on moderate media application without considering register allocation pressure. However, more complicated applications and optimizations that increase the size of the working set such as software pipelining make consideration of register pressure during the scheduling process. Based on ISCD complier, this paper presents two new techniques: spilling schedule and basic block repartition that compose a new schedule algorithm to alleviate register pressure. Experimental results show that it can deal with heavy workload application very well. The algorithm can also be applied to other microprocessors with the similar register architecture.
- Published
- 2006
- Full Text
- View/download PDF
32. Pipelining Network Storage I/O
- Author
-
Dan Feng, Lingfang Zeng, and Fang Wang
- Subjects
Input/output ,business.industry ,Computer science ,Network storage ,Pipeline (computing) ,Parallel computing ,Scheduling (computing) ,Storage area network ,Software pipelining ,Embedded system ,Computer data storage ,Data_FILES ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,business - Abstract
In this paper, with introducing pipelining technology, network storage (e.g. network-attached storage, storage area network) and segmenting the reasonable pipelining stage are studied, which may have significant direction for enhancing performance of network storage system. Some experi-ments about pipelining I/O are implemented in the Heter-RAID. These results show that I/O pipelining scheduling can take advantage of pipelining features to achieve high performance I/O over network storage system.
- Published
- 2006
- Full Text
- View/download PDF
33. Multi-dimensional Kernel Generation for Loop Nest Software Pipelining
- Author
-
Guang R. Gao, Hongbo Rong, and Alban Douillet
- Subjects
Rate-monotonic scheduling ,Schedule ,Fixed-priority pre-emptive scheduling ,Software pipelining ,Job shop scheduling ,Computer science ,Two-level scheduling ,Dynamic priority scheduling ,Parallel computing ,Nested loop join ,Execution time ,Fair-share scheduling ,Scheduling (computing) - Abstract
Single-dimension Software Pipelining (SSP) has been proposed as an effective software pipelining technique for multi-dimensional loops [16]. This paper introduces for the first time the scheduling methods that actually produce the kernel code. Because of the multi-dimensional nature of the problem, the scheduling problem is more complex and challenging than with traditional modulo scheduling. The scheduler must handle multiple subkernels and initiation rates under specific scheduling constraints, while producing a solution that minimizes the execution time of the final schedule. In this paper three approaches are proposed: the level-by-level method, which schedules operations in loop level order, starting from the innermost, and does not let other operations interfere with the already scheduled levels, the flat method, which schedules operations from different loop levels with the same priority, and the hybrid method, which uses the level-by-level mechanism for the innermost level and the flat solution for the other levels. The methods subsume Huff's modulo scheduling [8] for single loops as a special case. We also break a scheduling constraint introduced in earlier publications and allow for a more compact kernel. The proposed approaches were implemented in the Open64/ORC compiler, and evaluated on loop nests from the Livermore, SPEC200 and NAS benchmarks.
- Published
- 2006
- Full Text
- View/download PDF
34. Software Pipelining Support for Transport Triggered Architecture Processors
- Author
-
T. Jarvinen, Perttu Salmela, Jarmo Takala, and Pekka Jääskeläinen
- Subjects
Assembly language ,Cycles per instruction ,business.industry ,Computer science ,Parallel algorithm ,Transport triggered architecture ,Microarchitecture ,Parallel language ,Software pipelining ,Computer architecture ,High-level programming language ,Embedded system ,business ,computer ,computer.programming_language - Abstract
Many telecommunication applications, especially baseband processing, and digital signal processing (DSP) applications call for high-performance implementations due to the complexity of algorithms and high throughput requirements. In general, the required performance is obtained with the aid of parallel computational resources. In these application domains, software implementations are often preferred over fixed-function ASICs due to the flexibility and ease of development. Application-specific instruction-set processor (ASIP) architectures can be used to exploit efficiently the inherent parallelism of the algorithms but still maintaining the flexibility. Use of high-level languages to program processor architectures with parallel resources can lead to inefficient resource utilization and, on the other hand, parallel assembly programming is error prone and tedious. In this paper, the inherent problems of parallel programming and software pipelining are mitigated with parallel language syntax and automatic generation of software pipelined code for the iteration kernels. With the aid of the developed tool support, the underlying performance of a processor architecture with parallel resources can be exploited and full utilization of the main processing resources is obtained for pipelined loop kernels. The given examples show that efficiency can be obtained without reducing the performance.
- Published
- 2006
- Full Text
- View/download PDF
35. HiLO: High Level Optimization of FFTs
- Author
-
David Padua and Nick Rizzolo
- Subjects
Software pipelining ,Computer science ,Fast Fourier transform ,Process (computing) ,Code (cryptography) ,Program transformation ,Compiler ,Parallel computing ,Program optimization ,computer.software_genre ,Abstract syntax tree ,computer - Abstract
As computing platforms become more and more complex, the task of optimizing performance critical codes becomes more challenging. Recently, more attention has been focused on automating this optimization process by making aggressive assumptions about the algorithm. Motivated by these trends, this paper presents HiLO, the high-level optimizer for FFT codes. HiLO blends traditional optimization techniques into an optimization strategy tailored to the needs of FFT codes and outputs C code ready to be further optimized by the native compiler. It has already been shown that such high-level transformations are important to coax the native compiler into doing its job well. HiLO provides a more appropriate platform for researching these phenomena, suggests an optimization strategy that improves on previous approaches, and shows that even software pipelining at the C level can improve the final binary's performance.
- Published
- 2005
- Full Text
- View/download PDF
36. Hardware Support for Multithreaded Execution of Loops with Limited Parallelism
- Author
-
Georgios Dimitriou and Constantine D. Polychronopoulos
- Subjects
Schedule ,business.industry ,Computer science ,Re-order buffer ,Processor scheduling ,Thread (computing) ,Parallel computing ,Scheduling (computing) ,Shared resource ,Software pipelining ,Loop scheduling ,Macro ,business ,Computer hardware ,Context switch - Abstract
Loop scheduling has significant differences in multithreaded from other parallel processors. The sharing of hardware resources imposes new scheduling limitations, but it also allows a faster communication across threads. We present a multithreaded processor model, Coral 2000, with hardware extensions that support Macro Software Pipelining, a loop scheduling technique for multithreaded processors. We tested and evaluated Coral 2000 on a cycle-level simulator, using synthetic and integer SPEC benchmarks. We obtained speedups of up to 30% with respect to highly optimized superblock-based schedules on loops that exhibit limited parallelism.
- Published
- 2005
- Full Text
- View/download PDF
37. Parallelism Improvements of Software Pipelining by Combining Spilling with Rematerialization
- Author
-
Hiroaki Ogi, Naohiro Ishii, Kazunori Iwata, and Tsubasa Mochizuki
- Subjects
Speedup ,Rematerialization ,Computer science ,Pipeline (computing) ,Parallel computing ,computer.software_genre ,Scheduling (computing) ,Software pipelining ,Loop scheduling ,Computer architecture ,Very long instruction word ,Compiler ,Instruction-level parallelism ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,computer - Abstract
On the instruction level parallelism architecture developed as EPIC, VLIW structure machine et al., the perforemance is affected by the compiler techniques. The integrated and convergent optimization techniques have been studied for their developments of the parallelism. In this paper, we develop a software pipelining technique for the improvement of the parallel processing in these machine structures. The software pipelining is a loop scheduling technique by overlapping the execution of several consecutive instructions of the program. Then, much registers are needed for the realization of the software pipelining. Here, spilling code and the rematerialization are implemented in the pipelining scheduling. Experimental results of the proposed method are compared with the conventional methos. The results show the improvements of the speedup of the parallel prosecssing in the bench marks.
- Published
- 2005
- Full Text
- View/download PDF
38. Coping with Data Dependencies of Multi-dimensional Array References
- Author
-
Lin Qiao, Weitong Huang, and Zhizhong Tang
- Subjects
Multidimensional analysis ,Software pipelining ,Linear system ,Parallel algorithm ,Static analysis ,Dependence analysis ,Nested loop join ,Lambda ,Algorithm ,Mathematics - Abstract
This paper presents a new static data dependence analysis approach, Dependence Difference Inequality Test, which can deal with coupled subscripts for multi-dimensional array references for software pipelining techniques for nested loops. The Dependence Difference Inequality Test (DDIT) replaces direction vectors with dependence difference inequalities as constraints to variables in a linear system. The method presented in this paper extends the applicable range of the Generalized Lambda Test and seems to be a practical scheme to analyze data dependence. Experimental results show that the number of data independences checked by the DDIT algorithm is slightly smaller than that manually. It is also shown that our method is better than other traditional data dependence analysis methods without increasing time cost: it increases the success rate of the Generalized Lambda Test by approximately 14.19%.
- Published
- 2005
- Full Text
- View/download PDF
39. A Static Data Dependence Analysis Approach for Software Pipelining
- Author
-
Lin Qiao, Zhizhong Tang, and Weitong Huang
- Subjects
Speedup ,Software pipelining ,Program analysis ,Computer science ,Linear system ,Parallel algorithm ,Dependence analysis ,Instruction-level parallelism ,Algorithm ,Computer Science::Databases ,Loop dependence analysis - Abstract
This paper introduces a new static data dependence constraint, called dependence difference inequality, which can deal with coupled subscripts for multi-dimensional array references. Unlike direction vectors, dependence difference inequalities are related to not only the iteration space for a loop program but also the operation distance between two operations. They are more strict than other methods, and can act as additional constraints to each variable in a linear system on their own or with others. As a result, the solution space for a linear system can be compressed heavily. So long as dependence difference inequalities do not satisfy simultaneously, the loop can be software-pipelined with any initiation interval even if there exists a data dependence between two operations. Meanwhile, by replacing direction vectors with dependence difference inequalities some conservative estimations made by other traditional data dependence analysis approaches can be eliminated.
- Published
- 2005
- Full Text
- View/download PDF
40. Self-loop Pipelining and Reconfigurable Dataflow Arrays
- Author
-
João M. P. Cardoso
- Subjects
Computer science ,Dataflow ,business.industry ,Pipeline (computing) ,Control reconfiguration ,Parallel computing ,Reconfigurable computing ,Data flow diagram ,Imperative programming ,Software ,Software pipelining ,Computer architecture ,business ,Software architecture - Abstract
This paper presents some interesting concepts of static dataflow machines that can be used by reconfigurable computing architectures. We introduce some data-driven reconfigurable arrays and summarize techniques to map imperative software programs to those architectures, some of them being focus of current research work. In particular, we briefly present a novel technique for pipelining loops. Experiments with the technique confirm important improvements over the use of conventional loop pipelining. Hence, the technique proves to be an efficient approach to map loops to coarse-grained reconfigurable architectures employing a static dataflow computational model.
- Published
- 2004
- Full Text
- View/download PDF
41. Increasing Software-Pipelined Loops in the Itanium-Like Architecture
- Author
-
Yu Chen, Haibo Lin, Zhizhong Tang, and Wenlong Li
- Subjects
business.industry ,Processor register ,Computer science ,Parallel computing ,Loop variant ,computer.software_genre ,Software ,Software pipelining ,Explicitly parallel instruction computing ,Itanium ,Compiler ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,business ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,computer ,Register allocation - Abstract
The Itanium architecture (IPF) contains features such as register rotation to support efficient software pipelining. One of the drawbacks of software pipelining is its high register requirement, which may lead to failure when registers provided by architecture are insufficient. This paper evaluates the register requirements of software-pipelined loops in Itanium architecture and presents two new methods, which try to reduce static general register requirements in software pipelined loops by either reducing instructions in the loop body or allocating unused rotating registers for variants using static registers. We have implemented our methods in the Open Research Compiler (ORC) targeted for Itanium processors, and experiments show that number of software-pipelined loops in NAS Benchmarks increased. For some benchmarks, the performance is improved by more than 18%.
- Published
- 2004
- Full Text
- View/download PDF
42. Minimizing Variables’ Lifetime in Loop-Intensive Applications
- Author
-
Noureddine Chabini and Wayne Wolf
- Subjects
Polynomial ,Mathematical optimization ,Variable (computer science) ,Software pipelining ,Computer science ,Context (language use) ,Nested loop join ,Retiming ,Time complexity ,Algorithm - Abstract
In this paper, we address a set of research problems regarding loopintensive applications. The first problem consists of minimizing variables’ lifetime under timing constraints. Any variable that is alive for more than one iteration must be saved by for instance storing it into a register. The second problem is derived from the first one, and consists of balancing variables’ lifetime for a target total number of registers and under timing constraints. For the third problem, we focus on minimizing variables’ lifetime in the context of software pipelining in the case of one loop as well as nested loops. We provide methods to solve these three problems, and show that a set of them have polynomial run-time. Once these methods are used, one may need to solve the problem of generating the transformed code and reducing its size. We provide algorithms to solve this fourth problem. Designers face these problems during hardware-software co-design, and in designing embedded systems as well as system-on-chip. Solving these problems is also useful in low power design. We exercise some of these methods on known benchmarks, and provide obtained numerical results that show their effectiveness. We solve these problems using techniques related to retiming, an algorithm originally developed for hardware optimization.
- Published
- 2003
- Full Text
- View/download PDF
43. Tailoring Software Pipelining for Effective Exploitation of Zero Overhead Loop Buffer
- Author
-
Gang-Ryung Uh
- Subjects
Computer science ,Loop inversion ,CPU cache ,Pipeline (computing) ,Register renaming ,Parallel computing ,computer.software_genre ,Instruction selection ,Execution time ,Loop fission ,Software pipelining ,Overhead (computing) ,Compiler ,Cache ,Instruction-level parallelism ,computer - Abstract
A Zero Overhead Loop Buffer (ZOLB) is an architectural feature that is commonly found in DSPs (Digital Signal Processors). This buffer can be viewed as a compiler (or program) managed cache that can hold a limited number of instructions, which will be executed a specified number of times without incurring any loop overhead. Preliminary versions of the research, which exploit a ZOLB, report significant improvement in execution time with a minimal code size increase [UH99,UH00]. This paper extends the previous compiler efforts to further exploit a ZOLB by employing a new software pipelining methodology. The proposed techniques choose complex instructions, which capitalize on instruction level parallelism across loop iteration boundaries. Unlike the traditional pipelining techniques, the proposed pipelining strategy is tightly coupled with instruction selection so that it can perform register renaming and/or proactively generate additional instruction(s) on the fly to discover more loop parallelism on the ZOLB. This framework reports additional significant improvements in execution time with modest code size increases for various signal processing applications on the DSP16000.
- Published
- 2003
- Full Text
- View/download PDF
44. MIRS: Modulo Scheduling with Integrated Register Spilling
- Author
-
Eduard Ayguadé, Mateo Valero, Javier Zalamea, and Josep Llosa
- Subjects
Computer science ,Backtracking ,Modulo ,Instruction scheduling ,Parallel computing ,computer.software_genre ,Scheduling (computing) ,Software pipelining ,Compiler ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Instruction-level parallelism ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,computer ,Register allocation - Abstract
The overlapping of loop iterations in software pipelining techniques imposes high register requirements. The schedule for a loop is valid if it requires at most the number of registers available in the target architecture. Otherwise its register requirements have to be reduced by spilling registers to memory. Previous proposals for spilling in software pipelined loops require a two-step process. The first step performs the actual instruction scheduling without register constraints. The second step adds (if required) spill code and reschedules the modified loop. The process is repeated until a valid schedule, requiring no more registers than those available, is found. The paper presents MIRS (Modulo scheduling with Integrated Register Spilling), a novel register-constrained modulo scheduler that performs modulo scheduling and register spilling simultaneously in a single step. The algorithm is iterative and uses backtracking to undo previous scheduling decisions whenever resource or dependence conflicts appear. MIRS is compared against a state-of-the-art two-step approach already described in the literature. For this purpose, a workbench composed of a large set of loops from the Perfect Club and a set of processor configurations are used. On the average, for the loops that require spill code a speed-up in the range 14-31% and a reduction of the memory traffic by a factor in the range 0.90-0.72 are achieved.
- Published
- 2003
- Full Text
- View/download PDF
45. A Model for Hardware Realization of Kernel Loops
- Author
-
Jirong Liao, Weng-Fai Wong, and Tulika Mitra
- Subjects
Loop unrolling ,Loop optimization ,business.industry ,Computer science ,Design space exploration ,Loop fusion ,Parallel computing ,computer.software_genre ,Software pipelining ,Kernel (statistics) ,Compiler ,business ,computer ,Realization (systems) ,Computer hardware - Abstract
Hardware realization of kernel loops holds the promise of accelerating the overall application performance and is therefore an important part of the synthesis process. In this paper, we consider two important loop optimization techniques, namely loop unrolling and software pipelining that can impact the performance and cost of the synthesized hardware. We propose a novel model that accounts for various characteristics of a loop, including dependencies, parallelism and resource requirement, as well as certain high level constraints of the implementation platform. Using this model, we are able to deduce the optimal unroll factor and technique for achieving the best performance given a fixed resource budget. The model was verified using a compiler-based FPGA synthesis framework on a number of kernel loops. We believe that our model is general and applicable to other synthesis frameworks, and will help reduce the time for design space exploration.
- Published
- 2003
- Full Text
- View/download PDF
46. Overcoming Static Register Pressure for Software Pipelining in the Itanium Architecture
- Author
-
Haibo Lin, Zhizhong Tang, and Wenlong Li
- Subjects
Scheme (programming language) ,Software pipelining ,Computer science ,Explicitly parallel instruction computing ,Pattern recognition (psychology) ,Parallel algorithm ,Itanium ,Parallel computing ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,computer ,computer.programming_language ,Register allocation - Abstract
Software pipelining techniques have been shown to significantly improve the performance of loop-intensive scientific programs. The Itanium architecture contains many features to enhance parallel execution, such as support for efficient software pipelining of loops. The drawback of software pipelining is the high register requirements, which may lead to software pipelining failure due to limited static general registers in Itanium. This paper evaluates the register requirements of software-pipelined loops. It then presents a novel register allocation scheme, which allocates stacked registers to serve as static registers. Experimental results show that this method gains an average 2.4% improvement, and a peak 18% improvement in performance on NAS Benchmarks.
- Published
- 2003
- Full Text
- View/download PDF
47. A First Step Towards Time Optimal Software Pipelining of Loops with Control Flows
- Author
-
Soo-Mook Moon, Jihong Kim, and Han-Saem Yun
- Subjects
Flow control (data) ,Basis (linear algebra) ,Computer science ,business.industry ,Pipeline (computing) ,Software development ,Parallel computing ,computer.software_genre ,Software pipelining ,Problem of time ,Control system ,Compiler ,business ,computer - Abstract
We address the problem of time optimal software pipelining of loops with control flows, one of the most difficult open problems in the area of parallelizing compilers. We present a necessary condition for loops with control flows to have equivalent time optimal programs, generalizing the result by Schwiegelshohn et al., which has been the most significant theoretical result on the problem. As part of the formal treatment of the problem, we propose a new formalization of software pipelining, which provides a basis of our proof as well as a new theoretical framework for software pipelining research. Being the first generalized result on the problem, our work described in this paper forms an important first step towards time optimal software pipelining.
- Published
- 2001
- Full Text
- View/download PDF
48. Global Software Pipelining with Iteration Preselection
- Author
-
David Gregg
- Subjects
Difficult problem ,Software pipelining ,Computer science ,Global software ,Parallel computing ,Heuristics ,Instruction-level parallelism ,Code growth ,Scheduling (computing) - Abstract
Software pipelining loops containing multiple paths is a very difficult problem. Loop shifting offers the possibility of a close to optimal schedule with acceptable code growth. Deciding how often to shift each operation is difficult, and existing heuristics are rather ad hoc. We separate loop shifting from scheduling, and present new, non-greedy heuristics. Experimental results show that our approach yields better performance and less code growth.
- Published
- 2000
- Full Text
- View/download PDF
49. Pipelining Wavefront Computations: Experiences and Performance
- Author
-
E. Christopher Lewis and Lawrence Snyder
- Subjects
Stencil code ,Computer science ,Pipeline (computing) ,Message passing ,Message Passing Interface ,Parallel computing ,Program optimization ,computer.software_genre ,Parallel language ,Automatic parallelization ,Software pipelining ,Compiler ,Programmer ,computer - Abstract
Wavefront computations are common in scientific applications. Although it is well understood how wavefronts are pipelined for parallel execution, the question remains: How are they best presented to the compiler for the effective generation of pipelined code? We address this question through a quantitative and qualitative study of three approaches to expressing pipelining: programmer implemented via message passing, compiler discovered via automatic parallelization, and programmer defined via explicit parallel language features for pipelining. This work is the first assessment of the efficacy of these approaches in solving wavefront computations, and in the process, we reveal surprising characteristics of commercial compilers. We also demonstrate that a parallel language-level solution simplifies development and consistently performs well.
- Published
- 2000
- Full Text
- View/download PDF
50. Loop Shifting for Loop Compaction
- Author
-
Guillaume Huard and Alain Darte
- Subjects
For loop ,Loop unrolling ,Loop counter ,Loop inversion ,Computer science ,Loop fusion ,Do while loop ,Parallel computing ,Loop tiling ,Loop variant ,Loop fission ,Loop splitting ,Software pipelining ,Loop nest optimization ,Loop interchange ,While loop ,Inner loop ,Loop dependence analysis - Abstract
The idea of decomposed software pipelining is to decouple the software pipelining problem into a cyclic scheduling problem without resource constraints and an acyclic scheduling problem with resource constraints. In terms of loop transformation and code motion, the technique can be formulated as a combination of loop shifting and loop compaction. Loop shifting amounts to move statements between iterations thereby changing some loop independent dependences into loop carried dependences and vice versa. Then, loop compaction schedules the body of the loop considering only loop independent dependences, but taking into account the details of the target architecture. In this paper, we show how loop shifting can be optimized so as to minimize both the length of the critical path and the number of dependences for loop compaction. Both problems (and the combination) are polynomially solvable with fast graph algorithms, the first one by using an algorithm due to Leiserson and Saxe, the second one by designing a variant of minimum-cost flow algorithms.
- Published
- 2000
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.