Descriptor: "Software pipelining" / Publisher: springer berlin heidelberg - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Software pipelining"' showing total 82 results

Start Over Descriptor "Software pipelining" Publisher springer berlin heidelberg

82 results on '"Software pipelining"'

1. Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

Author: Das, Abhishek, Dally, William J., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Luque, Emilio, editor, Margalef, Tomàs, editor, and Benítez, Domingo, editor
Published: 2008
Full Text: View/download PDF

2. Instruction Re-selection for Iterative Modulo Scheduling on High Performance Multi-issue DSPs

Author: Cho, Doosan, Ravi, Ayyagari, Uh, Gang-Ryung, Paek, Yunheung, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Zhou, Xiaobo, editor, Sokolsky, Oleg, editor, Yan, Lu, editor, Jung, Eun-Sun, editor, Shao, Zili, editor, Mu, Yi, editor, Lee, Dong Chun, editor, Kim, Dae Young, editor, Jeong, Young-Sik, editor, and Xu, Cheng-Zhong, editor
Published: 2006
Full Text: View/download PDF

3. MIRS: Modulo Scheduling with Integrated Register Spilling

Author: Zalamea, Javier, Llosa, Josep, Ayguadé, Eduard, Valero, Mateo, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Dietz, Henry G., editor
Published: 2003
Full Text: View/download PDF

4. Aggregate operation movement: A min-cut approach to global code motion

Author: Lo, Raymond, Chan, Sun, Dehnert, Jim, Towle, Ross, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Bougé, Luc, editor, Fraigniaud, Pierre, editor, Mignotte, Anne, editor, and Robert, Yves, editor
Published: 1996
Full Text: View/download PDF

5. A comparison of modulo scheduling techniques for software pipelining

Author: Pfahler, Peter, Piepenbrock, Georg, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Gyimóthy, Tibor, editor
Published: 1996
Full Text: View/download PDF

6. Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops

Author: Wang, Jian, Gao, Guang R., Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Gyimóthy, Tibor, editor
Published: 1996
Full Text: View/download PDF

7. Delayed exceptions — Speculative execution of trapping instructions

Author: Ertl, M. Anton, Krall, Andreas, Goos, Gerhard, editor, Hartmanis, Juris, editor, and Fritzson, Peter A., editor
Published: 1994
Full Text: View/download PDF

8. Software pipelining with CGA and proposed intrinsics on a reconfigurable processor for HEVC decoders

Author: Ahn, Yong-Jo, Yoo, Jonghun, Jo, Hyun-Ho, and Sim, Donggyu
Published: 2019
Full Text: View/download PDF

9. Optimizing microprograms for recurrent loops on pipelined architectures using timed Petri nets

Author: Hanen, Claire, Goos, G., editor, Hartmanis, J., editor, Barstow, D., editor, Brauer, W., editor, Brinch Hansen, P., editor, Gries, D., editor, Luckham, D., editor, Moler, C., editor, Pnueli, A., editor, Seegmüller, G., editor, Stoer, J., editor, Wirth, N., editor, and Rozenberg, Grzegorz, editor
Published: 1990
Full Text: View/download PDF

10. Low Power Compiler Optimization for Pipelining Scaling

Author: Jen-Chieh Chang, Cheng-Yu Lee, Chia-Jung Chen, and Rong-Guey Chang
Subjects: Loop optimization, Computer science, business.industry, Pipeline (computing), Optimizing compiler, Parallel computing, Energy consumption, computer.software_genre, Dynamic voltage scaling, Software pipelining, Embedded system, Profile-guided optimization, Compiler, business, computer
Abstract: Low power has played an increasingly important role for embedded systems. To save power, lowering voltage and frequency is very straightforward and effective; therefore dynamic voltage scaling (DVS) has become a prevalent low-power technique. However, DVS makes no effect on power saving when the voltage reaches a lower bound. Fortunately, a technique called dynamic pipeline scaling (DPS) can overcome this limitation by switching pipeline modes at low-voltage level. Approaches proposed in previous work on DPS were based on hardware support. From viewpoint of compiler, little has been addressed on this issue. This paper presents a DPS optimization technique at compiler time to reduce power dissipation. The useful information of an application is exploited to devise an analytical model to assess the cost of enabling DPS mechanism. As a consequence we can determine the switching timing between pipeline modes at compiler time without causing significant run-time overhead. The experimental result shows that our approach is effective in reducing energy consumption.
Published: 2013
Full Text: View/download PDF

11. Synthetic Aperture Radar Data Processing on an FPGA Multi-core System

Author: Jørgen Dall, Anders Kusk, Pascal Schleuniger, and Sven Karlsson
Subjects: Synthetic aperture radar, Data processing, Multi-core processor, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020206 networking & telecommunications, 02 engineering and technology, CAS latency, 030218 nuclear medicine & medical imaging, 03 medical and health sciences, 0302 clinical medicine, Network on a chip, Software pipelining, Radar imaging, 0202 electrical engineering, electronic engineering, information engineering, Field-programmable gate array, business, Computer hardware
Abstract: Synthetic aperture radar, SAR, is a high resolution imaging radar. The direct back-projection algorithm allows for a precise SAR output image reconstruction and can compensate for deviations in the flight track of airborne radars. Often graphic processing units, GPUs are used for data processing as the back-projection algorithm is computationally expensive and highly parallel. However, GPUs may not be an appropriate solution for applications with strictly constrained space and power requirements. In this paper, we describe how we map a SAR direct back-projection application to a multi-core system on an FPGA. The fabric consisting of 64 processor cores and 2D mesh interconnect utilizes 60% of the hardware resources of a Xilinx Virtex-7 device with 550 thousand logic cells and consumes about 10 watt. We apply software pipelining to hide memory latency and reduce the hardware footprint by 14%. We show that the system provides real-time processing of a SAR application that maps a 3000m wide area with a resolution of 2x2 meters.
Published: 2013
Full Text: View/download PDF

12. Locality Improvement of Data-Parallel Adams–Bashforth Methods through Block-Based Pipelining of Time Steps

Author: Matthias Korch
Subjects: Software pipelining, CPU cache, Computer science, Pipeline (computing), Locality, Scalability, Initial value problem, Parallel computing, Supercomputer, Block (data storage), Linear multistep method
Abstract: Adams---Bashforth methods are a well-known class of explicit linear multi-step methods for the solution of initial value problems of ordinary differential equations. This article discusses different data-parallel implementation variants with different loop structures and communication patterns and compares the resulting locality and scalability. In particular, pipelining of time steps is employed to improve the locality of memory references. The comparison is based on detailed runtime experiments performed on parallel computer systems with different architectures, including the two supercomputer systems JUROPA and HLRB II.
Published: 2012
Full Text: View/download PDF

13. Pipelining Software and Services for Language Processing

Author: Arif Bramantoro, Ulrich Schäfer, and Toru Ishida
Subjects: Language Grid, business.industry, Computer science, computer.internet_protocol, Interoperability, Software development, computer.software_genre, Business Process Execution Language, Software pipelining, Software, Computer architecture, The Internet, Web service, business, Software engineering, computer
Abstract: This chapter reports on our experiences with combining two different platforms in natural language processing research, i.e. Heart of Gold and the Language Grid, to provide more language resources available on the Web. Heart of Gold is known as middleware architecture for pipelining deep and shallow natural language processing components. The Language Grid is one of the service grid infrastructures built on top of the Internet to provide pipelined language services. Both of these frameworks provide composite language services and components. Having Heart of Gold integrated in the Language Grid environment contributes to increased interoperability among various language services. The integrated architecture also supports the combination of pipelined language services in the Language Grid and the pipelined natural language processing components in Heart of Gold to provide a better quality of language services available on the Web. Thus, language services with different characteristics can be combined based on the concept of Web service with different treatment of each combination. An evaluation is presented to show that the overhead of the integration is not significant.
Published: 2011
Full Text: View/download PDF

14. Translation Validation of Loop Optimizations and Software Pipelining in the TVOC Framework

Author: Benjamin Goldberg
Subjects: Loop optimization, Transformation (function), Software pipelining, LOOP (programming language), Programming language, Computer science, Process (computing), Code (cryptography), Compiler, Translation (geometry), computer.software_genre, computer
Abstract: Translation validation (TV) is the process of proving that the execution of a translator has generated an output that is a correct translation of the input. When applied to optimizing compilers, TV is used to prove that the generated target code is a correct translation of the source program being compiled. This is in contrast to verifying a compiler, i.e. ensuring that the compiler will generate correct target code for every possible source program - which is generally a far more difficult endeavor. This paper reviews the TVOC framework developed by Amir Pnueli and his colleagues for translation validation for optimizing compilers, where the program being compiled undergoes substantional transformation for the purposes of optimization. The paper concludes with a discussion of how recent work on the TV of software pipelining by Tristan & Leroy can be incorporated into the TVOC framework.
Published: 2010
Full Text: View/download PDF

15. A Remote Mirroring Architecture with Adaptively Cooperative Pipelining

Author: Bing Liu, Zhenhai Zhao, Xiaoguang Liu, Tingting Qin, Yongzhi Song, and Gang Wang
Subjects: Computer science, business.industry, Pipeline (computing), Parallel computing, Encryption, Two stages, law.invention, Software pipelining, Computer architecture, Backup, law, Internet Protocol, Hardware_ARITHMETICANDLOGICSTRUCTURES, Architecture, business, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, Mirroring
Abstract: In recent years, the remote mirroring technology has attracted increasing attention In this paper, we present a novel adaptively cooperative pipelining model for remote mirroring systems Unlike the traditional pipelining model, this new model takes the decentralization of processors into account and adopts an adaptive batching strategy to alleviate imbalanced pipeline stages caused by this property To release the heavy load on CPU exerted by compression, encryption, TCP/IP protocol stack and so on, we design fine-grained pipelining, multi-threaded pipelining and hybrid pipelining We implement a remote mirroring prototype based on Linux LVM2 The experimental results show that, the adaptively cooperative pipelining model balances the primary and the backup sites - the two stages of the pipeline effectively, and fine-grained pipelining, multi-threaded pipelining and hybrid pipelining improve the performance remarkably.
Published: 2010
Full Text: View/download PDF

16. Implementation and Evaluation of Fast Parallel Packet Filters on a Cell Processor

Author: Masato Tsuru and Yoshiyuki Yamashita
Subjects: Software pipelining, Software, Computer science, business.industry, Filter (video), Network packet, Embedded system, Network processor, Single-core, Program optimization, business, Virtual network
Abstract: Packet filters are essential for most areas of recent information network technologies. While high-end expensive routers and firewalls are implemented in hardware-based, flexible and cost-effective ones are usually in software-based solutions using general-purpose CPUs but have less performance. The authors have studied the methods of applying code optimization techniques to the packet filters executing on a single core processor. In this paper, by utilizing the multi-core processor Cell Broadband Engine with software pipelining, we construct a parallelized and SIMDed packet filter 40 times faster than the naive C program filter executed on a single core.
Published: 2010
Full Text: View/download PDF

17. Concurrent Separation Logic for Pipelined Parallelization

Author: Christian J. Bell, Andrew W. Appel, and David Walker
Subjects: Multi-core processor, Software pipelining, Computer architecture, Shared memory, Computer science, Separation logic, Parallel computing, Compiler, Hoare logic, computer.software_genre, computer, Operational semantics
Abstract: Recent innovations in automatic parallelizing compilers are showing impressive speedups on multicore processors using shared memory with asynchronous channels. We have formulated an operational semantics and proved sound a concurrent separation logic to reason about multithreaded programs that communicate asynchronously through channels and share memory. Our logic supports shared channel endpoints (multiple producers and consumers) and introduces histories to overcome limitations with local reasoning. We demonstrate how to transform a sequential proof into a parallelized proof that targets the output of the parallelizing optimization DSWP (Decoupled Software Pipelining).
Published: 2010
Full Text: View/download PDF

18. Ultra High Throughput Implementations for MD5 Hash Algorithm on FPGA

Author: Liehui Jiang, Qiuxia Zhao, Yi Shao, and Yuliang Wang
Subjects: Loop unrolling, Software pipelining, Speedup, Computer science, Pipeline (computing), Stratix, Pentium, Parallel computing, Field-programmable gate array, Throughput (business), AND gate
Abstract: This paper first presents a new architecture of MD5, which achieved the theoretical upper bound on throughput in the iterative architecture. And then based on the general proposed architecture, this paper implemented other two different kinds of pipelined architectures which are based on the iterative technique and the loop unrolling technique respectively. The latter with 32-stage pipelining reached a throughput up to 32.035Gbps on an Altera Stratix II GX EP2SGX90FF FPGA, and the speedup achieved 194x over the Intel Pentium 4 3.0 processor. At least to the authors’ knowledge, this is the fastest published FPGA-based design at the time of writing. At last the proposed designs are compared with other published MD5 designs, the designs in this paper have obvious advantages both in speed and logic requirements.
Published: 2010
Full Text: View/download PDF

19. MPSoC Design Using Application-Specific Architecturally Visible Communication

Author: Philip Brisk, Paolo Ienne, Theo Kluter, and Edoardo Charbon
Subjects: Hardware_MEMORYSTRUCTURES, Speedup, CPU cache, business.industry, Computer science, NCCR-MICS, MESI protocol, NCCR-MICS/CL2, MESIF protocol, Runtime system, Software pipelining, Embedded system, Cache, business, Memory safety
Abstract: This paper advocates the placement of Architecturally Visible Communication (AVC) buffers between adjacent cores in MPSoCs to provide high-throughput communication for streaming applications. Producer/consumer relationships map poorly onto cache-based MPSoCs. Instead, we instantiate application specific AVC buffers on top of a distributed consistent and coherent cache-based system with shared main memory to provide the desired functionality. Using JPEG compression as a case study, we show that the use of AVC buffers in conjunction with parallel execution via heterogeneous software pipelining provides a speedup of as much as 4.2x compared to a baseline single processor system, with an increase in estimated memory energy consumption of only 1.6x. Additionally, we describe a method to integrate the AVC buffers into the L1 cache coherence protocol; this allows the runtime system to guarantee memory safety and coherence in situations where the parallelization of the application may be unsafe due to pointers that could not be resolved at compile time.
Published: 2009
Full Text: View/download PDF

20. Compiling Techniques for Coarse Grained Runtime Reconfigurable Architectures

Author: Alexander Fell, Ranjani Narayan, Keshavan Varadarajan, Mythri Alle, and S. K. Nandy
Subjects: Reduction (complexity), Data flow diagram, Software pipelining, Computer architecture, Binary decision diagram, Computer science, Computation, Optimizing compiler, Compiler, Parallel computing, computer.software_genre, computer, Data-flow analysis
Abstract: In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer relationship. These HyperOps are hosted on computation structures that are provisioned on demand at runtime. We also report compiler optimizations that collectively reduce the overheads of data-driven computations in runtime reconfigurable architectures. On an average, HyperOps offer a 44% reduction in total execution time and a 18% reduction in management overheads as compared to using basic blocks as coarse grained operations. We show that HyperOps formed using our compiler are suitable to support data flow software pipelining.
Published: 2009
Full Text: View/download PDF

21. Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

Author: Anton Lokhmotov, Lee Howes, Paul H. J. Kelly, and Alastair F. Donaldson
Subjects: Software pipelining, Kernel (image processing), Computer science, Computation, Broadband, Compiler, Parallel computing, Architecture, computer.software_genre, Programmer, computer, Implementation
Abstract: On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both the memory access pattern and the execution schedule of a computation kernel, the compiler or run-time system can derive efficient data movement, even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled Access/Execute specifications, allowing for automatic communication optimisations such as software pipelining and data reuse. We demonstrate the ease and efficiency of programming the Cell Broadband Engine architecture using these classes by implementing a set of benchmarks, which exhibit data reuse and non-affine access functions, and by comparing these implementations against alternative implementations, which use hand-written DMA transfers and software-based caching.
Published: 2009
Full Text: View/download PDF

22. Software Pipelining in Nested Loops with Prolog-Epilog Merging

Author: Albert Cohen, Mohammed Fellahi, Architectures, Languages and Compilers to Harness the End of Moore Years (ALCHEMY), Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), and André Seznec and Joel Emer and Mike O'Boyle and Margaret Martonosi and Theo Ungerer
Subjects: Loop unrolling, ACM: C.: Computer Systems Organization/C.1: PROCESSOR ARCHITECTURES, Loop inversion, Computer science, Pipeline (computing), [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], ACM: D.: Software/D.3: PROGRAMMING LANGUAGES/D.3.4: Processors, 02 engineering and technology, Parallel computing, 020202 computer hardware & architecture, Scheduling (computing), Prolog, Software pipelining, ACM: B.: Hardware/B.2: ARITHMETIC AND LOGIC STRUCTURES, 0202 electrical engineering, electronic engineering, information engineering, [INFO.INFO-ES]Computer Science [cs]/Embedded Systems, 020201 artificial intelligence & image processing, Nested loop join, computer, Inner loop, computer.programming_language
Abstract: International audience; Software pipelining (or modulo scheduling) is a powerful back-end optimization to exploit instruction and vector parallelism. Software pipelining is particularly popular for embedded devices as it improves the computation throughput without increasing the size of the inner loop kernel (unlike loop unrolling), a desirable property to minimize the amount of code in local memories or caches. Unfortunately, common media and signal processing codes exhibit series of low-tripcount inner loops. In this situation, software pipelining is often not an option: it incurs severe fill/drain time overheads and code size expansion due to nested prologs and epilogs. We propose a method to pipeline series of inner loops without increasing the size of the loop nest, apart from an outermost prolog and epilog. Our method achieves significant code size savings and allows pipelining of low-trip-count loops. These benefits come at the cost of additional scheduling constraints, leading to a linear optimization problem to trade memory usage for pipelining opportunities.
Published: 2009
Full Text: View/download PDF

23. Pipelined Java Virtual Machine Interpreters

Author: Lex Augusteijn and Jan Hoogerbrugge
Subjects: Source code, Programming language, Computer science, media_common.quotation_subject, Parallel computing, computer.software_genre, Bytecode, Software pipelining, Threaded code, Very long instruction word, Compiler, Rewriting, computer, Interpreter, media_common
Abstract: The performance of a Java Virtual Machine (JVM) interpreter running on a very long instruction word (VLIW) processor can be improved by means of pipelining. While one bytecode is in its execute stage, the next bytecode is in its decode stage, and the next bytecode is in its fetch stage. The paper describes how we implemented threading and pipelining by rewriting the source code of the interpreter and several modifications in the compiler. Experiments for evaluating the effectiveness of pipelining are described. Pipelining improves the execution speed of a threaded interpreter by 19.4% in terms of instruction count and 14.4% in terms of cycle count. Most of the simple bytecodes, like additions and multiplications, execute in four cycles. This number corresponds to the branch latency of our target VLIW processor. Thus most of the code of the interpreter is executed in branch delay slots.
Published: 2007
Full Text: View/download PDF

24. Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation

Author: Santosh Nagarakatte and Ramaswamy Govindarajan
Subjects: Linear programming, business.industry, Computer science, Parallel computing, Scheduling (computing), Software, Software pipelining, Code generation, Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING, business, Heuristics, Instruction-level parallelism, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, Register allocation
Abstract: In achieving higher instruction level parallelism, software pipelining increases the register pressure in the loop. The usefulness of the generated schedule may be restricted to cases where the register pressure is less than the available number of registers. Spill instructions need to be introduced otherwise. But scheduling these spill instructions in the compact schedule is a difficult task. Several heuristics have been proposed to schedule spill code. These heuristics may generate more spill code than necessary, and scheduling them may necessitate increasing the initiation interval. We model the problem of register allocation with spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. The formulation minimizes the increase in initiation interval (II) by optimally placing spill code and simultaneously minimizes the amount of spill code produced. To the best of our knowledge, this is the first integrated formulation for register allocation, optimal spill code generation and scheduling for software pipelined loops. The proposed formulation performs better than the existing heuristics by preventing an increase in II in 11.11% of the loops and generating 18.48% less spill code on average among the loops extracted from Perfect Club and SPEC benchmarks with a moderate increase in compilation time.
Published: 2007
Full Text: View/download PDF

25. Software Pipelining for Packet Filters

Author: Yoshiyuki Yamashita and Masato Tsuru
Subjects: Computer science, Network packet, Packet generator, Parallel computing, computer.software_genre, Scheduling (computing), Software pipelining, Computer engineering, Basic block, Packet analyzer, Compiler, Fast packet switching, computer, Processing delay
Abstract: Packet filters play an essential role in traffic management and security management on the Internet. In order to create software-based packet filters that are fast enough to work even under a DOS attack, it is vital to effectively combine both the higher-level optimization related to algorithmic structure and the lower-level optimization related to acceleration techniques in compiler study. In the present paper, we focus on the lower-level (machine code) optimization using software-pipelining, and report experimental results that indicate the potential of our approach for accelerating packet filter performance. The technical difficulty is that the packet filter is a lump of conditional branches, so that standard optimization techniques usually applied to basic blocks is not directly applicable to this problem. Using predicated execution and enhanced modulo scheduling, we solve this problem and achieve 20 times higher performance compared with a conventional interpreter-based packet filter. We also compare the proposed filters and compiler-based packet filters, and obtain a better than two-fold increase in performance.
Published: 2007
Full Text: View/download PDF

26. Loop Striping: Maximize Parallelism for Nested Loops

Author: Edwin H.-M. Sha, Meilin Liu, Zili Shao, Meikang Qiu, and Chun Xue
Subjects: For loop, Loop counter, Recursion, Loop inversion, Computer science, Loop fusion, Parallel computing, Loop tiling, Loop fission, Loop splitting, Software pipelining, Loop nest optimization, Loop interchange, Nested loop join, Algorithm
Abstract: The majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested loops. Most of the existing loop transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated loop bounds and loop indexes calculations. This paper proposes a new technique, loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where a stripe is a group of iterations in which all iterations are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for loop striping transformations. The experimental results show that loop striping always achieves better iteration period than software pipelining and loop unfolding, improving average iteration period by 50% and 54% respectively
Published: 2006
Full Text: View/download PDF

27. SCAN: A Heuristic for Near-Optimal Software Pipelining

Author: F. Blachot, Guillaume Huard, and Benoît Dupont de Dinechin
Subjects: business.industry, Computer science, Modulo, Optimizing compiler, Parallel computing, computer.software_genre, Scheduling (computing), Software pipelining, Software, Very long instruction word, Compiler, Heuristics, business, Integer programming, computer
Abstract: Software pipelining is a classic compiler optimization that improves the performances of inner loops on instruction-level parallel processors. In the context of embedded computing, applications are compiled prior to manufacturing the system, so it is possible to invest large amounts of time for compiler optimizations. Traditionally, software pipelining is performed by heuristics such as iterative modulo scheduling. Optimal software pipelining can be formulated as integer linear programs, however these formulations can take exponential time to solve. As a result, the size of loops that can be optimally software pipelined is quite limited. In this article, we present the SCAN heuristic, which enables to benefit from the integer linear programming formulations of software pipelining even on loops of significant size. The principle of the SCAN heuristic is to iteratively constrain the software pipelining problem until the integer linear programming formulation is solvable in reasonable time. We applied the SCAN heuristic to a multimedia benchmark for the ST200 VLIW processor. We show that it almost always compute an optimal solution for loops that are intractable by classic integer linear programming approaches. This improves performances by up to 33.3% over the heuristic modulo scheduling of the production ST200 compiler.
Published: 2006
Full Text: View/download PDF

28. Register Pressure in Software-Pipelined Loop Nests: Fast Computation and Impact on Architecture Design

Author: Guang R. Gao and Alban Douillet
Subjects: Schedule, Software pipelining, Computer science, Processor register, Pipeline (computing), Register file, Parallel computing, Loop variant, Nested loop join, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, Register allocation
Abstract: Recently the Single-dimension Software Pipelining (SSP) technique was proposed to software pipeline loop nests at an arbitrary loop level [18,19,20]. However, SSP schedules require a high number of rotating registers, and may become infeasible if register needs exceed the number of available registers. It is therefore desirable to design a method to compute the register pressure quickly (without actually performing the register allocation) as an early measure of the feasibility of an SSP schedule. Such a method can also be instrumental to provide a valuable feedback to processor architects in their register files design decision, as far as the needs of loop nests are concerned. This paper presents a method that computes quickly the minimum number of rotating registers required by an SSP schedule. The results have demonstrated that the method is always accurate and is 3 to 4 orders of magnitude faster on average than the register allocator. Also, experiments suggest that 64 floating-point rotating registers are in general enough to accommodate the needs of the loop nests used in scientific computations.
Published: 2006
Full Text: View/download PDF

29. Register Allocation on Stream Processor with Local Register File

Author: Mei Wen, Nan Wu, Ju Ren, Chunyuan Zhang, and Yi He
Subjects: Instruction register, Computer science, Working set, Register file, Parallel computing, computer.software_genre, law.invention, Memory address register, Software pipelining, law, Control register, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, Processor register, FLAGS register, Register renaming, Register window, Stack register, Microprocessor, Very long instruction word, Status register, Basic block, Operating system, Index register, Compiler, Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING, Memory data register, Instruction-level parallelism, computer, Register allocation
Abstract: Emerging stream processors for intensive computing use local register file to support ALUs array and use VLIW to explore instruction level parallelism. The current VLIW compilers for local register file such as ISCD work well on moderate media application without considering register allocation pressure. However, more complicated applications and optimizations that increase the size of the working set such as software pipelining make consideration of register pressure during the scheduling process. Based on ISCD complier, this paper presents two new techniques: spilling schedule and basic block repartition that compose a new schedule algorithm to alleviate register pressure. Experimental results show that it can deal with heavy workload application very well. The algorithm can also be applied to other microprocessors with the similar register architecture.
Published: 2006
Full Text: View/download PDF

30. Pipelining Network Storage I/O

Author: Dan Feng, Lingfang Zeng, and Fang Wang
Subjects: Input/output, business.industry, Computer science, Network storage, Pipeline (computing), Parallel computing, Scheduling (computing), Storage area network, Software pipelining, Embedded system, Computer data storage, Data_FILES, Hardware_ARITHMETICANDLOGICSTRUCTURES, business
Abstract: In this paper, with introducing pipelining technology, network storage (e.g. network-attached storage, storage area network) and segmenting the reasonable pipelining stage are studied, which may have significant direction for enhancing performance of network storage system. Some experi-ments about pipelining I/O are implemented in the Heter-RAID. These results show that I/O pipelining scheduling can take advantage of pipelining features to achieve high performance I/O over network storage system.
Published: 2006
Full Text: View/download PDF

31. Multi-dimensional Kernel Generation for Loop Nest Software Pipelining

Author: Guang R. Gao, Hongbo Rong, and Alban Douillet
Subjects: Rate-monotonic scheduling, Schedule, Fixed-priority pre-emptive scheduling, Software pipelining, Job shop scheduling, Computer science, Two-level scheduling, Dynamic priority scheduling, Parallel computing, Nested loop join, Execution time, Fair-share scheduling, Scheduling (computing)
Abstract: Single-dimension Software Pipelining (SSP) has been proposed as an effective software pipelining technique for multi-dimensional loops [16]. This paper introduces for the first time the scheduling methods that actually produce the kernel code. Because of the multi-dimensional nature of the problem, the scheduling problem is more complex and challenging than with traditional modulo scheduling. The scheduler must handle multiple subkernels and initiation rates under specific scheduling constraints, while producing a solution that minimizes the execution time of the final schedule. In this paper three approaches are proposed: the level-by-level method, which schedules operations in loop level order, starting from the innermost, and does not let other operations interfere with the already scheduled levels, the flat method, which schedules operations from different loop levels with the same priority, and the hybrid method, which uses the level-by-level mechanism for the innermost level and the flat solution for the other levels. The methods subsume Huff's modulo scheduling [8] for single loops as a special case. We also break a scheduling constraint introduced in earlier publications and allow for a more compact kernel. The proposed approaches were implemented in the Open64/ORC compiler, and evaluated on loop nests from the Livermore, SPEC200 and NAS benchmarks.
Published: 2006
Full Text: View/download PDF

32. Software Pipelining Support for Transport Triggered Architecture Processors

Author: T. Jarvinen, Perttu Salmela, Jarmo Takala, and Pekka Jääskeläinen
Subjects: Assembly language, Cycles per instruction, business.industry, Computer science, Parallel algorithm, Transport triggered architecture, Microarchitecture, Parallel language, Software pipelining, Computer architecture, High-level programming language, Embedded system, business, computer, computer.programming_language
Abstract: Many telecommunication applications, especially baseband processing, and digital signal processing (DSP) applications call for high-performance implementations due to the complexity of algorithms and high throughput requirements. In general, the required performance is obtained with the aid of parallel computational resources. In these application domains, software implementations are often preferred over fixed-function ASICs due to the flexibility and ease of development. Application-specific instruction-set processor (ASIP) architectures can be used to exploit efficiently the inherent parallelism of the algorithms but still maintaining the flexibility. Use of high-level languages to program processor architectures with parallel resources can lead to inefficient resource utilization and, on the other hand, parallel assembly programming is error prone and tedious. In this paper, the inherent problems of parallel programming and software pipelining are mitigated with parallel language syntax and automatic generation of software pipelined code for the iteration kernels. With the aid of the developed tool support, the underlying performance of a processor architecture with parallel resources can be exploited and full utilization of the main processing resources is obtained for pipelined loop kernels. The given examples show that efficiency can be obtained without reducing the performance.
Published: 2006
Full Text: View/download PDF

33. HiLO: High Level Optimization of FFTs

Author: David Padua and Nick Rizzolo
Subjects: Software pipelining, Computer science, Fast Fourier transform, Process (computing), Code (cryptography), Program transformation, Compiler, Parallel computing, Program optimization, computer.software_genre, Abstract syntax tree, computer
Abstract: As computing platforms become more and more complex, the task of optimizing performance critical codes becomes more challenging. Recently, more attention has been focused on automating this optimization process by making aggressive assumptions about the algorithm. Motivated by these trends, this paper presents HiLO, the high-level optimizer for FFT codes. HiLO blends traditional optimization techniques into an optimization strategy tailored to the needs of FFT codes and outputs C code ready to be further optimized by the native compiler. It has already been shown that such high-level transformations are important to coax the native compiler into doing its job well. HiLO provides a more appropriate platform for researching these phenomena, suggests an optimization strategy that improves on previous approaches, and shows that even software pipelining at the C level can improve the final binary's performance.
Published: 2005
Full Text: View/download PDF

34. Hardware Support for Multithreaded Execution of Loops with Limited Parallelism

Author: Georgios Dimitriou and Constantine D. Polychronopoulos
Subjects: Schedule, business.industry, Computer science, Re-order buffer, Processor scheduling, Thread (computing), Parallel computing, Scheduling (computing), Shared resource, Software pipelining, Loop scheduling, Macro, business, Computer hardware, Context switch
Abstract: Loop scheduling has significant differences in multithreaded from other parallel processors. The sharing of hardware resources imposes new scheduling limitations, but it also allows a faster communication across threads. We present a multithreaded processor model, Coral 2000, with hardware extensions that support Macro Software Pipelining, a loop scheduling technique for multithreaded processors. We tested and evaluated Coral 2000 on a cycle-level simulator, using synthetic and integer SPEC benchmarks. We obtained speedups of up to 30% with respect to highly optimized superblock-based schedules on loops that exhibit limited parallelism.
Published: 2005
Full Text: View/download PDF

35. Parallelism Improvements of Software Pipelining by Combining Spilling with Rematerialization

Author: Hiroaki Ogi, Naohiro Ishii, Kazunori Iwata, and Tsubasa Mochizuki
Subjects: Speedup, Rematerialization, Computer science, Pipeline (computing), Parallel computing, computer.software_genre, Scheduling (computing), Software pipelining, Loop scheduling, Computer architecture, Very long instruction word, Compiler, Instruction-level parallelism, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, computer
Abstract: On the instruction level parallelism architecture developed as EPIC, VLIW structure machine et al., the perforemance is affected by the compiler techniques. The integrated and convergent optimization techniques have been studied for their developments of the parallelism. In this paper, we develop a software pipelining technique for the improvement of the parallel processing in these machine structures. The software pipelining is a loop scheduling technique by overlapping the execution of several consecutive instructions of the program. Then, much registers are needed for the realization of the software pipelining. Here, spilling code and the rematerialization are implemented in the pipelining scheduling. Experimental results of the proposed method are compared with the conventional methos. The results show the improvements of the speedup of the parallel prosecssing in the bench marks.
Published: 2005
Full Text: View/download PDF

36. Coping with Data Dependencies of Multi-dimensional Array References

Author: Lin Qiao, Weitong Huang, and Zhizhong Tang
Subjects: Multidimensional analysis, Software pipelining, Linear system, Parallel algorithm, Static analysis, Dependence analysis, Nested loop join, Lambda, Algorithm, Mathematics
Abstract: This paper presents a new static data dependence analysis approach, Dependence Difference Inequality Test, which can deal with coupled subscripts for multi-dimensional array references for software pipelining techniques for nested loops. The Dependence Difference Inequality Test (DDIT) replaces direction vectors with dependence difference inequalities as constraints to variables in a linear system. The method presented in this paper extends the applicable range of the Generalized Lambda Test and seems to be a practical scheme to analyze data dependence. Experimental results show that the number of data independences checked by the DDIT algorithm is slightly smaller than that manually. It is also shown that our method is better than other traditional data dependence analysis methods without increasing time cost: it increases the success rate of the Generalized Lambda Test by approximately 14.19%.
Published: 2005
Full Text: View/download PDF

37. A Static Data Dependence Analysis Approach for Software Pipelining

Author: Lin Qiao, Zhizhong Tang, and Weitong Huang
Subjects: Speedup, Software pipelining, Program analysis, Computer science, Linear system, Parallel algorithm, Dependence analysis, Instruction-level parallelism, Algorithm, Computer Science::Databases, Loop dependence analysis
Abstract: This paper introduces a new static data dependence constraint, called dependence difference inequality, which can deal with coupled subscripts for multi-dimensional array references. Unlike direction vectors, dependence difference inequalities are related to not only the iteration space for a loop program but also the operation distance between two operations. They are more strict than other methods, and can act as additional constraints to each variable in a linear system on their own or with others. As a result, the solution space for a linear system can be compressed heavily. So long as dependence difference inequalities do not satisfy simultaneously, the loop can be software-pipelined with any initiation interval even if there exists a data dependence between two operations. Meanwhile, by replacing direction vectors with dependence difference inequalities some conservative estimations made by other traditional data dependence analysis approaches can be eliminated.
Published: 2005
Full Text: View/download PDF

38. Self-loop Pipelining and Reconfigurable Dataflow Arrays

Author: João M. P. Cardoso
Subjects: Computer science, Dataflow, business.industry, Pipeline (computing), Control reconfiguration, Parallel computing, Reconfigurable computing, Data flow diagram, Imperative programming, Software, Software pipelining, Computer architecture, business, Software architecture
Abstract: This paper presents some interesting concepts of static dataflow machines that can be used by reconfigurable computing architectures. We introduce some data-driven reconfigurable arrays and summarize techniques to map imperative software programs to those architectures, some of them being focus of current research work. In particular, we briefly present a novel technique for pipelining loops. Experiments with the technique confirm important improvements over the use of conventional loop pipelining. Hence, the technique proves to be an efficient approach to map loops to coarse-grained reconfigurable architectures employing a static dataflow computational model.
Published: 2004
Full Text: View/download PDF

39. Increasing Software-Pipelined Loops in the Itanium-Like Architecture

Author: Yu Chen, Haibo Lin, Zhizhong Tang, and Wenlong Li
Subjects: business.industry, Processor register, Computer science, Parallel computing, Loop variant, computer.software_genre, Software, Software pipelining, Explicitly parallel instruction computing, Itanium, Compiler, Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING, business, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, computer, Register allocation
Abstract: The Itanium architecture (IPF) contains features such as register rotation to support efficient software pipelining. One of the drawbacks of software pipelining is its high register requirement, which may lead to failure when registers provided by architecture are insufficient. This paper evaluates the register requirements of software-pipelined loops in Itanium architecture and presents two new methods, which try to reduce static general register requirements in software pipelined loops by either reducing instructions in the loop body or allocating unused rotating registers for variants using static registers. We have implemented our methods in the Open Research Compiler (ORC) targeted for Itanium processors, and experiments show that number of software-pipelined loops in NAS Benchmarks increased. For some benchmarks, the performance is improved by more than 18%.
Published: 2004
Full Text: View/download PDF

40. Minimizing Variables’ Lifetime in Loop-Intensive Applications

Author: Noureddine Chabini and Wayne Wolf
Subjects: Polynomial, Mathematical optimization, Variable (computer science), Software pipelining, Computer science, Context (language use), Nested loop join, Retiming, Time complexity, Algorithm
Abstract: In this paper, we address a set of research problems regarding loopintensive applications. The first problem consists of minimizing variables’ lifetime under timing constraints. Any variable that is alive for more than one iteration must be saved by for instance storing it into a register. The second problem is derived from the first one, and consists of balancing variables’ lifetime for a target total number of registers and under timing constraints. For the third problem, we focus on minimizing variables’ lifetime in the context of software pipelining in the case of one loop as well as nested loops. We provide methods to solve these three problems, and show that a set of them have polynomial run-time. Once these methods are used, one may need to solve the problem of generating the transformed code and reducing its size. We provide algorithms to solve this fourth problem. Designers face these problems during hardware-software co-design, and in designing embedded systems as well as system-on-chip. Solving these problems is also useful in low power design. We exercise some of these methods on known benchmarks, and provide obtained numerical results that show their effectiveness. We solve these problems using techniques related to retiming, an algorithm originally developed for hardware optimization.
Published: 2003
Full Text: View/download PDF

41. Tailoring Software Pipelining for Effective Exploitation of Zero Overhead Loop Buffer

Author: Gang-Ryung Uh
Subjects: Computer science, Loop inversion, CPU cache, Pipeline (computing), Register renaming, Parallel computing, computer.software_genre, Instruction selection, Execution time, Loop fission, Software pipelining, Overhead (computing), Compiler, Cache, Instruction-level parallelism, computer
Abstract: A Zero Overhead Loop Buffer (ZOLB) is an architectural feature that is commonly found in DSPs (Digital Signal Processors). This buffer can be viewed as a compiler (or program) managed cache that can hold a limited number of instructions, which will be executed a specified number of times without incurring any loop overhead. Preliminary versions of the research, which exploit a ZOLB, report significant improvement in execution time with a minimal code size increase [UH99,UH00]. This paper extends the previous compiler efforts to further exploit a ZOLB by employing a new software pipelining methodology. The proposed techniques choose complex instructions, which capitalize on instruction level parallelism across loop iteration boundaries. Unlike the traditional pipelining techniques, the proposed pipelining strategy is tightly coupled with instruction selection so that it can perform register renaming and/or proactively generate additional instruction(s) on the fly to discover more loop parallelism on the ZOLB. This framework reports additional significant improvements in execution time with modest code size increases for various signal processing applications on the DSP16000.
Published: 2003
Full Text: View/download PDF

42. A Model for Hardware Realization of Kernel Loops

Author: Jirong Liao, Weng-Fai Wong, and Tulika Mitra
Subjects: Loop unrolling, Loop optimization, business.industry, Computer science, Design space exploration, Loop fusion, Parallel computing, computer.software_genre, Software pipelining, Kernel (statistics), Compiler, business, computer, Realization (systems), Computer hardware
Abstract: Hardware realization of kernel loops holds the promise of accelerating the overall application performance and is therefore an important part of the synthesis process. In this paper, we consider two important loop optimization techniques, namely loop unrolling and software pipelining that can impact the performance and cost of the synthesized hardware. We propose a novel model that accounts for various characteristics of a loop, including dependencies, parallelism and resource requirement, as well as certain high level constraints of the implementation platform. Using this model, we are able to deduce the optimal unroll factor and technique for achieving the best performance given a fixed resource budget. The model was verified using a compiler-based FPGA synthesis framework on a number of kernel loops. We believe that our model is general and applicable to other synthesis frameworks, and will help reduce the time for design space exploration.
Published: 2003
Full Text: View/download PDF

43. Overcoming Static Register Pressure for Software Pipelining in the Itanium Architecture

Author: Haibo Lin, Zhizhong Tang, and Wenlong Li
Subjects: Scheme (programming language), Software pipelining, Computer science, Explicitly parallel instruction computing, Pattern recognition (psychology), Parallel algorithm, Itanium, Parallel computing, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, computer, computer.programming_language, Register allocation
Abstract: Software pipelining techniques have been shown to significantly improve the performance of loop-intensive scientific programs. The Itanium architecture contains many features to enhance parallel execution, such as support for efficient software pipelining of loops. The drawback of software pipelining is the high register requirements, which may lead to software pipelining failure due to limited static general registers in Itanium. This paper evaluates the register requirements of software-pipelined loops. It then presents a novel register allocation scheme, which allocates stacked registers to serve as static registers. Experimental results show that this method gains an average 2.4% improvement, and a peak 18% improvement in performance on NAS Benchmarks.
Published: 2003
Full Text: View/download PDF

44. A First Step Towards Time Optimal Software Pipelining of Loops with Control Flows

Author: Soo-Mook Moon, Jihong Kim, and Han-Saem Yun
Subjects: Flow control (data), Basis (linear algebra), Computer science, business.industry, Pipeline (computing), Software development, Parallel computing, computer.software_genre, Software pipelining, Problem of time, Control system, Compiler, business, computer
Abstract: We address the problem of time optimal software pipelining of loops with control flows, one of the most difficult open problems in the area of parallelizing compilers. We present a necessary condition for loops with control flows to have equivalent time optimal programs, generalizing the result by Schwiegelshohn et al., which has been the most significant theoretical result on the problem. As part of the formal treatment of the problem, we propose a new formalization of software pipelining, which provides a basis of our proof as well as a new theoretical framework for software pipelining research. Being the first generalized result on the problem, our work described in this paper forms an important first step towards time optimal software pipelining.
Published: 2001
Full Text: View/download PDF

45. Global Software Pipelining with Iteration Preselection

Author: David Gregg
Subjects: Difficult problem, Software pipelining, Computer science, Global software, Parallel computing, Heuristics, Instruction-level parallelism, Code growth, Scheduling (computing)
Abstract: Software pipelining loops containing multiple paths is a very difficult problem. Loop shifting offers the possibility of a close to optimal schedule with acceptable code growth. Deciding how often to shift each operation is difficult, and existing heuristics are rather ad hoc. We separate loop shifting from scheduling, and present new, non-greedy heuristics. Experimental results show that our approach yields better performance and less code growth.
Published: 2000
Full Text: View/download PDF

46. Pipelining Wavefront Computations: Experiences and Performance

Author: E. Christopher Lewis and Lawrence Snyder
Subjects: Stencil code, Computer science, Pipeline (computing), Message passing, Message Passing Interface, Parallel computing, Program optimization, computer.software_genre, Parallel language, Automatic parallelization, Software pipelining, Compiler, Programmer, computer
Abstract: Wavefront computations are common in scientific applications. Although it is well understood how wavefronts are pipelined for parallel execution, the question remains: How are they best presented to the compiler for the effective generation of pipelined code? We address this question through a quantitative and qualitative study of three approaches to expressing pipelining: programmer implemented via message passing, compiler discovered via automatic parallelization, and programmer defined via explicit parallel language features for pipelining. This work is the first assessment of the efficacy of these approaches in solving wavefront computations, and in the process, we reveal surprising characteristics of commercial compilers. We also demonstrate that a parallel language-level solution simplifies development and consistently performs well.
Published: 2000
Full Text: View/download PDF

47. Loop Shifting for Loop Compaction

Author: Guillaume Huard and Alain Darte
Subjects: For loop, Loop unrolling, Loop counter, Loop inversion, Computer science, Loop fusion, Do while loop, Parallel computing, Loop tiling, Loop variant, Loop fission, Loop splitting, Software pipelining, Loop nest optimization, Loop interchange, While loop, Inner loop, Loop dependence analysis
Abstract: The idea of decomposed software pipelining is to decouple the software pipelining problem into a cyclic scheduling problem without resource constraints and an acyclic scheduling problem with resource constraints. In terms of loop transformation and code motion, the technique can be formulated as a combination of loop shifting and loop compaction. Loop shifting amounts to move statements between iterations thereby changing some loop independent dependences into loop carried dependences and vice versa. Then, loop compaction schedules the body of the loop considering only loop independent dependences, but taking into account the details of the target architecture. In this paper, we show how loop shifting can be optimized so as to minimize both the length of the critical path and the number of dependences for loop compaction. Both problems (and the combination) are polynomially solvable with fast graph algorithms, the first one by using an algorithm due to Leiserson and Saxe, the second one by designing a variant of minimum-cost flow algorithms.
Published: 2000
Full Text: View/download PDF

48. Unroll-Based Copy Elimination for Enhanced Pipeline Scheduling

Author: Jinpyo Park, Han-Saem Yun, Suhyun Kim, and Soo-Mook Moon
Subjects: Software pipelining, Very long instruction word, Computer science, Pipeline (computing), Parallel computing, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, Pipeline scheduling
Abstract: Enhanced pipeline scheduling (EPS) is a software pipelining technique which can achieve a variable initiation interval (II) for loops with control flows via its code motion pipelining. EPS, however, leaves behind many renaming copy instructions that cannot be coalesced due to interferences. These copies take resources, and more seriously, they may cause a stall if they rename a multi-latency instruction whose latency is longer than the II aimed for by EPS. This paper describes how those renaming copies can be deleted through unrolling, which enables EPS to avoid a serious slowdown from latency handling and resource pressure while keeping its variable II and other advantages. In fact, EPS's renaming through copies, followed by unrollbased copy elimination, provides a more general and simpler solution to the cross-iteration register overwrite problem in software pipelining which works for loops with control flows as well as for straight-line loops. Our empirical study performed on a VLIW testbed with a two-cycle load latency shows that the unrolled version of the 16-ALU VLIW code includes fewer no-op VLIWs caused by stalls, improving the performance by a geometric mean of 18%, yet the peak improvement with a longer latency reaches as much as a geometric mean of 25%.
Published: 2000
Full Text: View/download PDF

49. Performance measurements on sandglass-type parallelization of doacross loops

Author: Toshitsugu Yuba, Motoyasu Takabatake, and Hiroki Honda
Subjects: Automatic parallelization, Software pipelining, LOOP (programming language), Computer science, Systems architecture, Parallel computing, Type (model theory)
Abstract: In this paper, we propose the sandglass-type parallelization technique for a doacross loop which has the characteristics of iterationbased parallelizing and software pipelining. We prove its effectiveness by comparing the sandglass-type to well-known three parallelization techniques: iteration-based, software pipelining, and a combination of doalltype parallel and sequential techniques. We conclude that the sandglasstype parallelization technique is the most effective among the techniques mentioned above in cases in which there are less than ten processing elements and the size of tasks with loop-carried dependences is smaller than the size of tasks lacking loop-carried dependence.
Published: 1999
Full Text: View/download PDF

50. Resource Usage Modelling for Software Pipelining

Author: V. Janaki Ramanan and Ramaswamy Govindarajan
Subjects: Software pipelining, Computer science, Optimizing compiler, Parallel computing, Automaton
Abstract: In this paper we propose two optimization techniques for resource modelling in software pipelining. The first technique is the extension of our grouping technique to reduce the space overhead in automaton based software pipelining methods. This technique exploits the symmetry in the states of the automaton to produce the Group Automaton (GA) with reduced number of states. Our second technique is the Conjugate Offset method which eliminates the “symmetric” offset sets from the set of offset sets produced by the Reduce MS-State Diagram approach. Our experimental results reveal that the proposed optimizations result in significant reduction in the space requirements.
Published: 1999
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

82 results on '"Software pipelining"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources