166 results on '"Trace scheduling"'
Search Results
2. Trace Scheduling
- Author
-
Freudenberger, Stefan M. and Padua, David, editor
- Published
- 2011
- Full Text
- View/download PDF
3. The Multiflow Trace Scheduling Compiler
- Author
-
Lowney, P. Geoffrey, Freudenberger, Stefan M., Karzes, Thomas J., Lichtenstein, W. D., Nix, Robert P., O’Donnell, John S., Ruttenberg, John C., Rau, B. R., editor, and Fisher, J. A., editor
- Published
- 1993
- Full Text
- View/download PDF
4. Mathematical Foundation of Trace Scheduling.
- Author
-
Banerjee, Utpal
- Subjects
- *
EXECUTION traces (Computer program testing) , *SCHEDULING , *MATHEMATICS , *THEORY , *ALGORITHMS , *PARALLEL programs (Computer programs) , *PROGRAM transformation , *COMPUTER programming , *SOFTWARE engineering - Abstract
Since its introduction by Joseph A. Fisher in 1979, trace scheduling has influenced much of the work on compile-time ILP (Instruction Level Parallelism) transformations. Initially developed for use in microcode compaction, it quickly became the main technique for machine-level compile-time parallelism exploitation. Although it has been used since the 1980s in many state-of-the-art compilers (e.g., Intel, Fujitsu, HP), a rigorous theory of trace scheduling is still lacking in the existing literature. This is reflected in the ad hoc way compensation code is inserted after a trace compaction, in the total absence of any attempts to measure the size of that compensation code, and so on. The aim of this article is to create a mathematical theory of the foundation of trace scheduling. We give a clear algorithm showing how to insert compensation code after a trace is replaced with its schedule, and then prove that the resulting program is indeed equivalent to the original program. We derive an upper bound on the size of that compensation code, and show that this bound can be actually attained. We also give a very simple proof that the trace scheduling algorithm always terminates. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
5. Optimization of Arithmetic Coding for JPEG2000.
- Author
-
Rhu, M. and Park, I.-C.
- Subjects
- *
JPEG (Image coding standard) , *MATHEMATICAL optimization , *CODING theory , *IMAGE processing , *IMAGING systems , *IMAGE compression - Abstract
Embedded block coding with optimized truncation (EBCOT) employed in the JPEG2000 standard accounts for the majority of the processing time, because the EBCOT is full of bit operations that cannot be implemented efficiently in software. The block coder consists of a bit-plane coder (BPC) followed by a binary arithmetic coder (BAC), where the most up-to-date BPC architectures are capable of producing symbols at a much higher rate than the conventional BACs can handle. This letter proposes a novel pipelined BAC architecture that can encode input symbols at a much higher rate than the conventional BAC architectures. The proposed architecture can significantly reduce the critical path delay and can achieve a throughput of 400Msymbols/s. The critical path delay synthesized with 0.18μm CMOS technology is 2.42 ns, which is almost half of the delay taken in conventional BAC architectures. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
6. Compiling Real-Time Programs With Timing Constraint Refinement and Structural Code Motion.
- Author
-
Gerber, Richard and Hong, Seongsoo
- Subjects
- *
COMPUTER software , *COMPUTER programming , *REAL-time computing , *SOFTWARE engineering , *ENGINEERING - Abstract
We present a programming language called TCEL (Time-Constrained Event Language), whose semantics are based on time-constrained relationships between observable events. Such a semantics infers only those timing constraints necessary to achieve real-time correctness, without overconstraining the system. Moreover, an optimizing compiler can exploit this looser semantics to help tune the code, so that its worst-case execution time is consistent with its real-time requirements. In this paper we describe such a transformation system, which work$ in two phases. First, the TCEL source code is translated into an intermediate representation. Then an instruction- scheduling algorithm rearranges selected unobservable operations and synthesizes tasks guaranteed to respect the original event- based constraints. [ABSTRACT FROM AUTHOR]
- Published
- 1995
- Full Text
- View/download PDF
7. Region Scheduling: An Approach for Detecting and Redistributing Parallelism.
- Author
-
Gupta, Rajiv and Soffa, Mary Lou
- Subjects
- *
PARALLEL computers , *SOFTWARE engineering , *COMPUTER software , *COMPUTER systems , *COMPUTERS - Abstract
In developing compiler techniques for programs targeted for parallel execution, it is imperative that a program representation be utilized that not only facilitates the detection and scheduling of parallelism but also easily enables program transformations that increase opportunities for parallelism. These requirements are the driving force behind region scheduling, a technique applicable to both fine grain and coarse grain parallelism. This technique employs a program representation that divides a program into regions consisting of source and intermediate level statements and enables the expression of both data and control dependencies. Guided by estimates of the parallelism present in regions, the region scheduler redistributes code, thus providing opportunities for parallelism in those regions containing insufficient parallelism compared to the capabilities of the executing architecture. The program representation and the transformations are applicable to both structured and unstructured programs, making region scheduling useful for a wide range of applications. The results of experiments conducted using the technique in the generation of code for a reconfigurable long instruction word architecture are presented. The advantages of region scheduling over trace scheduling, another technique for transforming and detecting fine grain parallelism in programs, are discussed. [ABSTRACT FROM AUTHOR]
- Published
- 1990
- Full Text
- View/download PDF
8. Horizon: A Retargetable Compiler for Horizontal Microarchitectures.
- Author
-
Mueller, Robert A., Duda, Michael R., Sweany, Philip H., and Walicki, Jack S.
- Subjects
- *
COMPILERS (Computer programs) , *MICROPROGRAMMING , *SOFTWARE engineering , *COMPUTER programming , *COMPUTER software development , *SYSTEMS software , *COMPUTER architecture - Abstract
Advances in memory and circuit technologies make custom microarchitectures a cost-effective approach to architectures designed to support high-performance applications such as image processing and synthesis. However, the vertical migration of complex application code into horizontal microcode makes traditional methods of handwritten and hand-optimized microcode with primitive assembly languages impractical. Higher level languages that permit abstraction from low-level timing and concurrency details are a major step toward alleviating the problem. This approach is feasible only if compilers for these languages exist that can produce high-quality microcode and that can be targeted to new machines with modest effort and high reliability. In this paper we overview the distinctive aspects of the Horizon retargetable microcode compiler that facilitate the production of highly optimized microcode and the targeting of the compiler to specific machines. [ABSTRACT FROM AUTHOR]
- Published
- 1988
- Full Text
- View/download PDF
9. The multiflow trace scheduling compiler.
- Author
-
Lowney, P., Freudenberger, Stefan, Karzes, Thomas, Lichtenstein, W., Nix, Robert, O'Donnell, John, and Ruttenberg, John
- Abstract
The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more than 50 operations in flight. At Multiflow the compiler generated code for eight different target machine architectures and compiled over 50 million lines of Fortran and C applications and systems code. The requirement of finding large amounts of parallelism in ordinary programs, the trace scheduling algorithm, and the many unique features of the Multiflow hardware placed novel demands on the compiler. New techniques in instruction scheduling, register allocation, memory-bank management, and intermediate-code optimizations were developed, as were refinements to reduce the overhead of trace scheduling. This article describes the Multiflow compiler and reports on the Multiflow practice and experience with compiling for instruction-level parallelism beyond basic blocks. [ABSTRACT FROM AUTHOR]
- Published
- 1993
- Full Text
- View/download PDF
10. Symbolic execution using approximate computing (SEAC) — A novel branch hazard distribution method
- Author
-
Alaa Ali, Oladiran G. Olaleye, Magdy Bayoumi, and Dmitri Perkins
- Subjects
Loop unrolling ,Software pipelining ,Speedup ,Trace scheduling ,Computer science ,Parallel computing ,Minification ,Compiler ,Symbolic execution ,computer.software_genre ,Hazard (computer architecture) ,computer - Abstract
To improve the overall performance of computer systems, instruction-level parallelism (ILP) has been widely exploited. However, branch hazards, conditional and unconditional, still limit the efficiency of most ILP techniques. Compiler techniques such as loop unrolling, software pipelining, and trace scheduling are being used to increase the amount of parallelism available in systems with fairly predictable branches, while predicated instructions have been useful in eliminating branch hazards in specific cases. The limitations imposed on ILP by branch hazards, however, are significant in large blocks of codes or, at best, hidden at the expense of processor resources. As a result, researchers are exploring the techniques of approximate computing, which when applied, would be suitable for only fault-tolerant systems. Some are also working on the methods of code approximation, which mainly involves hazard minimization by distribution over specific parts of code segments. In this work, we propose and demonstrate a novel branch hazard distribution technique — Symbolic Execution using Approximate Computing (SEAC). We applied the proposed technique to a test program and ran simulation experiments using the Detailed CPU model in gem5 simulator. Simulation results show that SEAC is 3.57, 1.95 and 1.32 times better than the best, among the tested conventional ILP techniques, based on speedup, energy saving, and branch hazard distribution coefficient respectively.
- Published
- 2018
- Full Text
- View/download PDF
11. Architecture-Aware Real-Time Compression of Execution Traces
- Author
-
Željko Žilić, Bojan Mihajlović, and Warren J. Gross
- Subjects
Trace scheduling ,Computer science ,media_common.quotation_subject ,Parallel computing ,Link register ,Branch predictor ,Program counter ,Branch trace ,Debugging ,Hardware and Architecture ,Branch target predictor ,Redundancy (engineering) ,Software ,media_common - Abstract
In recent years, on-chip trace generation has been recognized as a solution to the debugging of increasingly complex software. An execution trace can be seen as the most fundamentally useful type of trace, allowing the execution path of software to be determined post hoc. However, the bandwidth required to output such a trace can be excessive. Our architecture-aware trace compression (AATC) scheme adds an on-chip branch predictor and branch target buffer to reduce the volume of execution trace data in real time through on-chip compression. Novel redundancy reduction strategies are employed, most notably in exploiting the widespread use of linked branches and the compiler-driven movement of return addresses between link register, stack, and program counter. In doing so, the volume of branch target addresses is reduced by 52%, whereas other algorithmic improvements further decrease trace volume. An analysis of spatial and temporal redundancy in the trace stream allows a comparison of encoding strategies to be made for systematically increasing compression performance. A combination of differential, Fibonacci, VarLen, and Move-to-Front encodings are chosen to produce two compressor variants: a performance-focused xAATC that encodes 56.5 instructions/bit using 24,133 gates and an area-efficient fAATC that encodes 48.1 instructions/bit using only 9,854 gates.
- Published
- 2015
- Full Text
- View/download PDF
12. Adaptive Space-Shared Scheduling for Shared-Memory Parallel Programs
- Author
-
Bernhard Egger, Young Hyun Cho, and Surim Oh
- Subjects
010302 applied physics ,Rate-monotonic scheduling ,Trace scheduling ,Computer science ,Distributed computing ,02 engineering and technology ,Dynamic priority scheduling ,01 natural sciences ,Gang scheduling ,Fair-share scheduling ,020202 computer hardware & architecture ,Fixed-priority pre-emptive scheduling ,Shared memory ,Two-level scheduling ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering - Abstract
Space-sharing is regarded as the proper resource management scheme for many-core OSes. For today’s many-core chips and parallel programming models providing no explicit resource requirements, an important research problem is to provide a proper resource allocation to the running applications while considering not only the architectural features but also the characteristics of the parallel applications.
- Published
- 2017
- Full Text
- View/download PDF
13. Path-Dividing Based Scheduling Algorithm for Reducing Energy Consumption of Clustered VLIW Architectures
- Author
-
Xu Yang
- Subjects
Rate-monotonic scheduling ,Schedule ,Computer science ,Trace scheduling ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Energy consumption ,Round-robin scheduling ,Fair-share scheduling ,Theoretical Computer Science ,Scheduling (computing) ,Fixed-priority pre-emptive scheduling ,Computational Theory and Mathematics ,Hardware and Architecture ,Two-level scheduling ,Software - Abstract
This paper presents an instruction scheduling algorithm for clustered very long instruction words (VLIW) architectures. It exploits a path-dividing-based technique to decide a more appropriate processing order of instructions, and utilizes a more global view to generate the scheduling result by simultaneously considering the influence of both data dependence relations between instructions and distribution of instructions among clusters. The algorithm is composed of two stages. The first stage virtually schedules all the instructions based on the path-dividing technique. The cluster virtually assigned in the first stage is delivered to the second stage. While the cycle virtually scheduled in the first stage is only used in this stage for the purpose of taking into account the implications between cluster assignment and cycle scheduling. The second stage performs actual scheduling according to the cluster assignment decisions made in the first stage and generates final scheduling results. The proposed algorithm is implemented and evaluated with benchmarks extracted from UTDSP and MediaBench. Results show that the improvement is impressive. Besides producing a large reduction in energy consumption, the algorithm can also make a remarkably performance speed-up. The average energy consumption ranges from 28.1% (4-Clusters) to 37.3% (8-Clusters). And the average speed-up ranges from 29.9% (4-Clusters) to 39.1% (8-Clusters).
- Published
- 2014
- Full Text
- View/download PDF
14. Compiler-Assisted Leakage- and Temperature- Aware Instruction-Level VLIW Scheduling
- Author
-
Zhaolin Li, Fang Wang, Shan Cao, and Shaojun Wei
- Subjects
Schedule ,Computer science ,business.industry ,Trace scheduling ,Instruction scheduling ,Hardware_PERFORMANCEANDRELIABILITY ,Energy consumption ,Parallel computing ,Fair-share scheduling ,Scheduling (computing) ,Hardware and Architecture ,Very long instruction word ,Embedded system ,Hardware_INTEGRATEDCIRCUITS ,Algorithm design ,Electrical and Electronic Engineering ,business ,Software ,Leakage (electronics) - Abstract
With technology scaled to nanometer-scale, leakage energy consumption is accounting for a greater proportion than ever, especially for very long instruction word (VLIW) architectures with a large number of functional units (FUs). The growing energy consumption leads to an increase in chip temperature, which again brings an exponential growth in leakage current, and consequently leakage energy. However, few studies consider both leakage energy and temperature reduction during the compiling on VLIW architectures. In this paper, a leakage- and temperature-aware design flow is presented to assist the compiling of instruction-level VLIW scheduling. And two scheduling algorithms are proposed for the design flow. First, the leakage-aware rescheduling algorithm is proposed for leakage energy reduction by concentrating operations to fewer FUs and shutting more FUs down. Then, the temperature-aware workload balance algorithm is presented to reduce peak temperature by balancing the concentrated workloads among homogenous FUs. It is proved that the proposed two algorithms can reduce the leakage energy and peak temperature without performance loss. Experimental results demonstrate that the peak temperature is reduced by 15.27% and 12.84% for FU groups with three and two FUs and the leakage energy is reduced by 78.14% and 30.31% on average compared with the communication scheduling and list algorithm, respectively.
- Published
- 2014
- Full Text
- View/download PDF
15. Optimization Algorithms in Project Scheduling
- Author
-
Amer M. Fahmy
- Subjects
Rate-monotonic scheduling ,Computer science ,Trace scheduling ,Nurse scheduling problem ,Two-level scheduling ,Flow shop scheduling ,Dynamic priority scheduling ,Industrial engineering ,Fair-share scheduling ,Deadline-monotonic scheduling - Abstract
Scheduling, or planning in a general perspective, is the backbone of project manage‐ ment; thus, the successful implementation of project scheduling is a key factor to projects’ success. Due to its complexity and challenging nature, scheduling has become one of the most famous research topics within the operational research context, and it has been widely researched in practical applications within various industries, especially manufacturing, construction, and computer engineering. Accordingly, the literature is rich with many implementations of different optimization algorithms and their extensions within the project scheduling problem (PSP) analysis field. This study is intended to exhibit the general modelling of the PSP, and to survey the implementa‐ tions of various optimization algorithms adopted for solving the different types of the PSP.
- Published
- 2016
16. Trace-based affine reconstruction of codes
- Author
-
Mahmut Kandemir, Gabriel Rodríguez, Juan Touriño, and José M. Andión
- Subjects
Source code ,Trace scheduling ,Computer science ,media_common.quotation_subject ,02 engineering and technology ,Parallel computing ,Loop tiling ,computer.software_genre ,020202 computer hardware & architecture ,Scheduling (computing) ,Tree traversal ,Runtime system ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Affine transformation ,Compiler ,computer ,media_common - Abstract
Complete comprehension of loop codes is desirable for a variety of program optimizations. Compilers perform static code analyses and transformations, such as loop tiling or memory partitioning, by constructing and manipulating formal representations of the source code. Runtime systems observe and characterize application behavior to drive resource management and allocation, including dependence detection and parallelization, or scheduling. However, the source codes of target applications are not always available to the compiler or runtime system in an analyzable form. It becomes necessary to find alternate ways to model application behavior. This paper presents a novel mathematical framework to rebuild loops from their memory access traces. An exploration engine traverses a tree-like solution space, driven by the access strides in the trace. It is guaranteed that the engine will find the minimal affine nest capable of reproducing the observed sequence of accesses by exploring this space in a brute force fashion, but most real traces will not be tractable in this way. Methods for an efficient solution space traversal based on mathematical properties of the equation systems which model the solution space are proposed. The experimental evaluation shows that these strategies achieve efficient loop reconstruction, processing hundreds of gigabytes of trace data in minutes. The proposed approach is capable of correctly and minimally reconstructing 100% of the static control parts in PolyBench/C applications. As a side effect, the trace reconstruction process can be used to efficiently compress trace files. The proposed tool can also be used for dynamic access characterization, predicting over 99% of future memory accesses.
- Published
- 2016
- Full Text
- View/download PDF
17. A Function-Oriented Local Scheduling Framework for VLIW Architectures in Assembly-Level
- Author
-
Chaohuan Hou, Hao Zhu, Peng Chu, Donghui Wang, and Qi Wang
- Subjects
General Computer Science ,Trace scheduling ,Computer science ,Very long instruction word ,Scheduling (production processes) ,Parallel computing - Published
- 2012
- Full Text
- View/download PDF
18. A computational study of heuristic and exact techniques for superblock instruction scheduling
- Author
-
Michael Chase, R. Wayne Oldford, Peter van Beek, Tyrel Russell, and Abid M. Malik
- Subjects
Rate-monotonic scheduling ,Computer science ,Trace scheduling ,General Engineering ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Management Science and Operations Research ,Round-robin scheduling ,Fair-share scheduling ,Artificial Intelligence ,Two-level scheduling ,Greedy algorithm ,Software - Abstract
Compilers perform instruction scheduling to improve the performance of code on modern computer architectures. Superblocks—a straight-line sequence of code with a single entry point and multiple possible exit points—are a commonly used scheduling region within compilers. Superblock scheduling is NP-complete, and is done suboptimally in production compilers using a greedy algorithm coupled with a heuristic. Recently, exact schedulers have also been proposed. In this paper, we perform an extensive computational study of heuristic and exact techniques for scheduling superblocks. Our study extends previous work in using a more realistic architectural model, in not assuming perfect profile information, and in systematically investigating the case where profile information is not available. Our experimental results show that heuristics can be brittle and what looks promising under idealized (but unrealistic) conditions may not be robust in practice. As well, for the case where profile information is not available, some methods clearly dominate. Notably, a much inferior method is deployed in at least one existing compiler.
- Published
- 2012
- Full Text
- View/download PDF
19. Mathematical foundation of trace scheduling
- Author
-
Utpal Banerjee
- Subjects
Instruction set ,Mathematical theory ,Trace scheduling ,Computer science ,Microcode ,Parallel computing ,Compiler ,computer.software_genre ,Instruction-level parallelism ,Upper and lower bounds ,computer ,Software ,Scheduling (computing) - Abstract
Since its introduction by Joseph A. Fisher in 1979, trace scheduling has influenced much of the work on compile-time ILP (Instruction Level Parallelism) transformations. Initially developed for use in microcode compaction, it quickly became the main technique for machine-level compile-time parallelism exploitation. Although it has been used since the 1980s in many state-of-the-art compilers (e.g., Intel, Fujitsu, HP), a rigorous theory of trace scheduling is still lacking in the existing literature. This is reflected in the ad hoc way compensation code is inserted after a trace compaction, in the total absence of any attempts to measure the size of that compensation code, and so on. The aim of this article is to create a mathematical theory of the foundation of trace scheduling. We give a clear algorithm showing how to insert compensation code after a trace is replaced with its schedule, and then prove that the resulting program is indeed equivalent to the original program. We derive an upper bound on the size of that compensation code, and show that this bound can be actually attained. We also give a very simple proof that the trace scheduling algorithm always terminates.
- Published
- 2011
- Full Text
- View/download PDF
20. Data dependence graph directed scheduling for clustered VLIW architectures
- Author
-
Xu Yang, Hu He, and Yihe Sun
- Subjects
Multidisciplinary ,Speedup ,Trace scheduling ,Very long instruction word ,Computer science ,Cluster (physics) ,Instruction scheduling ,Graph (abstract data type) ,Ranging ,Parallel computing ,Scheduling (computing) - Abstract
This paper presents an instruction scheduling and cluster assignment approach for clustered very long instruction words (VLIW) processors. The technique produces high performance code by simultaneously balancing instructions among clusters and minimizing the amount of inter-cluster data communications. The scheme is evaluated based on benchmarks extracted from UTDSP. Results show a significant speedup compared with previously used techniques with speed-ups of up to 44%, with average speed-ups ranging from 14% (2-cluster) to 18% (4-cluster).
- Published
- 2010
- Full Text
- View/download PDF
21. FADAlib: an open source C++ library for fuzzy array dataflow analysis
- Author
-
Denis Barthou, Adrien Eliche, Marouane Belaoucha, Sid Ahmed Ali Touati, Parallélisme, Réseaux, Systèmes, Modélisation (PRISM), Université de Versailles Saint-Quentin-en-Yvelines (UVSQ)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), Architectures, Languages and Compilers to Harness the End of Moore Years (ALCHEMY), Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Multi-core processor ,business.industry ,Dataflow ,Trace scheduling ,Computer science ,020207 software engineering ,02 engineering and technology ,Parallel computing ,Dependence analysis ,computer.software_genre ,Fuzzy logic ,020202 computer hardware & architecture ,Data flow diagram ,Software ,0202 electrical engineering, electronic engineering, information engineering ,General Earth and Planetary Sciences ,Compiler ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,business ,computer ,General Environmental Science - Abstract
International audience; Ubiquitous multicore architectures require that many levels of parallelism have to be found in codes. Dependence analysis is the main approach in compilers for the detection of parallelism. It enables vectorisation and automatic parallelisation, among many other optimising transformations, and is therefore of crucial importance for optimising compilers. This paper presents new open source software, \texttt{FADAlib}, performing an instance-wise dataflow analysis for scalar and array references. The software is a C++ implementation of the Fuzzy Array Dataflow Analysis (FADA) method. This method can be applied on codes with irregular control such as \texttt{while}-loops, \texttt{if-then-else} or non-regular array accesses, and computes exact instance-wise dataflow analysis on regular codes. As far as we know, \texttt{FADAlib} is the first released open source C++ implementation of instance-wise data flow dependence handling larger classes of programs. In addition, the library is technically independent from an existing compiler; It can be plugged in many of them; this article shows an example of a successful integration inside gcc/GRAPHITE. We give details concerning the library implementation and then report some initial results with gcc and possible use for trace scheduling on irregular codes.
- Published
- 2010
- Full Text
- View/download PDF
22. Optimal trace scheduling using enumeration
- Author
-
Kent Wilken, Ghassan Shobaki, and Mark Heffernan
- Subjects
Mathematical optimization ,Schedule ,Computer science ,Trace scheduling ,Heuristic (computer science) ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Fair-share scheduling ,Hardware and Architecture ,Basic block ,Software ,Information Systems ,TRACE (psycholinguistics) - Abstract
This article presents the first optimal algorithm for trace scheduling. The trace is a global scheduling region used by compilers to exploit instruction-level parallelism across basic block boundaries. Several heuristic techniques have been proposed for trace scheduling, but the precision of these techniques has not been studied relative to optimality. This article describes a technique for finding provably optimal trace schedules, where optimality is defined in terms of a weighted sum of schedule lengths across all code paths in a trace. The optimal algorithm uses branch-and-bound enumeration to efficiently explore the entire solution space. Experimental evaluation of the algorithm shows that, with a time limit of 1 s per problem, 91% of the hard trace scheduling problems in the SPEC CPU 2006 Integer Benchmarks are solved optimally. For 58% of these hard problems, the optimal schedule is improved compared to that produced by a heuristic scheduler with a geometric mean improvement of 3.2% in weighted schedule length and 18% in compensation code size.
- Published
- 2009
- Full Text
- View/download PDF
23. How VLIW almost disappeared - and then proliferated
- Author
-
R. Lethin
- Subjects
Trace scheduling ,Computer science ,Concurrency ,computer.software_genre ,Concurrency control ,Very long instruction word ,Microcode ,Operating system ,Compiler ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Electrical and Electronic Engineering ,Architecture ,Instruction-level parallelism ,computer - Abstract
Very long instruction word (VLIW) refers to a computer architecture and algorithms that take advantage of large amounts of instruction level parallelism (ILP). Joseph A. (Josh) Fisher, a former Yale professor and a Hewlett-Packard Senior Fellow, introduced VLIW architecture in the early 1980s. The insights underlying Fisher's invention of VLIW came to him when he was a graduate student at New York University's Courant Institute in the late 1970s. He was microcoding a clone of the Control Data Corporation (CDC) 6600 computer. To maximize the performance of the clone, called PUMA, he used a standard trick of writing microcode with many concurrent operations. In doing so, he realized he could get even more concurrency and performance by moving operations speculatively above branches. His study of this motion, and how it violated the laws of computer architecture of the day, led to his invention of the trace scheduling compiler algorithm. He wrote his Ph.D. thesis about trace scheduling. He later became a professor at Yale University, where he started the Extremely Long Instruction (ELI) project, which developed the architecture for the trace scheduling compiler.
- Published
- 2009
- Full Text
- View/download PDF
24. Dynamic Instruction Scheduling in a Trace-based Multi-threaded Architecture
- Author
-
Alberto F. De Souza and P. Rounce
- Subjects
Out-of-order execution ,Computer science ,Trace scheduling ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Theoretical Computer Science ,Scheduling (computing) ,Very long instruction word ,Two-level scheduling ,Cache ,Software ,Information Systems - Abstract
Simulation results are presented using the hardware-implemented, trace-based dynamic instruction scheduler of our single process DTSVLIW architecture to schedule instructions from several processes into multiple streams of VLIW instructions for execution by a wide-issue, simultaneous multi-threading (SMT) execution engine. The scheduling process involves single instruction execution of each process, dynamically scheduling executed instructions into blocks of VLIW instructions cached for subsequent SMT execution: SMT provides a mechanism to reduce the impact of horizontal and vertical waste, and variable memory latencies, seen in the DTSVLIW. Preliminary experiments explore this extended model. Results achieve PE utilization of up to 87% on a 4-thread, 1-scalar, 8 PE design, with speed-ups of up to 6.3 that of a single processor. Noticeably it only needs a single scalar process to be scheduled at any time, with main memory fetches being 1-4% that of a single processor.
- Published
- 2008
- Full Text
- View/download PDF
25. DAG Scheduling for Heterogeneous Systems Using Biogeography-Based Optimization
- Author
-
Kaijun Ren, Shaowei Liu, Kefeng Deng, and Junqiang Song
- Subjects
Rate-monotonic scheduling ,Mathematical optimization ,Nurse scheduling problem ,Heuristic ,Computer science ,Trace scheduling ,Two-level scheduling ,Symmetric multiprocessor system ,Flow shop scheduling ,Dynamic priority scheduling ,Round-robin scheduling ,Fair-share scheduling ,Scheduling (computing) - Abstract
Efficient scheduling algorithm is critical for DAG-based applications to obtain high-performance in heterogeneous computing systems. In comparison with heuristic-based algorithms, meta-heuristic based scheduling algorithms can produce better results by searching in a guided manner. Biogeography-based optimization (BBO) is a recently proposed optimization technique which has shown less parameters, faster convergency, and superior performance than existing meta-heuristics. In this article, we introduce this novel optimization technique into the field of DAG scheduling. To reduce scheduling overhead, the proposed algorithm only encodes task mapping while using a heuristic strategy to determine task ordering. Moreover, it uses heuristic-based algorithms as baseline algorithms to obtain better results. We evaluate the BBO-based scheduling algorithm using three real world DAG-based applications under various parameter settings. The results show that the BBO-based scheduling algorithm outperforms the state-of-the-art meta-heuristic based algorithms.
- Published
- 2015
- Full Text
- View/download PDF
26. Instruction scheduling for a clustered VLIW processor with a word-interleaved cache
- Author
-
F. Jesus Sanchez, Enric Gibert, and Antonio González
- Subjects
Loop unrolling ,Hardware_MEMORYSTRUCTURES ,Computer Networks and Communications ,Computer science ,Cache coloring ,Trace scheduling ,Instruction scheduling ,Register file ,Pipeline burst cache ,Parallel computing ,Distributed cache ,Replication (computing) ,Computer Science Applications ,Theoretical Computer Science ,Scheduling (computing) ,Computational Theory and Mathematics ,Very long instruction word ,Cache ,Cache algorithms ,Software - Abstract
Clustering is a common technique to overcome the wire delay problem incurred by the evolution of technology. Fully distributed architectures, where the register file, the functional units and the data cache are partitioned, are particularly effective to deal with these constraints and moreover they are very scalable. In this paper, effective instruction scheduling techniques for a word-interleaved cache clustered VLIW processor are presented. Such scheduling techniques rely on (i) loop unrolling and variable alignment to increase the fraction of local accesses, (ii) a latency assignment process to schedule memory instructions with an appropriate latency, and (iii) different heuristics to assign memory instructions to clusters. Memory consistency is guaranteed by constraining the assignment of memory instructions to clusters. In addition, the use of Attraction Buffers is also introduced. An Attraction Buffer is a hardware mechanism that allows some data replication in order to increase the number of local accesses and, in consequence, reduces stall time. Performance results for the Mediabench benchmark suite demonstrate the effectiveness of the presented techniques and mechanisms. The number of local accesses is increased by more than 25% by using the mentioned scheduling techniques, while stall time is reduced by more than 30% when Attraction Buffers are used. Finally, IPC results for such an architecture are 10% and 5% better compared to those of a clustered VLIW processor with a centralized/unified data cache depending on the scheduling heuristic, respectively. Copyright © 2006 John Wiley & Sons, Ltd.
- Published
- 2006
- Full Text
- View/download PDF
27. Compiler optimization on VLIW instruction scheduling for low power
- Author
-
Chingren Lee, Shi-Chun Tsai, Jenq Kuen Lee, and TingTing Hwang
- Subjects
Earliest deadline first scheduling ,Rate-monotonic scheduling ,Trace scheduling ,Computer science ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Round-robin scheduling ,Computer Graphics and Computer-Aided Design ,Fair-share scheduling ,Computer Science Applications ,Two-level scheduling ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Electrical and Electronic Engineering - Abstract
In this article, we investigate compiler transformation techniques regarding the problem of scheduling VLIW instructions aimed at reducing power consumption of VLIW architectures in the instruction bus. The problem can be categorized into two types: horizontal scheduling and vertical scheduling. For the case of horizontal scheduling, we propose a bipartite-matching scheme for instruction scheduling. We prove that our greedy bipartite-matching scheme always gives the optimal switching activities of the instruction bus for given VLIW instruction scheduling policies. For the case of vertical scheduling, we prove that the problem is NP-hard, and we further propose a heuristic algorithm to solve the problem. Our experiment is performed on Alpha-based VLIW architectures and an ATOM simulator, and the compiler incorporated in our proposed schemes is implemented based on SUIF and MachSUIF. Experimental results of horizontal scheduling optimization show an average 13.30% reduction with four-way issue architecture and an average 20.15% reduction with eight-way issue architecture for transitional activities of the instruction bus as compared with conventional list scheduling for an extensive set of benchmarks. The additional reduction for transitional activities of the instruction bus from horizontal to vertical scheduling with window size four is around 4.57 to 10.42%, and the average is 7.66%. Similarly, the additional reduction with window size eight is from 6.99 to 15.25%, and the average is 10.55%.
- Published
- 2003
- Full Text
- View/download PDF
28. Fast enumeration-based modulo scheduling heuristic for VLIW architectures
- Author
-
Phillipe Le Gall, Phillipe Elleaume, Said Belkouch, and Mounir Bahtat
- Subjects
Rate-monotonic scheduling ,Fixed-priority pre-emptive scheduling ,Trace scheduling ,Computer science ,Two-level scheduling ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Round-robin scheduling ,Fair-share scheduling - Abstract
Modulo scheduling is a software pipelining technique exploiting instruction-level parallelism (ILP) of VLIW architectures to efficiently implement loops. This paper presents a novel enumeration-based resource-constrained heuristic for modulo scheduling. It takes into consideration the criticality of the nodes, generating near optimal schedules in terms of initiation intervals and register requirements. The scheduling algorithm outperformed better-known heuristics in terms of the quality of schedules, while presenting small compilation time enabling it to be used in a production environment. Experimental results on the VLIW TMS320C6678 DSP processor, showed improved performance on a signal processing set of algorithms.
- Published
- 2014
- Full Text
- View/download PDF
29. Modulo scheduler implementation for VLIW processor
- Author
-
Ingoo Heo, Yunheung Paek, Jangseop Shin, Sangjun Han, and Hyungyun Jung
- Subjects
Schedule ,Computer science ,Trace scheduling ,business.industry ,Instruction scheduling ,Parallel computing ,computer.software_genre ,Fixed-priority pre-emptive scheduling ,Software ,Software pipelining ,Very long instruction word ,Compiler ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,business ,computer - Abstract
For VLIW processors, compiler must statically schedule instructions since there are no hardware for detecting hazards and reordering instructions at runtime. Thus, instruction scheduling techniques for VLIW processors have critical influence on correct execution and effective utilization of hardware resources. Software pipelining is a popular instruction scheduling technique which enables overlapped execution of successive loop iterations. We implemented modulo scheduler, which is a widely used technique of obtaining software pipelined schedule. Experiments on a set of multimedia applications show that performance is increased up to 2.6x compared to simple list scheduling implementation.
- Published
- 2014
- Full Text
- View/download PDF
30. Warp-aware trace scheduling for GPUs
- Author
-
Maurice Herlihy, Onur Mutlu, Thomas B. Jablin, and James A. Jablin
- Subjects
Rate-monotonic scheduling ,Trace scheduling ,Computer science ,Two-level scheduling ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Instruction-level parallelism ,Fair-share scheduling ,Branch trace - Abstract
GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern GPUs cannot dynamically carry out such optimizations because they lack hardware branch prediction and cannot speculatively execute instructions beyond a branch. We propose to circumvent these limitations by adapting Trace Scheduling, a technique originally developed for microcode optimization. Trace Scheduling divides code into traces (or paths), and optimizes each trace in a context-independent way. Adapting Trace Scheduling to GPU code requires revisiting and revising each step of microcode Trace Scheduling to attend to branch and warp behavior, identifying instructions on the critical path, avoiding warp divergence, and reducing divergence time. Here, we propose "Warp-Aware Trace Scheduling" for GPUs. As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10× on a real system by increasing instructions executed per cycle (IPC) by a harmonic mean of 1.12× and reducing instruction serialization and total instructions executed.
- Published
- 2014
- Full Text
- View/download PDF
31. Aligned Scheduling: Cache-Efficient Instruction Scheduling for VLIW Processors
- Author
-
Vasileios Porpodas and Marcelo Cintra
- Subjects
Rate-monotonic scheduling ,Fixed-priority pre-emptive scheduling ,Trace scheduling ,Computer science ,Very long instruction word ,Two-level scheduling ,Instruction scheduling ,Parallel computing ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Fair-share scheduling ,Multiprocessor scheduling - Abstract
The performance of statically scheduled VLIW processors is highly sensitive to the instruction scheduling performed by the compiler. In this work we identify a major deficiency in existing instruction scheduling for VLIW processors. Unlike most dynamically scheduled processors, a VLIW processor with no load-use hardware interlocks will completely stall upon a cache-miss of any of the operations that are scheduled to run in parallel. Other operations in the same or subsequent instruction words must stall. However, if coupled with non-blocking caches, the VLIW processor is capable of simultaneously resolving multiple loads from the same word. Existing instruction scheduling algorithms do not optimize for this VLIW-specific problem.
- Published
- 2014
- Full Text
- View/download PDF
32. Continuous program optimization: Design and evaluation
- Author
-
M. Franz and T. Kistler
- Subjects
Continuous optimization ,Profiling (computer programming) ,Trace scheduling ,Computer science ,Distributed computing ,Locality ,Real-time computing ,Dynamic compilation ,Program optimization ,Data structure ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,Software - Abstract
This paper presents a system in which the already executing user code is continually and automatically reoptimized in the background, using dynamically collected execution profiles as a guide. Whenever a new code image has been constructed in the background in this manner, it is hot-swapped in place of the previously executing one. Control is then transferred to the new code and construction of yet another code image is initiated in the background. Two new runtime optimization techniques have been implemented in the context of this system: object layout adaptation and dynamic trace scheduling. The former technique constantly improves the storage layout of dynamically allocated data structures to improve data cache locality. The latter increases the instruction-level parallelism by continually adapting the instruction schedule to predominantly executed program paths. The empirical results presented in this paper make a case in favor of continuous optimization, but also indicate some of the pitfalls and current shortcomings of continuous optimization. If not applied judiciously, the costs of dynamic optimizations outweigh their benefit in many situations so that no break-even point is ever reached. In favorable circumstances, however, speed-ups of over 96 percent have been observed. It appears as if the main beneficiaries of continuous optimization are shared libraries in specific application domains which, at different times, can be optimized in the context of the currently dominant client application.
- Published
- 2001
- Full Text
- View/download PDF
33. Evolution-based scheduling of multiple variant and multiple processor programs
- Author
-
Piotr Jȩdrzejowicz, Henryk Szreder, Ireneusz Czarnowski, and Aleksander Skakowski
- Subjects
Earliest deadline first scheduling ,Rate-monotonic scheduling ,Computer Networks and Communications ,Trace scheduling ,Least slack time scheduling ,Computer science ,Distributed computing ,Scheduling (production processes) ,Dynamic priority scheduling ,Flow shop scheduling ,Round-robin scheduling ,Gang scheduling ,Deadline-monotonic scheduling ,Multiprocessor scheduling ,Fair-share scheduling ,Scheduling (computing) ,Gain scheduling ,Fixed-priority pre-emptive scheduling ,Hardware and Architecture ,Genetic algorithm scheduling ,Nurse scheduling problem ,Two-level scheduling ,Lottery scheduling ,Software - Abstract
The paper proposes to manage complexity and cost issues of the fault-tolerant programs not at a single program level, but rather from the point of view of the whole set of such programs, which are to be run under time constraints. The paper introduces a family of scheduling problems called fault-tolerant programs scheduling. Since, the discussed problems are, in general, computationally difficult, a challenge is to find effective scheduling procedures. Several evolution-based algorithms solving three basic kinds of fault-tolerant programs scheduling problems have been proposed. The problems involve scheduling multiple variant or multiple processor tasks on multiple, identical processors, under time constraints. To validate the algorithms computational experiment has been conducted. Experimental results show that evolution based algorithms produce satisfactory to good solutions in a reasonable time.
- Published
- 2001
- Full Text
- View/download PDF
34. Retrospective: very long instruction word archtectures and the ELI-512
- Author
-
Joseph A. Fisher
- Subjects
Trace scheduling ,Computer science ,Instruction scheduling ,Parallel computing ,Permission ,computer.software_genre ,Terminology ,Scheduling (computing) ,Very long instruction word ,Compiler ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Electrical and Electronic Engineering ,Instruction-level parallelism ,computer - Abstract
In this paper I introduced the term VLIW. VLIW was motivated by a compiler technique, and, for many readers, this paper was their introduction to "region scheduling" as well. I had put forward the first region scheduling algorithm, called trace scheduling, a few years before. Since region scheduling is a compiler technique, it is of interest to fewer people, but it enables superscalars and VLIWs with lots of instruction-level parallelism (ILP). Because I could see the power of region scheduling, I first began to think about VLIWs. I was fortunate in that this allowed me to coin the term instruction-level parallelism, and to work out a lot of the original details and terminology of ILP, before many others believed it was important.
- Published
- 2009
- Full Text
- View/download PDF
35. [Untitled]
- Author
-
Benjamin Bishop, Thomas P. Kelliher, Robert M. Owens, and Mary Jane Irwin
- Subjects
Instruction register ,Earliest deadline first scheduling ,Rate-monotonic scheduling ,Speedup ,Trace scheduling ,Computer science ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Round-robin scheduling ,Fair-share scheduling ,Scheduling (computing) ,Instruction set ,Fixed-priority pre-emptive scheduling ,Two-level scheduling ,Signal Processing ,Electrical and Electronic Engineering ,Information Systems - Abstract
We consider the increased performance that can be obtained by using in concert, three previously proposed (and in two cases used in commercial systems) ideas. These ideas are aggressive dynamic (run time) instruction scheduling, reuse of decoded instructions, and trace scheduling. We show that these ideas complement and support one another. Hence, while each of these ideas has been shown to have merit in its own right, when used in concert, we claim the overall advantage is greater than that obtained by using any one singly. To support this claim, we present the results from running several common multimedia kernels. Overall, these results show an average speedup of 3.50 times what can be had by using dynamic instruction scheduling alone.
- Published
- 1999
- Full Text
- View/download PDF
36. Rotation scheduling: a loop pipelining algorithm
- Author
-
Andrea S. LaPaugh, Edwin H.-M. Sha, and Liang-Fang Chao
- Subjects
Rate-monotonic scheduling ,Open-shop scheduling ,I/O scheduling ,Trace scheduling ,Computer science ,Processor scheduling ,Dynamic priority scheduling ,Parallel computing ,Multiprocessor scheduling ,Fair-share scheduling ,Scheduling (computing) ,Fixed-priority pre-emptive scheduling ,Nurse scheduling problem ,High-level synthesis ,Electrical and Electronic Engineering ,Retiming ,Computer Science::Operating Systems ,Data-flow analysis ,Instruction scheduling ,Flow shop scheduling ,Round-robin scheduling ,Computer Graphics and Computer-Aided Design ,Two-level scheduling ,Heuristics ,Algorithm ,Software - Abstract
We consider the resource-constrained scheduling of loops with inter-iteration dependencies. A loop is modeled as a data flow graph (DFG), where edges are labeled with the number of iterations between dependencies. We design a novel and flexible technique, called rotation scheduling, for scheduling cyclic DFGs using loop pipelining. The rotation technique repeatedly transforms a schedule to a more compact schedule. We provide a theoretical basis for the operations based on retiming. We propose two heuristics to perform rotation scheduling, and give experimental results showing that they have very good performance.
- Published
- 1997
- Full Text
- View/download PDF
37. [Untitled]
- Author
-
Bernd Jurisch, Peter Brucker, and Andreas Krämer
- Subjects
Rate-monotonic scheduling ,Earliest deadline first scheduling ,Open-shop scheduling ,Trace scheduling ,Computer science ,Distributed computing ,Scheduling (production processes) ,General Decision Sciences ,Dynamic priority scheduling ,Management Science and Operations Research ,Gang scheduling ,Fair-share scheduling ,Multiprocessor scheduling ,Scheduling (computing) ,Fixed-priority pre-emptive scheduling ,Genetic algorithm scheduling ,Nurse scheduling problem ,Lottery scheduling ,Computer Science::Operating Systems ,Job shop scheduling ,Instruction scheduling ,Flow shop scheduling ,Round-robin scheduling ,Deadline-monotonic scheduling ,Two-level scheduling ,Minimum-cost flow problem - Abstract
In a multi-purpose machine scheduling problem, jobs or operations can be processed by any machine of prespecified subsets of the machine set. In this paper, we study the computational complexity of such scheduling problems. We study scheduling problems with identical and uniform machines as well as problems with shop characteristics.
- Published
- 1997
- Full Text
- View/download PDF
38. Session 18: Ubi/cloud computing
- Author
-
Chaohuan Hou, Tiejun Zhang, Lidan Bao, Si Xu, and Donghui Wang
- Subjects
Rate-monotonic scheduling ,Fixed-priority pre-emptive scheduling ,Trace scheduling ,Computer science ,Very long instruction word ,Two-level scheduling ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Fair-share scheduling - Abstract
Instruction scheduling helps to hide pipeline delays and cache latencies in VLIW architectures, whose performance heavily depends on the compiler optimization techniques. However, the instruction conflicts across blocks also need to be taken into account while compiling. In order to solve this problem with as less performance loss as possible, the paper presents a general framework which hides the delay of conflicts in three different aspects. The experiments indicate that the proposed algorithms have achieved the goal of solving the conflicts with less cost in performance.
- Published
- 2013
- Full Text
- View/download PDF
39. A Delay Slot Scheduling Framework for VLIW Architectures in Assembly-Level
- Author
-
Hao Zhu, Zhiyuan Xue, Donghui Wang, Ying Huan, and Chaohuan Hou
- Subjects
Rate-monotonic scheduling ,Fixed-priority pre-emptive scheduling ,Trace scheduling ,Computer science ,Two-level scheduling ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Round-robin scheduling ,Fair-share scheduling - Abstract
A delay slot scheduling framework for VLIW architectures is presented in this paper. In this framework, an assembly data dependence graph is proposed to describe the data dependences among instructions in a basic block. An assembly control flow graph is also proposed to describe control relations among basic blocks. With the help of predicate analysis technology, majority invalid control flows are detected. Along the edges in these two graphs, instructions are selected to be filled into delay slots. Additionally, a scheme to balance the local scheduling and the global scheduling is proposed in this paper. This framework is evaluated by several experiments on the SuperV-DSP processor. And the results demonstrate the effectiveness of the new approach.
- Published
- 2013
- Full Text
- View/download PDF
40. Compiler-assisted leakage energy optimization of media applications on stream architectures
- Author
-
Guoyue Jiang, Zhaolin Li, Shaojun Wei, Shan Cao, and Zhixiang Chen
- Subjects
Rate-monotonic scheduling ,Idle ,Fixed-priority pre-emptive scheduling ,business.industry ,Trace scheduling ,Computer science ,Embedded system ,Two-level scheduling ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,business ,Fair-share scheduling - Abstract
Stream architecture is emerging as an important architecture for performance improvement of media applications. With technology scaled to nanometer-scale, leakage energy consumption is accounting for a greater proportion of the total energy consumption than ever, especially for stream architectures with a large number of functional units. In this paper, a compiler-assisted instruction-level scheduling algorithm is proposed to optimize leakage energy by idle interval distribution for stream architectures. The algorithm explores the scheduling spaces of idle intervals spatially and temporally and gathers the idle intervals for power-gating. The leakage energy is optimized without performance loss by increasing efficient power-gated cycles and decreasing switch times. We implement the proposed scheduling algorithm on Imagine processor and employ a set of benchmarks to evaluate the effectiveness of the algorithm. Experimental results show that the leakage energy consumption is reduced by 52% averagely compared with the list scheduling algorithm.
- Published
- 2013
- Full Text
- View/download PDF
41. The Compiler Forest
- Author
-
Joel Galenson, Gordon Plotkin, and Mihai Budiu
- Subjects
business.industry ,Computer science ,Programming language ,Trace scheduling ,Parallel computing ,Modular design ,computer.software_genre ,Set (abstract data type) ,Compiler construction ,Computer cluster ,A-normal form ,Compiler ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Software_PROGRAMMINGLANGUAGES ,business ,Software architecture ,computer - Abstract
We address the problem of writing compilers targeting complex execution environments, such as computer clusters composed of machines with multi-core CPUs. To that end we introduce partial compilers. These compilers can pass sub-programs to several child (partial) compilers, combining the code generated by their children to generate the final target code. We define a set of high-level polymorphic operations manipulating both compilers and partial compilers as first-class values. These mechanisms provide a software architecture for modular compiler construction. This allows the building of a forest of compilers, providing a structured treatment of multistage compilers.
- Published
- 2013
- Full Text
- View/download PDF
42. Trace software pipelining
- Author
-
M. Anton Ertl, Jian Wang, and Andreas Krall
- Subjects
Trace scheduling ,Loop inversion ,Computer science ,business.industry ,Pipeline (computing) ,Parallel computing ,computer.software_genre ,Computer Science Applications ,Theoretical Computer Science ,Loop scheduling ,Software ,Software pipelining ,Computational Theory and Mathematics ,Hardware and Architecture ,Very long instruction word ,Superscalar ,Compiler ,Instruction-level parallelism ,business ,computer - Abstract
Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches. This paper presents a novel global software pipelining technique, called Trace Software Pipelining, targeted to the instruction-level parallel processors such as Very Long Instruction Word (VLIW) and superscalar machines. Trace software pipelining applies a global code scheduling technique to compact the original loop body. The resulting loop is called a trace software pipelined (TSP) code. The trace softwrae pipelined code can be directly executed with special architectural support or can be transformed into a globally software pipelined loop for the current VLIW and superscalar processors. Thus, exploiting parallelism across all iterations of a loop can be completed through compacting the original loop body with any global code scheduling technique. This makes our new technique very promising in practical compilers. Finally, we also present the preliminary experimental results to support our new approach.
- Published
- 1995
- Full Text
- View/download PDF
43. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism
- Author
-
Jack L. Lo and Susan J. Eggers
- Subjects
Loop unrolling ,Rate-monotonic scheduling ,Schedule ,Trace scheduling ,Computer science ,Instruction scheduling ,Dynamic priority scheduling ,Parallel computing ,Round-robin scheduling ,Computer Graphics and Computer-Aided Design ,Fair-share scheduling ,Microarchitecture ,Scheduling (computing) ,Fixed-priority pre-emptive scheduling ,Two-level scheduling ,Cache ,Instruction-level parallelism ,Software - Abstract
Traditional list schedulers order instructions based on an optimistic estimate of the load latency imposed by the hardware and therefore cannot respond to variations in memory latency caused by cache hits and misses on non-blocking architectures. In contrast, balanced scheduling schedules instructions based on an estimate of the amount of instruction-level parallelism in the program. By scheduling independent instructions behind loads based on what the program can provide, rather than what the implementation stipulates in the best case (i.e., a cache hit), balanced scheduling can hide variations in memory latencies more effectively. Since its success depends on the amount of instruction-level parallelism in the code, balanced scheduling should perform even better when more parallelism is available. In this study, we combine balanced scheduling with three compiler optimizations that increase instruction-level parallelism: loop unrolling, trace scheduling and cache locality analysis. Using code generated for the DEC Alpha by the Multiflow compiler, we simulated a non-blocking processor architecture that closely models the Alpha 21164. Our results show that balanced scheduling benefits from all three optimizations, producing average speedups that range from 1.15 to 1.40, across the optimizations. More importantly, because of its ability to tolerate variations in load interlocks, it improves its advantage over traditional scheduling. Without the optimizations, balanced scheduled code is, on average, 1.05 times faster than that generated by a traditional scheduler; with them, its lead increases to 1.18.
- Published
- 1995
- Full Text
- View/download PDF
44. Compiling real-time programs with timing constraint refinement and structural code motion
- Author
-
Richard Gerber and Seongsoo Hong
- Subjects
Intermediate language ,Correctness ,Source code ,Static single assignment form ,Computer science ,Programming language ,Trace scheduling ,media_common.quotation_subject ,Optimizing compiler ,Static timing analysis ,Semantics ,computer.software_genre ,Execution time ,High-level programming language ,computer ,Software ,media_common - Abstract
We present a programming language called TCEL (Time-Constrained Event Language), whose semantics are based on time-constrained relationships between observable events. Such a semantics infers only those timing constraints necessary to achieve real-time correctness, without overconstraining the system. Moreover, an optimizing compiler can exploit this looser semantics to help tune the code, so that its worst-case execution time is consistent with its real-time requirements. In this paper we describe such a transformation system, which works in two phases. First, the TCEL source code is translated into an intermediate representation. Then an instruction-scheduling algorithm rearranges selected unobservable operations and synthesizes tasks guaranteed to respect the original event-based constraints. >
- Published
- 1995
- Full Text
- View/download PDF
45. ON OPTIMAL LOOP UNROLLING IN TWO-PROCESSOR SCHEDULING
- Author
-
Hesham H. Ali and Hesham El-Rewini
- Subjects
Loop unrolling ,General Computer Science ,Computer science ,Trace scheduling ,Optimal scheduling ,Parallel algorithm ,Processor scheduling ,Parallel computing ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Single loop ,Two stages ,Scheduling (computing) - Abstract
In this paper, we study the problem of scheduling loop task graphs onto two processor systems to minimize the execution lime. We use the idea of loop unrolling to uncover dependence among different loop iterations in loop task graphs. The scheduling is divided into two stages: determining an optimal unrolling factor, and constructing an optimal schedule for this unrolling factor. We propose a criterion for optimal loop unrolling for unit-delay task graphs, where the tasks are enclosed in a single loop. Our main theorem presents a useful test for detecting optimal unrolling factors. A scheduling algorithm that employs the concept of optimal loop unrolling is introduced. We also present serial and parallel algorithms lo find the best unrolling factor.
- Published
- 1995
- Full Text
- View/download PDF
46. Two-Dimensional Dynamic Loop Scheduling Schemes for Computer Clusters
- Author
-
Satish Penmatsa, Eric Ogharandukun, Anthony T. Chronopoulos, and Naveen Jayakumar
- Subjects
Rate-monotonic scheduling ,Earliest deadline first scheduling ,Open-shop scheduling ,I/O scheduling ,Least slack time scheduling ,Computer science ,Trace scheduling ,Distributed computing ,Processor scheduling ,Workload ,Dynamic priority scheduling ,Flow shop scheduling ,Parallel computing ,Round-robin scheduling ,Gang scheduling ,Fair-share scheduling ,Stride scheduling ,Multiprocessor scheduling ,Scheduling (computing) ,Fixed-priority pre-emptive scheduling ,Loop scheduling ,Computer cluster ,Two-level scheduling ,Lottery scheduling ,Computer Science::Operating Systems - Abstract
Efficient scheduling of parallel loops in a network of computers can significantly reduce the total execution time of complex scientific applications. In this paper, we compare the performance of two-dimensional dynamic loop scheduling schemes for computer clusters with that of one-dimensional loop scheduling schemes. The loop scheduling schemes are implemented using the Message Passing Interface on a cluster of processors. Experimental results show that the two-dimensional scheduling schemes were found to significantly reduce the total execution time of tasks over the one-dimensional schemes. In addition, the two-dimensional schemes present a more balanced load distribution of the workload among the computers in the cluster.
- Published
- 2012
- Full Text
- View/download PDF
47. Low Power Instructions Scheduling Based on Ant Colony Optimization
- Author
-
Yong Chen, Yanxiang He, Qian Liu, Ximi Liao, and Nian Chen
- Subjects
Rate-monotonic scheduling ,Mathematical optimization ,Trace scheduling ,Computer science ,Nurse scheduling problem ,Ant colony optimization algorithms ,Two-level scheduling ,Dynamic priority scheduling ,Parallel computing ,Metaheuristic ,Fair-share scheduling - Abstract
In this paper, we focus on the problem of instructions scheduling based on new meta-heuristic algorithm named Ant Colony Optimization (ACO).During the execution of the program, in fact, the flow goes from one instruction to the other. This procedure leads to binary transition just out of the differences between the binary code of two consecutive instructions. Our method pursues a related optimization strategy to reduce the total number of binary transition during the execution. It's very hard to find the best scheduling solution because of being a NP-Hard. ACO is used to research for an approximate but effective solution, to achieve the goal of combinational optimization. The final goal is to minimize the dissipation of power at lower level.
- Published
- 2012
- Full Text
- View/download PDF
48. Reducing branch costs via branch alignment
- Author
-
GrunwaldDirk and CalderBrad
- Subjects
Trace scheduling ,Computer science ,Emphasis (telecommunications) ,Basic block ,General Earth and Planetary Sciences ,Cache ,Parallel computing ,Branch predictor ,Computer Graphics and Computer-Aided Design ,Software ,General Environmental Science - Abstract
Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. As wide-issue architectures become increasingly popular the importance of reducing branch costs will increase, and branch alignment is one mechanism which can effectively reduce these costs. In this paper, we propose an improved branch alignment algorithm that takes into consideration the architectural cost model and the branch prediction architecture when performing the basic block reordering. We show that branch alignment algorithms can improve a broad range of static and dynamic branch prediction architectures. We also show that a program performance can be improved by approximately 5% even when using recently proposed, highly accurate branch prediction architectures. The programs are compiled by any existing compiler and then transformed via binary transformations. When implementing these algorithms on a Alpha AXP 21604 up to a 16% reduction in total execution time is achieved.
- Published
- 1994
- Full Text
- View/download PDF
49. Avoidance and suppression of compensation code in a trace scheduling compiler
- Author
-
P. Geoffrey Lowney, Stefan M. Freudenberger, and Thomas Gross
- Subjects
Dead code ,Trace scheduling ,Computer science ,Loop-invariant code motion ,Instruction scheduling ,Unreachable code ,Parallel computing ,Redundant code ,Software ,Dead code elimination ,Fair-share scheduling - Abstract
Trace scheduling is an optimization technique that selects a sequence of basic blocks as a trace and schedules the operations from the trace together. If an operation is moved across basic block boundaries, one or more compensation copies may be required in the off-trace code. This article discusses the generation of compensation code in a trace scheduling compiler and presents techniques for limiting the amount of compensation code: avoidance (restricting code motion so that no compensation code is required) and suppression (analyzing the global flow of the program to detect when a copy is redundant). We evaluate the effectiveness of these techniques based on measurements for the SPEC89 suite and the Livermore Fortran Kernels, using our implementation of trace scheduling for a Multiflow Trace 7/300. The article compares different compiler models contrasting the performance of trace scheduling with the performance obtained from typical RISC compilation techniques. There are two key results of this study. First, the amount of compensation code generated is not large. For the SPEC89 suite, the average code size increase due to trace scheduling is 6%. Avoidance is more important than suppression, although there are some kernels that benefit significantly from compensation code suppression. Since compensation code is not a major issue, a compiler can be more aggressive in code motion and loop unrolling. Second, compensation code is not critical to obtain the benefits of trace scheduling. Our implementation of trace scheduling improves the SPEC mark rating by 30% over basic block scheduling, but restricting trace scheduling so that no compensation code is required improves the rating by 25%. This indicates that most basic block scheduling techniques can be extended to trace scheduling without requiring any complicated compensation code bookkeeping.
- Published
- 1994
- Full Text
- View/download PDF
50. Instruction window size trade-offs and characterization of program parallelism
- Author
-
George B. Adams, Pradeep Dubey, and Michael J. Flynn
- Subjects
Trace scheduling ,Fortran ,Computer science ,Concurrency ,Speculative execution ,Instruction window ,Parallel computing ,computer.software_genre ,Theoretical Computer Science ,Scheduling (computing) ,Concurrency control ,Computational Theory and Mathematics ,Hardware and Architecture ,Superscalar ,Compiler ,Instruction-level parallelism ,computer ,Software ,computer.programming_language - Abstract
Detecting independent operations is a prime objective for computers that are capable of issuing and executing multiple operations simultaneously. The number of instructions that are simultaneously examined for detecting those that are independent is the scope of concurrency detection. The authors present an analytical model for predicting the performance impact of varying the scope of concurrency detection as a function of available resources, such as number of pipelines in a superscalar architecture. The model developed can show where a performance bottleneck might be: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection. The cost associated with speculative execution is examined via a set of probability distributions that characterize the inherent parallelism in the instruction stream. These results were derived using traces from a Multiflow TRACE SCHEDULING compacting FORTRAN 77 and C compilers. The experiments provide misprediction delay estimates for 11 common application-level benchmarks under scope constraints, assuming speculative, out-of-order execution and run time scheduling. The throughput prediction of the analytical model is shown to be close to the measured static throughput of the compiler output. >
- Published
- 1994
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.