Descriptor: "Data dependence" / Topic: parallel computing - Searchworks@Jio Institute Digital Library Search Results

2. Skipping Non-essential Instructions Makes Data-Dependence Profiling Faster

Author: Nicolas Morew, Mohammad Norouzi, Ali Jannesari, and Felix Wolf
Subjects: 010302 applied physics, Profiling (computer programming), Computer science, Data dependence, 020207 software engineering, 02 engineering and technology, Parallel computing, Static analysis, computer.software_genre, 01 natural sciences, Pointer (computer programming), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Compiler, computer, Compile time
Abstract: Data-dependence profiling is a dynamic program-analysis technique to discover potential parallelism in sequential programs. Unlike purely static analysis, which may overestimate the number of dependences because it does not know many pointers values and array indices at compile time, profiling has the advantage of recording data dependences that actually occur at runtime. But it has the disadvantage of significantly slowing down program execution, often by a factor of 100. In our earlier work, we lowered the overhead of data-dependence profiling by excluding polyhedral loops, which can be handled statically using certain compilers. However, neither does every program contain polyhedral loops, nor are statically identifiable dependences restricted to such loops. In this paper, we introduce an orthogonal approach, focusing on data dependences between accesses to scalar variables - across the entire program, inside and outside loops. We first analyze the program statically and identify memory-access instructions that create data dependences that would appear in any execution of these instructions. Then, we exclude these instructions from instrumentation, allowing the profiler to skip them at runtime and avoid the associated overhead. We evaluate our approach with 49 benchmarks from three benchmark suites. We improved the profiling time of all programs by at least 38%, with a median reduction of 61% across all the benchmarks.
Published: 2020
Full Text: View/download PDF

3. GOPipe

Author: Jidong Zhai, Youngmin Yi, Zhen Zheng, Chanyoung Oh, and Xipeng Shen
Subjects: 020203 distributed computing, Computer science, Computation, Data dependence, 020207 software engineering, 02 engineering and technology, Parallel computing, computer.software_genre, Stencil, Task (project management), Pipeline transport, Software framework, Software portability, Programming productivity, Factor (programming language), 0202 electrical engineering, electronic engineering, information engineering, Granularity, computer, computer.programming_language
Abstract: Recent studies have shown promising performance benefits when multiple stages of a pipelined stencil application are mapped to different parts of a GPU to run concurrently. An important factor for the computing efficiency of such pipelines is the granularity of a task. In previous programming frameworks that support true pipelined computations on GPU, the choice has to be made by the programmers during the application development time. Due to many difficulties, programmers' decisions are often far from optimal, causing inferior performance and performance portability. This paper presents GOPipe, a granularity-oblivious programming framework for efficient pipelined stencil executions on GPU. With GOPipe, programmers no longer need to specify the appropriate task granularity. GOPipe automatically finds it, and dynamically schedules tasks of that granularity for efficiency while observing all inter-task and inter-stage data dependencies. In our experiments on six real-life applications and various scenarios, GOPipe outperforms the state-of-the-art system by 1.39X on average with a much better programming productivity.
Published: 2019
Full Text: View/download PDF

4. Design of a Dual-Warp Scheduler for Streaming Multi-Processors Based GP-GPU

Author: Jong Joon Park and Do-Hyun Kim
Subjects: Stream processing, symbols.namesake, General Computer Science, Computer science, Superscalar, Data dependence, symbols, Deadline scheduler, Image processing, Parallel computing, Gaussian filter, Scheduling (computing)
Abstract: In this paper, a warp scheduler is proposed for the improvement of multi-core stream processor based GP-GPU performance. The proposed warp schedulers are divided into odd and even warps, which are issued separately by applying the dual-warp issue. Furthermore, it can simultaneously process up to four instructions because each warp can issue two instructions through superscalar issue. The superscalar issue has a limitation in that it cannot simultaneously process two instructions having data dependence. To solve this limitation, the warp scheduler determines the instruction issuance by testing the issuing condition of the multi-core processor and the read/write register dependence. For scheduling algorithm, the round-robin algorithm was used. To measure the performance of multi-core stream processors, the Gaussian filter mask processing result of the GP-GPU using the proposed warp scheduler was compared with that of the multi-core CPU on various embedded platforms. The experiment results showed that the processing speed of the GP-GPU using the warp scheduler was 6-7 times faster. The GP-GPU also performed better on an image processing application.
Published: 2016
Full Text: View/download PDF

5. Accelerating Data-Dependence Profiling with Static Hints

Author: Mohammad Norouzi, Felix Wolf, Ali Jannesari, and Qamar Ilias
Subjects: 010302 applied physics, Profiling (computer programming), Computer science, Data dependence, 020207 software engineering, 02 engineering and technology, Parallel computing, Static analysis, Dependence analysis, 01 natural sciences, Pointer (computer programming), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Persistent data structure, Merge (version control)
Abstract: Data-dependence profiling is a program-analysis technique to discover potential parallelism in sequential programs. Contrary to purely static dependence analysis, profiling has the advantage that it captures only those dependences that actually occur during execution. Lacking critical runtime information such as the value of pointers and array indices, purely static analysis may overestimate the amount of dependences. On the downside, dependence profiling significantly slows down the program, not seldom prolonging execution by a factor of 100. In this paper, we propose a hybrid approach that substantially reduces this overhead. First, we statically identify persistent data dependences that will appear in any execution. We then exclude the affected source-code locations from instrumentation, allowing the profiler to skip them at runtime and avoiding the associated overhead. At the end, we merge static and dynamic dependences. We evaluated our approach with 38 benchmarks from two benchmark suites and obtained a median reduction of the profiling time by 62% across all the benchmarks.
Published: 2019
Full Text: View/download PDF

6. HITSM: A Heuristic Algorithm for Independent Task Scheduling in Multicore

Author: Yujie Huang, Xiaoyang Zeng, Minge Jing, Chen-Lu Li, and Quan Zhang
Subjects: Multi-core processor, Job shop scheduling, Computer science, Server, Data dependence, 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, Parallel computing, Load balancing (computing), 020202 computer hardware & architecture, Scheduling (computing)
Abstract: Nowadays, multicore is widely used in servers which are shared by many users. Besides the tasks submitted by different users are usually data-independent. However, the existing scheduling algorithms for multicore always take data dependence into consideration which limit the performance of multicore. As a consequence, we propose a heuristic algorithm called HITSM for independent task scheduling in multicore. In the proposed HITSM, we take makespan and load balancing into consideration. The experiment results demonstrate that, compared to First Come First service (FCFS) and Min-min algorithm, the proposed HITSM respectively reduces the makespan by 9.4% and 6.8%, and separately increases the performance on load balancing by 27.6 and 25.2 times on the heterogeneous cores.
Published: 2018
Full Text: View/download PDF

7. Prediction based convolution neural network acceleration

Author: Zhonghai Lu and Yuan Yao
Subjects: 010302 applied physics, Computer science, Data dependence, 02 engineering and technology, Parallel computing, Work in process, 01 natural sciences, Convolutional neural network, 020202 computer hardware & architecture, Convolution, Acceleration, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Parallelism (grammar), Algorithm, Order of magnitude
Abstract: Although intra-layer parallelism is commonly used to expedite CNN execution, it is difficult to achieve inter-layer parallelism because of data dependence between layers. In the paper, we propose a two-phase prediction and correction mechanism to break the data dependence between CNN layers so as to enable inter-layer parallelism. Our technique achieves one more order of magnitude (from the order of 10 to the order of 100) CNN acceleration compared to other three state-of-the-art GPU based CNN acceleration mechanisms.
Published: 2017
Full Text: View/download PDF

8. Approximate Data Dependence Graph Generation Using Adaptive Sampling

Author: Mostafa M. Abbas and Ahmed El-Mahdy
Subjects: 010302 applied physics, Profiling (computer programming), Adaptive sampling, Computer science, Data dependence, Binary number, 02 engineering and technology, Parallel computing, 01 natural sciences, 020202 computer hardware & architecture, Program analysis, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), Instrumentation (computer programming), Graph generation, Algorithm, Data-flow analysis
Abstract: Identifying data dependence among loop iterations is a fundamental step in the parallelisation process. Generally, code instrumentation provides for such information at the expense of high runtime performance penalty. This paper proposes an efficient method that trades slight accuracy reduction with significant performance gain to generate an approximate dependence graph. The proposed method relies on replicating the loop under test, providing for instrumented and not instrumented code versions, and adaptively switching between them, as well as deciding on the analysis detail, depending on the stability of measured dependence distances. Moreover, the method utilises random sampling, decreasing the chances of missing dependent irregular memory accesses. An initial performance investigation of the method is conducted using the Pin binary instrumentation tools, results on selected PolyBench kernels shows up to 8.5× improvement in instrumentation time, with no missed dependencies in 14 kernels, and 45% missed dependencies in one kernel.
Published: 2016
Full Text: View/download PDF

9. Selective restart of threads for efficient thread-level speculation on multicore architecture

Author: Inhwan Lee and Sungjae Lee
Subjects: Speedup, Recovery method, Computer science, Data dependence, Speculative multithreading, Parallel computing, Thread (computing), ComputerSystemsOrganization_PROCESSORARCHITECTURES, Electrical and Electronic Engineering, Condensed Matter Physics, Multicore architecture, Electronic, Optical and Magnetic Materials
Abstract: An efficient recovery method for thread-level speculation (TLS) is proposed. The method tracks the inter-thread data dependence as a method for identifying those threads that are obviously unaffected by a data dependence violation. The method is simple to implement. Still, the simulation results using benchmark applications show that the method can significantly reduce the number of unnecessary thread restarts and consequently improve the performance of TLS. Specifically, when compared with the baseline TLS, TLS with the proposed method is 2.3 times faster for IS, 1.7 times faster for equake, and 3.5 times faster for mcf with the use of 64 cores. With the method, the performance of TLS increases steadily up to 64 cores for IS, equake, and mcf, while the speedup of the baseline TLS starts to saturate at 8 or 16 cores.
Published: 2012
Full Text: View/download PDF

10. Optimizing Scheduling Technology for Clustered VLIW Architectures Using Data Dependence Graph

Author: Xu Yang, Yi-He Sun, and Hu He
Subjects: Computer Networks and Communications, Hardware and Architecture, Computer science, Very long instruction word, Data dependence, Graph (abstract data type), Parallel computing, Computer Graphics and Computer-Aided Design, Software, Scheduling (computing)
Published: 2011
Full Text: View/download PDF

11. An Instruction Scheduler for Dynamic ALU Cascading Adoption

Author: Shinji Tomita, Hajime Shimada, Kosuke Ogata, Hiroshi Nakashima, Shinobu Miwa, and Jun Yao
Subjects: General Computer Science, business.industry, Computer science, Clock rate, Data dependence, Instruction scheduling, Workload, Energy consumption, Parallel computing, Energy minimization, Scheduling (computing), Embedded system, Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING, Hardware_ARITHMETICANDLOGICSTRUCTURES, business, Voltage
Abstract: To reduce the processor energy consumption under low workload and low clock frequency executions, a possible solution is to use ALU cascading while keeping the supply voltage unchanged. This cascading scheme uses a single cycle to execute multiple ALU instructions which have a data dependence relationship between them and thus saves clock cycles for the whole execution. Since the processor energy consumption is the product result of both power and execution time, ALU cascading is expected to help energy optimization for microprocessors operating under low frequency status. To implement ALU cascading in a current superscalar processor, a specific instruction scheduler is required to wakeup a pair of cascadable instructions simultaneously despite there being a data dependence relationship between them. Furthermore, ALU cascading is only applied under low clock frequency execution mode so that the instruction scheduler must support standard scheduling for the normal clock frequency execution. In this paper, we propose an instruction scheduling method that enables the additional wakeup features for the utilization of ALU cascading without large hardware extensions. With this scheduler, the average IPC improvement becomes 3.7% in SPECint2000 and 6.4% in Mediabench, as compared to the baseline execution. The delay of additional hardware required for the ALU cascading purpose is also evaluated to study the complexity of ALU cascading.
Published: 2009
Full Text: View/download PDF

12. An Efficient Data-Dependence Profiler for Sequential and Parallel Programs

Author: Felix Wolf, Zhen Li, and Ali Jannesari
Subjects: Profiling (computer programming), Automatic parallelization, Program analysis, Memory management, Computer science, Pointer (computer programming), Data dependence, Performance tuning, Parallel computing, Scheduling (computing)
Abstract: Extracting data dependences from programs serves as the foundation of many program analysis and transformation methods, including automatic parallelization, runtime scheduling, and performance tuning. To obtain data dependences, more and more related tools are adopting profiling approaches because they can track dynamically allocated memory, pointers, and array indices. However, dependence profiling suffers from high runtime and space overhead. To lower the overhead, earlier dependence profiling techniques exploit features of the specific program analyses they are designed for. As a result, every program analysis tool in need of data-dependence information requires its own customized profiler. In this paper, we present an efficient and at the same time generic data-dependence profiler that can be used as a uniform basis for different dependence-based program analyses. Its lock-free parallel design reduces the runtime overhead to around 86× on average. Moreover, signature-based memory management adjusts space requirements to practical needs. Finally, to support analyses and tuning approaches for parallel programs such as communication pattern detection, our profiler produces detailed dependence records not only for sequential but also for multi-threaded code.
Published: 2015
Full Text: View/download PDF

13. Parallelization and performance tuning of molecular dynamics code with OpenMP

Author: Kui-lin Lu, Shu-ren Bai, and Liping Ran
Subjects: Molecular dynamics, Automatic parallelization, Shared memory, Scale (ratio), Mechanics of Materials, Computer science, Mechanical Engineering, Improved algorithm, Data dependence, Performance tuning, Code (cryptography), General Materials Science, Parallel computing
Abstract: An OpenMP approach was proposed to parallelize the sequential molecular dynamics (MD) code on shared memory machines. When a code is converted from the sequential form to the parallel form, data dependence is a main problem. A traditional sequential molecular dynamics code is anatomized to find the data dependence segments in it, and the two different methods, i. e., recover method and backward mapping method were used to eliminate those data dependencies in order to realize the parallelization of this sequential MD code. The performance of the parallelized MD code was analyzed by using some performance analysis tools. The results of the test show that the computing size of this code increases sharply form 1 million atoms before parallelization to 20 million atoms after parallelization, and the wall clock during computing is reduced largely. Some hot-spots in this code are found and optimized by improved algorithm. The efficiency of parallel computing is 30% higher than that of before, and the calculation time is saved and larger scale calculation problems are solved.
Published: 2006
Full Text: View/download PDF

14. Parallelization of Block Encryption Algorithm Based on Piecewise Nonlinear Map

Author: Dariusz Burak
Subjects: business.industry, Data dependence, Chaotic, Parallelism (grammar), Piecewise, Parallel computing, Nonlinear map, Encryption, business, Block (data storage), Mathematics
Abstract: In this paper, the results of parallelizing chaotic block encryption algorithm based on a piecewise nonlinear map are presented. A data dependence analysis of loops was applied in order to parallelize this algorithm. An OpenMP standard is used for presenting the parallelism of the algorithm. The efficiency measurement for a parallel program is shown.
Published: 2015
Full Text: View/download PDF

15. Parallelization of a Block Cipher Based on Chaotic Neural Networks

Author: Dariusz Burak
Subjects: CHAOS (operating system), Artificial neural network, Computer science, business.industry, Chaotic neural network, Data dependence, Parallelism (grammar), Parallel algorithm, Parallel computing, Encryption, business, Block cipher
Abstract: In this paper the results of parallelizing a block cipher based on chaotic neural networks are presented. A data dependence analysis of loops is applied in order to parallelize the algorithm. The parallelism of the algorithm is demonstrated in accordance with the OpenMP standard. As a result of this study, it is stated that the most time-consuming loops of the algorithm are suitable for parallelization. The efficiency measurements of a parallel algorithm working in standard modes of operation are shown.
Published: 2015
Full Text: View/download PDF

16. Fast Data-Dependence Profiling by Skipping Repeatedly Executed Memory Operations

Author: Michael Beaumont, Felix Wolf, Ali Jannesari, and Zhen Li
Subjects: Profiling (computer programming), Program analysis, Exploit, Computer science, Real-time computing, Data dependence, Time overhead, Parallel computing
Abstract: Nowadays, more and more program analysis tools adopt profiling approaches in order to obtain data dependences because of their ability of tracking dynamically allocated memory, pointers, and array indices. However, dependence profiling suffers from high time overhead. To lower the overhead, former dependence profiling techniques either exploit features of the specific program analyses they are designed for, or let the profiling process run in parallel. Although they successfully lowered the time overhead of dependence profiling by a certain amount, none of them have tried to solve the fundamental problem that causes the high time overhead: the memory operations that are repeatedly executed in loops. In most of the time, these memory operations lead to exactly the same data dependences. However, a profiling method has to profile all these memory operations over and over again in order to not miss a single dependence that may occur just once. In this paper, we present a method that allow a dependence profiling technique to skip memory operations that are repeatedly executed in loops without missing any single data dependence. Our method works with all types of loops and does not require any prepossessing like source annotation of the input code. Experiment results show that our method can lower the time overhead of data-dependence profiling by up to 52 %.
Published: 2015
Full Text: View/download PDF

17. Adding static data dependence collapsing to a high-performance instruction scheduler

Author: Fleur L. Steven, Colin Egan, Richard D. Potter, and G.B. Steven
Subjects: Hardware and Architecture, Single entity, Computer science, Data dependence, Instruction scheduling, Parallel computing, Static data, Operand, Execution time, Software, Scheduling (computing), Compile time
Abstract: State-of-the-art processors achieve high performance by executing multiple instructions in parallel. However, the parallel execution of instructions is ultimately limited by true data dependencies between individual instructions. The objective of this paper is to present and quantify the benefits of static data dependence collapsing, a non-speculative technique for reducing the impact of true data dependencies on program execution time. Data dependence collapsing involves combining a pair of instructions when the second instruction is directly dependent on the first. The two instructions are then treated as a single entity and are executed together in a single functional unit that is optimised to handle functions with three input operands instead of the traditional two inputs. Dependence collapsing can be accomplished either dynamically at run time or statically at compile time. Since dynamic dependence collapsing has been studied extensively elsewhere, this paper concentrates on static dependence collapsing. To quantify the benefits of static dependence collapsing, we added a new dependence collapsing option to the Hatfield Superscalar Scheduler (HSS), a state-of-the-art instruction scheduler that targets the Hatfield Superscalar Architecture (HSA). We demonstrate that the addition of dependence collapsing to HSS delivers a significant performance increase of up to 15%. Furthermore, since HSA already executes over four instructions in each processor cycle without dependence collapsing, dependence collapsing enables 0.4 additional instructions to be executed in each processor cycle.
Published: 2001
Full Text: View/download PDF

18. Non-linear array data dependence test

Author: Cheng-Ming Yang and Tsung-Chuan Huang
Subjects: Wavefront, Scheme (programming language), Current (mathematics), Computer science, Data dependence, Process (computing), Parallel computing, Loop (topology), Nonlinear system, Hardware and Architecture, Code (cryptography), computer, Algorithm, Software, Information Systems, computer.programming_language
Abstract: Data dependence analysis is the most essential process while parallelizing a sequential program. Most current data dependence tests cannot handle array subscripts that are non-linear expressions. In this paper, we present a new parallelization algorithm, called non-linear array subscripts (NLA) test, to deal with non-linear or complex array subscripts. In this scheme, the iterations subject to loop-carried dependence are scheduled into different wavefronts, while the iterations with no loop-carried dependence are assigned into the same wavefront. Based on the wavefront information, the original loop is then transformed into parallel code. Our experimental results on shared-memory parallel machines HP SPP2000 and ALR Quad6 prove the high effectiveness of the NLA test.
Published: 2001
Full Text: View/download PDF

19. An Approach for Parallel Detection and Execution of Arithmetic Operations at Inter Instruction Level

Author: S Valli and V Sankaranarayanan
Subjects: Data dependency, Computer science, Data dependence, Parallelism (grammar), Limit (mathematics), Parallel detection, Parallel computing, Electrical and Electronic Engineering, Arithmetic expressions, Arithmetic, Operand, Computer Science Applications, Theoretical Computer Science
Abstract: An approach is developed for parallel detection and evaluation of arithmetic operations at inter instruction level. Parallelism exploitation is targeted at inter instruction level. Data dependence analysis is carried out since dependence will limit the number of parallel operations. Arithmetic expressions with +, -, *, / as permissible operators, variables and constants as operands are considered in the implementation. Transputers are used in the implementation.
Published: 2001
Full Text: View/download PDF

20. Enhancing java processor performance with smart dynamic folding

Author: Lung-Chung Chang, Min Fu Kao, Lee-Ren Ton, and Chung-Ping Chung
Subjects: Mechanism design, business.product_category, Java processor, Stack (abstract data type), Computer science, Data dependence, Stack trace, General Engineering, Folding (DSP implementation), Parallel computing, Internet appliance, business, Stack machine
Abstract: The Java processor is suitable for Internet appliances or embedded controllers due to its speed and low memory requirement. However, its performance is severely limited by true data dependence. In this work, we present a smart and dynamic stack operations folding – POC model‐based folding. The stack instructions are classified into P,O, and C three types. The folding algorithm can automatically determine the folding relations among all the instructions based on the type and folding attributes of each instruction. The proposed algorithm has no requirement to match different patterns. A typical folding mechanism design based on this model is then introduced. Also, the performance of various folding methods based on the POC model is evaluated. Simulation data indicate that the 4‐foldable method eliminates 84% of all stack operations. Furthermore, the 2‐, 3‐, and 4‐foldable methods accelerate the overall program by 1.22, 1.32 and 1.34, respectively, as compared to a Java processor without folding.
Published: 2000
Full Text: View/download PDF

21. Loop parallelisation for pvm-based distributed-memory systems

Author: David J. Evans and Mohd Yazid Saman
Subjects: LOOP (programming language), Programming language, Computer science, Applied Mathematics, Data dependence, Process (computing), Parallel computing, computer.software_genre, Computer Science Applications, Task (computing), Computational Theory and Mathematics, Distributed memory systems, Virtual machine, Distributed memory, computer
Abstract: Writing programs for a distributed-memory system (DMS) is a difficult job. In this paper, a method for parallelising sequential programs for DMS is presented. The input programs are C programs and the output parallel versions are programs containing routines for the Parallel Virtual Machine (PVM). PVM allows a group of computers in a network to be specified as a DMS and provides the routines for task activation and communication. The main task in this parallelisation of program is to process the loops in the source program and determine if there exists any data dependences or not. If the loop iterations are independent, the body will be transformed to tasks that will run independently for PVM.
Published: 2000
Full Text: View/download PDF

22. JAPS: an automatic parallelizing system based on JAVA

Author: Jiancheng Du, Daoxu Chen, and Li Xie
Subjects: Java, Exploit, Programming language, Computer science, Data parallelism, Data dependence, Task parallelism, Dynamic priority scheduling, Parallel computing, Dependence analysis, computer.software_genre, Scheduling (computing), computer, computer.programming_language
Abstract: JAPS is an automatic parallelizing system based on JAVA running on NOW. It implements the automatic process from dependence analysis to parallel execution. The current version of JAPS can exploit functional parallelism and the detection of data parallelism will be incorporated in the new version, which is underway. The framework and key techniques of JAPS are presented. Specific topics discussed are task partitioning, summary information collection, data dependence analysis, pre-scheduling and dynamic scheduling, etc.
Published: 1999
Full Text: View/download PDF

23. Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization

Author: Philippe Clauss, Alain Ketterlin, Compilation pour les Architectures MUlti-coeurS (CAMUS), Laboratoire des Sciences de l'Image, de l'Informatique et de la Télédétection (LSIIT), Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)-Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire de Sciences de l'Image, de l'Informatique et de la Télédétection, équipe ICPS (LSIIT / ICPS), Centre National de la Recherche Scientifique (CNRS), MULTICORE, Institut National de Recherche en Informatique et en Automatique (Inria), and Clauss, Philippe
Subjects: 010302 applied physics, Profiling (computer programming), Computer science, Data dependence, 020207 software engineering, 02 engineering and technology, Parallel computing, Static analysis, 01 natural sciences, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Graph, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering
Abstract: International audience; This paper describes a tool using one or more executions of a sequential program to detect parallel portions of the program. The tool, called Parwiz, uses dynamic binary instrumentation, targets various forms of parallelism, and suggests distinct parallelization actions, ranging from simple directive tagging to elaborate loop transformations. The first part of the paper details the link between the program's static structures (like routines and loops), the memory accesses performed by the program, and the dependencies that are used to highlight potential parallelism. This part also describes the instrumentation involved, and the general architecture of the system. The second part of the paper puts the framework into action. The first study focuses on loop parallelism, targeting OpenMP parallel- for directives, including privatization when necessary. The second study is an adaptation of a well-known vectorization technique based on a slightly richer dependence description, where the tool suggests an elaborate loop transformation. The third study views loops as a graph of (hopefully lightly) dependent iterations. The third part of the paper explains how the overall cost of data- dependence profiling can be reduced. This cost has two major causes: first, instrumenting memory accesses slows down the program, and second, turning memory accesses into dependence graphs consumes processing time. Parwiz uses static analysis of the original (binary) program to provide data at a coarser level, moving from individual accesses to complete loops whenever possible, thereby reducing the impact of both sources of inefficiency.
Published: 2012
Full Text: View/download PDF

24. Runtime analysis of application binaries for function level parallelism potential using QEMU

Author: Ghulam Mustafa, Abdul Qadeer, Khansa Butt, and Abdul Waheed
Subjects: Profiling (computer programming), Source code, Exploit, Computer science, media_common.quotation_subject, Data dependence, Task parallelism, Parallel computing, computer.software_genre, Automatic parallelization, Compiler, Dynamic instrumentation, computer, media_common
Abstract: Efficacy of automatic parallelization techniques that rely on source code analysis alone is often limited due to lack of information about runtime characteristics of target applications. In order to exploit runtime application behavior for its parallelization, we need: (1) tools/techniques for dynamic instrumentation and profiling; and (2) a methodology to identify areas of application that are amenable for explicit and speculative parallelization. In this paper, we present an infrastructure that provides above-mentioned facilities to analyze ELF binaries in an emulated runtime environment. The infrastructure, which is implemented as an extension to quick emulator (QEMU), includes a profiling mechanism to capture runtime behavior of an application and an inter-function dependence metric for quantitatively measure the potential for function level parallelism. The dependence metric is an extension of data dependence densities effort [7]. We ran sequential versions of NAS benchmarks through this infrastructure to determine their function level parallelization potential. Resulting data can be consumed for manual parallelization efforts as well as for automated parallelization through compiler feedback during build process.
Published: 2012
Full Text: View/download PDF

25. Implement Improved Loop Level OpenMP Program Based on Parallel Region Reconstruction

Author: Shi'an Hu, Aixian Dong, and Hongtu Ma
Subjects: Instruction set, Computer science, Data dependence, Parallel computing, Fork–join queue, Merge (version control)
Abstract: Based on three OpenMP program models, the technology of parallel region reconstruction is mainly discussed to implement the improved loop level OpenMP program. Parallel region reconstruction is to expand and merge parallel regions. When reconstructing parallel regions, there are two things should be noted, that is to keep data attribute and data dependence before and after optimization. Experimental results of PPOPP show that after parallel region reconstruction, the improvement of lu1k is maximally up to 28.1%, and the improvement of erle64 is the lowest about 1.87%. The reason of lu1k's highest improvement is that a parallel region is expanded outside a loop of 1024 iterations, which reduce 1023 times of the parallel region creation. The experimental results indicate the technology of parallel region reconstruction reduces the creation of parallel region, and improves the performance of the OpenMP program.
Published: 2012
Full Text: View/download PDF

26. Multi-slicing: a compiler-supported parallel approach to data dependence profiling

Author: Hongtao Yu and Zhiyuan Li
Subjects: Profiling (computer programming), Multi-core processor, Computer science, business.industry, media_common.quotation_subject, Data dependence, Ambiguity, Parallel computing, computer.software_genre, Slicing, Software, Compiler, Granularity, business, computer, media_common
Abstract: Retrofitting existing software for the increasingly dominant multicore microprocessors has a strong appeal from the economic point of view. One of the key issues in such an effort is to fully understand the data dependences in the existing software. Unfortunately, current compilers have quite limited ability to analyze data dependences. Therefore, execution-driven data dependence profiling has gained significant interest because it can resolve memory access ambiguity exactly during program execution, which allows data dependences to be analyzed exactly. Although such dependence profiling is valid for specific inputs only, the insight it provides can be highly valuable to software engineers in their parallelization effort. On the other hand, dependence profiling itself can take tremendous memory and machine time. In this paper, we propose a novel dependence profiling method which, with the support of several new compiler and runtime techniques, partitions the profiling task into many independent slices, each requiring significantly less memory. Different slices can be profiled in parallel, producing subgraphs which are eventually combined automatically into the complete data dependence graph by the compiler. The slices can be extracted with different degrees of granularity. Experiments show that, for several well-known benchmark programs, our parallel scheme shortens the profiling time by a few orders of magnitude.
Published: 2012
Full Text: View/download PDF

27. Fast loop-level data dependence profiling

Author: Hongtao Yu and Zhiyuan Li
Subjects: Profiling (computer programming), Memory address, Alias, Computer science, Data dependence, Hash function, Compiler, Parallel computing, Dependence analysis, computer.software_genre, Data structure, computer
Abstract: Execution-driven data dependence profiling has gained significant interest as a tool to compensate the weakness of static data dependence analysis. Although such dependence profiling is valid for specific inputs only, its result can be used in many ways for program parallelization. Unfortunately, traditional hash-based dependence profiling can take tremendous memory and machine time, which severely limits its practical use. In this paper, we propose new compiler-based techniques to perform fast loop-level data dependence profiling. Firstly, using type consistency and alias information, our compiler embeds memory tags into the data structures in the original program such that memory addresses can be efficiently compared for dependence testing. This approach avoids the bytewise hashing overhead in conventional profiling methods. Secondly, we prove that a partial dependence graph obtained from profiling is sufficient for loop-level reordering transformations and parallelization. Such partial dependence graph can be obtained very fast, without having to exhaustively enumerate all dependence edges. Thirdly, our compiler partitions the profiling task into independent slices. Such slices can be profiled in parallel, producing subgraphs which are eventually combined automatically into the complete data dependence graph by the compiler. Experiments show that these techniques significantly reduce the memory use and shorten the profiling time (by an order of magnitude for several SPEC2006 benchmarks). Benchmarks too big to profile at all loop levels by previous methods can now be profiled fully within several hours.
Published: 2012
Full Text: View/download PDF

28. Distributed replay protocol for distributed uniprocessors

Author: Xuechao Wei, Wenting Han, Wei Zhou, Tao Sun, Hong An, Mengjie Mao, and Bobin Deng
Subjects: Computer science, Distributed computing, Data dependence, Uniprocessor system, Parallel computing, Architecture design, Latency (engineering), Speculation, Partition (database)
Abstract: Data speculation technique has been heavily exploited in various scenarios of architecture design. It bridges the time or space gap between data producer and data consumer, which gives opportunities to processors to gain significant speedups. However, large instruction windows, deep pipeline and increasing latency of on-chip communication make data misspeculation very expensive in modern processors.This paper proposes a Distributed Replay Protocol(DRP) that addresses data misspeculation in a distributed uniprocessor, named TFlex. The partition feature of distributed uniprocessors aggravates the penalty of data misspeculation. After detecting misspeculation, DRP avoids squashing pipeline; on the contrary, it retains all instructions in the window and selectively replays the instructions that depend on the misspeculative data. As one possible use of DRP, We apply it to recovery from data dependence speculation. We also summarize the challenges of implementing selective replay mechanism on distributed uniprocessors, and then come up with two variations of DRP to effectively solve these challenges. The evaluation results show that without data speculation, DRP achieves 99% of the performance of perfect memory disambiguation. It speeds up diverse applications over baseline TFlex(with a state-of-art data dependence predictor) by a geometric mean of 24%.
Published: 2012
Full Text: View/download PDF

29. Copy Elimination on Data Dependence Graphs

Author: Quentin Colombet, Florian Brandner, Compilation and embedded computing systems (COMPSYS), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Subjects: Register (music), Computer science, Data dependence, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), 020207 software engineering, 020201 artificial intelligence & image processing, 02 engineering and technology, Parallel computing, Extension (predicate logic), Arithmetic, SPECint, Register allocation
Abstract: International audience; Register allocation recently regained much interest due to new decoupled strategies that split the problem into separate phases: spilling, register assignment, and copy elimination. A common assumption of existing copy elimination approaches is that the original ordering of the instructions in the program is not changed. This work presents an extension of a local recoloring technique called Parallel Copy Motion. We perform code motion on data dependence graphs in order to eliminate useless copies and reorder instructions, while at the same time a valid register assignment is preserved. Our results show that even after traditional register allocation with coalescing our technique is able to eliminate an additional 3% (up to 9%) of the remaining copies and reduce the weighted costs of register copies by up to 25% for the SPECINT 2000 benchmarks. In comparison to Parallel Copy Motion, our technique removes 11% (up to 20%) more copies and up to 39% more of the copy costs.
Published: 2012
Full Text: View/download PDF

30. Parallelization of the Discrete Chaotic Block Encryption Algorithm

Author: Michał Chudzik and Dariusz Burak
Subjects: Automatic parallelization, Parallelizable manifold, Computer science, business.industry, Data dependence, Chaotic, Parallelism (grammar), Parallel computing, Encryption, business, Data dependency analysis, Block (data storage)
Abstract: In this paper, we present the results of parallelizing the Lian et al. discrete chaotic block encryption algorithm. The data dependence analysis of loops was applied in order to parallelize this algorithm. The OpenMP standard is chosen for presenting the parallelism of the algorithm. We show that the algorithm introduced by Lian et al. can be divided into parallelizable and unparallelizable parts. As a result of our study, it was stated that the most timeconsuming loops of the algorithm are suitable for parallelization. The efficiency measurement for a parallel program is presented.
Published: 2012
Full Text: View/download PDF

31. A general data dependence test for dynamic, pointer-based data structures

Author: Joseph Hummel, Alexandru Nicolau, and Laurie Hendren
Subjects: Computer engineering, Computer science, Escape analysis, Pointer (computer programming), Data dependence, Compiler, Parallel computing, computer.software_genre, Data structure, computer, Pointer analysis, Computer Graphics and Computer-Aided Design, Software
Abstract: Optimizing compilers require accurate dependence testing to enable numerous, performance-enhancing transformations. However, data dependence testing is a difficult problem, particularly in the presence of pointers. Though existing approaches work well for pointers to named memory locations (i.e. other variables), they are overly conservative in the case of pointers to unnamed memory locations. The latter occurs in the context of dynamic, pointer-based data structures, used in a variety of applications ranging from system software to computational geometry to N-body and circuit simulations. In this paper we present a new technique for performing more accurate data dependence testing in the presence of dynamic, pointer-based data structures. We will demonstrate its effectiveness by breaking false dependences that existing approaches cannot, and provide results which show that removing these dependences enables significant parallelization of a real application.
Published: 1994
Full Text: View/download PDF

32. Parallelism analysis and optimization in SPEFY, a programming environment

Author: S. Srinivas, Ming Li, and K.J.M. Moriarty
Subjects: Data flow diagram, Hardware and Architecture, Computer science, Data parallelism, Data dependence, Code (cryptography), Feature (machine learning), Parallelism (grammar), General Physics and Astronomy, Task parallelism, Parallel computing, Instruction-level parallelism
Abstract: SPEFY (Scotia Programming Environment and Facility) is a new software development environment designed to simplify and accelerate the development of large-scale programs in a manner that makes the most efficient use of the supercomputers on which they run. The centerpiece of SPEFY is the Parallelism Analysis and Optimization tool, which is an interactive facility for analyzing code, detecting data dependence, and optimizing the program by parallelism-enhancing transformations. A significant feature of the analysis is that it is performed both across and within procedures, and greatly increase the precision of data flow and dependence information. The objective of this paper is to describe the Parallelism Analysis and Optimization tool of SPEFY. It discusses data dependence, interprocedural analysis by determining the relevant effects of procedure calls, data dependence analysis incorporating interprocedural information, and program restructuring optimization techniques.
Published: 1994
Full Text: View/download PDF

33. Exclusive squashing for thread-level speculation

Author: Arturo Gonzalez-Escribano, Alvaro Garcia-Yaguez, and Diego R. Llanos
Subjects: Reduction (complexity), Automatic parallelization, Speedup, Computer science, 020204 information systems, Computation, Data dependence, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), Speculative multithreading, 020207 software engineering, 02 engineering and technology, Parallel computing
Abstract: Speculative parallelization is a runtime technique that optimistically executes sequential code in parallel, checking that no dependence violations appear. In this paper, we address the problem of minimizing the number of threads that should be restarted when a data dependence violation is found. We present a new mechanism that keeps track of inter-thread dependencies in order to selectively stop and restart offending threads, and all threads that have consumed data from them. Results show a reduction of 38.5% to 81.8% in the number of restarted threads for real application loops and up to a 10% speedup, depending on the amount of local computation.
Published: 2011
Full Text: View/download PDF

34. Optimal data dependence chaining in parallel loops

Author: Z. Szczerbiński
Subjects: Arc (geometry), MIMD, Computer science, Data dependence, Chaining, General Engineering, Parallel computing
Abstract: We propose a method for optimizing shared-memory MIMD programs containing parallel loops with loop-carried data dependences of various distances. The optimization consists in reducing the number of synchronizations necessary to satisfy the dependences by identifying dependences which are redundant and neet not be synchronized. The idea of dependence chaining is presented. Next, conditions for dependence arc elimination are formulated. A proposal for application of nested forward dependences in dependence chaining so as to achieve extra arc elimination is put forward. Theoretical considerations are accompanied by the description of the algorithm which implements the proposed method and an example of its application.
Published: 1993
Full Text: View/download PDF

35. On data dependence analysis for compiling programs on distributed-memory machines (extended abstract)

Author: Chua-Huang Huang, Sanjay Sharma, and P. Sadayappan
Subjects: Programming language, Computer science, Data dependence, Distributed memory, Parallel computing, computer.software_genre, Computer Graphics and Computer-Aided Design, computer, Software
Published: 1993
Full Text: View/download PDF

36. Compiling lisp programs for parallel execution

Author: James R. Larus
Subjects: General Computer Science, Computer science, Programming language, Data dependence, Program transformation, Fexpr, Multiprocessing, Parallel computing, computer.software_genre, Computer Science Applications, Parallelism (grammar), Recursive functions, Preprocessor, Lisp, computer, Software, computer.programming_language
Abstract: Curare, the program restructurer described in this paper automatically transforms a sequential Lisp program into an equivalent concurrent program that runs on a multiprocessor. Data dependences constrain the program's concurrent execution because, in general, two conflicting statements cannot execute in a different order without affecting the program's result. Not all dependences are essential to produce the program's result.Curare attempts to transform the program so it computes its result with fewer conflicts. An optimized program will execute with less synchronization and more concurrency. Curare then examines loops in a program to find those that are unconstrained or lightly constrained by dependences. By necessity,Curare treats recursive functions as loops and does not limit itself to explicit program loops. Recursive functions offer several advantages over explicit loops since they provide a convenient framework for inserting locks and handling the dynamic behavior of symbolic programs. Loops that are suitable for concurrent execution are changed to execute on a set of concurrent server processes. These servers execute single loop iterations and therefore need to be extremely inexpensive to invoke. Restructured programs execute significantly faster than the original sequential programs. This improvement is large enough to attract programmers to a multiprocessor, particularly since it requires little effort on their part.
Published: 1991
Full Text: View/download PDF

37. OpenMP Implementation of Parallel Linear Solver for Reservoir Simulation

Author: Changjun Hu, Jue Wang, Jilin Zhang, and Jianjiang Li
Subjects: Reservoir simulation, Computer science, Locality, Data dependence, Domain decomposition methods, Linear solver, Parallel computing, System of linear equations, Computational science
Abstract: In this paper, we discuss an OpenMP implementation of an evolutionary LSOR method, the MBLSOR method, for solution of system of linear equations related to reservoir simulation on SMPs. MBLSOR method not only can improve the data locality by spatial computational domain decomposition technique, but it also can parallel the sub blocks with no data dependence. We compare the performance of different parallel LSOR methods in terms of efficiency and data locality. Numerical results on SMPs indicate that MBLSOR algorithm is more efficient.
Published: 2008
Full Text: View/download PDF

38. Application of redundant computation in software performance analysis

Author: Zakarya A. Alzamil and Bogdan Korel
Subjects: Computer science, Computation, Data dependence, Redundancy (engineering), Software performance analysis, Parallel computing, Dependence analysis, Redundant code
Abstract: Redundant computation is an execution of a program statement(s) that does not contribute to the program output. The same statement on one execution may exhibit redundant computation whereas on a different execution, it contributes to the program output. A redundant (dead) statement always exhibits redundant computation, i.e., its execution is always redundant. However, a statement that exhibits redundant computation is not necessarily a redundant statement. Redundant computation represents a partial redundancy of a statement. A high degree of redundant computation in a program may indicate a performance deficiency. Therefore, elimination (or reduction) of redundant computation may improve program's performance. In this paper we present an approach of automated detection of redundant computation in programs and show its application in performance analysis. We developed a tool that automatically detects redundant computations in C programs and identifies potential performance deficiencies related to redundant computation. We have performed an experimental study that showed that redundant computation is a commonly occurring phenomenon in programs, and it is frequently a source of performance deficiency.
Published: 2005
Full Text: View/download PDF

39. Quantification of ISA Impact on Superscalar Processing

Author: Rafael Rico and Raúl Durán
Subjects: Instruction set, Computer science, Superscalar, Matrix representation, Data dependence, x86, Graph (abstract data type), Graph theory, Parallel computing, Machine code
Abstract: The differences found between the superscalar performance in x86 and non-x86 processors and the peculiar characteristics of the x86 ISA recommend to carry out a thorough analysis of the available parallelism at the machine language layer. However, computer architecture evaluation requires new tools that complement the customary simulations and, in this sense, the traditional graph theory can help to create a new frame for fine-grain parallelism analysis. We construct the matrix representation associated to the data dependence graph of execution traces. In this paper, we explain how this matrix characterizes the corresponding code in a mathematical manner, fulfills a number of properties and restrictions, and provides information about the ability of the code to be processed concurrently. Besides, we also show how different data dependence sources can be composed, thus providing a mechanism to explore their final influence on the parallelism degree. These techniques are applied to an example from which some conclusions are derived
Published: 2005
Full Text: View/download PDF

40. Comparison of Data Dependence Analysis Tests

Author: Miia Viitanen and Timo Hämäläinen
Subjects: Program analysis, Computer science, Data dependence, Parallelism (grammar), Compiler, Parallel computing, computer.software_genre, computer, Execution time
Abstract: Comparison of six data dependence analysis algorithms is presented. The algorithms are purposed for a parallel compiler that is being developed for a configurable multi-DSP system PARNEU. The algorithms are implemented in SUIF compiler framework and benchmarked with Perfect Club, Audio Signal Processing, and Media Bench test problems. Proprietary PARNEU programs that have been manually parallelised are also included. Performance in terms of accuracy and execution time of the data dependence algorithms has been measured and compared. The results show that the Omega test is the most accurate but also takes most execution time for benchmarks with for-loop parallelism.
Published: 2004
Full Text: View/download PDF

41. Maximizing Parallelism for Nested Loops with Non-uniform Dependences

Author: Sam Jin Jeong
Subjects: Convex hull, Variable (computer science), Data dependence, Parallelism (grammar), Iteration space, Parallel computing, Nested loop join, Mathematics
Abstract: Partitioning of loops is a very important optimization issue and requires the efficient and exact data dependence analysis. Although several methods exist in order to parallelize loops with non-uniform dependences, most of them perform poorly due to irregular and complex dependence constraints. This paper proposes Improved Region Partitioning Method for minimizing the size of the sequential region and maximizing parallelism. Our approach is based on the Convex Hull theory that has adequate information to handle non-uniform dependences. By parallelizing anti dependence region using variable renaming, we will divide the iteration space into two parallel regions and one or less sequential region. Comparison with other schemes shows more parallelism than the existing techniques.
Published: 2004
Full Text: View/download PDF

42. Integrating fault-tolerant feature into TOPAS parallel programming environment for distributed systems

Author: Giang Nguyen, Viet Tran, and M. Kotocova
Subjects: Computer science, Distributed computing, Data dependence, Reactive programming, Dynamic load balancing, Control reconfiguration, Fault tolerance, Parallel computing, Scheduling (computing)
Abstract: In this paper, TOPAS-a new parallel programming environment for distributed systems-is presented. TOPAS automatically analyzes data dependence among tasks and synchronizes data, which reduces the time needed for parallel program developments. TOPAS also provides supports for scheduling, dynamic load balancing and fault tolerance. Experiments show simplicity and efficiency of parallel programming in TOPAS environment with fault-tolerant integration, which provides graceful performance degradation and quick reconfiguration time for application recovery.
Published: 2003
Full Text: View/download PDF

43. Incremental data dependence analysis

Author: K.V. Praveen, R.K. Ghosh, and Sanjeev K. Aggarwal
Subjects: Program analysis, Computer science, Data dependence, Work (physics), Code (cryptography), Concurrent computing, Value (computer science), Parallel computing, Dependence analysis, Algorithm, Data-flow analysis
Abstract: Under the existing framework for dependence analysis, every time a program is modified, exhaustive reanalysis has to be carried out to restructure the program. Often the changes in the program may be limited to a small portion. This may not affect a major part of the value based dependences. Therefore, a framework for incremental dependence analysis which would only recompute dependences pertaining to modified code, is desirable. Incremental analyzers modify only the affected parts of the solution, and hence can quickly reflect the affect of the changes on the solutions. In this work we present a framework for incremental dependence analysis for value based dependences.
Published: 2002
Full Text: View/download PDF

44. Data optimization: minimizing residual interprocessor data motion on SIMD machines

Author: K. Knobe and V. Natarajan
Subjects: Control flow, Computer science, Data dependence, Data optimization, Parallel computing, Simd machines, Residual, Data structure, Motion (physics)
Abstract: Basic concepts in array layout are summarized, and unhonored preferences and residual data motion are discussed. A technique for minimizing such motion is presented. For each array the source program is divided into regions, each associated with a single home. This enables efficient handling of residual data motion. The partitioning into regions is based on control flow and data dependence. Preliminary results obtained with this technique show an order-of-magnitude improvement for certain classes of programs. >
Published: 2002
Full Text: View/download PDF

45. Parallelization of sequential programs for net-based execution

Author: W.B. Joerg and D.J. Maier
Subjects: Theoretical computer science, Computer science, Programming language, Data dependence, Parallel computing, Pascal (programming language), Petri net, computer.software_genre, Simulation based, computer, computer.programming_language
Abstract: We present an experimental tool for identifying coarse-grained parallelism in Pascal programs. The tool produces a net description of a sequential program where statements that could potentially be executed in parallel have been identified. Conventional control and data dependence analysis is used to map the statements in a sequential program into execution steps in a dependency net. We introduce the concept of dependency strength and show how it is used to guide the grouping of statements. A simulation based on laws adapted from electrostatics and mechanics is performed where the statements are allowed to attract and repel one another to affect their position within the dependency net. Statements that must be executed sequentially are coalesced together. Several translation parameters can be modified and their effects on the resulting net descriptions can be studied.
Published: 2002
Full Text: View/download PDF

46. A general compiler framework for speculative multithreading

Author: Manoj Franklin and Anasua Bhowmik
Subjects: Super-threading, Speedup, Computer science, Data dependence, Speculative multithreading, Compiler, Parallel computing, Thread (computing), computer.software_genre, Data structure, Pointer analysis, computer
Abstract: Speculative multithreading (SpMT) promises to be an effective mechanism for parallelizing non-numeric programs, which tend to use irregular data structures with pointers and have complex flows of control. Proper thread formation is crucial to obtaining good speedup in an SpMT system. This paper presents a compiler framework for partitioning a sequential program into multiple threads for parallel execution in an SpMT system. This framework is very general, and supports a wide variety of threads, such as speculative threads, non-speculative threads, loop-centric threads, and out-of-order thread spawning. The compiler uses profiling, intra-procedural pointer analysis, data dependence information and control dependence information. The compiler is implemented on the SUIF-MachSUIF platform. A simulation-based evaluation of the generated threads shows that an average speedup of 3 can be obtained with 6 processing elements for non-numeric programs. This speedup reduces to 2 if we use only loop-based threads.
Published: 2002
Full Text: View/download PDF

47. Equimax: A New Formulation of Optimal Register-Sensitive Scheduling for ILP Processors

Author: Sid-Ahmed-Ali Touati
Subjects: Job shop scheduling, Very long instruction word, Computer science, Bounded function, Data dependence, Parallel computing, Integer programming, Fair-share scheduling, Software architecture description, Scheduling (computing)
Abstract: In this article, we give a new formulation of acyclic scheduling problem under registers and resources constraints in multiple instructions issuing processors (VLIW and superscalar) Given a direct acyclic data dependence graph G = (V, E), the complexity of our integer linear programming model is bounded by O(|V|2) variables and O(|E| + |V|2) constraints according to a target architecture description. This complexity is better than the complexity of the existing techniques which includes a worst total schedule time factor.
Published: 2001
Full Text: View/download PDF

48. Slicing concurrent programs

Author: NandaMangala Gowri and RameshS.
Subjects: Reverse engineering, Computer science, Programming language, media_common.quotation_subject, Concurrency, Data dependence, General Medicine, Parallel computing, computer.software_genre, Slicing, Program analysis, Shared memory, Debugging, Program slicing, Mutual exclusion, computer, media_common
Abstract: Slicing is a well-known program analysis technique for analyzing sequential programs and found useful in debugging, testing and reverse engineering. This paper extends the notion of slicing to concurrent programs with shared memory, interleaving semantics and mutual exclusion. Interference among concurrent threads or processes complicates the computation of slices of concurrent programs. Further, unlike slicing of sequential programs, a slicing algorithm for concurrent programs needs to differentiate between loop-independent data dependence and certain loop-carried data dependences. We show why previous methods do not give precise solutions in the presence of nested threads and loops and describe our solution that correctly and efficiently computes precise slices. Though the complexity of this algorithm is exponential on the number of threads, a number of optimizations are suggested. Using these optimizations, we are able to get near linear behavior for many practical concurrent programs.
Published: 2000
Full Text: View/download PDF

49. Limits of Instruction Level Parallelism with Data Value Speculation

Author: José González and Antonio González
Subjects: Data value, Computer science, Data dependence, Thread (computing), Parallel computing, Branch predictor, Key issues, Instruction-level parallelism, Speculation, Popularity
Abstract: Increasing the instruction level parallelism (ILP) is one of the key issues to boost the performance of future generation processors. Current processor organizations include different mechanisms to overcome the limitations imposed by name and control dependencies but no mechanisms targeting to data dependencies. Thus, these dependencies will become one of the main bottlenecks in the future. Data value speculation is gaining popularity as a mechanism to overcome the limitations imposed by data dependencies by predicting the values that flow through them. In this work, we present a study of the potential of data value speculation to boost the limits of instruction level parallelism using both perfect and realistic predictors. Speedups obtained by data value speculation are very huge for an infinite window and still significant for a limited window. Different prediction schemes oriented to single thread and multiple threads (from a single program) architectures have been studied. The latter shows a significant improvement respect to the former for FP benchmarks although the difference is much smaller for integer programs.
Published: 1999
Full Text: View/download PDF

50. Data dependence speculation using data address prediction and its enhancement with instruction reissue

Author: Toshinori Sato
Subjects: Out-of-order execution, Parallel processing (DSP implementation), Order (exchange), Computer science, Dynamic data, Data dependence, Overhead (computing), Parallel computing, Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING, Speculation, Instruction-level parallelism
Abstract: Introduces an instruction reissue mechanism in order to enhance dynamic data dependence speculation using data address prediction. Since instructions which are not data-dependent upon speculatively executed instructions are not squashed, the effect of data dependence speculation is enhanced. We extend the register update unit to reissue misspeculated instructions. The overhead caused by the extension is small, and thus it does not have any impact on processor cycle time. From the experimental evaluation, we have found that instruction reissue with dynamic data dependence speculation improves the processor performance even for those application programs whose performance is degraded when instruction squashing is used.
Published: 1998
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

68 results on '"Data dependence"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources