762 results on '"Software pipelining"'
Search Results
2. The compiler design handbook : optimizations and machine code generation.
- Author
-
Shankar, P. and Srikant, Y. N.
- Subjects
Code generators ,Compilers (Computer programs) ,Software pipelining - Abstract
Summary: Today's embedded devices and sensor networks are becoming more and more sophisticated, requiring more efficient and highly flexible compilers. Engineers are discovering that many of the compilers in use today are ill-suited to meet the demands of more advanced computer architectures. Updated to include the latest techniques, The Compiler Design Handbook, Second Edition offers a unique opportunity for designers and researchers to update their knowledge, refine their skills, and prepare for emerging innovations. The completely revised handbook includes 14 new chapters addressing topics such as worst case execution time estimation, garbage collection, and energy aware compilation. The editors take special care to consider the growing proliferation of embedded devices, as well as the need for efficient techniques to debug faulty code. New contributors provide additional insight to chapters on register allocation, software pipelining, instruction scheduling, and type systems.
- Published
- 2003
3. Adaptive Low-Cost Loop Expansion for Modulo Scheduling
- Author
-
Zhong, Hongli, Liu, Zhong, Liu, Sheng, Ma, Sheng, Li, Chen, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Shaoshan, editor, and Wei, Xiaohui, editor
- Published
- 2022
- Full Text
- View/download PDF
4. Concept of the Blocks Architecture
- Author
-
Wijtvliet, Mark, Corporaal, Henk, Kumar, Akash, Wijtvliet, Mark, Corporaal, Henk, and Kumar, Akash
- Published
- 2022
- Full Text
- View/download PDF
5. Evaluating a Dynamic and a Modulo Scheduling-based Static Approach for Configuration Generation in CGRA Accelerators.
- Author
-
Fernandes Ribeiro, Lucas, Silva, Francisco Carlos, and Saraiva Silva, Ivan
- Abstract
With the increasing complexity of the applications, there is an urgent need for solutions to improve the performance of these applications. It is noted that the loops present in some of them are responsible for up to 71% of the code execution time. So, optimizing the loop's execution it is possible to accelerate the execution of the whole application. This performance gain can be obtained with the use of software pipelining. This work proposes RENOIR, a tool that uses software pipelining technique through the implementation of the modulo scheduling algorithm, developed in software, that acts in the generation of configurations for a coarse-grained reconfigurable architecture (CGRA). RENOIR is implemented in a compiler, which receives an application and generates a code that can run on a gem5 implementation of a CGRA that includes a RISC-V microprocessor and a thin reconfigurable array. The results show that all the applications tested achieved an improvement in performance, reaching a gain of 2,32x in certain applications. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
6. Novel Implementation Approach with Enhanced Memory Access Performance of MGS Algorithm for VLIW Architecture.
- Author
-
Najoui, Mohamed, Hatim, Anas, Belkouch, Said, and Chabini, Noureddine
- Subjects
- *
ALGORITHMS , *PARALLEL processing , *SCHEDULING software , *LINEAR equations , *IMAGE processing - Abstract
Modified Gram–Schmidt (MGS) algorithm is one of the most-known forms of QR decomposition (QRD) algorithms. It has been used in many signal and image processing applications to solve least square problem and linear equations or to invert matrices. However, QRD is well-thought-out as a computationally expensive technique, and its sequential implementation fails to meet the requirements of many real-time applications. In this paper, we suggest a new parallel version of MGS algorithm that uses VLIW (Very Long Instruction Word) resources in an efficient way to get more performance. The presented parallel MGS is based on compact VLIW kernels that have been designed for each algorithm step taking into account architectural and algorithmic constraints. Based on instruction scheduling and software pipelining techniques, the proposed kernels exploit efficiently data, instruction and loop levels parallelism. Additionally, cache memory properties were used efficiently to enhance parallel memory access and to avoid cache misses. The robustness, accuracy and rapidity of the introduced parallel MGS implementation on VLIW enhance significantly the performance of systems under severe rea-time and low power constraints. Experimental results show great improvements over the optimized vendor QRD implementation and the state of art. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
7. Epilogue
- Author
-
Aiken, Alex, Banerjee, Utpal, Kejariwal, Arun, Nicolau, Alexandru, Aiken, Alex, Banerjee, Utpal, Kejariwal, Arun, and Nicolau, Alexandru
- Published
- 2016
- Full Text
- View/download PDF
8. Software pipelining with CGA and proposed intrinsics on a reconfigurable processor for HEVC decoders.
- Author
-
Ahn, Yong-Jo, Yoo, Jonghun, Jo, Hyun-Ho, and Sim, Donggyu
- Abstract
This work proposes several intrinsics on a reconfigurable processor intended for HEVC decoding and software pipelining algorithms with a coarse-grained array (CGA) architecture as well as the proposed intrinsic instructions. Software pipelining algorithms are developed for the CGA acceleration of inverse transform, pixel reconstruction, de-blocking filter and sample adaptive offset modules. To enable efficient software pipelining, several very-long instruction-word-based intrinsics are designed in order to maximize the parallelization rather than the computational acceleration. We found that the HEVC decoder with the proposed intrinsics yields 2.3 times faster in running clock cycle than a decoder that does not use the intrinsics. In addition, the HEVC decoder with CGA pipelining algorithms executes 10.9 times faster than that without the CGA mode. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
9. Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies
- Author
-
Das, Abhishek, Dally, William J., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Luque, Emilio, editor, Margalef, Tomàs, editor, and Benítez, Domingo, editor
- Published
- 2008
- Full Text
- View/download PDF
10. Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods
- Author
-
George Biros, Andreas Gerstlauer, Dhairya Malhotra, Lizy K. John, Mochamad Asri, and Jiajun Wang
- Subjects
020203 distributed computing ,Random access memory ,Computer science ,business.industry ,Pipeline (computing) ,02 engineering and technology ,Supercomputer ,Idle ,Software pipelining ,Software ,Computational Theory and Mathematics ,Application-specific integrated circuit ,Hardware and Architecture ,Embedded system ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,x86 ,Hardware acceleration ,System on a chip ,business ,Dram - Abstract
In this article, we study performance and energy saving benefits of hardware acceleration under different hardware configurations and usage scenarios for a state-of-the-art Fast Multipole Method (FMM), which is a popular N-body method. We use a dedicated Application Specific Integrated Circuit (ASIC) to accelerate General Matrix-Matrix Multiply (GEMM) operations. FMM is widely used in applications and is representative example of the workload for many HPC applications. We compare architectures that integrate the GEMM ASIC next to, in or near main memory with an on-chip coupling aimed at minimizing or avoiding repeated round-trip transfers through DRAM for communication between accelerator and CPU. We study tradeoffs using detailed and accurately calibrated x86 CPU, accelerator and DRAM simulations. Our results show that simply moving accelerators closer to the chip does not necessarily lead to performance/energy gains. We demonstrate that, while careful software blocking and on-chip placement optimizations can reduce DRAM accesses by 2X over a naive on-chip integration, these dramatic savings in DRAM traffic do not automatically translate into significant total energy or runtime savings. This is chiefly due to the application characteristics, the high idle power and effective hiding of memory latencies in modern systems. Only when more aggressive co-optimizations such as software pipelining and overlapping are applied, additional performance and energy savings can be unlocked by 37 and 35 percent respectively over baseline acceleration. When similar optimizations (pipelining and overlapping) are applied with an off-chip integration, on-chip integration delivers up to 20 percent better performance and 17 percent less total energy consumption than off-chip integration.
- Published
- 2021
- Full Text
- View/download PDF
11. Instruction Re-selection for Iterative Modulo Scheduling on High Performance Multi-issue DSPs
- Author
-
Cho, Doosan, Ravi, Ayyagari, Uh, Gang-Ryung, Paek, Yunheung, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Zhou, Xiaobo, editor, Sokolsky, Oleg, editor, Yan, Lu, editor, Jung, Eun-Sun, editor, Shao, Zili, editor, Mu, Yi, editor, Lee, Dong Chun, editor, Kim, Dae Young, editor, Jeong, Young-Sik, editor, and Xu, Cheng-Zhong, editor
- Published
- 2006
- Full Text
- View/download PDF
12. MIRS: Modulo Scheduling with Integrated Register Spilling
- Author
-
Zalamea, Javier, Llosa, Josep, Ayguadé, Eduard, Valero, Mateo, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Dietz, Henry G., editor
- Published
- 2003
- Full Text
- View/download PDF
13. VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing.
- Author
-
Najoui, Mohamed, Bahtat, Mounir, Hatim, Anas, Belkouch, Said, and Chabini, Noureddine
- Subjects
- *
REAL-time computing , *LINEAR algebra , *SIGNAL processing , *PARALLEL computers , *ALGORITHMS - Abstract
QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12 and 6 speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
14. Evaluating a Dynamic and a Modulo Scheduling-based Static Approach for Configuration Generation in CGRA Accelerators
- Author
-
Lucas Fernandes Ribeiro, Ivan Saraiva Silva, and Francisco Carlos Silva
- Subjects
General Computer Science ,Computer science ,business.industry ,Modulo ,computer.software_genre ,Scheduling (computing) ,law.invention ,Microprocessor ,Software pipelining ,Software ,Very long instruction word ,law ,Embedded system ,Code (cryptography) ,Compiler ,Electrical and Electronic Engineering ,business ,computer - Abstract
With the increasing complexity of the applications, there is an urgent need for solutions to improve the performance of these applications. It is noted that the loops present in some of them are responsible for up to 71% of the code execution time. So, optimizing the loop’s execution it is possible to accelerate the execution of the whole application. This performance gain can be obtained with the use of software pipelining. This work proposes RENOIR, a tool that uses software pipelining technique through the implementation of the modulo scheduling algorithm, developed in software, that acts in the generation of configurations for a coarse-grained reconfigurable architecture (CGRA). RENOIR is implemented in a compiler, which receives an application and generates a code that can run on a gem5 implementation of a CGRA that includes a RISC-V microprocessor and a thin reconfigurable array. The results show that all the applications tested achieved an improvement in performance, reaching a gain of 2,32x in certain applications.
- Published
- 2020
- Full Text
- View/download PDF
15. Energy and memory-aware software pipelining streaming applications on NoC-based MPSoCs
- Author
-
Suhaimi Abd Ishak, Umair Ullah Tariq, and Hui Wu
- Subjects
Computer Networks and Communications ,Computer science ,020206 networking & telecommunications ,02 engineering and technology ,Parallel computing ,Energy consumption ,MPSoC ,Scheduling (computing) ,Software pipelining ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Frequency scaling ,Retiming ,Integer programming ,Software - Abstract
In this article, we explore the problem of energy-aware scheduling of real-time applications modelled by conditional task graphs on NoC based MPSoC such that the total energy consumption is minimized. We propose a novel energy and memory-aware retiming conditional task graph (EMRCTG) approach that integrates task-level coarse-grained software pipelining with Dynamic Voltage and Frequency Scaling (DVFS). Our approach not only optimizes energy consumption but ensures that memory capacity constraints are satisfied. EMRCTG has two phases. In the first phase, we map tasks to processors, transform intra-period data dependencies into inter-period and generate a schedule by a Non-Linear Programming (NLP)-based algorithm assuming infinite memory capacity. The NLP-based algorithm assigns a continuous frequency and voltage to each task and each communication and uses a polynomial-time heuristic to transform the continuous frequencies and voltages to discrete frequencies and voltages. We analyse the memory consumption of the generated schedule and initiate schedule repair phase 2 if the memory capacity constraints violate. The schedule repair phase finds a set of nodes such that by reducing their retiming values the memory capacity constraints satisfy. We compare our approach against two existing approaches GeneS and JCCTS. GeneS is a genetic algorithm that first transforms the dependent task set into an independent task set and then collectively performs task mapping, ordering and voltage scaling. JCCTS is a mixed integer linear programming based approach that optimally removes inter-processor communication overhead. Our experimental result show that compared to the approach GeneS our approach can obtain an improvement in range of 1.6 to 18 percent and an average improvement of 11 percent. Compared to the approach JCCTS our approach can achieve an improvement in range of 9 to 42 percent and an average improvement of 26 percent.
- Published
- 2020
- Full Text
- View/download PDF
16. Aggregate operation movement: A min-cut approach to global code motion
- Author
-
Lo, Raymond, Chan, Sun, Dehnert, Jim, Towle, Ross, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Bougé, Luc, editor, Fraigniaud, Pierre, editor, Mignotte, Anne, editor, and Robert, Yves, editor
- Published
- 1996
- Full Text
- View/download PDF
17. A comparison of modulo scheduling techniques for software pipelining
- Author
-
Pfahler, Peter, Piepenbrock, Georg, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Gyimóthy, Tibor, editor
- Published
- 1996
- Full Text
- View/download PDF
18. Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops
- Author
-
Wang, Jian, Gao, Guang R., Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Gyimóthy, Tibor, editor
- Published
- 1996
- Full Text
- View/download PDF
19. Delayed exceptions — Speculative execution of trapping instructions
- Author
-
Ertl, M. Anton, Krall, Andreas, Goos, Gerhard, editor, Hartmanis, Juris, editor, and Fritzson, Peter A., editor
- Published
- 1994
- Full Text
- View/download PDF
20. Compiling for the Cydra 5
- Author
-
Dehnert, James C., Towle, Ross A., Rau, B. R., editor, and Fisher, J. A., editor
- Published
- 1993
- Full Text
- View/download PDF
21. Exploiting Parallelism of Imperfect Nested Loops on Coarse-Grained Reconfigurable Architectures.
- Author
-
Yin, Shouyi, Lin, Xinhan, Liu, Leibo, and Wei, Shaojun
- Subjects
- *
PARALLEL processing , *COMPUTER architecture , *DATA pipelining , *KERNEL operating systems , *APPLICATION-specific integrated circuits - Abstract
Coarse-grained reconfigurable architecture (CGRA) is a promising parallel computing platform that provides high performance, high power efficiency and flexibility. However, for imperfect nested loops, the existing loop mapping methods often result in low execution performance and poor hardware utilization. To tackle this problem, this paper makes three contributions:
1) a highly effective and general approach to map imperfect loops on CGRA;2) a global optimization strategy to search the optimal initiation intervals (IIs);3) a powerful kernel compression method to reduce the oversized kernel. Experiment results show that our approach can reduce the total computing latency by 20.5, 58.5 and 73.2 percent compared to the state-of-the-art approaches on $2 \times 2$- Published
- 2016
- Full Text
- View/download PDF
22. Software pipelining for graphic processing unit acceleration: Partition, scheduling and granularity.
- Author
-
Liu, Bozhong, Qiu, Weidong, Jiang, Lin, and Gong, Zheng
- Subjects
- *
DATA pipelining , *ELECTRONICS , *PIPELINED ADCs , *HIGH performance computing , *ELECTRONIC data processing - Abstract
The graphic processing unit (GPU) is becoming increasingly popular as a performance accelerator in various applications requiring high-performance parallel computing capability. In a central processing unit (CPU) or GPU hybrid system, software pipelining is a major task in order to deliver accelerated performance, where hiding CPU–GPU communication overheads by splitting a large task into small units is the key challenge. In this paper, we carry out a systematic investigation into task partitioning in order to achieve maximum performance gain. We first validate the advantage of even partition strategy, and then propose the optimal scheduling, with detailed study into how to achieve optimal unit size (data granularity) in an analytical framework. Experiments on AMD and NVIDIA GPU platforms demonstrate that our approaches achieve around 31 – 59% performance improvement using software pipelining. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
23. Software Pipelining in a Non-Conventional Architecture to Improve Performance.
- Author
-
Freire Lopes Nunes, Denis and Fernandes, Silvio Roberto
- Abstract
Strategies to parallelize applications is one way to increase performance in computer architectures. In addition to the hardware organization, some techniques are adopted, maintaining its architecture or instruction set, in order to increase the flow of processing instructions, such as pipeline and software pipelining. This paper presents a version of the software pipelining technique in IPNoSys, an unconventional architecture, for improved performance in the execution of applications containing loops. The new version was compared with an already implemented version of the software pipelining technique to IPNoSys and was two times more efficient than the old version. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
24. Real-Time Software Pipelining for Multidomain Motion Controllers.
- Author
-
Kang, Hyeongseok, Kim, Kanghee, and Jin, Hyun-Wook
- Abstract
Motion control systems require an isochronal real-time guarantee that each control task should periodically produce outputs with no jitters. However, it is difficult to build up such a tight isochronal system with a multicore architecture and a general-purpose operating system, because the inherent resource sharing principle leads to large jitters to the control tasks. This paper proposes a software pipelining framework for an EtherCAT-based motion controller that achieves a tight isochronal guarantee with that combination. The tight guarantee is possible by multicore partitioning and reservation-aware task phasing, which reduce resource contentions between the tasks on each stage of the pipeline. Through experiments, we show that the proposed pipelining framework gives a tight isochronal guarantee with high scalability in terms of the number of motion transactions. On a real 8-axis motion control platform with two processor cores dedicated to the pipeline and a slight modification of the Linux operating system, it achieves a maximum jitter of 10\ \upmu\texts for four motion transactions with a common period of 1.6 ms, whereas a priority-driven method gives a maximum jitter of a few hundreds of microsecond under the same condition. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
25. Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage.
- Author
-
Bahtat, Mounir, Belkouch, Said, Elleaume, Philippe, and Le Gall, Philippe
- Subjects
FOURIER transforms ,DIGITAL signal processing ,MICROPROCESSORS - Abstract
The fast Fourier transform (FFT) is perhaps today's most ubiquitous algorithm used with digital data; hence, it is still being studied extensively. Besides the benefit of reducing the arithmetic count in the FFT algorithm, memory references and scheme's projection on processor's architecture are critical for a fast and efficient implementation. One of the main bottlenecks is in the long latency memory accesses to butterflies' legs and in the redundant references to twiddle factors. In this paper, we describe a new FFT implementation on high-end very long instruction word (VLIW) digital signal processors (DSP), which presents improved performance in terms of clock cycles due to the resulting low-level resource balance and to the reduced memory accesses of twiddle factors. The method introduces a tradeoff parameter between accuracy and speed. Additionally, we suggest a cache-efficient implementation methodology for the FFT, dependently on the provided VLIW hardware resources and cache structure. Experimental results on a TI VLIW DSP show that our method reduces the number of clock cycles by an average of 51 % (2 times acceleration) when compared to the most assembly-optimized and vendor-tuned FFT libraries. The FFT was generated using an instruction-level scheduling heuristic. It is a modulo-based register-sensitive scheduling algorithm, which is able to compute an aggressively efficient sequence of VLIW instructions for the FFT, maximizing the parallelism rate and minimizing clock cycles and register usage. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
26. Optimizing microprograms for recurrent loops on pipelined architectures using timed Petri nets
- Author
-
Hanen, Claire, Goos, G., editor, Hartmanis, J., editor, Barstow, D., editor, Brauer, W., editor, Brinch Hansen, P., editor, Gries, D., editor, Luckham, D., editor, Moler, C., editor, Pnueli, A., editor, Seegmüller, G., editor, Stoer, J., editor, Wirth, N., editor, and Rozenberg, Grzegorz, editor
- Published
- 1990
- Full Text
- View/download PDF
27. Reducing Memory Access Conflicts with Loop Transformation and Data Reuse on Coarse-grained Reconfigurable Architecture
- Author
-
Jianfei Jiang, Zhigang Mao, Weiguang Sheng, Guanghui He, Zhongyuan Zhao, and Chen Yuge
- Subjects
Data access ,Software pipelining ,Computer science ,Distributed computing ,Synchronization (computer science) ,Bandwidth (computing) ,Overhead (computing) ,Control reconfiguration ,Compiler ,computer.software_genre ,computer ,Scheduling (computing) - Abstract
Coarse-Grained Reconfigurable Arrays (CGRAs) are promising to have low power consumption and high energy-efficiency characteristics as accelerators. Recent years, many research works focus on improving the programmability of the CGRAs by enabling the fast reconfiguration during execution. The performance of these CGRAs critically hinges upon the scheduling power of the compiler. One of the critical challenges is to reduce memory access conflicts using static compilation techniques. Memory accessing conflict brings the synchronization overhead which causes the pipelining stall and reduces CGRA performance. Existing compilers usually tackle this challenge by orchestrating the data placement of the on-chip global memory (OGM) in CGRA to let the parallel memory accesses avoid the bank conflict. However, we find bank conflict is not the only reason that causes the memory access conflicts. In some CGRAs, the bandwidth of the data network between OGM and processing element array (PEA) is also limited due to the low power design principle. The unbalanced network bandwidth loads is another reason that causes memory access conflicts. Furthermore, the redundant data access across iterations is one of the primary causes of memory access conflicts. Based on these observations, we provide a comprehensive and generalized compilation flow to reduce the memory conflicts. Firstly, we develop a loop transformation model to maximize the inter-iteration data reuse of the loops to reduce the memory accessing operations under the software pipelining scheme. Secondly, we enhance the bandwidth utilization of the network between OGM and PEA and avoid the bank conflict by providing a conflict-aware spatial mapping algorithm which can be easily integrated into existing CGRA modulo scheduling compilation flow. Experimental results show our method is capable of improving performance by an average of 44% comparing with state-of-the-art CGRA compiling flow.
- Published
- 2021
- Full Text
- View/download PDF
28. On the effectiveness of register moves to minimise post-pass unrolling in software pipelined loops.
- Author
-
Bachir, Mounira, Cohen, Albert, and Touati, Sid-Ahmed-Ali
- Abstract
Software pipelining is a powerful technique to expose fine-grain parallelism, but it results in variables staying alive across more than one kernel iteration. It requires periodic register allocation and is challenging for code generation: the lack of a reliable solution currently restricts the applicability of software pipelining. The classical software solution that does not alter the computation throughput consists in unrolling the loop a posteriori [11], [10]. However, the resulting unrolling degree is often unacceptable and may reach absurd levels. Alternatively, loop unrolling can be avoided thanks to software register renaming. This is achieved through the insertion of move operations, but this may increase the initiation interval (II) which nullifies the benefits of software pipelining. This article aims at tightly controling the post-pass loop unrolling necessary to generate code. We study the potential of live range splitting to reduce kernel loop unrolling, introducing additional move instructions without inscreasing the II. We provide a complete formalisation of the problem, an algorithm, and extensive experiments. Our algorithm yields low unrolling degrees in most cases — with no increase of the II. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
29. Assembly programming optimization based on TS201.
- Author
-
Shuting Guo and Hongchun Hu
- Abstract
It's very important to optimize the performance of programming based on DSP for high efficiency of execution. Comparison and analysis on several software optimizing measures in assembly code are represented based on TS201, including avoiding or utilizing the pipelining delay, SIMD technique, software pipelining and loop unrolling. It's proved by an example that the execution efficiency is improved by 49.3%. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
30. Efficient implementation of Background Subtraction GMM method on Digital Signal Processor
- Author
-
Abdelouahed Abounada, Assia Arsalane, Smail Bariko, and Abdessamad Klilou
- Subjects
Background subtraction ,Digital signal processor ,business.industry ,Computer science ,Video processing ,computer.software_genre ,Software pipelining ,Software ,Memory management ,Computer engineering ,Compiler ,business ,computer ,Digital signal processing - Abstract
Detecting moving objects based on real-time video processing is considered as a challenging task. The background subtraction using the Gaussian Mixture Model (GMM) is one of the widely used technique. However, it requires high-computing power to meet real-time processing constraints. An efficient implementation of GMM algorithm on an embedded platform based on the C6678 digital signal processor (DSP) was proposed in this paper. First, an optimized GMM code has been developed for the proposed embedded platform. Then, an efficient memory management strategy has been implemented to ensure contiguous memory access. In addition, the compiler options have been enabled to take profit from loops unrolling and software pipelining to improve the computing performances. In order to evaluate the quality of our implementation, a metrics comparison have been performed based on the background truth images provided by the literature dataset references. Experimental results show important improvements over the optimized implementation of the state-of-the-art.
- Published
- 2020
- Full Text
- View/download PDF
31. Déploiement d'applications à boucles intensives sur des architectures multiprocesseurs hétérogènes
- Author
-
Glanon, Philippe Anicet, Laboratoire d'Intégration des Systèmes et des Technologies (LIST), Direction de Recherche Technologique (CEA) (DRT (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA), Université Paris-Saclay, Chokri Mraidha, STAR, ABES, and Laboratoire d'Intégration des Systèmes et des Technologies (LIST (CEA))
- Subjects
[INFO.INFO-SY] Computer Science [cs]/Systems and Control [cs.SY] ,Maximum throughput ,[INFO.INFO-AR]Computer Science [cs]/Hardware Architecture [cs.AR] ,[INFO.INFO-CC]Computer Science [cs]/Computational Complexity [cs.CC] ,[INFO.INFO-AR] Computer Science [cs]/Hardware Architecture [cs.AR] ,[INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing ,Cyclic scheduling ,[INFO.INFO-DM]Computer Science [cs]/Discrete Mathematics [cs.DM] ,Static dataflow graphs ,Architectures hétérogènes ,Système cyber-Physiques ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,[INFO.INFO-FL]Computer Science [cs]/Formal Languages and Automata Theory [cs.FL] ,[INFO.INFO-PF] Computer Science [cs]/Performance [cs.PF] ,[INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,[INFO.INFO-SY]Computer Science [cs]/Systems and Control [cs.SY] ,Ordonnancement cyclique ,Débit maximal ,[INFO.INFO-FL] Computer Science [cs]/Formal Languages and Automata Theory [cs.FL] ,Ordonnancement multiprocesseur ,Heterogeneous architectures ,Software pipelining ,Graphes de flots de données statiques ,[INFO.INFO-ES] Computer Science [cs]/Embedded Systems ,[INFO.INFO-DM] Computer Science [cs]/Discrete Mathematics [cs.DM] ,[INFO.INFO-PF]Computer Science [cs]/Performance [cs.PF] ,Pipeline logiciel ,[INFO.INFO-CC] Computer Science [cs]/Computational Complexity [cs.CC] ,Cyber-Physical systems ,[INFO.INFO-ES]Computer Science [cs]/Embedded Systems ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,Multiprocessor scheduling - Abstract
Cyber-physical systems (CPSs) are distributed computing-intensive systems, that integrate a wide range of software applications and heterogeneous processing resources, each interacting with the other ones through different communication resources to process a large volume of data sensed from physical, chemical or biological processes. An essential issue in the design stage of these systems is to predict the timing behaviour of software applications and to provide performance guarantee to these applications. In order tackle this issue, efficient static scheduling strategies are required to deploy the computations of software applications on the processing architectures. These scheduling strategies should deal with several constraints, which include the loop-carried dependency constraints between the computational programs as well as the resource and communication constraints of the processing architectures intended to execute these programs. Actually, loops being one of the most time-critical parts of many computing-intensive applications, the optimal timing behaviour and performance of the applications depends on the optimal schedule of loops structures enclosed in the computational programs executed by the applications. Therefore, to provide performance guarantee for the applications, the scheduling strategies should efficiently explore and exploit the parallelism embedded in the repetitive execution patterns of loops while ensuring the respect of resource and communications constraints of the processing architectures of CPSs. Scheduling a loop under resource and communication constraints is a complex problem. To solve it efficiently, heuristics are obviously necessary. However, to design efficient heuristics, it is important to characterize the set of optimal solutions for the scheduling problem. An optimal solution for a scheduling problem is a schedule that achieve an optimal performance goal. In this thesis, we tackle the study of resource-constrained and communication-constrained scheduling of loop-intensive applications on heterogeneous multiprocessor architectures with the goal of optimizing throughput performance for the applications. In order to characterize the set of optimal scheduling solutions and to design efficient scheduling heuristics, we use synchronous dataflow (SDF) model of computation to describe the loop structures specified in the computational programs of software applications and we design software pipelined scheduling strategies based on the structural and mathematical properties of the SDF model., Les systèmes cyber-physiques (CPS en anglais) sont des systèmes distribués qui intègrent un large panel d'applications logicielles et de ressources de calcul hétérogènes connectées par divers moyens de communication (filaire ou non-filaire). Ces systèmes ont pour caractéristique de traiter en temps-réel, un volume important de données provenant de processus physiques, chimiques ou biologiques. Une problématique essentielle dans la phase de conception des CPSs est de prédire le comportement temporel des applications logicielles et de fournir des garanties de performances pour ces applications. Afin de répondre à cette problématique, des stratégies d'ordonnancement statique sont nécessaires. Ces stratégies doivent tenir compte de plusieurs contraintes, notamment les contraintes de dépendances cycliques induites par les boucles de calcul des applications ainsi que les contraintes de ressource et de communication des architectures de calcul. En effet, les boucles étant l'une des parties les plus critiques en temps d'exécution pour plusieurs applications de calcul intensif, le comportement temporel et les performances optimales des applications logicielles dépendent de l'ordonnancement optimal des structures de boucles embarqués dans les programmes de calcul. Pour prédire le comportement temporel des applications logicielles et fournir des garanties de performances pour ces applications, les stratégies d'ordonnancement statiques doivent donc explorer et exploiter efficacement le parallélisme embarqué dans les patterns d'exécution des programmes à boucles intensives tout en garantissant le respect des contraintes de ressources et de communication des architectures de calcul. L'ordonnancement d'un programme à boucles intensives sous contraintes ressources et communication est un problème complexe et difficile. Afin de résoudre efficacement ce problème, il est indispensable de concevoir des heuristiques. Cependant, pour concevoir des heuristiques efficaces, il est important de caractériser l'ensemble des solutions optimales pour le problème d'ordonnancement. Une solution optimale pour un problème d'ordonnancement est un ordonnancement qui réalise un objectif optimal de performance. Dans cette thèse, nous nous intéressons au problème d'ordonnancement des programmes à boucles intensives sur des architectures de calcul multiprocesseurs hétérogènes sous des contraintes de ressource et de communication, avec l'objectif d'optimiser le débit de fonctionnement des applications logicielles. Pour ce faire, nous utilisons les modèles de flots de données statiques pour décrire les structures de boucles spécifiées dans les programmes de calcul et nous concevons des stratégies d'ordonnancement périodiques sur la base des propriétés structurelles et mathématiques de ces modèles afin de générer des solutions optimales et approximatives d'ordonnancement.
- Published
- 2020
32. Scalable Multi-FPGA Acceleration for Large RNNs with Full Parallelism Levels
- Author
-
Hamin Jang, Eriko Nurvitadhi, Dongup Kwon, Suyeon Hur, and Jangwoo Kim
- Subjects
Scheme (programming language) ,Dataflow ,Computer science ,02 engineering and technology ,Parallel computing ,010501 environmental sciences ,01 natural sciences ,020202 computer hardware & architecture ,Software pipelining ,Recurrent neural network ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Leverage (statistics) ,Field-programmable gate array ,computer ,0105 earth and related environmental sciences ,computer.programming_language - Abstract
The increasing size of recurrent neural networks (RNNs) makes it hard to meet the growing demand for real-time AI services. For low-latency RNN serving, FPGA-based accelerators can leverage specialized architectures with optimized dataflow. However, they also suffer from severe HW under-utilization when partitioning RNNs, and thus fail to obtain the scalable performance.In this paper, we identify the performance bottlenecks of existing RNN partitioning strategies. Then, we propose a novel RNN partitioning strategy to achieve the scalable multi-FPGA acceleration for large RNNs. First, we introduce three parallelism levels and exploit them by partitioning weight matrices, matrix/vector operations, and layers. Second, we examine the performance impact of collective communications and software pipelining to derive more accurate and optimal distribution results. We prototyped an FPGA-based acceleration system using multiple Intel high-end FPGAs, and our partitioning scheme allows up to 2.4x faster inference of modern RNN workloads than conventional partitioning methods.
- Published
- 2020
- Full Text
- View/download PDF
33. An Embedded System for Gray Matter Segmentation of PET-Image
- Author
-
Dipankar Khorat, Samarendra Kumar Sharma, and Khakon Das
- Subjects
Software pipelining ,Wavelet ,Lifting scheme ,business.industry ,Computer science ,Embedded system ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,k-means clustering ,Wavelet transform ,Segmentation ,business ,Peak signal-to-noise ratio ,Haar wavelet - Abstract
This paper presents a complete embedded system for segmentation of gray matter of the human brain from Positron emission tomography (PET) image using 8-bit embedded Atmel microcontroller operating at low power and a frequency of 16 MHz. Hardware implementation of the embedded system for gray matter segmentation (ESGMS) of PET image is chosen as hardware design is more efficient by adopting the principles of software development. The ideology of parallelism achieved by the sixth level of software pipelining and two levels of dedicated hardware pipelining. Using lifting technique, the wavelet function is generated and denoised the same simultaneously, generally known as second-generation wavelet. The proposed architecture reduced number of memories through in-place approach as the resultant of the operands is overwritten on the operand location. For realization, Haar wavelet is used as mother wavelet in the design which is compiled lifting technique to perform split and update. Haar mother wavelet with the lifting scheme procures the denoised image for future operations. The wavelets confer a better output image, which is established with altering different statistical measures. With our proposed system, we can obtain peak signal to noise ratio (PSNR) about 107 dB. For segmentation, unsupervised clustering algorithm (K-mean) is used and each segment is superimposed with different colors. The gray matter of the human brain in the PET image is leveled using pink color.
- Published
- 2020
- Full Text
- View/download PDF
34. Performance evaluation of implicit and explicit SIMDization
- Author
-
Angela Pohl, Asadollah Shahbahrami, Hossein Amiri, and Ben Juurlink
- Subjects
Loop unrolling ,Computer Networks and Communications ,Computer science ,020207 software engineering ,02 engineering and technology ,Parallel computing ,Streaming SIMD Extensions ,computer.software_genre ,Instruction set ,Software pipelining ,Artificial Intelligence ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,020201 artificial intelligence & image processing ,Image tracing ,SIMD ,Compiler ,computer ,Software - Abstract
Processor vendors have been expanding Single Instruction Multiple Data (SIMD) extensions to exploit data-level-parallelism in their General Purpose Processors (GPPs). Each SIMD technology such as Streaming SIMD Extensions (SSE) and Advanced Vector eXtensions (AVX) has its own Instruction Set Architecture (ISA) which equipped with Special Purpose Instructions (SPIs). In order to exploit these features, many programming approaches have been developed. Intrinsic Programming Model (IPM) is a low-level concept for explicit SIMDization. Besides, Compiler’s Automatic Vectorization (CAV) has been embedded in modern compilers such as Intel C++ compiler (ICC), GNU Compiler Collections (GCC) and LLVM for implicit vectorization. Each SIMDization shows different improvements because of different SIMD ISAs, vector register width, and programming approaches. Our goal in this paper is to evaluate the performance of explicit and implicit vectorization. Our experimental results show that the behavior of explicit vectorization on different compilers is almost the same compared to implicit vectorization. IPM improves the performance more than CAVs. In general, ICC and GCC compilers can more efficiently vectorize kernels and use SPI compared to LLVM. In addition, AVX2 technology is more useful for small matrices and compute-intensive kernels compared to large matrices and data-intensive kernels because of memory bottlenecks. Furthermore, CAVs fail to vectorize kernels which have overlapping and non-consecutive memory access patterns. The way of implementation of a kernel impacts the vectorization. In order to understand what kind of scalar implementations of an algorithm is suitable for vectorization, an approach based on code modification technique is proposed. Our experimental results show that scalar implementations that have either loop collapsing, loop unrolling, software pipelining, or loop exchange techniques can be efficiently vectorized compared to straightforward implementations.
- Published
- 2018
- Full Text
- View/download PDF
35. SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
- Author
-
Trevor E. Carlson, Kim-Anh Tran, Konstantinos Koukos, Alexandra Jimborean, Magnus Själander, and Stefanos Kaxiras
- Subjects
Computer science ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,CAS latency ,Software pipelining ,Software ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Code generation ,010302 applied physics ,Computer Sciences ,business.industry ,Program optimization ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Microarchitecture ,Datavetenskap (datalogi) ,Very long instruction word ,Memory-level parallelism ,Compiler ,Cache ,business ,Instruction-level parallelism ,computer ,Computer hardware - Abstract
Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency. In this paper, we address one of the main performance bottlenecks—level cache misses—through a software- hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms. We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, mod- ulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-low and indirect memory accesses. © ACM, 2018. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions of Computing Education, https://doi.org/10.1145/3296979.3192393
- Published
- 2018
- Full Text
- View/download PDF
36. Throughput-Memory Footprint Trade-Off in Synthesis of Streaming Software on Embedded Multiprocessors.
- Author
-
HASHEMI, MATIN, FOROOZANNEJAD, MOHAMMAD H., and GHIASI, SOHEIL
- Subjects
COMPUTER storage devices ,COMPUTER software ,STREAMING technology ,DATA flow computing ,ELECTRONIC data processing ,MULTIPROCESSORS - Abstract
We study the trade-off between throughput and memory footprint of embedded software that is synthesized from acyclic static dataflow (task graph) specifications targeting distributed memory multiprocessors. We identify iteration overlapping as a knob in the synthesis process by which one can trade application throughput for its memory requirement. Given an initial processor assignment and non-overlapped task schedule, we formally present underlying properties of the problem, such as constraints on a valid iteration overlapping, maximum possible throughput, and minimum memory footprint. Moreover, we develop an effective algorithm for generation of a rich set of design points that provide a range of trade-off options. Experimental results on a number of applications and architectures validate the effectiveness of our approach. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
37. Minimal Unroll Factor for Code Generation of Software Pipelining.
- Author
-
Bachir, Mounira, Touati, Sid-Ahmed-Ali, Brault, Frederic, Gregg, David, and Cohen, Albert
- Subjects
- *
DATA pipelining , *LOOP tiling (Computer science) , *PARALLEL processing , *EMBEDDED computer systems , *COMPUTER software , *CIPHERS - Abstract
We address the problem of generating compact code from software pipelined loops. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates lifetime intervals spanning multiple loop iterations. These intervals require periodic register allocation (also called variable expansion), which in turn yields a code generation challenge. We are looking for the minimal unrolling factor enabling the periodic register allocation of software pipelined kernels. This challenge is generally addressed through one of: (1) hardware support in the form of rotating register files, which solve the unrolling problem but are expensive in hardware; (2) register renaming by inserting register moves, which increase the number of operations in the loop, and may damage the schedule of the software pipeline and reduce throughput; (3) post-pass loop unrolling that does not compromise throughput but often leads to impractical code growth. The latter approach relies on the proof that MAXLIVE registers (maximal number of values simultaneously alive) are sufficient for periodic register allocation (Eisenbeis et al. in PACT '95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264-267, Manchester, UK, ; Hendren et al. in CC '92: Proceedings of the 4th International Conference on Compiler Construction, pages 176-191, London, UK, ). However, the best existing heuristic for controlling this code growth-modulo variable expansion (Lam in SIGPLAN Not 23(7):318-328, )-may not apply the correct amount of loop unrolling to guarantee that MAXLIVE registers are enough, which may result in register spills Eisenbeis et al. in PACT '95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264-267, Manchester, UK, . This paper presents our research results on the open problem of minimal loop unrolling, allowing a software-only code generation that does not trade the optimality of the initiation interval ( II) for the compactness of the generated code. Our novel idea is to use the remaining free registers after periodic register allocation to relax the constraints on register reuse. The problem of minimal loop unrolling arises either before or after software pipelining, either with a single or with multiple register types (classes). We provide a formal problem definition for each scenario, and we propose and study a dedicated algorithm for each problem. Our solutions are implemented within an industrial-strength compiler for a VLIW embedded processor from STMicroelectronics, and validated on multiple benchmarks suites. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
38. Performance Aspects of Sparse Matrix-Vector Multiplication
- Author
-
I. Šimeček
- Subjects
sparse matrix-vector multiplication ,code restructuring ,loop unrolling ,software pipelining ,cache hierarchy ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
Sparse matrix-vector multiplication (shortly SpM×V) is an important building block in algorithms solving sparse systems of linear equations, e.g., FEM. Due to matrix sparsity, the memory access patterns are irregular and utilization of the cache can suffer from low spatial or temporal locality. Approaches to improve the performance of SpM×V are based on matrix reordering and register blocking [1, 2], sometimes combined with software-pipelining [3]. Due to its overhead, register blocking achieves good speedups only for a large number of executions of SpM×V with the same matrix A.We have investigated the impact of two simple SW transformation techniques (software-pipelining and loop unrolling) on the performance of SpM×V, and have compared it with several implementation modifications aimed at reducing computational and memory complexity and improving the spatial locality. We investigate performance gains of these modifications on four CPU platforms.
- Published
- 2006
39. Software Pipelining for Stream Programs on Resource Constrained Multicore Architectures.
- Author
-
Wei, Haitao, Yu, Junqing, Yu, Huafei, Qin, Mingkang, and Gao, Guang R.
- Subjects
- *
COMPUTER software , *DATA pipelining , *STREAMING technology , *MULTICORE processors , *COMPUTER architecture , *COMPUTER scheduling , *SOURCE code - Abstract
Stream programming model has been productively applied to a number of important application domains. Software pipelining is an important code scheduling technique for stream programs. However, the multicore evolution has presented a new dimension of challenges: that is how to orchestrate the best software pipelining schedule in the face of resource constrained architectures (e.g., number of cores, available memory, and bandwidth)? In this paper, we proposed a new solution methodology to address the problem above. Our main contributions include the following. A unified Integer Linear Programming (ILP) formulation has been proposed that combines the requirement of both rate-optimal software pipelining and the minimization of intercore communication overhead. Next, an extended formulation has been proposed to formulate the schedule under memory size constrained systems. It orchestrates the rate-optimal software pipelining execution for stream programs with strict memory, processor cores, and communication constraints. A solution testbed has been implemented for the proposed problem formulations. This has been realized by extending the Brook programming environment with our software pipelining support—named DFBrook. An experimental study has been conducted to verify the effectiveness of the proposed solutions. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
40. A software pipelining algorithm of streaming applications with low buffer requirements.
- Author
-
Hatanaka, A. and Bagherzadeh, N.
- Subjects
ALGORITHMS ,STREAMING technology ,APPLICATION software ,BUFFER storage (Computer science) ,PROGRAMMING languages - Abstract
Stream programming languages have become popular owing to their representations that enable parallelization of applications via static analysis. Several research groups have proposed approaches to software pipeline streaming applications onto multi/many-core architectures, such as CELL BE processors and NVIDIA GPUs. In this paper, we present a novel scheduling algorithm that software-pipelines streaming applications onto multi/many core architectures. The algorithm generates software pipeline schedules by formulating and solving MILP (Mixed Integer Linear Programming) problems. Experimental results show that compared to previous works, our approach generates schedules that use up to a 71% smaller amount of buffers needed for communication between kernels. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
41. Worst case analysis of decomposed software pipelining for cyclic unitary RCPSP with precedence delays.
- Author
-
Benabid, Abir and Hanen, Claire
- Subjects
DATA pipelining ,END-to-end delay ,DELAY lines ,SCHEDULING ,APPROXIMATION algorithms - Abstract
In this paper, we address a cyclic scheduling problem of finding a periodic schedule with minimal period for the unitary resource constrained cyclic scheduling problem. The main originality is in being able to cope with both precedence delays and complex resource settings which make the problem $\mathcal{NP}$-complete, in general. A guaranteed approach, called Decomposed Software Pipelining, has been proposed by Gasperoni and Schwiegelshohn (Parallel Process. Lett. 4:391-403, ), followed by the retiming method by Calland et al. (IEEE Trans. Parallel Distrib. Syst. 9(1):24-35, ) to solve the problem assuming parallel identical processors and ordinary precedence constraints. In this paper, an extension of this approach to unitary resource-constrained cyclic scheduling problems with precedence delays, is analyzed and its worst case performance ratio is provided. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
42. Leakage-Aware Modulo Scheduling for Embedded VLIW Processors.
- Author
-
Yong Guan and Jingling Xue
- Subjects
EMBEDDED computer systems ,SCHEDULING ,HIGH performance processors ,SEMICONDUCTORS ,ENERGY consumption ,NANOELECTROMECHANICAL systems ,TRIMARANS ,COMPILERS (Computer programs) - Abstract
semi-conductor technologies move down to the nanometer scale, leakage power has become a significant component of the total power consumption. In this paper, we present a leakage-aware modulo scheduling algorithm to achieve leakage energy saving for applications with loops on Very Long Instruction Word (VLIW) architectures. The proposed algorithm is designed to maximize the idleness of function units integrated with the dual-threshold domino logic, and reduce the number of transitions between the active and sleep modes. We have implemented our technique in the Trimaran compiler and conducted experiments using a set of embedded benchmarks from DSPstone and Mibench on the cycle-accurate VLIW simulator of Trimaran. The results show that our technique achieves significant leakage energy saving compared with a previously published DAG-based (Directed Acyclic Graph) leakage-aware scheduling algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
43. A survey on system level energy optimisation for MPSoCs in IoT and consumer electronics
- Author
-
Haider Ali, Yongjun Zheng, Xiaojun Zhai, Nikos Antonopoulos, James Hardy, Kaniz Fatema, Umair Ullah Tariq, Faycal Bensaali, Liu Lu, and Abbes Amira
- Subjects
Service (systems architecture) ,General Computer Science ,business.industry ,Computer science ,Quality of service ,MPSoC ,Theoretical Computer Science ,Software pipelining ,Computer architecture ,Smart city ,The Internet ,Frequency scaling ,business ,Wireless sensor network - Abstract
Internet-of-Things (IoT) is an appealing service to revolutionise Smart City (SC) initiatives across the globe. IoT interconnects a plethora of digital devices known as Sensor Nodes (SNs) to the Internet. Due to their high performance and exceptional Quality-of-Service (QoS) Multiprocessor System-on-Chip (MPSoC) computing architectures are gaining increasing popularity for the computationally extensive workloads in both IoT and consumer electronics. In this survey, we have explored balance between the IoT paradigm and its applications in SC while introducing Wireless Sensor Network (WSN), including the structure of the SN. We considered MPSoCs systems in relation to characteristics such as architecture and the communication technology involved. This provides an insight into the benefits of coupling MPSoCs with IoT. This paper, also investigates prevalent software level energy optimisation techniques and extensively reviews workload mapping and scheduling approaches since 2001 until today for energy savings using (1) Dynamic Voltage and Frequency Scaling (DVFS) and/or Dynamic Power Management (DPM) (2) Inter-processor communication reduction (3) Coarse-grained software pipelining integrated with DVFS. This paper constructively summarises the findings of these approaches and algorithms identifying insightful directions to future research avenues.
- Published
- 2021
- Full Text
- View/download PDF
44. Periodic register saturation in innermost loops
- Author
-
Touati, Sid-Ahmed-Ali and Mathe, Zsolt
- Subjects
- *
CONSTRAINT satisfaction , *HIGH performance computing , *COMPUTER scheduling , *ESTIMATION theory , *ENERGY consumption , *COMPUTER software - Abstract
Abstract: This article treats register constraints in high performance codes and embedded VLIW computing, aiming to decouple register constraints from instruction scheduling. It extends the register saturation (RS) concept to periodic instruction schedules, i.e., software pipelining (SWP). We formally study an approach which consists in computing the exact upper-bound of the register need for all the valid SWP schedules of an innermost loop (without overestimation), independently of the functional unit constraints. We call this upper-limit the periodic register saturation (PRS) of the data dependence graph (DDG). PRS is well adapted to situations where spilling is not a favourite solution for reducing register pressure compared to ILP scheduling: spill operations request memory data with a higher energy consumption. Also, spill code introduces unpredictable cache effects and makes Worst-Case-Execution-Time (WCET) estimation less accurate. PRS is a computer science concept that has many possible applications. First, it provides compiler designers new ways to generate better codes regarding register optimisation by avoiding useless spilling. Second, its computation can help architectural designers to decide about the most suitable number of available registers inside an embedded VLIW processor; such architectural decision can be done with full respect to instruction level parallelism extraction, independently of the chosen functional units configuration. Third, it can be used to verify prior to instruction scheduling that a code does not require spilling. In this paper, we write an appropriate mathematical formalism for the problem of PRS computation and reduction in the case of loops where data dependence graphs may be cyclic. We prove that PRS reduction is NP-hard and we provide optimal and approximate methods. We have implemented our methods, and our experimental results demonstrate the practical usefulness and efficiency of the PRS approach. [Copyright &y& Elsevier]
- Published
- 2009
- Full Text
- View/download PDF
45. High-performance computing of for a vector of inputs on Alpha and IA-64 CPUs
- Author
-
Haidar Sharif, Md., Basermann, Achim, Seidel, Christian, and Hunger, Axel
- Subjects
- *
MICROPROCESSORS , *HIGH performance computing , *MODERN architecture , *TECHNOLOGY - Abstract
Abstract: The modern microprocessors have become more sophisticated, the performance of software on modern architectures has grown more and more difficult to dissect and prognosticate. The execution of a program nowadays entails the complex interaction of code, compiler and processor micro-architecture. The built-in functions to compute or of math library and hardware are often incapable of achieving the challenging performance of high-performance numerical computing. To meet this demand, the current trend in constructing high-performance numerical computing for specific processors Alpha 21264 & 21364, and IA-64 has been optimized for and for a vector of inputs which is significantly faster than optimized library routines. A detailed deliberation of how the processor micro-architecture as well as the manual optimization techniques improve the computing performance has been developed. [Copyright &y& Elsevier]
- Published
- 2008
- Full Text
- View/download PDF
46. On the Periodic Register Need in Software Pipelining.
- Author
-
Touati, Sid-Ahmed-Ali
- Subjects
- *
ALGORITHMS , *DATA pipelining , *ONLINE data processing , *COMPUTER scheduling , *TIME measurements , *COMPUTER networks , *CONSTRAINT databases , *SOFTWARE architecture , *COMPUTER network architectures - Abstract
This paper presents several theoretical and fundamental results on the register need in periodic schedules, also known as MAXLIVE. Our first computation is a novel formula for computing the exact number of registers needed by a scheduled loop. This formula has two advantages: Its computation can be done by using a polynomial algorithm with O(n lg n) complexity (n is the number of instructions in the loop) and it allows the generalization of a previous result [13]. Second, during software pipelining, we show that the minimal number of registers needed may increase when incrementing the initiation interval (II), which is contrary to intuition. For the case of zero architectural delays in accessing registers, we provide a sufficient condition for keeping the minimal number of registers from increasing when incrementing the II. Third, we prove an interesting property that enables us to optimally compute the minimal periodic register sufficiency of a loop for all its valid periodic schedules, irrespective of H. Fourth and last, we prove that the problem of optimal stage scheduling under register constraints is polynomially solvable for a subclass of data dependence graphs, whereas this problem is known to be NP-complete for arbitrary dependence graphs [7]. Our latter result generalizes a previous achievement [13] which addressed data dependence trees and forest of trees. In this study, we consider cyclic data dependence graphs without taking into account any resource constraints. The aim of our theoretical results on the periodic register need is to help current and future software pipeliners achieve significant performance improvements by making better (if not the best) use of the available resources. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
47. Architecture and Synthesis for Area-Efficient Pipelining of Irregular Loop Nests
- Author
-
Steve Dai, Gai Liu, Ritchie Zhao, Mingxing Tan, and Zhiru Zhang
- Subjects
For loop ,business.industry ,Computer science ,Loop inversion ,Pipeline (computing) ,Loop fusion ,Memory bandwidth ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,Loop tiling ,01 natural sciences ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Loop fission ,Software pipelining ,Embedded system ,Memory architecture ,0202 electrical engineering, electronic engineering, information engineering ,Memory footprint ,0101 mathematics ,Electrical and Electronic Engineering ,business ,Software ,Inner loop - Abstract
Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive loop iterations. While existing HLS pipelining techniques obtain good performance with low complexity for regular loop nests, they provide inadequate support for effectively synthesizing irregular loop nests. For loop nests with dynamic-bound inner loops, current pipelining techniques require unrolling of the inner loops, which is either very expensive in resource or even inapplicable due to dynamic loop bounds. To address this major limitation, this paper proposes ElasticFlow, a novel architecture capable of dynamically distributing inner loops to an array of processing units (LPUs) in an area-efficient manner. The proposed LPUs can be either specialized to execute an individual inner loop or shared among multiple inner loops to balance the tradeoff between performance and area. A customized banked memory architecture is proposed to coordinate memory accesses among different LPUs to maximize memory bandwidth without significantly increasing memory footprint. We evaluate ElasticFlow using a variety of real-life applications and demonstrate significant performance improvements over a state-of-the-art commercial HLS tool for Xilinx FPGAs.
- Published
- 2017
- Full Text
- View/download PDF
48. Software pipelining with CGA and proposed intrinsics on a reconfigurable processor for HEVC decoders
- Author
-
Dong-Gyu Sim, Jonghun Yoo, Hyun-Ho Jo, and Yong-Jo Ahn
- Subjects
Offset (computer science) ,Pixel ,Cycles per instruction ,Computer science ,020207 software engineering ,02 engineering and technology ,Parallel computing ,Intrinsics ,Computer graphics ,Software pipelining ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Multimedia information systems ,Decoding methods ,Information Systems - Abstract
This work proposes several intrinsics on a reconfigurable processor intended for HEVC decoding and software pipelining algorithms with a coarse-grained array (CGA) architecture as well as the proposed intrinsic instructions. Software pipelining algorithms are developed for the CGA acceleration of inverse transform, pixel reconstruction, de-blocking filter and sample adaptive offset modules. To enable efficient software pipelining, several very-long instruction-word-based intrinsics are designed in order to maximize the parallelization rather than the computational acceleration. We found that the HEVC decoder with the proposed intrinsics yields 2.3 times faster in running clock cycle than a decoder that does not use the intrinsics. In addition, the HEVC decoder with CGA pipelining algorithms executes 10.9 times faster than that without the CGA mode.
- Published
- 2017
- Full Text
- View/download PDF
49. Register allocation methods of improved software pipelining for loops with conditional branches.
- Author
-
Itoga, Hiroya, Haraikawa, Tomohiro, Yamashita, Yoshiyuki, and Nakata, Ikuo
- Subjects
- *
DATA pipelining , *SOFTWARE engineering , *CYBERNETICS , *ALGORITHMS , *GRAPHIC methods , *KERNEL functions , *LOOPS (Group theory) , *ELECTRONICS , *PIPELINED ADCs - Abstract
Improved Enhanced Modulo Scheduling (Improved EMS) is proposed as an efficient software pipelining method for loops with conditional branches. In EMS and Improved EMS, multiple pipeline kernels are generated, depending on the truth value of the condition, and the appropriate one is selected on execution at each iteration. Since these kernels have variables that live for the number of stages with different conditions mixed together, register allocation becomes difficult. In this paper, the authors discuss register allocation methods using interference graphs for EMS and improved EMS. The first method is a register allocation method using conventional interference graph which is applied to the deployed kernels. Because the interference graph becomes very large, however, in proportion to the number of kernels deployed in the method, the proposed method reduces the size of the interference graph by overlapping the graphs and improves the results of register allocation. These two methods were evaluated with approximating coloring algorithms. It was clear that allocation could be performed rapidly, and in a manner in no way inferior to the conventional method for the number of registers by the method by overlapping and reducing the size of the graphs. © 2006 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3, 89(12): 59– 69, 2006; Published online in Wiley InterScience (
www.interscience.wiley.com ). DOI 10.1002/ecjc.20290 [ABSTRACT FROM AUTHOR]- Published
- 2006
- Full Text
- View/download PDF
50. Optimized Hardware Implementation of Enhanced TRIPLE-DES using Cluster LUT and Pipelining on SPARTEN FPGA
- Author
-
Noha Hussen, Amany Sarhan, and Marwa Fayez
- Subjects
Triple DES ,Software pipelining ,Computer architecture ,Computer science ,Lookup table ,Cluster (physics) ,Field-programmable gate array - Published
- 2017
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.