1,352 results on '"multicore"'
Search Results
2. Hardware Software Co-Design for Multi-Threaded Computation on RISC-V-Based Multicore System
- Author
-
Binh Kieu-do-Nguyen, Khai-Duy Nguyen, Nguyen The Binh, Khai-Minh Ma, Tri-Duc Ta, Duc-Hung Le, Cong-Kha Pham, and Trong-Thuc Hoang
- Subjects
RISC-V ,multithread ,multicore ,task scheduling ,hardware-software co-design ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The open-source and customizable features of the RISC-V Instruction Set Architecture (ISA) have facilitated its rapid adoption since its publication in 2011. The availability of numerous free core designs leads to the pervasiveness of RISC-V-based devices on diverse applications spanning the Internet of Things (IoT), embedded systems, artificial intelligence (AI), and virtual/augmented reality (VR/AR). The increasing prevalence of RISC-V cores has consequently caused a demand for high-performance and resource-efficient multicore systems. However, while numerous proposals exist for constructing multicore systems on conventional architectures, realizing an efficient multicore system that effectively leverages the features of RISC-V remains a challenge. This paper introduces a novel hardware/software co-design methodology to address these bottlenecks while minimizing resource utilization. Experimental results demonstrate the efficiency of our approach, exhibiting significant performance gains over single-threaded implementations and even surpassing traditional multi-threaded approaches.
- Published
- 2024
- Full Text
- View/download PDF
3. Scalable High-Throughput and Low-Latency DVB-S2(x) LDPC Decoders on SIMD Devices
- Author
-
Bertrand Le Gal
- Subjects
LDPC decoding ,SIMD ,multicore ,DVB-S2 ,DVB-S2x ,high-throughput ,Telecommunication ,TK5101-6720 ,Transportation and communications ,HE1-9990 - Abstract
Low-density parity-check (LDPC) codes are error correction codes (ECC) with near Shannon correction performances limit boosting the reliability of digital communication systems using them. Their efficiency goes hand in hand with their high computational complexity resulting in a computational bottleneck in physical layer processing. Solutions based on multicore and many-core architectures have been proposed to support the development of software-defined radio and virtualized radio access networks (vRANs). Many studies focused on the efficient parallelization of LDPC decoding algorithms. In this study, we propose an efficient SIMD parallelization strategy for DVB-S2(x) LDPC codes. It achieves throughputs from 7 Gbps to 12 Gbps on an INTEL Xeon Gold target when 10 layered decoding iterations are executed. Simultaneously, the latencies are lower than $400~\mu $ s. These performances are equivalent to FPGA-based solutions and overclass CPU and GPU related works by factors up to $5\times $ .
- Published
- 2024
- Full Text
- View/download PDF
4. Reinforcement Learning-Based Cache Replacement Policies for Multicore Processors
- Author
-
Matheus A. Souza and Henrique C. Freitas
- Subjects
Cache replacement ,coherence ,multicore ,reinforcement learning ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
High-performance computing (HPC) systems need to handle ever-increasing data sizes for fast processing and quick response times. However, modern processors’ caches are unable to handle massive amounts of data, leading to significant cache miss penalties that affect performance. In this context, selecting an effective cache replacement policy is crucial to improving HPC performance. Existing cache replacement policies fall short of Bélády’s optimal algorithm, and we propose a new approach that leverages the coherence state and sharers’ bit-vector of a cache block to make better decisions. We suggest a reinforcement learning-based strategy that learns from past eviction decisions and applies this knowledge to make better decisions in the future. Our approach uses a next-attempt method that combines the results from classic cache replacement algorithms with reinforcement learning. We evaluated our approach using the Sniper simulator and seven kernels from CAP Benchmarks. Our results show that our approach can significantly reduce the cache miss rate by 41.20% and 27.30% in L1 and L2 caches, respectively. In addition, our approach can improve the IPC by 27.33% in the best case and reduce energy consumption by 20.36% compared to an unmodified policy.
- Published
- 2024
- Full Text
- View/download PDF
5. A Multi-core Based Real-time Scheduler Supporting Periodic and Sporadic Threads and Processes.
- Author
-
Kim, Sanggyu and Park, Hong Seong
- Abstract
This paper proposes, implements, and verifies a multicore real-time scheduler (MCRT scheduler) for periodic and sporadic threads and processes, and non-real-time processes where periodic and sporadic (or event-driven) processes are processed according to real-time characteristics such as limited periods and deadlines. Using the Xenomai and Linux operating systems, the proposed MCRT scheduler was implemented and verified through various test cases designed for multicore operations. The proposed MCRT scheduler generates scheduling tables for periodic and sporadic threads and processes, based on which they are executed during the basic period. The MCRT scheduler was verified using several examples. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
6. Preparation of Multicore Millimeter-Sized Spherical Alginate Capsules to Specifically and Sustainedly Release Fish Oil
- Author
-
Lina Tao, Panpan Wang, Ting Zhang, Mengzhen Ding, Lijie Liu, Ningping Tao, Xichang Wang, and Jian Zhong
- Subjects
Ionotropic gelation ,Millimeter-sized spherical capsule ,Monoaxial dispersion electrospraying ,Multicore ,Specific and sustained release ,Nutrition. Foods and food supply ,TX341-641 - Abstract
Specific and sustained release of nutrients from capsules to the gastrointestinal tract has attracted many attentions in the field of food and drug delivery. In this work, we reported a monoaxial dispersion electrospraying-ionotropic gelation technique to prepare multicore millimeter-sized spherical capsules for specific and sustained release of fish oil. The spherical capsules had diameters from 2.05 mm to 0.35 mm with the increased applied voltages. The capsules consisted of uniform (at applied voltages of ≤ 10 kV) or nonuniform (at applied voltages of > 10 kV) multicores. The obtained capsules had reasonable loading ratios (9.7 %−6.3 %) due to the multicore structure. In addition, the obtained capsules had specific and sustained release behaviors of fish oil into the small intestinal phase of in vitro gastrointestinal tract and small intestinal tract models. The simple monoaxial dispersion electrospraying-ionotropic gelatin technique does not involve complicated preparation formulations and polymer modification, which makes the technique has a potential application prospect for the fish oil preparations and the encapsulation of functional active substances in the field of food and drug industries.
- Published
- 2023
- Full Text
- View/download PDF
7. An Efficient Authenticated Elliptic Curve Cryptography Scheme for Multicore Wireless Sensor Networks
- Author
-
Esau Taiwo Oladipupo, Oluwakemi Christiana Abikoye, Agbotiname Lucky Imoize, Joseph Bamidele Awotunde, Ting-Yi Chang, Cheng-Chi Lee, and Dinh-Thuan Do
- Subjects
Multiprocessor ,multicore ,wireless sensor ,encryption ,chosen plaintext attack ,chosen ciphertext attack ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The need to ensure the longevity of Wireless Sensor Networks (WSNs) and secure their communication has spurred various researchers to come up with various WSN models. Prime among the methods for extending the life span of WSNs is the clustering of Wireless Sensors (WS), which reduces the workload of WS and thereby reduces its power consumption. However, a drastic reduction in the power consumption of the sensors when multicore sensors are used in combination with sensors clustering has not been well explored. Therefore, this work proposes a WSN model that employs clustering of multicore WS. The existing Elliptic Curve Cryptographic (ECC) algorithm is optimized for parallel execution of the encryption/decryption processes and security against primitive attacks. The Elliptic Curve Diffie-Helman (ECDH) was used for the key exchange algorithm, and the Elliptic Curve Digital Signature Algorithm (ECDSA) was used to authenticate the communicating nodes. Security analysis of the model and comparative performance analysis with the existing ones were demonstrated. The security analysis results reveal that the proposed model meets the security requirements and resists various security attacks. Additionally, the projected model is scalable, energy-conservative, and supports data freshness. The results of comparative performance analysis show that the proposed WSN model can efficiently leverage multiprocessors and/or many cores for quicker execution and conserves power usage.
- Published
- 2023
- Full Text
- View/download PDF
8. A Comprehensive Survey on the Use of Hypervisors in Safety-Critical Systems
- Author
-
Santiago Lozano, Tamara Lugo, and Jesus Carretero
- Subjects
Aerospace ,automotive ,aviation ,embedded ,hypervisor ,multicore ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Virtualization has become one of the main tools for making efficient use of the resources offered by multicore embedded platforms. In recent years, even sectors such as space, aviation, and automotive, traditionally wary of adopting this type of technology due to the impact it could have on the safety of their systems, have been forced to introduce it into their day-to-day work, as their applications are becoming increasingly complex and demanding. This article provides a comprehensive review of the research work that uses or considers the use of a hypervisor as the basis for building a virtualized safety-critical embedded system. Once the hypervisors developed or adapted for this type of system have been identified, an exhaustive qualitative comparison is made between them. an exhaustive qualitative comparison is made between them. To the best of our knowledge, this is the first time that all this information is collected in a single article. Therefore, the main contribution of this article is that it collects and categorizes the information of each hypervisor and compares them with each other, so that this article can be used as a starting point for future researchers in this area, who will be able to quickly check which hypervisor is best suited to their research needs.
- Published
- 2023
- Full Text
- View/download PDF
9. A hybrid CUDA, OpenMP, and MPI parallel TCA-based domain adaptation for classification of very high-resolution remote sensing images.
- Author
-
Garea, Alberto S., Heras, Dora B., Argüello, Francisco, and Demir, Begüm
- Subjects
- *
DEEP learning , *MULTISPECTRAL imaging , *REMOTE sensing , *MESSAGE passing (Computer science) , *CLASSIFICATION - Abstract
Domain Adaptation (DA) is a technique that aims at extracting information from a labeled remote sensing image to allow classifying a different image obtained by the same sensor but at a different geographical location. This is a very complex problem from the computational point of view, specially due to the very high-resolution of multispectral images. TCANet is a deep learning neural network for DA classification problems that has been proven as very accurate for solving them. TCANet consists of several stages based on the application of convolutional filters obtained through Transfer Component Analysis (TCA) computed over the input images. It does not require backpropagation training, in contrast to the usual CNN-based networks, as the convolutional filters are directly computed based on the TCA transform applied over the training samples. In this paper, a hybrid parallel TCA-based domain adaptation technique for solving the classification of very high-resolution multispectral images is presented. It is designed for efficient execution on a multi-node computer by using Message Passing Interface (MPI), exploiting the available Graphical Processing Units (GPUs), and making efficient use of each multicore node by using Open Multi-Processing (OpenMP). As a result, an accurate DA technique from the point of view of classification and with high speedup values over the sequential version is obtained, increasing the applicability of the technique to real problems. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
10. Reducing energy consumption using heterogeneous voltage frequency scaling of data-parallel applications for multicore systems.
- Author
-
Bratek, Pawel, Szustak, Lukasz, Wyrzykowski, Roman, and Olas, Tomasz
- Subjects
- *
VOLTAGE , *FLUID dynamics , *MULTICORE processors , *PARALLEL algorithms - Abstract
This paper investigates the exploitation of heterogeneous DVFS (dynamic voltage frequency scaling) control for improving the energy efficiency of data-parallel applications on ccNUMA shared-memory systems. We propose to adjust the clock frequency individually for the appropriately selected groups of cores, taking into account the diversified costs of parallel computation. This paper aims to evaluate the proposed approach using two different data-parallel applications: solving the 3D diffusion problem, and MPDATA fluid dynamics application. As a result, we observe the energy-savings gains of up to 20 percentage points over the traditional homogeneous frequency scaling approach on the server with two 18-core Intel Xeon Gold 6240. Additionally, we confirm the effectiveness of our strategy using two 64-core AMD EPYC 7773X. This paper also introduces two pruning algorithms that help select the optimal heterogeneous DVFS setups taking into account the energy or performance profile of studied applications. Finally, the cost and efficiency of developed algorithms are verified and compared experimentally against the brute-force search. • Heterogeneous DVFS method for energy efficiency of regular data-parallel applications. • Individually adjusting clock frequency for cores based on workload distribution. • Pruning algorithms for selecting optimal heterogeneous DVFS setups. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
11. Performance evaluation on work-stealing featured parallel programs on asymmetric performance multicore processors
- Author
-
Adnan
- Subjects
Amdahl’s law ,Speedup factor ,Asymmetric performance ,Multicore ,Work stealing ,Computer engineering. Computer hardware ,TK7885-7895 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The speed difference between high-performance CPUs and energy-efficient CPUs, which are found in asymmetric performance multicore processors, affects the current form of Amdahl’s law equation. This paper proposes two updates to that equation based on the performance evaluation results of a simple parallel pi program written with OpenCilk. Performance evaluation was done by measuring execution time and instructions per cycle (IPC). The performance evaluation of the parallel program executed on the Intel Core i5 1240P processor did not indicate decreased performance due to asymmetric performance. Instead, the program with efficient work-stealing advantages from OpenCilk performed well. In the case of using the execution time of the P-CPU as a reference to obtain speedup, the evaluation results in a sublinear speedup. Conversely, in the case of using the execution time of the E-CPU as a reference, the evaluation results in a superlinear speedup. This paper proposes two updates to Amdahl’s law equation based on these two evaluation results.
- Published
- 2023
- Full Text
- View/download PDF
12. DAG Hierarchical Schedulability Analysis for Avionics Hypervisor in Multicore Processors.
- Author
-
Yang, Huan, Zhao, Shuai, Shi, Xiangnan, Zhang, Shuang, and Guo, Yangming
- Subjects
AVIONICS ,HYPERVISOR (Computer software) ,MULTICORE processors ,VIRTUAL machine systems ,DIRECTED acyclic graphs - Abstract
Parallel hierarchical scheduling of multicore processors in avionics hypervisor is being studied. Parallel hierarchical scheduling utilizes modular reasoning about the temporal behavior of the upper Virtual Machine (VM) by partitioning CPU time. Directed Acyclic Graphs (DAGs) are used for modeling functional dependencies. However, the existing DAG scheduling algorithm wastes resources and is inaccurate. Decreasing the completion time (CT) of DAG and offering a tight and secure boundary makes use of joint-level parallelism and inter-joint dependency, which are two key factors of DAG topology. Firstly, Concurrent Parent and Child Model (CPCM) is researched, which accurately captures the above two factors and can be applied recursively when parsing DAG. Based on CPCM, the paper puts forward a hierarchical scheduling algorithm, which focuses on decreasing the maximum CT of joints. Secondly, the new Response Time Analysis (RTA) algorithm is proposed, which offers a general limit for other execution sequences of Noncritical joints (NC-joints) and a specific limit for a fixed execution sequence. Finally, research results show that the parallel hierarchical scheduling algorithm has higher performance than other algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
13. Fine-grain data classification to filter token coherence traffic.
- Author
-
Upadhyay, Bhargavi R., Ros, Alberto, and M., Supriya
- Subjects
- *
OPTICAL disks , *CLASSIFICATION - Abstract
Snoop-based cache coherence protocols perform well in small-scale systems by enabling low latency cache-to-cache data transfers in just two-hop coherence transactions. However, they are not a scalable alternative as they require frequent broadcast of coherence requests. Token coherence protocols were proposed to improve the scalability of snoop-based protocols by removing a large amount of traffic due to broadcast responses. Still, broadcasting coherence requests on every cache miss represents a scalability issue for medium and large-scale systems. In this paper, we propose to reduce the number of broadcast operations in Token coherence protocols by performing an efficient fine-grain private-shared data classification and disabling broadcasts for misses to data classified as private. Our fine-grain classification is orchestrated and stored by the Translation Look-aside Buffers (TLBs), where entries are kept for a longer time than in local caches. We explore different classification granularity accounting for different storage overheads and their impact on filtering coherence traffic. We evaluate our proposals on a set of parallel benchmarks through full-system cycle-accurate simulation and show that a subpage-grain classification offers the best trade-off when accounting for storage, traffic, and performance. When running a 16-core configuration, our subpage-grain classification eliminates 40.1% of broadcast operations compared to not performing any classification and 13.7% of broadcast operations more than a page-grain data classification. This reduction translates into less network traffic (16.0%), and finally, performance improvements of 12.0% compared to not having a classification mechanism. • Evaluation of TLB-based private/shared classification with varying granularities. • Proposal of a new TLB-based sub-page classification mechanism. • Integration of classification techniques to filter Token coherence traffic. • Reduction in traffic by 16% and performance by 20% with a low storage cost. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
14. A Survey of Techniques for Reducing Interference in Real-Time Applications on Multicore Platforms
- Author
-
Tamara Lugo, Santiago Lozano, Javier Fernandez, and Jesus Carretero
- Subjects
Real-time systems ,architecture ,multicore ,timing analysis ,schedulability analysis ,WCET ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
This survey reviews the scientific literature on techniques for reducing interference in real-time multicore systems, focusing on the approaches proposed between 2015 and 2020. It also presents proposals that use interference reduction techniques without considering the predictability issue. The survey highlights interference sources and categorizes proposals from the perspective of the shared resource. It covers techniques for reducing contentions in main memory, cache memory, a memory bus, and the integration of interference effects into schedulability analysis. Every section contains an overview of each proposal and an assessment of its advantages and disadvantages.
- Published
- 2022
- Full Text
- View/download PDF
15. FPGA-based programmable embedded platform for image processing applications
- Author
-
Siddiqui, Fahad Manzoor, Woods, Roger, and Rafferty, Karen
- Subjects
621.36 ,FPGA ,Dataflow ,Multicore ,Zynq ,Parallel computing ,Hardware acceleration ,Image Processing ,Programmable - Abstract
A vast majority of electronic systems including medical, surveillance and critical infrastructure employs image processing to provide intelligent analysis. They use onboard pre-processing to reduce data bandwidth and memory requirements before sending information to the central system. Field Programmable Gate Arrays (FPGAs) represent a strong platform as they permit reconfigurability and pipelining for streaming applications. However, rapid advances and changes in these application use cases crave adaptable hardware architectures that can process dynamic data workloads and be easily programmed to achieve ecient solutions in terms of area, time and power. FPGA-based development needs iterative design cycles, hardware synthesis and place-and-route times which are alien to the software developers. This work proposes an FPGA-based programmable hardware acceleration approach to reduce design effort and time. This allows developers to use FPGAs to profile, optimise and quickly prototype algorithms using a more familiar software-centric, edit-compile-run design flow that enables the programming of the platform by software rather than high-level synthesis (HLS) engineering principles. Central to the work has been the development of an optimised FPGA-based processor called Image Processing Processor (IPPro) which efficiently uses the underlying resources and presents a programmable environment to the programmer using a dataflow design principle. This gives superior performance when compared to competing alternatives. From this, a three-layered platform has been created which enables the realisation of parallel computing skeletons on FPGA which are used to eciently express designs in high-level programming languages. From bottom-up, these layers represent programming (actor, multiple actors and parallel skeletons) and hardware (IPPro core, multicore IPPro, system infrastructure) abstraction. The platform allows acceleration of parallel and non-parallel dataflow applications. A set of point and area image pre-processing functions are implemented on Avnet Zedboard platform which allows the evaluation of the performance. The point function achieved 2.53 times better performance than the area functions and point and area functions achieved performance improvements of 7.80 and 5.27 times over sin- gle core IPPro by exploiting data parallelism. The pipelined execution of multiple stages revealed that a dataflow graph can be decomposed into balanced actors to deliver maximum performance by hiding data transfer and processing time through exploiting task parallelism; otherwise, the maximum achievable performance is limited by the slowest actor due to the ripple effect caused by unbalanced actors. The platform delivered better performance in terms of fps/Watt/Area than Embedded Graphic Processing Unit (GPU) considering both technologies allows a software-centric design flow.
- Published
- 2018
16. Pengujian Multicore Pada Processor Terhadap Performansi Server Virtualisasi Menggunakan Metode Load Testing
- Author
-
Doddy Ferdiansyah, Aliev Riaunanda Kamal, Sali Alas Majapahit, and Ferry Mulyanto
- Subjects
keamanan ,keamanan informasi ,laboratorium ,multicore ,pengujian performa ,Mathematics ,QA1-939 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Dalam membangun sebuah laboratorium keamanan informasi, perlu diperhatikan aspek perangkat, aplikasi, dan lingkungannya (environment). Laboratorium keamanan informasi ini bertujuan untuk menguji tingkat keamanan dari sebuah aplikasi yang akan atau sudah dibangun. Tetapi yang menjadi masalah utama adalah sulitnya pemilihan jenis perangkat keras yang sesuai dengan kebutuhan. Ada beberapa parameter yang harus diperhatikan dalam menentukan komponen perangkat keras yang tepat, yaitu Random Access Memeory (RAM), Processor, dan Network Interface Card (NIC). Dalam penelitian ini, hanya berfokus pada pengujian pengaruh Multi Core dalam sebuah Processor. Seperti yang diketahui, Processor merupakan perangkat utama yang sangat penting dalam komputer. Jika diibaratkan, Processor merupakan otak dari komputer yang akan digunakan untuk menguji tingkat keamanan dari sebuah aplikasi yang akan atau sudah dibangun. Selain itu, pemilihan jenis Processor juga sangat berpengaruh dalam pemrosesan tugas yang akan dilakukan oleh komputer uji dalam rancangan pembangunan laboratorium kemanan informasi ini, sehingga pemilihan Processor sangatlah penting. Hasil akhir dari penelitian ini adalah untuk mendapatkan rekomendasi Processor yang sesuai dengan kebutuhan komputer uji pada Blueprint laboratorium kemanan informasi
- Published
- 2021
- Full Text
- View/download PDF
17. RT-SEAT: A hybrid approach based real-time scheduler for energy and temperature efficient heterogeneous multicore platforms
- Author
-
Yanshul Sharma and Sanjay Moulik
- Subjects
Multicore ,Deadline ,Energy ,Heterogeneous ,Temperature ,Technology - Abstract
The demand for heterogeneous multicore platforms is growing at a rapid pace in modern gadgets. Such platforms help cater to various types of applications and thus provide high resource utilization. As each task has a different execution time on different types of cores, it is very challenging to schedule tasks on such platforms. With the advancement in technology, it has become imperative to manage energy consumption and temperatures of cores in multicore platforms. Hence, in this work, we propose RT-SEAT, a hybrid real-time scheduler for energy and temperature-efficient heterogeneous multicore systems. Through extensive experimental analysis, we have found that RT-SEAT is able to schedule more tasks (up to 46.75%), save more energy (up to 6.89%), and reduce the average temperature of the cores by 20.45%, when the system workload varies from 50% to 100%, with respect to the state-of-the-art.
- Published
- 2022
- Full Text
- View/download PDF
18. Speeding up wheel factoring method.
- Author
-
Bahig, Hazem M., Nassr, Dieaa I., Mahdi, Mohammed A., Hazber, Mohamed A. G., Al-Utaibi, Khaled, and Bahig, Hatem M.
- Subjects
- *
PUBLIC key cryptography , *PARALLEL algorithms , *CRYPTOSYSTEMS , *POLYNOMIAL time algorithms , *ALGORITHMS , *WHEELS - Abstract
The security of many public key cryptosystems that are used today depends on the difficulty of factoring an integer into its prime factors. Although there is a polynomial time quantum-based algorithm for integer factorization, there is no polynomial time algorithm on a classical computer. In this paper, we study how to improve the wheel factoring method using two approaches. The first approach is introducing two sequential modifications on the wheel factoring method. The second approach is parallelizing the modified algorithms on a parallel system. The experimental studies on composite integers n that are a product of two primes of equal size show the following results. (1) The percentages of improvements for the two modified sequential methods compared to the wheel factoring method are almost 47 % and 90 % . (2) The percentage of improvement for the two proposed parallel methods compared to the two modified sequential algorithms is 90 % on the average. (3) The maximum speedup achieved by the best parallel proposed algorithm using 24 threads is almost 336 times the wheel factoring method. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
19. Memory-Aware Denial-of-Service Attacks on Shared Cache in Multicore Real-Time Systems.
- Author
-
Bechtel, Michael and Yun, Heechul
- Subjects
- *
DENIAL of service attacks , *SHARED workspaces , *RANDOM access memory , *MULTICORE processors , *MICROELECTROMECHANICAL systems - Abstract
In this paper, we identify that memory performance plays a crucial role in the feasibility and effectiveness for performing denial-of-service attacks on shared cache. Based on this insight, we introduce new cache DoS attacks, which can be mounted from the user-space and can cause extreme worst-case execution time (WCET) impacts to cross-core victims—even if the shared cache is partitioned—by taking advantage of the platform’s memory address mapping information and HugePage support. We deploy these enhanced attacks on two popular embedded out-of-order multicore platforms using both synthetic and real-world benchmarks. The proposed DoS attacks achieve up to 111X WCET increases on the tested platforms. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
20. Snowflake: A lightweight portable stencil DSL
- Author
-
Zhang, N, Driscoll, M, Markley, C, Williams, S, Basu, P, and Fox, A
- Subjects
Scientific Computing ,Domain-Specific Language ,Python ,GPU ,Multicore - Abstract
Stencil computations are not well optimized by general-purpose production compilers and the increased use of multicore, manycore, and accelerator-based systems makes the optimization problem even more challenging. In this paper we present Snowflake, a Domain Specific Language (DSL) for stencils that uses a 'micro-compiler' approach, i.e., small, focused, domain-specific code generators. The approach is similar to that used in image processing stencils, but Snowflake handles the much more complex stencils that arise in scientific computing, including complex boundary conditions, higher-order operators (larger stencils), higher dimensions, variable coefficients, non-unit-stride iteration spaces, and multiple input or output meshes. Snowflake is embedded in the Python language, allowing it to interoperate with popular scientific tools like SciPy and iPython; it also takes advantage of built-in Python libraries for powerful dependence analysis as part of a just-in-time compiler. We demonstrate the power of the Snowflake language and the micro-compiler approach with a complex scientific benchmark, HPGMG, that exercises the generality of stencil support in Snowflake. By generating OpenMP comparable to, and OpenCL within a factor of 2x of hand-optimized HPGMG, Snowflake demonstrates that a micro-compiler can support diverse processor architectures and is performance-competitive whilst preserving a high-level Python implementation.
- Published
- 2017
21. Snowflake: A Lightweight Portable Stencil DSL
- Author
-
Fox, Armando [Univ. of California, Berkeley, CA (United States). Dept. of Electrical Engineering and Computer Science]
- Published
- 2017
- Full Text
- View/download PDF
22. Snowflake: A Lightweight Portable Stencil DSL
- Author
-
Zhang, Nathan, Driscoll, Michael, Fox, Armando, Markley, Charles, Williams, Samuel, and Basu, Protonu
- Subjects
Information and Computing Sciences ,Applied Computing ,Scientific Computing ,Domain-Specific Language ,Python ,GPU ,Multicore - Abstract
Stencil computations are not well optimized by general-purpose production compilers and the increased use of multicore, manycore, and accelerator-based systems makes the optimization problem even more challenging. In this paper we present Snowflake, a Domain Specific Language (DSL) for stencils that uses a 'micro-compiler' approach, i.e., small, focused, domain-specific code generators. The approach is similar to that used in image processing stencils, but Snowflake handles the much more complex stencils that arise in scientific computing, including complex boundary conditions, higher-order operators (larger stencils), higher dimensions, variable coefficients, non-unit-stride iteration spaces, and multiple input or output meshes. Snowflake is embedded in the Python language, allowing it to interoperate with popular scientific tools like SciPy and iPython; it also takes advantage of built-in Python libraries for powerful dependence analysis as part of a just-in-time compiler. We demonstrate the power of the Snowflake language and the micro-compiler approach with a complex scientific benchmark, HPGMG, that exercises the generality of stencil support in Snowflake. By generating OpenMP comparable to, and OpenCL within a factor of 2x of hand-optimized HPGMG, Snowflake demonstrates that a micro-compiler can support diverse processor architectures and is performance-competitive whilst preserving a high-level Python implementation.
- Published
- 2017
23. mdtmFTP and its evaluation on ESNET SDN testbed
- Author
-
Pouyoul, Eric [ESnet, Berkeley, CA (United States)]
- Published
- 2017
- Full Text
- View/download PDF
24. KiloCore: A 32-nm 1000-Processor Computational Array
- Author
-
Bohnenstiehl, Brent, Stillmaker, Aaron, Pimentel, Jon J, Andreas, Timothy, Liu, Bin, Tran, Anh T, Adeagbo, Emmanuel, and Baas, Bevan M
- Subjects
Affordable and Clean Energy ,Globally asynchronous locally synchronous ,many core ,multicore ,NoC ,parallel processor ,Condensed Matter Physics ,Electrical and Electronic Engineering ,Other Technology ,Electrical & Electronic Engineering - Abstract
A processor array containing 1000 independent processors and 12 memory modules was fabricated in 32-nm partially depleted silicon on insulator CMOS. The programmable processors occupy 0.055 mm2 each, contain no algorithm-specific hardware, and operate up to an average maximum clock frequency of 1.78 GHz at 1.1 V. At 0.9 V, processors operating at an average of 1.24 GHz dissipate 17 mW while issuing one instruction per cycle. At 0.56 V, processors operating at an average of 115 MHz dissipate 0.61 mW while issuing one instruction per cycle, resulting in an energy consumption of 5.3 pJ/instruction. On-die communication is performed by complementary circuit and packet-based networks that yield a total array bisection bandwidth of 4.2 Tb/s. Independent memory modules handle data and instructions and operate up to an average maximum clock frequency of 1.77 GHz at 1.1 V. All processors, their packet routers, and the memory modules contain unconstrained clock oscillators within independent clock domains that adapt to large supply voltage noise. Compared with a variety of Intel i7s and Nvidia GPUs, the KiloCore at 1.1 V has geometric mean improvements of 4.3 \times higher throughput per area and 9.4 \times higher energy efficiency for AES encryption, 4095-b low-density parity-check decoding, 4096-point complex fast Fourier transform, and 100-B record sorting applications.
- Published
- 2017
25. Two novel cache management mechanisms on CPU-GPU heterogeneous processors
- Author
-
Huijing Yang and Tingwen Yu
- Subjects
heterogeneous ,multicore ,cpu-gpu ,Information technology ,T58.5-58.64 - Abstract
Heterogeneous multicore processors that take full advantage of CPUs and GPUs within the same chip raise an emerging challenge for sharing a series of on-chip resources, particularly Last-Level Cache (LLC) resources. Since the GPU core has good parallelism and memory latency tolerance, the majority of the LLC space is utilized by GPU applications. Under the current cache management policies, the LLC sharing of CPU applications can be remarkably decreased due to the existence of GPU workloads, thus seriously affecting the overall performance. To alleviate the unfair contention within CPUs and GPUs for the cache capability, we propose two novel cache supervision mechanisms: static cache partitioning scheme based on adaptive replacement policy (SARP) and dynamic cache partitioning scheme based on GPU missing awareness (DGMA). SARP scheme first uses cache partitioning to split the cache ways between CPUs and GPUs and then uses adaptive cache replacement policy depending on the type of the requested message. DGMA scheme monitors GPU’s cache performance metrics at run time and set appropriate threshold to dynamically change the cache ratio of the mutual LLC between various kernels. Experimental results show that SARP mechanism can further increase CPU performance, up to 32.6% and an average increase of 8.4%. And DGMA scheme improves CPU performance under the premise of ensuring that GPU performance is not affected, and achieves a maximum increase of 18.1% and an average increase of 7.7%.
- Published
- 2021
- Full Text
- View/download PDF
26. Fast Newton-Raphson Power Flow Analysis Based on Sparse Techniques and Parallel Processing.
- Author
-
Ahmadi, Afshin, Smith, Melissa C., Collins, Edward R., Dargahi, Vahid, and Jin, Shuangshuang
- Subjects
- *
ELECTRICAL load , *MULTICORE processors , *PARALLEL processing , *SPARSE matrices , *SYSTEM analysis , *GRAPHICS processing units - Abstract
Power flow (PF) calculation provides the basis for the steady-state power system analysis and is the backbone of many power system applications ranging from operations to planning. The calculated voltage and power values by PF are essential to determining the system condition and ensuring the security and stability of the grid. The emergence of multicore processors provides an opportunity to accelerate the speed of PF computation and, consequently, improve the performance of applications that run PF within their processes. This paper introduces a fast Newton-Raphson power flow implementation on multicore CPUs by combining sparse matrix techniques, mathematical methods, and parallel processing. Experimental results validate the effectiveness of our approach by finding the power flow solution of a synthetic U.S. grid test case with 82,000 buses in just 1.8 seconds. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
27. Co-Design of Multicore Hardware and Multithreaded Software for Thread Performance Assessment on an FPGA.
- Author
-
Adam, George K.
- Subjects
GATE array circuits ,COMPUTER systems ,PARTICIPATORY design ,MULTICORE processors ,INTERNET of things - Abstract
Multicore and multithreaded architectures increase the performance of computing systems. The increase in cores and threads, however, raises further issues in the efficiency achieved in terms of speedup and parallelization, particularly for the real-time requirements of Internet of things (IoT)-embedded applications. This research investigates the efficiency of a 32-core field-programmable gate array (FPGA) architecture, with memory management unit (MMU) and real-time operating system (OS) support, to exploit the thread level parallelism (TLP) of tasks running in parallel as threads on multiple cores. The research outcomes confirm the feasibility of the proposed approach in the efficient execution of recursive sorting algorithms, as well as their evaluation in terms of speedup and parallelization. The results reveal that parallel implementation of the prevalent merge sort and quicksort algorithms on this platform is more efficient. The increase in the speedup is proportional to the core scaling, reaching a maximum of 53% for the configuration with the highest number of cores and threads. However, the maximum magnitude of the parallelization (66%) was found to be bounded to a low number of two cores and four threads. A further increase in the number of cores and threads did not add to the improvement of the parallelism. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. Performance Evaluation of Massively Parallel Systems Using SPEC OMP Suite.
- Author
-
Mustafa, Dheya
- Subjects
HIGH performance processors ,SUPERCOMPUTERS ,MODERN architecture ,COMPUTER software development ,COPROCESSORS - Abstract
Performance analysis plays an essential role in achieving a scalable performance of applications on massively parallel supercomputers equipped with thousands of processors. This paper is an empirical investigation to study, in depth, the performance of two of the most common High-Performance Computing architectures in the world. IBM has developed three generations of Blue Gene supercomputers—Blue Gene/L, P, and Q—that use, at a large scale, low-power processors to achieve high performance. Better CPU core efficiency has been empowered by a higher level of integration to gain more parallelism per processing element. On the other hand, the Intel Xeon Phi coprocessor armed with 61 on-chip x86 cores, provides high theoretical peak performance, as well as software development flexibility with existing high-level programming tools. We present an extensive evaluation study of the performance peaks and scalability of these two modern architectures using SPEC OMP benchmarks. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
29. Interference-Aware Schedulability Analysis and Task Allocation for Multicore Hard Real-Time Systems.
- Author
-
Aceituno, José María, Guasque, Ana, Balbastre, Patricia, Simó, José, and Crespo, Alfons
- Subjects
TASK analysis ,HIGH performance computing ,CACHE memory ,MULTICORE processors - Abstract
There has been a trend towards using multicore platforms for real-time embedded systems due to their high computing performance. In the scheduling of a multicore hard real-time system, there are interference delays due to contention of shared hardware resources. The main sources of interference are memory, cache memory, and the shared memory bus. These interferences are a great source of unpredictability and they are not always taken into account. Recent papers have proposed task models and schedulability algorithms to account for this interference delay. The aim of this paper is to provide a schedulability analysis for a task model that incorporates interference delay, for both fixed and dynamic priorities. We assume an implicit deadline task model. We rely on a task model where this interference is integrated in a general way, without depending on a specific type of hardware resource. There are similar approaches, but they consider fixed priorities. An allocation algorithm to minimise this interference (Imin) is also proposed and compared with existing allocators. The results show how Imin has the best rates in terms of percentages of schedulability and increased utilisation. In addition, Imin presents good results in terms of solution times. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. Real-Time System Benchmarking with Embedded Linux and RT Linux on a Multi-Core Hardware Platform
- Author
-
Hosseini, Kian and Hosseini, Kian
- Abstract
To catch up with the growing trend of parallelism, this thesis work focuses on the adaption of embedded real-time systems to a multicore platform. We use the embedded system of Xilinx ZCU-102, a multicore board, as an example of an embedded system without getting deep into its architecture. First, we deal with the tasks required to be able to make an embedded system operational and discuss why they are different from those for normal computer systems. The processes it takes to make a custom operating system for the given Xilinx embedded system are examined and patching the custom operating system along with customizing it is studied. We then take a look at related work in the field of benchmarking real-time systems and embedded systems and with a good understanding of related work propose a design similar to the related work for benchmarking embedded systems. The benchmarks we use run on multiple cores and aim at challenging the Xilinx board’s capabilities of running real-time tasks when the other cores on the board are occupied with performing independent tasks. We test the designed benchmarks on different conditions under two different operating systems of RT-Linux and Embedded Linux to study the differences between them. We then note how the RT-Linux would be a real upgrade for real-time systems if multicore operations are considered. The final result we have obtained is that core idling might decrease the performance of real-time tasks and RT-Linux might experience more interrupts but it is also better at recovering from interrupts.
- Published
- 2024
31. Reachability-Based Response-Time Analysis of Preemptive Tasks Under Global Scheduling
- Author
-
Pourya Gohari and Jeroen Voeten and Mitra Nasri, Gohari, Pourya, Voeten, Jeroen, Nasri, Mitra, Pourya Gohari and Jeroen Voeten and Mitra Nasri, Gohari, Pourya, Voeten, Jeroen, and Nasri, Mitra
- Abstract
Global scheduling reduces the average response times as it can use the available computing cores more efficiently for scheduling ready tasks. However, this flexibility poses challenges in accurately quantifying interference scenarios, often resulting in either conservative response-time analyses or scalability issues. In this paper, we present a new response-time analysis for preemptive periodic tasks (or job sets) subject to release jitter under global job-level fixed-priority (JLFP) scheduling. Our analysis relies on the notion of schedule-abstraction graph (SAG), a reachability-based response-time analysis known for its potential accuracy and efficiency. Up to this point, SAG was limited to non-preemptive tasks due to the complexity of handling preemption when the number of preemptions and the moments they occur are not known beforehand. In this paper, we introduce the concept of time partitions and demonstrate how it facilitates the extension of SAG for preemptive tasks. Moreover, our paper provides the first response-time analysis for the global EDF(k) policy - a JLFP scheduling policy introduced in 2003 to address the Dhall’s effect. Our experiments show that our analysis is significantly more accurate compared to the state-of-the-art analyses. For example, we identify 12 times more schedulable task sets than existing tests for the global EDF policy (e.g., for systems with 6 to 16 tasks, 70% utilization, and 4 cores) with an average runtime of 30 minutes. We show that EDF(k) outperforms global RM and EDF by scheduling on average 24.9% more task sets (e.g., for systems with 2 to 10 cores and 70% utilization). Moreover, for the first time, we show that global JLFP scheduling policies (particularly, global EDF(k)) are able to schedule task sets that are not schedulable using well-known partitioning heuristics.
- Published
- 2024
- Full Text
- View/download PDF
32. Tracking Multicore Contention in Memory Controllers and DRAM
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Moretó Planas, Miquel, Cazorla Almeida, Francisco Javier, Fernández de Lecea Navarro, Asier, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Moretó Planas, Miquel, Cazorla Almeida, Francisco Javier, and Fernández de Lecea Navarro, Asier
- Abstract
The main memory subsystem has traditionally been one of the more complex resources to analyze in multicore real-time embedded systems, with memory controller considerations and JEDEC timing constraints being the more prominent factors contributing to such complexity. One of the main challenges in multicore real-time systems is the production of the necessary evidence regarding the management of contention for the certification of multicore platforms in safety-relevant sectors. As current MPSoC platforms provide little information on how tasks may be interacting and delaying each other at large, it still remains a tall order to provide evidence about the correctness of hardware and software mechanisms deployed specifically to mitigate and manage contention on shared resources. This work attempts to bridge this gap by proposing a low-overhead hardware mechanism to tightly track inter-core contention within the main memory subsystem. The proposed technique enhances the quality of timing- and contention-related evidence, increasing the explainability and management of multicore contention in the main memory subsystem for multicore real-time systems in relation to applicable safety standards regulating their usage.
- Published
- 2024
33. Task Scheduling With Multicore Edge Computing in Dense Small Cell Networks
- Author
-
Wei Kuang Lai, Chin-Shiuh Shieh, and Yen-Ping Chen
- Subjects
Task scheduling ,edge computing ,multicore ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
As a reaction and complement to cloud computing, edge computing is a computing paradigm designed for low-latency computing. Edge servers, deployed at the boundary of the Internet, bridge those distributed end devices and the centralized cloud server, forming a harmonic architecture with low latency and balanced loadings. Elaborated task scheduling, including task assignment and processor dispatching, is essential to the success of edge computing systems in dense small cell networks. Plenty of issues need to be considered, such as servers’ computing power, storage capacity, loadings, bandwidth and tasks’ sizes, delays, partitionability, etc. This study contributes to the task scheduling for multicore edge computing environments. We first show that this scheduling problem is an NP-hard problem. An efficient and effective heuristic is then proposed to tackle the problem. Our Multicore Task assignment for maximum Rewards (MAR) scheme differs from most previous schemes in jointly considering all three critical factors: namely task partitionability, multicore, and task properties. A task’s priority is decided by its cost function, which takes into account the task’s size, deadline, partitionability, cores’ loadings, processing power, and so forth. First, tasks from end devices are assigned to edge servers considering servers’ loadings and storage. Next, tasks are assigned to the cores of the selected server. Simulations compare the proposed scheme with First-Come-First-Serve (FCFS), Shortest Task First (STF), Delay Priority Scheduling (DPS), and Green Greedy Algorithm (GGA). Simulations demonstrate that the task completion ratio can be significantly increased, and the number of aborted tasks can be greatly reduced. Compared with FCFS (First-Come-First-Serve), STF (Shortest Task First), DPS (Delay Priority Scheduling), and GGA (Green Greedy Algorithm), the improvement in task completion ratio for hotspots is up to 26%, 25%, 22%, and 9%, respectively.
- Published
- 2021
- Full Text
- View/download PDF
34. Succinct parallel Lempel–Ziv factorization on a multicore computer.
- Author
-
Han, Ling Bo, Lao, Bin, and Nong, Ge
- Subjects
- *
FACTORIZATION , *PARALLEL algorithms , *COMPUTERS - Abstract
This article proposes a succinct parallel algorithm, called pLZone, to compute the Lempel–Ziv (LZ77) factorization of a size-n input string over a constant alphabet in O (n) time using approximately a small n-word workspace, where each word occupies ⌈ log n ⌉ bits. pLZone is designed by dividing the computing process of the sequential factorization algorithm LZone into multiple stages that are organized as a pipeline to perform operations in parallel for acceleration, and a checking method is integrated into the pipeline to efficiently verify the output to prevent bugs during implementation. A performance evaluation experiment is conducted by running pLZone and the existing representative algorithms on a set of realistic and artificial datasets. Both the best time and space results are achieved by our proposed algorithm, which suggests that this work could provide a potential solution for efficient LZ77 computation. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. Fine-Grained Power Modeling of Multicore Processors Using FFNNs.
- Author
-
Sagi, Mark, Vu Doan, Nguyen Anh, Fasfous, Nael, Wild, Thomas, and Herkersdorf, Andreas
- Subjects
- *
MULTICORE processors , *ARTIFICIAL neural networks - Abstract
To minimize power consumption while maximizing performance, today's multicore processors rely on fine-grained run-time dynamic power information—both in the time domain, e.g. μ s to ms, and space domain, e.g. core-level. The state-of-the-art for deriving such power information is mainly based on predetermined power models which use linear modeling techniques to determine the core-performance/core-power relationship. However, with multicore processors becoming ever more complex, linear modeling techniques cannot capture all possible core-performance related power states anymore. Although artificial neural networks (ANN) have been proposed for coarse-grained power modeling of servers with time resolutions in the range of seconds, few works have yet investigated fine-grained ANN-based power modeling. In this paper, we explore feed-forward neural networks (FFNNs) for core-level power modeling with estimation rates in the range of 10 kHz. To achieve a high estimation accuracy while minimizing run-time overhead, we propose a multi-objective-optimization of the neural architecture using NSGA-II with the FFNNs being trained on performance counter and power data from a complex-out-of-order processor architecture. We show that relative power estimation error for the highest accuracy FFNN decreases on average by 7.5% compared to a state-of-the-art linear power modeling approach and decreases by 5.5% compared to a multivariate polynomial regression model. For the FFNNs optimized for both accuracy and overhead, the average error decreases between 4.1% and 6.7% compared to linear modeling while offering significantly lower overhead compared to the highest accuracy FFNN. Furthermore, we propose a micro-controller-based and an accelerator-based implementation for run-time inference of the power modeling FFNN and show that the area overhead is negligible. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
36. A Lock Free Approach To Parallelize The Cellular Potts Model: Application To Ductal Carcinoma In Situ
- Author
-
Tomeu Antonio J. and Salguero Alberto G.
- Subjects
cellular automata ,cellular potts model ,dcis ,multicore ,parallel ,software transactional memory ,speedup ,Biotechnology ,TP248.13-248.65 - Abstract
In the field of computational biology, in order to simulate multiscale biological systems, the Cellular Potts Model (CPM) has been used, which determines the actions that simulated cells can perform by determining a hamiltonian of energy that takes into account the influence that neighboring cells exert, under a wide range of parameters. There are some proposals in the literature that parallelize the CPM; in all cases, either lock-based techniques or other techniques that require large amounts of information to be disseminated among parallel tasks are used to preserve data coherence. In both cases, computational performance is limited. This work proposes an alternative approach for the parallelization of the model that uses transactional memory to maintain the coherence of the information. A Java implementation has been applied to the simulation of the ductal adenocarcinoma of breast in situ (DCIS). Times and speedups of the simulated execution of the model on the cluster of our university are analyzed. The results show a good speedup.
- Published
- 2020
- Full Text
- View/download PDF
37. DAG Hierarchical Schedulability Analysis for Avionics Hypervisor in Multicore Processors
- Author
-
Huan Yang, Shuai Zhao, Xiangnan Shi, Shuang Zhang, and Yangming Guo
- Subjects
multicore ,avionics hypervisor ,DAG ,hierarchical ,parallel scheduling ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Parallel hierarchical scheduling of multicore processors in avionics hypervisor is being studied. Parallel hierarchical scheduling utilizes modular reasoning about the temporal behavior of the upper Virtual Machine (VM) by partitioning CPU time. Directed Acyclic Graphs (DAGs) are used for modeling functional dependencies. However, the existing DAG scheduling algorithm wastes resources and is inaccurate. Decreasing the completion time (CT) of DAG and offering a tight and secure boundary makes use of joint-level parallelism and inter-joint dependency, which are two key factors of DAG topology. Firstly, Concurrent Parent and Child Model (CPCM) is researched, which accurately captures the above two factors and can be applied recursively when parsing DAG. Based on CPCM, the paper puts forward a hierarchical scheduling algorithm, which focuses on decreasing the maximum CT of joints. Secondly, the new Response Time Analysis (RTA) algorithm is proposed, which offers a general limit for other execution sequences of Noncritical joints (NC-joints) and a specific limit for a fixed execution sequence. Finally, research results show that the parallel hierarchical scheduling algorithm has higher performance than other algorithms.
- Published
- 2023
- Full Text
- View/download PDF
38. GPU-Accelerated Adaptive PCBSO Mode-Based Hybrid RLA for Sparse LU Factorization in Circuit Simulation.
- Author
-
Lee, Wai-Kong and Achar, Ramachandra
- Subjects
- *
GRAPHICS processing units , *FACTORIZATION , *SIMULATION Program with Integrated Circuit Emphasis , *TRANSMISSION line matrix methods - Abstract
LU factorization is extensively used in engineering and scientific computations for solution of large set of linear equations. Particularly, circuit simulators rely heavily on sparse version of LU factorization for solution involving circuit matrices. One of the recent advances in this field is exploiting the emerging computing platform of graphics processing units (GPUs) for parallel and sparse LU factorization. In this article, following contributions are made to advance the state of the art in hybrid right-looking algorithm (RLA): 1) a novel GPU kernel based on parallel column and block size optimization (PCBSO) is developed for adaptively allocating the block size while optimizing the number of columns for parallel execution based on the size of their associated submatrices at every level. The proposed approach helps to minimize the resource contention and to improve the computational performance and 2) an algorithm is developed to enable the execution of the new adaptive mode with dynamic parallelism. Also, a comprehensive performance comparison using a set of benchmark circuit examples is presented. The results indicate that, the proposed advancements can improve the results of state-of-the-art right looking sparse LU factorization in GPU by $1.54\times $ (Arithmetic Mean). [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
39. Improved composability of software components through parallel hardware platforms for in-car multimedia systems
- Author
-
Knirsch, Andreas
- Subjects
629.2 ,Composability ,In-Vehicle Infotainment ,Automotive ,In-Car Multimedia ,Multicore ,UI Compositing ,Scheduling - Abstract
Recent years have witnessed a significant change to vehicular user interfaces (UI). This is the result of increased functionality, triggered by the continuous proliferation of vehicular software and computer systems. The UI represents the integration point that must fulfil particular requirements for usability despite the increased functionality. A concurrent present trend is the substitution of federated systems with integrated architectures. The steadily rising number of interacting functional components and the increasing integration density implies a growing complexity that has an effect on system development. This evolution raises demands for concepts that aid the composition of such complex and interactive embedded software systems, operated within safety critical environments. This thesis explores the requirements related to composability of software components, based on the example of In-Car Multimedia (ICM). This thesis proposes a novel software architecture that provides an integration path for next-generation ICM. The investigation begins with an examination of characteristics, existing frameworks and applied practice regarding the development and composition of ICM systems. To this end, constructive aspects are identified as potential means for improving composability of independently developed software components that differ in criticality, temporal and computational characteristics. This research examines the feasibility of partitioning software components by exploitation of parallel hardware architectures. Experimental evaluations demonstrate the applicability of encapsulated scheduling domains. These are achieved through the utilisation of multiple technologies that complement each other and provide different levels of containment, while featuring efficient communication to preserve adequate interoperability. In spite of allocating dedicated computational resources to software components, certain resources are still shared and require concurrent access. Particular attention has been paid to management of concurrent access to shared resources to consider the software components' individual criticality and derived priority. A software based resource arbiter is specified and evaluated to improve the system's determinism. Within the context of automotive interactive systems, the UI is of vital importance, as it must conceal inherent complexity to minimise driver distraction. Therefore, the architecture is enhanced with a UI compositing infrastructure to facilitate implementation of a homogenous and comprehensive look and feel despite the segregation of functionality. The core elements of the novel architecture are validated both individually and in combination through a proof-of-concept prototype. The proposed integral architecture supports the development and in particular the integration of mixed-critical and interactive systems.
- Published
- 2015
40. Research on Multicore Key-Value Storage System for Domain Name Storage.
- Author
-
Han, Luchao, Guo, Zhichuan, and Zeng, Xuewen
- Subjects
ALGORITHMS ,DATA structures ,MULTICORE processors ,STORAGE ,INTERNET domain naming system - Abstract
This article proposes a domain name caching method for the multicore network-traffic capture system, which significantly improves insert latency, throughput and hit rate. The caching method is composed of caching replacement algorithm, cache set method. The method is easy to implement, low in deployment cost, and suitable for various multicore caching systems. Moreover, it can reduce the use of locks by changing data structures and algorithms. Experimental results show that compared with other caching system, our proposed method reaches the highest throughput under multiple cores, which indicates that the cache method we proposed is best suited for domain name caching. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
41. A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes
- Author
-
Hamidreza Khaleghzadeh, Ravi Reddy Manumachu, and Alexey Lastovetsky
- Subjects
Heterogeneous platforms ,multicore ,Nvidia GPU ,Intel Xeon Phi ,workload partitioning ,performance ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc., resulting in complexities such as resource contention, non-uniform memory access (NUMA), and accelerator-specific limitations such as limited main memory thereby necessitating support for efficient out-of-card execution. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we propose a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of h identical nodes where each node has c heterogeneous processors. This algorithm takes as input c discrete speed functions of cardinality m corresponding to the c heterogeneous processors. It does not make any assumptions about the shapes of these functions. Unlike load balancing algorithms, optimal solutions found by the algorithm may not load-balance an application in terms of execution time. The proposed algorithm has low time complexity of O(m2 × h + m3 × c3) unlike the state-of-the-art algorithm solving the same problem with the complexity of O(m3 × c3 × h3). We also propose an extension of the algorithm for clusters of h non-identical nodes where each node has c heterogeneous processors. We experimentally demonstrate the optimality of our algorithm using two well-known and highly optimized multi-threaded data-parallel applications, matrix-matrix multiplication and 2D fast Fourier transform, on a heterogeneous multi-accelerator NUMA node containing an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor and a simulated homogeneous cluster of such nodes.
- Published
- 2020
- Full Text
- View/download PDF
42. HP-DCFNoC: High Performance Distributed Dynamic TDM Scheduler Based on DCFNoC Theory
- Author
-
Tomas Picornell, Jose Flich, Duato Jose, and Carles Hernandez
- Subjects
Dynamic scheduler ,multicore ,real-time ,tdma ,time predictable network ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The need for increasing the performance of critical real-time embedded systems pushes the industry to adopt complex multi-core processor designs with embedded networks-on-chip. In this paper we present hp-DCFNoC, a distributed dynamic scheduler design that by relying on the key properties of a delayed conflict-free NoC (DCFNoC) is able to achieve peak performance numbers very close to a wormhole-based NoC design without compromising its real-time guarantees. In particular, our results show that the proposed scheduler achieves an overall throughput improvement of 6.9× and 14.4× over a baseline DCFNoC for 16 and 64-node meshes, respectively. When compared against a standard wormhole router 95% of its network throughput is preserved while strict timing predictability as property is kept. This achievement opens the door to new high performance time predictable NoC designs.
- Published
- 2020
- Full Text
- View/download PDF
43. Brushless DC Motor Control for Electric Vehicles using a Dual Core Microcontroller
- Author
-
Ramiro Adrián Ghignone, Julián Guido Giampetruzzi, Sharon Michelle Domanico, Cristian Gabriel Juárez, and Federico Joaquín Calá
- Subjects
bldc ,freno regenerativo ,vehículo eléctrico ,electrónica de potencia ,arm cortex ,multicore ,ciaa ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 ,Computer engineering. Computer hardware ,TK7885-7895 - Abstract
The present document describes the development and verification process of a power system for brushless direct current motors, specially designed for its usage in electric vehicles. This proposal is motivated by the current expansion of the electric mobility technologies as a solution to reduce the contaminant emissions in transport activities. The proposed design implements additional features which are fundamental for this kind of applications, as regenerative braking and telemetry through a mobile application. The prototype was built and verified through several laboratory tests.
- Published
- 2019
44. Automated detection of structured coarse-grained parallelism in sequential legacy applications
- Author
-
Edler Von Koch, Tobias Joseph Kastulus, Franke, Bjoern, Garcia, Frankie, and Singer, Jeremy
- Subjects
005.2 ,compilers ,automatic parallelization ,multicore ,skeletons ,programming - Abstract
The efficient execution of sequential legacy applications on modern, parallel computer architectures is one of today’s most pressing problems. Automatic parallelization has been investigated as a potential solution for several decades but its success generally remains restricted to small niches of regular, array-based applications. This thesis investigates two techniques that have the potential to overcome these limitations. Beginning at the lowest level of abstraction, the binary executable, it presents a study of the limits of Dynamic Binary Parallelization (Dbp), a recently proposed technique that takes advantage of an underlying multicore host to transparently parallelize a sequential binary executable. While still in its infancy, Dbp has received broad interest within the research community. This thesis seeks to gain an understanding of the factors contributing to the limits of Dbp and the costs and overheads of its implementation. An extensive evaluation using a parameterizable Dbp system targeting a Cmp with light-weight architectural Tls support is presented. The results show that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy Spec Cpu2006 benchmarks, but that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average. While automatically parallelizing compilers have traditionally focused on data parallelism, additional parallelism exists in a plethora of other shapes such as task farms, divide & conquer, map/reduce and many more. These algorithmic skeletons, i.e. high-level abstractions for commonly used patterns of parallel computation, differ substantially from data parallel loops. Unfortunately, algorithmic skeletons are largely informal programming abstractions and are lacking a formal characterization in terms of established compiler concepts. This thesis develops compiler-friendly characterizations of popular algorithmic skeletons using a novel notion of commutativity based on liveness. A hybrid static/dynamic analysis framework for the context-sensitive detection of skeletons in legacy code that overcomes limitations of static analysis by complementing it with profiling information is described. A proof-of-concept implementation of this framework in the Llvm compiler infrastructure is evaluated against Spec Cpu2006 benchmarks for the detection of a typical skeleton. The results illustrate that skeletons are often context-sensitive in nature. Like the two approaches presented in this thesis, many dynamic parallelization techniques exploit the fact that some statically detected data and control flow dependences do not manifest themselves in every possible program execution (may-dependences) but occur only infrequently, e.g. for some corner cases, or not at all for any legal program input. While the effectiveness of dynamic parallelization techniques critically depends on the absence of such dependences, not much is known about their nature. This thesis presents an empirical analysis and characterization of the variability of both data dependences and control flow across program runs. The cBench benchmark suite is run with 100 randomly chosen input data sets to generate whole-program control and data flow graphs (Cdfgs) for each run, which are then compared to obtain a measure of the variance in the observed control and data flow. The results show that, on average, the cumulative profile information gathered with at least 55, and up to 100, different input data sets is needed to achieve full coverage of the data flow observed across all runs. For control flow, the figure stands at 46 and 100 data sets, respectively. This suggests that profile-guided parallelization needs to be applied with utmost care, as misclassification of sequential loops as parallel was observed even when up to 94 input data sets are used.
- Published
- 2014
45. Enhancing the performance of decoupled software pipeline through backward slicing
- Author
-
Alwan, Esraa, Padget, Julian, and Fitch, John
- Subjects
004.35 ,decoupled software pipeline ,slicing ,multicore ,thread-level parallelism ,automatic restructuring - Abstract
The rapidly increasing number of cores available in multicore processors does not necessarily lead directly to a commensurate increase in performance: programs written in conventional languages, such as C, need careful restructuring, preferably automatically, before the benefits can be observed in improved run-times. Even then, much depends upon the intrinsic capacity of the original program for concurrent execution. Using software techniques to parallelize the sequential application can raise the level of gain from multicore systems. Parallel programming is not an easy job for the user, who has to deal with many issues such as dependencies, synchronization, load balancing, and race conditions. For this reason the role of automatically parallelizing compilers and techniques for the extraction of several threads from single-threaded programs, without programmer intervention, is becoming more important and may help to deliver better utilization of modern hardware. One parallelizing technique that has been shown to be an effective for the parallelization of applications that have irregular control flow and complex memory access patterns is Decoupled Software Pipeline (DSWP). This transformation partitions the loop body into a set of stages, ensuring that critical path dependencies are kept local to a stage. Each stage becomes a thread and data is passed between threads using inter-core communication. The success of DSWP depends on being able to extract the relatively fine-grain parallelism that is present in many applications. Another technique which offers potential gains in parallelizing general purpose applications is slicing. Program slicing transforms large programs into several smaller ones that execute independently, each consisting of only statements relevant to the computation of certain, socalled, (program) points. This dissertation explores the possibility of performance benefits arising from a secondary transformation of DSWP stages by slicing. To that end a new combination method called DSWP/Slice is presented. Our observation is that individual DSWP stages can be parallelized by slicing, leading to an improvement in performance of the longest duration DSWP stages. In particular, this approach can be applicable in cases where DOALL is not. In consequence better load balancing can be achieved between the DSWP stages. Moreover, we introduce an automatic implementation of the combination method using Low Level Virtual Machine (LLVM) compiler framework. This combination is particularly effective when the whole long stage comprises a function body. More than one slice extracted from a function body can speed up its execution time and also increases the scalability of DSWP. An evaluation of this technique on six programs with a range of dependence patterns leads to considerable performance gains on a core-i7 870 machine with 4-cores/8-threads. The results are obtained from an automatic implementation that shows the proposed method can give a factor of up to 1.8 speed up compared with the original sequential code.
- Published
- 2014
46. A data dependency recovery system for a heterogeneous multicore processor
- Author
-
Kainth, Haresh S., Hill, Richard, Bagdasar, Ovidiu, and Jones, Clifton
- Subjects
004.35 ,Thread-Level Speculation ,TLS ,Multicore ,IBM Cell ,Programming ,Data Hazards ,Asymmetric Architecture ,Heterogenous - Abstract
Multicore processors often increase the performance of applications. However, with their deeper pipelining, they have proven increasingly difficult to improve. In an attempt to deliver enhanced performance at lower power requirements, semiconductor microprocessor manufacturers have progressively utilised chip-multicore processors. Existing research has utilised a very common technique known as thread-level speculation. This technique attempts to compute results before the actual result is known. However, thread-level speculation impacts operation latency, circuit timing, confounds data cache behaviour and code generation in the compiler. We describe an software framework codenamed Lyuba that handles low-level data hazards and automatically recovers the application from data hazards without programmer and speculation intervention for an asymmetric chip-multicore processor. The problem of determining correct execution of multiple threads when data hazards occur on conventional symmetrical chip-multicore processors is a significant and on-going challenge. However, there has been very little focus on the use of asymmetrical (heterogeneous) processors with applications that have complex data dependencies. The purpose of this thesis is to: (i) define the development of a software framework for an asymmetric (heterogeneous) chip-multicore processor; (ii) present an optimal software control of hardware for distributed processing and recovery from violations;(iii) provides performance results of five applications using three datasets. Applications with a small dataset showed an improvement of 17% and a larger dataset showed an improvement of 16% giving overall 11% improvement in performance.
- Published
- 2014
- Full Text
- View/download PDF
47. An efficient parallel strategy for high-cost prefix operation.
- Author
-
Bahig, Hazem M. and Fathy, Khaled A.
- Subjects
- *
PARALLEL algorithms , *COMPUTER vision , *SUFFIXES & prefixes (Grammar) , *SORTING (Electronic computers) , *MULTICORE processors , *MULTIPLICATION - Abstract
The prefix computation strategy is a fundamental technique used to solve many problems in computer science such as sorting, clustering, and computer vision. A large number of parallel algorithms have been introduced that are based on a variety of high-performance systems. However, these algorithms do not consider the cost of the prefix computation operation. In this paper, we design a novel strategy for prefix computation to reduce the running time for high-cost operations such as multiplication. The proposed algorithm is based on (1) reducing the size of the partition and (2) keeping a fixed-size partition during all the steps of the computation. Experiments on a multicore system for different array sizes and number sizes demonstrate that the proposed parallel algorithm reduces the running time of the best-known optimal parallel algorithm in the average range of 62.7–79.6%. Moreover, the proposed algorithm has high speedup and is more scalable than those in previous works. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
48. Co-Design of Multicore Hardware and Multithreaded Software for Thread Performance Assessment on an FPGA
- Author
-
George K. Adam
- Subjects
multicore ,multithreading ,performance evaluation ,real-time systems ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Multicore and multithreaded architectures increase the performance of computing systems. The increase in cores and threads, however, raises further issues in the efficiency achieved in terms of speedup and parallelization, particularly for the real-time requirements of Internet of things (IoT)-embedded applications. This research investigates the efficiency of a 32-core field-programmable gate array (FPGA) architecture, with memory management unit (MMU) and real-time operating system (OS) support, to exploit the thread level parallelism (TLP) of tasks running in parallel as threads on multiple cores. The research outcomes confirm the feasibility of the proposed approach in the efficient execution of recursive sorting algorithms, as well as their evaluation in terms of speedup and parallelization. The results reveal that parallel implementation of the prevalent merge sort and quicksort algorithms on this platform is more efficient. The increase in the speedup is proportional to the core scaling, reaching a maximum of 53% for the configuration with the highest number of cores and threads. However, the maximum magnitude of the parallelization (66%) was found to be bounded to a low number of two cores and four threads. A further increase in the number of cores and threads did not add to the improvement of the parallelism.
- Published
- 2022
- Full Text
- View/download PDF
49. Performance Evaluation of Massively Parallel Systems Using SPEC OMP Suite
- Author
-
Dheya Mustafa
- Subjects
performance measurement ,Xeon Phi ,Blue Gene ,multicore ,many integrated cores ,performance analysis ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Performance analysis plays an essential role in achieving a scalable performance of applications on massively parallel supercomputers equipped with thousands of processors. This paper is an empirical investigation to study, in depth, the performance of two of the most common High-Performance Computing architectures in the world. IBM has developed three generations of Blue Gene supercomputers—Blue Gene/L, P, and Q—that use, at a large scale, low-power processors to achieve high performance. Better CPU core efficiency has been empowered by a higher level of integration to gain more parallelism per processing element. On the other hand, the Intel Xeon Phi coprocessor armed with 61 on-chip x86 cores, provides high theoretical peak performance, as well as software development flexibility with existing high-level programming tools. We present an extensive evaluation study of the performance peaks and scalability of these two modern architectures using SPEC OMP benchmarks.
- Published
- 2022
- Full Text
- View/download PDF
50. Parallelization of Frequent Itemset Mining Methods with FP-tree: An Experiment with PrePost+ Algorithm.
- Author
-
Jamsheela, Olakara and Gopalakrishna, Raju
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.