9,440 results on '"Instruction set"'
Search Results
2. A Hybrid Sparse-dense Defensive DNN Accelerator Architecture against Adversarial Example Attacks.
- Author
-
Wang, Xingbin, Zhao, Boyan, Su, Yulan, Zhang, Sisi, Yuan, Fengkai, Zhang, Jun, Meng, Dan, and Hou, Rui
- Subjects
RELIABILITY in engineering ,RESOURCE management ,SYSTEM safety ,SYNCHRONIZATION ,ARTIFICIAL intelligence - Abstract
Understanding how to defend against adversarial attacks is crucial for ensuring the safety and reliability of these systems in real-world applications. Various adversarial defense methods are proposed, which aim at improving the robustness of neural networks against adversarial attacks by changing the model structure, adding detection networks, and adversarial purification network. However, deploying adversarial defense methods in existing DNN accelerators or defensive accelerators leads to many key issues. To address these challenges, this article proposes sDNNGuard, an elastic heterogeneous DNN accelerator architecture that can efficiently orchestrate the simultaneous execution of original (target) DNN networks and the detect algorithm or network. It not only supports for dense DNN detect algorithms, but also allows for sparse DNN defense methods and other mixed dense-sparse (e.g., dense-dense and sparse-dense) workloads to fully exploit the benefits of sparsity. sDNNGuard with a CPU core also supports the non-DNN computing and allows the special layer of the neural network, and used for the conversion for sparse storage format for weights and activation values. To reduce off-chip traffic and improve resources utilization, a new hardware abstraction with elastic on-chip buffer/computing resource management is proposed to achieve dynamical resource scheduling mechanism. We propose an extended AI instruction set for neural networks synchronization, task scheduling and efficient data interaction. Experiment results show that sDNNGuard can effectively validate the legitimacy of the input samples in parallel with the target DNN model, achieving an average 1.42× speedup compared with the state-of-the-art accelerators. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Hardware for Resilient Computing
- Author
-
Schagaev, Igor, Gutknecht, Jürg, Schagaev, Igor, and Gutknecht, Jürg
- Published
- 2024
- Full Text
- View/download PDF
4. Design of High-performance Heterogeneous Integrated Circuits
- Author
-
Melikyan, Vazgen and Melikyan, Vazgen
- Published
- 2024
- Full Text
- View/download PDF
5. Computer System Design
- Author
-
LaMeres, Brock J. and LaMeres, Brock J.
- Published
- 2024
- Full Text
- View/download PDF
6. Computer Systems
- Author
-
LaMeres, Brock J. and LaMeres, Brock J.
- Published
- 2023
- Full Text
- View/download PDF
7. 基于RISC-V 的图卷积神经网络加速器设计.
- Author
-
周 理, 赵祉乔, 潘国腾, 铁俊波, and 赵 王
- Abstract
Graph Convolutional Networks (GCN), an algorithm for processing non-Euclidean data, is currently mainly implemented on deep learning frameworks such as PyTorch for GPU acceleration. GCN's computation process involves nested matrix multiplication and data access operations, which can be satisfied by GPU in real-time but have high deployment costs and low energy efficiency. To improve the computational performance of GCN algorithm while maintaining software flexibility, this paper proposes a custom GCN accelerator based on RSIC-V SoC, which extends the dot product operation and hardware accelerator through hardware-software co-design in the hummingbird E203 SoC platform. The neural network parameter analysis determines the hardware quantization scheme from floating point to 32-bit fixed point. Experimental results show that the proposed accelerator has no accuracy loss and can achieve a maximum speedup of 6.88 times when running GCN algorithm on Cora dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
8. Hardware Acceleration for SLAM in Mobile Systems.
- Author
-
Fan, Zhe, Hao, Yi-Fan, Zhi, Tian, Guo, Qi, and Du, Zi-Dong
- Subjects
INDUSTRIAL robots ,ARM microprocessors ,ROBOT industry ,MOBILE robots ,HARDWARE - Abstract
The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping (SLAM) problem. However, existing SLAM platforms have difficulty in meeting the real-time and low-power requirements imposed by mobile systems. Though specialized hardware is promising with regard to achieving high performance and lowering the power, designing an efficient accelerator for SLAM is severely hindered by a wide variety of SLAM algorithms. Based on our detailed analysis of representative SLAM algorithms, we observe that SLAM algorithms advance two challenges for designing efficient hardware accelerators: the large number of computational primitives and irregular control flows. To address these two challenges, we propose a hardware accelerator that features composable computation units classified as the matrix, vector, scalar, and control units. In addition, we design a hierarchical instruction set for coping with a broad range of SLAM algorithms with irregular control flows. Experimental results show that, compared against an Intel x86 processor, on average, our accelerator with the area of 7.41 mm
2 achieves 10.52x and 112.62x better performance and energy savings, respectively, across different datasets. Compared against a more energy-efficient ARM Cortex processor, our accelerator still achieves 33.03x and 62.64x better performance and energy savings, respectively. [ABSTRACT FROM AUTHOR]- Published
- 2023
- Full Text
- View/download PDF
9. New Processor Architecture and Its Use in Mobile Application Development
- Author
-
Fojtik, Rostislav, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Antipova, Tatiana, editor
- Published
- 2022
- Full Text
- View/download PDF
10. THE IMPLEMENTATION OF A HIGHLY CONFIGURABLE CONTROL STANDARD IN THE DEVELOPMENT OF A ROBOTICS PLATFORM FOR THE INSPECTION OF CONFINED SPACES.
- Author
-
Albei, Victor-Eduard, Ilie, Cristinel, Popa, Marius, Tănase, Nicolae, Ovezea, Dragoș, Constantin, Alexandru, Nedelcu, Adrian, and Berindei, Adelin-Marian
- Subjects
- *
ROBOTICS , *CONFINED spaces (Work environment) , *DATA transmission systems , *CONTROL theory (Engineering) , *AUTOMATION - Abstract
This paper describes a general-purpose control standard and demonstrates its implementation as part of a controller-effector assembly. It elaborates on the specific layers on which the standard is defined: an adaptive control structure, a data transmission protocol and its corresponding instruction set. The aforementioned effector component consists of a reduced form factor robotics platform capable of remote controlled movement and optical inspection. [ABSTRACT FROM AUTHOR]
- Published
- 2023
11. A Reconfigurable Architecture to Implement Linear Transforms of Image Processing Applications
- Author
-
Sanyal, Atri, Sinha, Amitabha, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Bhattacharjee, Debotosh, editor, Kole, Dipak Kumar, editor, Dey, Nilanjan, editor, Basu, Subhadip, editor, and Plewczynski, Dariusz, editor
- Published
- 2021
- Full Text
- View/download PDF
12. Embedded Microprocessor Systems Basics
- Author
-
Meyer-Baese, Uwe and Meyer-Baese, Uwe
- Published
- 2021
- Full Text
- View/download PDF
13. 基于FPGA 快速实现定制化RISC-V 处理器.
- Author
-
陆 松, 蒋句平, and 任会峰
- Abstract
With the rising of the open instruction set RISC-V, a number of open source and commercial soft cores have emerged, which are used in different fields such as IoT hardware, embedded systems, artificial intelligence chips, security devices, and high-performance computers. How to better balance between performance, power consumption, and chip area requires that the instruction set can be easily tailored, extended, and supported by the software development environment. To this end, this paper proposes a quick customization method for RISC-V processor, through adding custom instructions, extending ALU functional units, connecting control signals and data paths, FPGA prototype verification, customizing the cross compiler and application testing. Taking the matrix calculation acceleration as an example, a customized instruction for the vector inner production is designed on the open source IP Hummingbird E203, finishes the prototype verification on FPGA. The matrix calculation benchmark shows that the performance of the customized RISC-V processor has been significantly improved. For matrix multiplication, the performance speedup reaches 5.3~7.6. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
14. Compilation and Wear Leveling for Programmable Logic-in-Memory Architecture
- Author
-
Shirinzadeh, Saeideh, Drechsler, Rolf, Shirinzadeh, Saeideh, and Drechsler, Rolf
- Published
- 2020
- Full Text
- View/download PDF
15. Optimization of beam pointing algorithm based on PowerPC
- Author
-
Lei Shulan, Wu Huixiang, and Li Wenxue
- Subjects
beam pointing ,powerpc architecture ,cordic algorithm ,instruction set ,Electronics ,TK7800-8360 - Abstract
Based on PowerPC architecture, this paper proposes an optimization strategy of beam pointing algorithm, which is realized from the trigonometric function calculation optimization, floating point arithmetic optimization, loop nesting optimization, and PowerPC instruction optimization. Through the optimization algorithm proposed in this paper, the processing time of the algorithm is reduced to one tenth of the original. At the same time, the optimization strategy proposed in this paper also has certain guidance and reference for the algorithm development and optimization of other platforms.
- Published
- 2021
- Full Text
- View/download PDF
16. Towards Integration of a Dedicated Memory Controller and Its Instruction Set to Improve Performance of Systems Containing Computational SRAM.
- Author
-
Mambu, Kévin, Charles, Henri-Pierre, Kooli, Maha, and Dumas, Julie
- Subjects
STATIC random access memory ,DYNAMIC random access memory ,DATA structures ,MEMORY - Abstract
In-memory computing (IMC) aims to solve the performance gap between CPU and memories introduced by the memory wall. However, it does not address the energy wall problem caused by data transfer over memory hierarchies. This paper proposes the data-locality management unit (DMU) to efficiently transfer data from a DRAM memory to a computational SRAM (C-SRAM) memory allowing IMC operations. The DMU is tightly coupled within the C-SRAM and allows one to align the data structure in order to perform effective in-memory computation. We propose a dedicated instruction set within the DMU to issue data transfers. The performance evaluation of a system integrating C-SRAM within the DMU compared to a reference scalar system architecture shows an increase from × 5.73 to × 11.01 in speed-up and from × 29.49 to × 46.67 in energy reduction, versus a system integrating C-SRAM without any transfer mechanism compared to a reference scalar system architecture. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
17. Application of WINDLX Simulator in Teaching Practice to Solve the Structural and Control Related in the Pipeline
- Author
-
Jingmei, Li, Yanxia, Wu, Guoyin, Zhang, Chaoguang, Men, Chunguang, Ma, Xiang, Li, Changting, Shi, Akan, Ozgur, Series Editor, Bellavista, Paolo, Series Editor, Cao, Jiannong, Series Editor, Coulson, Geoffrey, Series Editor, Dressler, Falko, Series Editor, Ferrari, Domenico, Series Editor, Gerla, Mario, Series Editor, Kobayashi, Hisashi, Series Editor, Palazzo, Sergio, Series Editor, Sahni, Sartaj, Series Editor, Shen, Xuemin (Sherman), Series Editor, Stan, Mircea, Series Editor, Xiaohua, Jia, Series Editor, Zomaya, Albert Y., Series Editor, Liu, Shuai, editor, Glowatz, Matt, editor, Zappatore, Marco, editor, Gao, Honghao, editor, Jia, Bing, editor, and Bucciero, Alberto, editor
- Published
- 2018
- Full Text
- View/download PDF
18. Application of WINDLX Simulator in Teaching Practice to Solve the Data-Related in the Pipeline
- Author
-
Jingmei, Li, Akan, Ozgur, Series Editor, Bellavista, Paolo, Series Editor, Cao, Jiannong, Series Editor, Coulson, Geoffrey, Series Editor, Dressler, Falko, Series Editor, Ferrari, Domenico, Series Editor, Gerla, Mario, Series Editor, Kobayashi, Hisashi, Series Editor, Palazzo, Sergio, Series Editor, Sahni, Sartaj, Series Editor, Shen, Xuemin (Sherman), Series Editor, Stan, Mircea, Series Editor, Xiaohua, Jia, Series Editor, Zomaya, Albert Y., Series Editor, Liu, Shuai, editor, Glowatz, Matt, editor, Zappatore, Marco, editor, Gao, Honghao, editor, Jia, Bing, editor, and Bucciero, Alberto, editor
- Published
- 2018
- Full Text
- View/download PDF
19. 实时机模型及时间语义指令集研究.
- Author
-
陈香兰, 李曦, 汪超, and 周学海
- Abstract
In mixed critical systems, applications with different security and time criticality share computing resources. Suffered from kinds of uncertainties in the system, designers need a design method with tight timing that can satisfy multiple design constraints such as function behavior certainty, timing behavior predictability and high computing performance at the same time, which challenges the theories and methods of existing computer architectures and programming languages. A real-time machine model, RTM, and a time triggered instruction set, TTI, which support time semantics, are proposed as the important foundation in constructing a Multi-tier Tight Timing design method, MTTT. A helicopter flight control program is used as an example to illustrate the effectiveness of RTM and TTI instruction set. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
20. 软件可编程的 FPGA 网络测量引擎技术实现.
- Author
-
晏子杰, 王京梅, 陈卓, and 刘宇
- Abstract
In order to solve the problems of large resource overhead and rough granularity in the existing network transmission and switching performance monitoring schemes, an implementation of software programmable FPCA network measurement engine technology scheme is proposed. First, this solution preprocesses the input rules by the measurement controller and compiles them into a custom instruction set, and sends them to the data collection points in each network node. Then, the data collection point processes the received instructions in a pipelined manner to measure the network flow. The proposed solution involves key technologies such as measurement rules pre - processing of network flow, pipeline high - speed processing engine with programmable hardware measurement rules, and can be used for efficient measurement of existing rules with complex rule definitions. Board - level verification is performed by injecting different parameter network flows into the system. The verification results show that the designed system can correctly receive and process the custom instruction set issued by the measurement controller to achieve the measurement function. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
21. Towards Integration of a Dedicated Memory Controller and Its Instruction Set to Improve Performance of Systems Containing Computational SRAM
- Author
-
Kévin Mambu, Henri-Pierre Charles, Maha Kooli, and Julie Dumas
- Subjects
in-memory computing ,energy modeling ,non-von neumann ,instruction set ,compilation ,stencils ,Applications of electric power ,TK4001-4102 - Abstract
In-memory computing (IMC) aims to solve the performance gap between CPU and memories introduced by the memory wall. However, it does not address the energy wall problem caused by data transfer over memory hierarchies. This paper proposes the data-locality management unit (DMU) to efficiently transfer data from a DRAM memory to a computational SRAM (C-SRAM) memory allowing IMC operations. The DMU is tightly coupled within the C-SRAM and allows one to align the data structure in order to perform effective in-memory computation. We propose a dedicated instruction set within the DMU to issue data transfers. The performance evaluation of a system integrating C-SRAM within the DMU compared to a reference scalar system architecture shows an increase from ×5.73 to ×11.01 in speed-up and from ×29.49 to ×46.67 in energy reduction, versus a system integrating C-SRAM without any transfer mechanism compared to a reference scalar system architecture.
- Published
- 2022
- Full Text
- View/download PDF
22. Design and implementation of RISC-V assembler supporting vector instructions.
- Author
-
DENG Ping, ZHU Xiao-long, SUN Hai-yan, and Ren Yi
- Abstract
Vector computing can effectively improve the computing efficiency of computers and reduce unnecessary hardware overhead. With the improvement of CPU computing capability, the expansion of register number, and other hardware development trends, vector computing has becoming one of the widely used technologies to improve the CPU performance. The RISC-V architecture, which is highly focused on, also needs vector technology to improve the architecture performance. The open source RISC-V assembler only support standard instructions, and does not support vector instructions until now. In order to support RISC-V vector instructions, this paper details the design and implementation of RISC-V assembler supporting vector instructions. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
23. EVALUATION OF CUSTOM VIRTUAL MACHINE INSTRUCTION SET EMULATOR.
- Author
-
KOSTELANSKÝ, Jozef and DEDERA, Ľubomír
- Subjects
SIMPLE machines ,ALGORITHMS ,DIRECT instruction ,TEACHING - Abstract
The main goal of the article is to evaluate performance characteristics of a custom virtual machine instruction set emulator. The instruction set has been designed as part of research aimed at utilization of custom virtual machines in the area of obfuscation techniques for software protection and malware detection, with the aim to efficiently implement the particular algorithm (CRC16). In the paper we compare performance characteristics of two implementations of the CRC16 algorithm - in the emulated custom virtual machine instruction set and the direct C-to-x86-compiled executable. The aim is to show that the emulation process of such a simple virtual machine has only minor influence on execution time in comparison with the C-to-x86-compiled code. [ABSTRACT FROM AUTHOR]
- Published
- 2020
24. Hardware: The ERRIC Architecture
- Author
-
Schagaev, Igor, Kaegi-Trachsel, Thomas, Schagaev, Igor, and Kaegi-Trachsel, Thomas
- Published
- 2016
- Full Text
- View/download PDF
25. Embedded Computing
- Author
-
N. Makarov, Sergey, Ludwig, Reinhold, Bitar, Stephen J., N. Makarov, Sergey, Ludwig, Reinhold, and Bitar, Stephen J.
- Published
- 2016
- Full Text
- View/download PDF
26. A Lightweight Posit Processing Unit for RISC-V Processors in Deep Neural Network Applications
- Author
-
Federico Rossi, Sergio Saponara, Emanuele Ruffaldi, and Marco Cococcioni
- Subjects
Artificial neural network ,business.industry ,Computer science ,Toolchain ,Computer Science Applications ,Human-Computer Interaction ,Instruction set ,Task (computing) ,Software ,Computer architecture ,RISC-V ,Computer Science (miscellaneous) ,Key (cryptography) ,Circuit complexity ,business ,Information Systems - Abstract
Nowadays, two groundbreaking factors are emerging in neural networks. Firstly, there is the RISC-V open instruction set architecture (ISA) that allows a seamless implementation of custom instruction sets. Secondly, there are several novel formats for real number arithmetic. In this work, we combined these two key aspects using the very promising posit format, developing a light Posit Processing Unit. We present an extension of the base RISC-V ISA that allows the conversion between 8 or 16-bit posits and 32-bit IEEE Floats or fixed point formats in order to offer a compressed representation of real numbers with little-to-none accuracy degradation. Then we elaborate on the hardware and software toolchain integration of our PPU inside the Ariane RISC-V core and its toolchain, showing how little it impacts in terms of circuit complexity and power consumption. Indeed, only 0.36% of the circuit is devoted to the PPU while the full RISC-V core occupies the 33% of the overall circuit complexity. Finally we present the impact of our PPU-light on a deep neural network task, reporting speedups up to 10 on sample inference processing time.
- Published
- 2022
- Full Text
- View/download PDF
27. Exploiting Reuse for GPU Subgraph Enumeration
- Author
-
Wentian Guo, Yuchen Li, and Kian-Lee Tan
- Subjects
Instruction set ,Set (abstract data type) ,Computational Theory and Mathematics ,Computer science ,Computation ,Enumeration ,Pattern matching ,Parallel computing ,Graphics ,Reuse ,Data structure ,Computer Science Applications ,Information Systems - Abstract
Subgraph enumeration is important for many applications such as network motif discovery, community detection, and frequent subgraph mining. To accelerate the execution, recent works utilize graphics processing units (GPUs) to parallelize subgraph enumeration. The performances of these parallel schemes are dominated by the set intersection operations which account for up to $95\%$ of the total processing time. (Un)surprisingly, a significant portion (as high as $99\%$) of these operations is actually redundant, i.e., the same set of vertices is repeatedly encountered and evaluated. Therefore, in this paper, we seek to salvage and recycle the results of such operations to avoid repeated computation. Our solution consists of two phases. In the first phase, we generate a reusable plan that determines the opportunity for reuse. The plan is based on a novel reuse discovery mechanism that can identify available results to prevent redundant computation. In the second phase, the plan is executed to produce the subgraph enumeration results. This processing is based on a newly designed reusable parallel search strategy that can efficiently maintain and retrieve the results of set intersection operations. Our implementation on GPUs shows that our approach can achieve up to $5$ times speedups compared with the state-of-the-art GPU solutions.
- Published
- 2022
- Full Text
- View/download PDF
28. Exploring Data Analytics Without Decompression on Embedded GPU Systems
- Author
-
Feng Zhang, Onur Mutlu, Xiaoyong Du, Zaifeng Pan, Yanliang Zhou, Xipeng Shen, and Jidong Zhai
- Subjects
Lossless compression ,Speedup ,Computer science ,business.industry ,Memory pool ,Instruction set ,Computational Theory and Mathematics ,Parallel processing (DSP implementation) ,Hardware and Architecture ,Embedded system ,Signal Processing ,Synchronization (computer science) ,Data analysis ,business ,Efficient energy use - Abstract
With the development of computer architecture, even for embedded systems, GPU devices can be integrated, providing outstanding performance and energy efficiency to meet the requirements of different industries, applications, and deployment environments. Data analytics is an important application scenario for embedded systems. Unfortunately, due to the limitation of the capacity of the embedded device, the scale of problems handled by the embedded system is limited. In this paper, we propose a novel data analytics method, called G-TADOC, for efficient text analytics directly on compression on embedded GPU systems. A large amount of data can be compressed and stored in embedded systems, and can be processed directly in the compressed state, which greatly enhances the processing capabilities of the systems. Particularly, G-TADOC has three innovations. First, a novel fine-grained thread-level workload scheduling strategy for GPU threads has been developed, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, a GPU thread-safe memory pool has been developed to handle inconsistency with low synchronization overheads. Third, a sequence-support strategy is provided to maintain high GPU parallelism while ensuring sequence information for lossless compression. Moreover, G-TADOC involves special optimizations for embedded GPUs, such as utilizing the CPU-GPU shared unified memory. Experiments show that G-TADOC provides 13.2× average speedup compared to the state-of-the-art TADOC. G-TADOC also improves performance-per-cost by 2.6× and energy efficiency by 32.5× over TADOC.
- Published
- 2022
- Full Text
- View/download PDF
29. On a Consistency Testing Model and Strategy for Revealing RISC Processor’s Dark Instructions and Vulnerabilities
- Author
-
Yuze Wang, Yingtao Jiang, Xiaohang Wang, Peng Liu, and Weidong Wang
- Subjects
Reduced instruction set computing ,Programming language ,Computer science ,Code coverage ,computer.file_format ,Space (commercial competition) ,computer.software_genre ,Theoretical Computer Science ,Test (assessment) ,Instruction set ,Consistency (database systems) ,Computational Theory and Mathematics ,Hardware and Architecture ,Encoding (memory) ,Executable ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,computer ,Software - Abstract
As the reduced instruction set computing (RISC) processors are widely used nowadays, to meet the requirement that no secret instructions be included in the processor ISA or implemented in the processor micro-architecture, a consistency testing approach capable of revealing any possible dark instructions (i.e., executable instructions without clear definitions) in RISC processors has been proposed and comes in three phases. During the generation phase, based on the instruction set encoding rules, all the undefined instructions are generated. Even with a smaller test space, this step guarantees the test coverage needed to reveal all possible dark instructions that exist. In the next phase, all the undefined instructions obtained from the previous phase are executed on the processor under test, following some persistence strategies; any instruction exhibiting usual execution result will be deemed suspicious and recorded so. During the last analysis phase, each of those recorded suspicious instructions will be checked and analyzed to decide whether it truly constitutes a dark instruction. We have applied the proposed testing model and strategy to several RISC processors and found that all of them have a few dark instructions previously unknown. The potential vulnerabilities introduced by these dark instructions have thus been evaluated and exposed.
- Published
- 2022
- Full Text
- View/download PDF
30. Instruction Set Optimization for Application Specific Processors
- Author
-
Ferger, Max, Hübner, Michael, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Kobsa, Alfred, editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Weikum, Gerhard, editor, Goehringer, Diana, editor, Santambrogio, Marco Domenico, editor, Cardoso, João M. P., editor, and Bertels, Koen, editor
- Published
- 2014
- Full Text
- View/download PDF
31. A reconfigurable computing architecture for 5G communication.
- Author
-
Guo, Yang, Liu, Zi-Jun, Yang, Lei, Li, Huan, and Wang, Dong-lin
- Abstract
Copyright of Journal of Central South University is the property of Springer Nature and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2019
- Full Text
- View/download PDF
32. Workload Balancing via Graph Reordering on Multicore Systems
- Author
-
YuAng Chen and Yeh-Ching Chung
- Subjects
Instruction set ,Multi-core processor ,Speedup ,Computational Theory and Mathematics ,Hardware and Architecture ,Computer science ,Signal Processing ,Scalability ,Overhead (computing) ,Thread (computing) ,Cache ,Parallel computing ,Data structure - Abstract
In a shared-memory multicore system, the intrinsic irregular data structure of graphs leads to poor cache utilization, and therefore deteriorates the performance of graph analytics. To address the problem, prior works have proposed a variety of lightweight reordering methods with focus on the optimization of cache locality. However, there is a compromise between cache locality and workload balance. Little insight has been devoted into the issue of workload imbalance for the underlying multicore system, which degrades the effectiveness of parallel graph processing. In this work, a measurement approach is proposed to quantify the imbalance incurred by the concentration of vertices. Inspired by it, we present Cache-aware Reorder (Corder) , a lightweight reordering method exploiting the cache hierarchy of multicore systems. At the shared-memory level, Corder promotes even distribution of computation loads amongst multicores. At the private-cache level, Corder facilitates cache efficiency by applying further refinement to local vertex order. Comprehensive performance evaluation of Corder is conducted on various graph applications and datasets. Experimental results show that Corder yields speedup of up to $2.59\times$ 2 . 59 × and on average $1.45\times$ 1 . 45 × , which significantly outperforms existing lightweight reordering methods. To identify the root causes of performance boost delivered by Corder, multicore activities are investigated in terms of thread behavior, cache efficiency, and memory utilization. Statistical analysis demonstrates that the issue of imbalanced thread execution time dominates other factors in determining the overall graph processing time. Moreover, Corder achieves remarkable advantages in cross-platform scalability and reordering overhead.
- Published
- 2022
- Full Text
- View/download PDF
33. Transparent Asynchronous Parallel I/O Using Background Threads
- Author
-
Houjun Tang, John Ravi, Suren Byna, and Quincey Koziol
- Subjects
Monitoring ,parallel I ,Computer science ,Test data generation ,Libraries ,computer.software_genre ,Computer Software ,Instruction set ,Instruction sets ,Asynchronous I ,Middleware ,background threads ,Communications Technologies ,Volume (computing) ,Byte ,Computational modeling ,Parallel I/O ,Computational Theory and Mathematics ,Hardware and Architecture ,Asynchronous communication ,POSIX ,Task analysis ,Signal Processing ,Operating system ,Asynchronous I/O ,Distributed Computing ,Connectors ,computer - Abstract
Moving toward exascale computing, the size of data stored and accessed by applications is ever increasing. However, traditional disk-based storage has not seen improvements that keep up with the explosion of data volume or the speed of processors. Multiple levels of non-volatile storage devices are being added to handle bursty I/O, however, moving data across the storage hierarchy can take longer than the data generation or analysis. Asynchronous I/O can reduce the impact of I/O latency as it allows applications to schedule I/O early and to check their status later. I/O is thus overlapped with application communication or computation or both, effectively hiding some or all of the I/O latency. POSIX and MPI-I/O provide asynchronous read and write operations, but lack the support for non-data operations such as file open and close. Users also have to manually manage data dependencies and use low-level byte offsets, which requires significant effort and expertise to adopt. In this article, we present an asynchronous I/O framework that supports all types of I/O operations, manages data dependencies transparently and automatically, provides implicit and explicit modes for application flexibility, and error information retrieval. We implemented these techniques in HDF5. Our evaluation of several benchmarks and application workloads demonstrates it effectiveness on hiding the I/O cost from the application.
- Published
- 2022
- Full Text
- View/download PDF
34. Critical Path Analysis through Hierarchical Distributed Virtualized Environments Using Host Kernel Tracing
- Author
-
Jason Puncher, Hani Nemati, Francois Tetreault, and Michel Dagenais
- Subjects
Computer Networks and Communications ,Computer science ,business.industry ,Cloud computing ,Parallel computing ,Tracing ,Virtualization ,computer.software_genre ,Computer Science Applications ,Instruction set ,Hardware and Architecture ,Kernel (statistics) ,business ,computer ,Host (network) ,Critical path method ,Software ,Information Systems - Published
- 2022
- Full Text
- View/download PDF
35. cuNH: Efficient GPU Implementations of Post-Quantum KEM NewHope
- Author
-
Jia Xu, Hongbing Wang, and Yiwen Gao
- Subjects
Computer science ,business.industry ,Concurrency ,Cryptography ,Parallel computing ,Instruction set ,Task (computing) ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Task analysis ,Overhead (computing) ,Key encapsulation ,SIMD ,business - Abstract
Post-quantum cryptography was proposed in the past years due to the foreseeable emergence of quantum computers that are able to break the conventional public key cryptosystems at acceptable costs. However, post-quantum schemes are usually less efficient than conventional ones, which makes them less practical in scenarios with limited resources or high concurrency. Server-side applications always feature multiple users, therefore requiring efficient execution of batch tasks. GPU is intrinsically well-suited to batch tasks owing to its SIMD/SIMT execution fashion, so it naturally helps to achieve high performance. However, a naive GPU-based implementation cannot make the best use of hardware resources of the GPU regardless of task loads. In this article, we propose SIMD parallelization paradigms for fine-grained GPU implementations and then apply them to a post-quantum key encapsulation algorithm called NewHope, where we carefully design every module, especially NTT and inverse NTT, to fit into the SIMD parallelization paradigms. In addition, we employ multi-streaming to improve performance in user's perspective. Finally, our evaluations are made on two testbeds with GPU accelerators NVIDIA GeForce MX150 and GeForce GTX 1650, respectively. The experimental results show that the fine-grained implementations save up to 98 percent latency at low task loads, and their throughputs increase by up to 86 percent at high task loads, when compared with the naive ones in kernel's perspective, and the multi-streaming implementations greatly reduce the latency overhead percentage at high task loads by up to 86 percent, when compared with the fine-grained implementation in user's perspective. Moreover, our fine-grained implementation and multi-streaming implementation are respectively 51.5 and 45.5 percent faster than Gupta et al. 's implementations when compared with it under reasonable assumptions. Furthermore, as lattice-based post-quantum schemes have similar operations, our proposal also easily applies to other lattice-based post-quantum schemes.
- Published
- 2022
- Full Text
- View/download PDF
36. Information Leakage Analysis Using a Co-Design-Based Fault Injection Technique on a RISC-V Microprocessor
- Author
-
Jim Plusquellic, Brian Dziki, Tom J. Mannos, and Donald E. Owen
- Subjects
Computer science ,business.industry ,Plaintext ,Fault injection ,Fault (power engineering) ,Computer Graphics and Computer-Aided Design ,Instruction set ,Programmable logic device ,Embedded system ,RISC-V ,Information leakage ,State (computer science) ,Electrical and Electronic Engineering ,business ,Software - Abstract
The RISC-V instruction set architecture open licensing policy has spawned a hive of development activity, making a range of implementations publicly available. The environments in which RISC-V operates have expanded correspondingly, driving the need for a generalized approach to evaluating the reliability of RISC-V implementations under adverse operating conditions or after normal wear-out periods. Fault injection (FI) refers to the process of changing the state of registers or wires, either permanently or momentarily, and then observing execution behavior. The analysis provides insight into the development of countermeasures that protect against the leakage or corruption of sensitive information which might occur because of unexpected execution behavior. In this paper, we develop a hardware-software co-design architecture that enables fast, configurable fault emulation and utilize it for information leakage and data corruption analysis. Modern System-on-chip FPGAs enable building an evaluation platform where control elements run on a processor(s) (PS) simultaneously with the target design running in the programmable logic (PL). Software components of the FI system introduce faults and report execution behavior. A pair of RISC-V FI-instrumented implementations are created and configured to execute the Advanced Encryption Standard and Twister algorithms. Key and plaintext information leakage and degraded pseudo-random sequences are both observed in the output for a subset of the emulated faults.
- Published
- 2022
- Full Text
- View/download PDF
37. Optimizing Depthwise Separable Convolution Operations on GPUs
- Author
-
Weizhe Zhang, Zheng Wang, and Gangzhao Lu
- Subjects
Instruction set ,Floating point ,Computational Theory and Mathematics ,Kernel (image processing) ,Hardware and Architecture ,Computer science ,Signal Processing ,Overhead (computing) ,Parallel computing ,Convolutional neural network ,Data type ,Integer (computer science) ,Convolution - Abstract
The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This article aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over $2\times$ 2 × (up to $3\times$ 3 × ) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNet and EfficientNet by 9.7 and 7.3 percent respectively, and reduces the end-to-end inference time of MobileNet and EfficientNet by 12.2 and 11.6 percent respectively.
- Published
- 2022
- Full Text
- View/download PDF
38. Design and Simulation Analysis of a Time Predictable Computer Architecture
- Author
-
Shivaraj, K. M. and Dharishini, P. Padma Priya
- Published
- 2015
39. Toward RISC-V CSR Compliance Testing
- Author
-
Rolf Drechsler, Niklas Bruns, Daniel Große, and Vladimir Herdt
- Subjects
General Computer Science ,Computer science ,business.industry ,Control (management) ,Compliance (psychology) ,Set (abstract data type) ,Instruction set ,Control and Systems Engineering ,RISC-V ,Corporate social responsibility ,Architecture specification ,Software engineering ,business ,Conformance testing - Abstract
Recently, the critical compliance testing (CT) problem for reduced instruction set computer (RISC)-V has received significant attention. However, control and status registers (CSRs), which form the backbone of the RISC-V privileged architecture specification, have been mostly neglected in the CT effort so far. In this letter, we first analyze the RISC-V privileged architecture specification in detail to group the CSRs into different classes according to their functionality. Based on the classes and additional common CSR characteristics, we come up with a set of fundamental CSR tests. These partly automatically generated CSR tests allow to check the compliance of RISC-V simulators and cores. We found several unknown errors in numerous RISC-V simulators. The results demonstrate the necessity for extensive CSR testing to ensure compliance with the RISC-V specification.
- Published
- 2021
- Full Text
- View/download PDF
40. RF-RISA: A novel flexible random forest accelerator based on FPGA
- Author
-
Hui Yang, Shuang Zhao, Fei Wang, Ziling Wei, and Shuhui Chen
- Subjects
Scheme (programming language) ,Hardware architecture ,Computer Networks and Communications ,business.industry ,Computer science ,Process (computing) ,Control reconfiguration ,Theoretical Computer Science ,Random forest ,Instruction set ,Artificial Intelligence ,Hardware and Architecture ,business ,Field-programmable gate array ,Throughput (business) ,computer ,Software ,Computer hardware ,computer.programming_language - Abstract
Recently, FPGA has been utilized to accelerate the Random Forest prediction process to meet the speed requirements of real-time tasks. However, the existing accelerators impose restrictions on the parameters of the accelerated model. The accelerators have to be reconfigured to adapt to a model whose parameters exceed the predefined restrictions. When these accelerators are applied in the scenarios where the model updates or switches frequently, non-trivial time overhead and maintenance costs may be introduced. To solve the above problem, a flexible accelerator RF-RISA, Random Forest Reduced Instruction Set Accelerator, is presented in this paper. Compared with the existing accelerators, RF-RISA eliminates all the restrictions by decoupling the model parameters from its hardware implementation. Specifically, RF-RISA encodes the information of the model into a group of instructions, then the instructions are stored in the memory rather than are hardcoded in the hardware. Meanwhile, a mapping scheme is proposed to map the instructions into the memory dynamically. Finally, a new hardware architecture is designed to support the pipelined computing. The theoretical analysis and experimental results show that the proposed RF-RISA can accelerate a wide range of RF models without reconfiguration. At the same time, it can achieve the same throughput as the state-of-the-art.
- Published
- 2021
- Full Text
- View/download PDF
41. Instruction-Set Accelerated Implementation of CRYSTALS-Kyber
- Author
-
Mojtaba Bisheh-Niasar, Reza Azarderakhsh, and Mehran Mozaffari-Kermani
- Subjects
Hardware architecture ,Computer science ,business.industry ,020208 electrical & electronic engineering ,Cryptography ,02 engineering and technology ,Instruction set ,Computer Science::Hardware Architecture ,Computer engineering ,0202 electrical engineering, electronic engineering, information engineering ,Cryptosystem ,Quantum algorithm ,Electrical and Electronic Engineering ,Elliptic curve cryptography ,business ,Key exchange ,Computer Science::Cryptography and Security ,Quantum computer - Abstract
Large scale quantum computers will break classical public-key cryptography protocols by quantum algorithms such as Shor’s algorithm. Hence, designing quantum-safe cryptosystems to replace current classical algorithms is crucial. Luckily there are some post-quantum candidates that are assumed to be resistant against future attacks from quantum computers, and NIST is considering standardizing them. Among these candidates, lattice-based cryptography sounds more interesting than others due to the performance results as well as confidence in the security. There are few works in the literature evaluating the performance of lattice-based cryptography in hardware. In this paper, we focus on Cryptographic Suite for Algebraic Lattices (CRYSTALS) key exchange mechanisms known as Kyber and provide an instruction-set hardware architecture and implement on Xilinx Artix-7 FPGA for performance evaluation and testing. Our proposed architecture provides an efficient and high-performance set of components to perform polynomial sampling, number-theoretic transform (NTT), and point-wise multiplication to speed up lattice-based post-quantum cryptography (PQC). This architecture implemented on ASIC outperforms state-of-the-art implementations.
- Published
- 2021
- Full Text
- View/download PDF
42. Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads
- Author
-
Fabian Schuiki, Florian Zaruba, Torsten Hoefler, Luca Benini, Zaruba F., Schuiki F., Hoefler T., and Benini L.
- Subjects
FOS: Computer and information sciences ,Floating point ,Computer science ,Pipeline (computing) ,RISC-V ,02 engineering and technology ,Parallel computing ,020202 computer hardware & architecture ,Theoretical Computer Science ,Vector processor ,Instruction set ,Computational Theory and Mathematics ,Hardware and Architecture ,Hardware Architecture (cs.AR) ,0202 electrical engineering, electronic engineering, information engineering ,general purpose ,Computer Science - Hardware Architecture ,many-core ,energy efficiency ,Software ,Gate equivalent ,Efficient energy use ,Integer (computer science) - Abstract
Data-parallel applications, such as data analytics, machine learning, and scientific computing, are placing an ever-growing demand on floating-point operations per second on emerging systems. With increasing integration density, the quest for energy efficiency becomes the number one design concern. While dedicated accelerators provide high energy efficiency, they are over-specialized and hard to adjust to algorithmic changes. We propose an architectural concept that tackles the issues of achieving extreme energy efficiency while still maintaining high flexibility as a general-purpose compute engine. The key idea is to pair a tiny 10kGE (kilo gate equivalent) control core, called Snitch, with a double-precision floating-point unit (FPU) to adjust the compute to control ratio. While traditionally minimizing non-floating-point unit (FPU) area and achieving high floating-point utilization has been a trade-off, with Snitch, we achieve them both, by enhancing the ISA with two minimally intrusive extensions: stream semantic registers (SSR) and a floating-point repetition instruction (FREP). SSRs allow the core to implicitly encode load/store instructions as register reads/writes, eliding many explicit memory instructions. The FREP extension decouples the floating-point and integer pipeline by sequencing instructions from a micro-loop buffer. These ISA extensions significantly reduce the pressure on the core and free it up for other tasks, making Snitch and FPU effectively dual-issue at a minimal incremental cost of 3.2 percent. The two low overhead ISA extensions make Snitch more flexible than a contemporary vector processor lane, achieving a $2\times$ 2 × energy-efficiency improvement. We have evaluated the proposed core and ISA extensions on an octa-core cluster in 22 nm technology. We achieve more than $6\times$ 6 × multi-core speed-up and a $3.5\times$ 3 . 5 × gain in energy efficiency on several parallel microkernels.
- Published
- 2021
- Full Text
- View/download PDF
43. Interactions, Impacts, and Coincidences of the First Golden Age of Computer Architecture
- Author
-
John R. Mashey
- Subjects
Unix ,Reduced instruction set computing ,Computer science ,Supercomputer ,Minicomputer ,law.invention ,Instruction set ,System programming ,Computer architecture ,Hardware and Architecture ,law ,Server ,Electrical and Electronic Engineering ,Turing ,computer ,Software ,computer.programming_language - Abstract
In their 2018 Turing Award lecture and 2019 paper, John Hennessy and David Patterson reviewed computer architecture progress since the 1960s. They projected a second golden age akin to the first, approximately 1986–1996, when new instruction set architectures, almost all reduced instruction set computers (RISCs), revolutionized the industry, eliminated most minicomputer vendors, rivaled mainframes, and began a takeover of supercomputing. The C language and derivatives came to pervade systems programming, whereas Unix derivatives came to run many servers, desktops, and smartphones. Such outcomes were not inevitable but depended on evolutionary interactions of computer architecture and languages, industry dynamics, and sometimes random coincidences.
- Published
- 2021
- Full Text
- View/download PDF
44. The Origin of Intel's Micro-Ops
- Author
-
Robert P. Colwell
- Subjects
Instruction set ,Out-of-order execution ,Reduced instruction set computing ,Hardware and Architecture ,Computer science ,business.industry ,x86 ,Electrical and Electronic Engineering ,Architecture ,Software engineering ,business ,Software ,Microarchitecture - Abstract
It was a different computing world in the late 1980s. Many if not most researchers in the computer architecture area had become convinced that complex instruction sets such as the Intel x86 were doomed in light of the many advantages promised by reduced instruction set architecture publications. There were many voices within Intel urging upper management to abandon x86 and get started on some alternative. Even engineers who had worked on Intel's then-flagship 486 were expressing serious reservations about whether the x86 architecture could be “dragged further up the hill” to be, if not directly competitive with emerging RISC designs, at least close enough for x86 to remain profitable.
- Published
- 2021
- Full Text
- View/download PDF
45. MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking Execution
- Author
-
Chao Yu, Yuebin Bai, and Rui Wang
- Subjects
Memory hierarchy ,Computer science ,Pipeline (computing) ,Instruction scheduling ,Task parallelism ,Parallel computing ,Blocking (computing) ,Theoretical Computer Science ,Instruction set ,Computational Theory and Mathematics ,Hardware and Architecture ,Latency (engineering) ,Performance improvement ,Software - Abstract
Improving the latency hiding ability is important for GPU performance. Although existing works, which mainly target on either improving thread level parallelism or optimizing memory hierarchy, are effective at improving GPUs’ latency hiding ability, warps are still blocked after executing long latency operations, reducing the number of schedulable warps. This article revisits the recently proposed non-blocking execution for GPUs to improve the latency hiding ability of GPUs. With non-blocking execution, instructions from warps blocked by long latency operations can be pre-executed to make full use of GPU resources. However, we find that the state-of-the-art non-blocking GPU architecture gains limited performance improvement. Through in-depth analysis, we observe that the poor performance is largely due to inefficient pre-execution state management, duplicate instruction extraction, frequent early eviction and severe resource congestion. To make non-blocking execution actually useful for GPUs and minimize hardware overheads, we carefully redesign the non-blocking architecture for GPUs based on our analysis and propose MIPSGPU . Our evaluations show that MIPSGPU , relative to the state-of-the-art non-blocking GPU architecture, improves performance of memory intensive applications by 19.05 percent, and reduces memory to SM traffics by 14 percent.
- Published
- 2021
- Full Text
- View/download PDF
46. A Generic GPU-Accelerated Framework for the Dial-A-Ride Problem
- Author
-
Song Guang Ho, Justin Dauwels, Ramesh Ramasamy Pandi, and Sarat Chandra Nagavarapu
- Subjects
Speedup ,Optimization problem ,Computer science ,business.industry ,Mechanical Engineering ,Distributed computing ,Tabu search ,Computer Science Applications ,Instruction set ,Automotive Engineering ,Operational planning ,Local search (optimization) ,business ,Metaheuristic ,Variable neighborhood search - Abstract
Accelerating the performance of optimization algorithms is crucial for many day-to-day applications. Mobility-on-demand is one such application that is transforming urban mobility by offering reliable and convenient on-demand door-to-door transportation at any time. Dial-a-ride problem (DARP) is an underlying optimization problem in the operational planning of mobility-on-demand systems. The primary objective of DARP is to design routes and schedules to serve passenger transportation requests with high-level user comfort. DARP often arises in dynamic real-world scenarios, where rapid route planning is essential. The traditional CPU-based algorithms are generally too slow to be useful in practice. Since customers expect quick response for their mobility requests, there has been a growing interest in fast solution methods. Therefore, in this paper, we introduce a GPU-based solution methodology for the dial-a-ride problem to produce good solutions in a short time. Specifically, we develop a GPU framework to accelerate time-critical neighborhood exploration of local search operations under the guidance of metaheuristics such as tabu search and variable neighborhood search. Besides, we propose device-oriented optimization strategies to enhance the utilization of a current-generation GPU architecture (Tesla P100). We report speedup achieved by our GPU approach when compared to its classical CPU counterpart, and the effect of each device optimization strategy on computational speedup. Results are based on standard test instances from the literature. Ultimately, the proposed GPU methodology generates better solutions in a short time when compared to the existing sequential approaches.
- Published
- 2021
- Full Text
- View/download PDF
47. DESIGNING THE PROCESSOR INSTRUCTION SET ON A PROGRAMMABLE LOGIC ARRAY.
- Author
-
DAN, ROTAR, GEORGE, CULEA, and DRAGOS, ANDRIOAIA
- Subjects
FIELD programmable gate arrays ,EMBEDDED computer systems ,FIELD programmable analog arrays - Abstract
The paper presents the authors' original contributions to the synthesis of embedded systems based on programmable logic arrays. An embedded system has one or more central units with a program structure. This allows the optimized design of the instruction set for that central unit. The paper presents the method used to design central units with a dedicated set of instructions. The advantages and disadvantages of the method are discussed and the design steps are presented. We discuss the shortcomings regarding the portability of the programs and we also show the methods of solving this. In the paper there is also an analysis of the structure used for the dedicated set of instructions. [ABSTRACT FROM AUTHOR]
- Published
- 2017
48. Instruction and Belief Effects on Sentential Reasoning
- Author
-
Matarazzo, Olimpia, Baldassarre, Ivana, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Esposito, Anna, editor, Esposito, Antonietta M., editor, Martone, Raffaele, editor, Müller, Vincent C., editor, and Scarpetta, Gaetano, editor
- Published
- 2011
- Full Text
- View/download PDF
49. ІМПЛЕМЕНТАЦІЯ НАБОРУ ІНСТРУКЦІЇ RISC-V
- Subjects
мікропроцесори ,microprocessors ,instruction set ,computer systems ,комп’ютерні системи ,RISC-V ,набір інструкції ,high-performance computing systems ,високопродуктивні обчислювальні системи - Abstract
In the 70s of the last century, during the active development of electronic computing technology, the development of computing systems went in two ways: high-performance computing systems (OS); narrow-profile computing systems. In fact, high-performance systems are universal operating systems whose task is to maximize the high speed of computing per unit of time. Narrow-profile operating systems aimed at performing certain types of tasks, where just the speed of calculations did not matter so much, and other technical characteristics were obtained for the first roles: energy efficiency, product ergonomics, the need to perform only certain types of tasks, and so on. Specifically, the type of tasks has become decisive in terms of the architectural implementation of the OS. Let's consider the RISC-V architecture to assess the prospects of its introduction into the mass segment. В 70-х роках минулого сторіччя, під час активного розвитку електронно-обчислювальної техніки, розвиток обчислювальних систем пішов двома шляхами: високопродуктивні обчислювальні системи (ОС); вузькопрофільні обчислювальні системи. Фактично, високопродуктивні системи – це універсальні ОС завданням яких є максимальна висока швидкість обчислень за одиницю часу. Вузькопрофільні ОС ставили за мету виконання певних типів завдань, де якраз швидкість обчислень не мала такого значення, а на перші ролі виходили інші технічні характеристики: енергоефективність, ергономіка виробу, необхідність виконання тільки певного роду завдань та інше. Саме вид завдань став вирішальним з точки зору архітектурної реалізації ОС. Розглянемо архітектуру RISC-V з метою оцінки перспективності впровадження її до масового сегменту.
- Published
- 2022
50. Effective Runtime Management of Tasks and Priorities in GNU OpenMP Applications
- Author
-
Emiliano Silvestri, Alessandro Pellegrini, Pierangelo Di Sanzo, Francesco Quaglia, Silvestri, E., Pellegrini, A., Di Sanzo, P., and Quaglia, F.
- Subjects
Instruction set ,Multi-core computing ,Settore ING-INF/05 ,Proposal ,Message systems ,Message system ,Runtime environment ,operating system support ,Switche ,task parallelism ,Theoretical Computer Science ,Instruction sets ,Hardware ,Computational Theory and Mathematics ,Hardware and Architecture ,Task analysis ,Task analysi ,Proposals ,Switches ,Software - Abstract
OpenMP has become a reference standard for the design of parallel applications. This standard is evolving very fast, thus offering ever new opportunities to the application programmers. However, OpenMP runtime environments are often not fully aligned to the actual requirements imposed by the evolution of such standard. Among the main lacks, we find: (a) a limited capability to effectively cope with task priorities, and (b) the inadequacy in guaranteeing core properties while processing tasks such as the so called extit{work-conservativeness}---the ability of the OpenMP runtime environment to fully exploit the underlying multi-processor/multi-core machine through the avoidance of thread-blocking phases. In this article we present the design of extensions to the GNU OpenMP ({ t GOMP}) implementation, integrated into { t gcc}, which allow the effective management of tasks and their priorities. Our proposal is based on a user-space library---modularly combined with the one already offered by { t GOMP}---and an external kernel-level Linux module---offering the opportunity to exploit raising hardware facilities for the purpose of task/priority management. We also provide experimental results showing the effectiveness of our proposal, achieved by running either OpenMP common benchmarks or a new benchmark application (named {sc Hashtag-Text}) that we explicitly devised in order to stress the OpenMP runtime environment in relation to the above-mentioned task/priority management aspects.
- Published
- 2022
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.