Author: "Schuiki, Fabian" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Schuiki, Fabian"' showing total 39 results

Start Over Author "Schuiki, Fabian"

Sorry, I don't understand your search. ×

39 results on '"Schuiki, Fabian"'

1. Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra

Author: Scheffler, Paul, Zaruba, Florian, Schuiki, Fabian, Hoefler, Torsten, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Sparse linear algebra is crucial in many application domains, but challenging to handle efficiently in both software and hardware, with one- and two-sided operand sparsity handled with distinct approaches. In this work, we enhance an existing memory-streaming RISC-V ISA extension to accelerate both one- and two-sided operand sparsity on widespread sparse tensor formats like compressed sparse row (CSR) and compressed sparse fiber (CSF) by accelerating the underlying operations of streaming indirection, intersection, and union. Our extensions enable single-core speedups over an optimized RISC-V baseline of up to 7.0x, 7.7x, and 9.8x on sparse-dense multiply, sparse-sparse multiply, and sparse-sparse addition, respectively, and peak FPU utilizations of up to 80% on sparse-dense problems. On an eight-core cluster, sparse-dense and sparse-sparse matrix-vector multiply using real-world matrices are up to 4.9x and 5.9x faster and up to 2.9x and 3.0x more energy efficient. We explore further applications for our extensions, such as stencil codes and graph pattern matching. Compared to recent CPU, GPU, and accelerator approaches, our extensions enable higher flexibility on data representation, degree of sparsity, and dataflow at a minimal hardware footprint, adding only 1.8% in area to a compute cluster. A cluster with our extensions running CSR matrix-vector multiplication achieves 9.9x and 1.7x higher peak floating-point utilizations than recent highly optimized sparse data structures and libraries for CPU and GPU, respectively, even when accounting for off-chip main memory (HBM) and on-chip interconnect latency and bandwidth effects., Comment: 15 pages, 8 figures. Accepted for publication in IEEE TPDS
Published: 2023
Full Text: View/download PDF

2. Implementing CNN Layers on the Manticore Cluster-Based Many-Core Architecture

Author: Kurth, Andreas, Schuiki, Fabian, and Benini, Luca
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, C.4, C.1.4, F.2.1, I.2
Abstract: This document presents implementations of fundamental convolutional neural network (CNN) layers on the Manticore cluster-based many-core architecture and discusses their characteristics and trade-offs., Comment: Technical report. 18 pages, 4 figures, 5 algorithms
Published: 2021

3. Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Author: Scheffler, Paul, Zaruba, Florian, Schuiki, Fabian, Hoefler, Torsten, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Sparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we enhance a memory-streaming RISC-V ISA extension to accelerate sparse-dense products through streaming indirection. We present efficient dot, matrix-vector, and matrix-matrix product kernels using our hardware, enabling single-core FPU utilizations of up to 80% and speedups of up to 7.2x over an optimized baseline without extensions. A matrix-vector implementation on a multi-core cluster is up to 5.8x faster and 2.7x more energy-efficient with our kernels than an optimized baseline. We propose further uses for our indirection hardware, such as scatter-gather operations and codebook decoding, and compare our work to state-of-the-art CPU, GPU, and accelerator approaches, measuring a 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication than a GTX 1080 Ti GPU running a cuSPARSE kernel., Comment: 6 pages, 4 figures. Submitted to DATE 2021. Camera-ready version
Published: 2020

4. An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication

Author: Kurth, Andreas, Rönninger, Wolfgang, Benz, Thomas, Cavalcante, Matheus, Schuiki, Fabian, Zaruba, Florian, and Benini, Luca
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing, B.4.3, C.1.2, C.5.4
Abstract: On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heterogeneous many-cores and accelerator-rich SoCs, which are not, or only partially, coherent, are a much less mature research area. In this work, we present a modular, topology-agnostic, high-performance on-chip communication platform. The platform includes components to build and link subnetworks with customizable bandwidth and concurrency properties and adheres to a state-of-the-art, industry-standard protocol. We discuss microarchitectural trade-offs and timing/area characteristics of our modules and show that they can be composed to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end on-chip communication fabrics (not only network switches but also DMA engines and memory controllers) with high degrees of concurrency. We design and implement a state-of-the-art ML training accelerator, where our communication fabric scales to 1024 cores on a die, providing 32 TB/s cross-sectional bandwidth at only 24 ns round-trip latency between any two cores., Comment: 14 pages, 24 figures, 4 tables
Published: 2020
Full Text: View/download PDF

5. Manticore: A 4096-core RISC-V Chiplet Architecture for Ultra-efficient Floating-point Computing

Author: Zaruba, Florian, Schuiki, Fabian, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Data-parallel problems demand ever growing floating-point (FP) operations per second under tight area- and energy-efficiency constraints. In this work, we present Manticore, a general-purpose, ultra-efficient chiplet-based architecture for data-parallel FP workloads. We have manufactured a prototype of the chiplet's computational core in Globalfoundries 22FDX process and demonstrate more than 5x improvement in energy efficiency on FP intensive workloads compared to CPUs and GPUs. The compute capability at high energy and area efficiency is provided by Snitch clusters containing eight small integer cores, each controlling a large FPU. The core supports two custom ISA extensions: The SSR extension elides explicit load and store instructions by encoding them as register reads and writes. The FREP extension decouples the integer core from the FPU allowing floating-point instructions to be issued independently. These two extensions allow the single-issue core to minimize its instruction fetch bandwidth and saturate the instruction bandwidth of the FPU, achieving FPU utilization above 90%, with more than 40% of core area dedicated to the FPU.
Published: 2020

6. FPnew: An Open-Source Multi-Format Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing

Author: Mach, Stefan, Schuiki, Fabian, Zaruba, Florian, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: The slowdown of Moore's law and the power wall necessitates a shift towards finely tunable precision (a.k.a. transprecision) computing to reduce energy footprint. Hence, we need circuits capable of performing floating-point operations on a wide range of precisions with high energy-proportionality. We present FPnew, a highly configurable open-source transprecision floating-point unit (TP-FPU) capable of supporting a wide range of standard and custom FP formats. To demonstrate the flexibility and efficiency of FPnew in general-purpose processor architectures, we extend the RISC-V ISA with operations on half-precision, bfloat16, and an 8bit FP format, as well as SIMD vectors and multi-format operations. Integrated into a 32-bit RISC-V core, our TP-FPU can speed up execution of mixed-precision applications by 1.67x w.r.t. an FP32 baseline, while maintaining end-to-end precision and reducing system energy by 37%. We also integrate FPnew into a 64-bit RISC-V core, supporting five FP formats on scalars or 2, 4, or 8-way SIMD vectors. For this core, we measured the silicon manufactured in Globalfoundries 22FDX technology across a wide voltage range from 0.45V to 1.2V. The unit achieves leading-edge measured energy efficiencies between 178 Gflop/sW (on FP64) and 2.95 Tflop/sW (on 8-bit mini-floats), and a performance between 3.2 Gflop/s and 25.3 Gflop/s.
Published: 2020

7. LLHD: A Multi-level Intermediate Representation for Hardware Description Languages

Author: Schuiki, Fabian, Kurth, Andreas, Grosser, Tobias, and Benini, Luca
Subjects: Computer Science - Programming Languages
Abstract: Modern Hardware Description Languages (HDLs) such as SystemVerilog or VHDL are, due to their sheer complexity, insufficient to transport designs through modern circuit design flows. Instead, each design automation tool lowers HDLs to its own Intermediate Representation (IR). These tools are monolithic and mostly proprietary, disagree in their implementation of HDLs, and while many redundant IRs exists, no IR today can be used through the entire circuit design flow. To solve this problem, we propose the LLHD multi-level IR. LLHD is designed as simple, unambiguous reference description of a digital circuit, yet fully captures existing HDLs. We show this with our reference compiler on designs as complex as full CPU cores. LLHD comes with lowering passes to a hardware-near structural IR, which readily integrates with existing tools. LLHD establishes the basis for innovation in HDLs and tools without redundant compilers or disjoint IRs. For instance, we implement an LLHD simulator that runs up to 2.4x faster than commercial simulators but produces equivalent, cycle-accurate results. An initial vertically-integrated research prototype is capable of representing all levels of the IR, implements lowering from the behavioural to the structural IR, and covers a sufficient subset of SystemVerilog to support a full CPU design.
Published: 2020

8. Snitch: A tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

Author: Zaruba, Florian, Schuiki, Fabian, Hoefler, Torsten, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Data-parallel applications, such as data analytics, machine learning, and scientific computing, are placing an ever-growing demand on floating-point operations per second on emerging systems. With increasing integration density, the quest for energy efficiency becomes the number one design concern. While dedicated accelerators provide high energy efficiency, they are over-specialized and hard to adjust to algorithmic changes. We propose an architectural concept that tackles the issues of achieving extreme energy efficiency while still maintaining high flexibility as a general-purpose compute engine. The key idea is to pair a tiny 10kGE control core, called Snitch, with a double-precision FPU to adjust the compute to control ratio. While traditionally minimizing non-FPU area and achieving high floating-point utilization has been a trade-off, with Snitch, we achieve them both, by enhancing the ISA with two minimally intrusive extensions: stream semantic registers (SSR) and a floating-point repetition instruction (FREP). SSRs allow the core to implicitly encode load/store instructions as register reads/writes, eliding many explicit memory instructions. The FREP extension decouples the floating-point and integer pipeline by sequencing instructions from a micro-loop buffer. These ISA extensions significantly reduce the pressure on the core and free it up for other tasks, making Snitch and FPU effectively dual-issue at a minimal incremental cost of 3.2%. The two low overhead ISA extensions make Snitch more flexible than a contemporary vector processor lane, achieving a $2\times$ energy-efficiency improvement. We have evaluated the proposed core and ISA extensions on an octa-core cluster in 22nm technology. We achieve more than $5\times$ multi-core speed-up and a $3.5\times$ gain in energy efficiency on several parallel microkernels.
Published: 2020
Full Text: View/download PDF

9. Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

Author: Schuiki, Fabian, Zaruba, Florian, Hoefler, Torsten, and Benini, Luca
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Single-issue processor cores are very energy efficient but suffer from the von Neumann bottleneck, in that they must explicitly fetch and issue the loads/storse necessary to feed their ALU/FPU. Each instruction spent on moving data is a cycle not spent on computation, limiting ALU/FPU utilization to 33% on reductions. We propose "Stream Semantic Registers" to boost utilization and increase energy efficiency. SSR is a lightweight, non-invasive RISC-V ISA extension which implicitly encodes memory accesses as register reads/writes, eliminating a large number of loads/stores. We implement the proposed extension in the RTL of an existing multi-core cluster and synthesize the design for a modern 22nm technology. Our extension provides a significant, 2x to 5x, architectural speedup across different kernels at a small 11% increase in core area. Sequential code runs 3x faster on a single core, and 3x fewer cores are needed in a cluster to achieve the same performance. The utilization increase to almost 100% in leads to a 2x energy efficiency improvement in a multi-core cluster. The extension reduces instruction fetches by up to 3.5x and instruction cache power consumption by up to 5.6x. Compilers can automatically map loop nests to SSRs, making the changes transparent to the programmer.
Published: 2019

10. Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI

Author: Cavalcante, Matheus, Schuiki, Fabian, Zaruba, Florian, Schaffner, Michael, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 x 256 double precision matrix multiplication on sixteen lanes. Ara runs at more than 1 GHz in the typical corner (TT/0.80V/25 oC) achieving a performance up to 33 DP-GFLOPS. In terms of energy efficiency, Ara achieves up to 41 DP-GFLOPS/W under the same conditions, which is slightly superior to similar vector processors found in literature. An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs., Comment: 13 pages. Accepted for publication in IEEE Transactions on Very Large Scale Integration Systems
Published: 2019
Full Text: View/download PDF

11. NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22nm FD-SOI

Author: Schuiki, Fabian, Schaffner, Michael, and Benini, Luca
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture
Abstract: Specialized coprocessors for Multiply-Accumulate (MAC) intensive workloads such as Deep Learning are becoming widespread in SoC platforms, from GPUs to mobile SoCs. In this paper we revisit NTX (an efficient accelerator developed for training Deep Neural Networks at scale) as a generalized MAC and reduction streaming engine. The architecture consists of a set of 32 bit floating-point streaming co-processors that are loosely coupled to a RISC-V core in charge of orchestrating data movement and computation. Post-layout results of a recent silicon implementation in 22 nm FD-SOI technology show the accelerator's capability to deliver up to 20 Gflop/s at 1.25 GHz and 168 mW. Based on these results we show that a version of NTX scaled down to 14 nm can achieve a 3x energy efficiency improvement over contemporary GPUs at 10.4x less silicon area, and a compute performance of 1.4 Tflop/s for training large state-of-the-art networks with full floating-point precision. An extended evaluation of MAC-intensive kernels shows that NTX can consistently achieve up to 87% of its peak performance across general reduction workloads beyond machine learning. Its modular architecture enables deployment at different scales ranging from high-performance GPU-class to low-power embedded scenarios., Comment: 6 pages, invited paper at DATE 2019
Published: 2018

12. A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Author: Schuiki, Fabian, Schaffner, Michael, Gürkaynak, Frank K., and Benini, Luca
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture
Abstract: Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by 7x over previously published results; (ii) an optimized IEEE754 compliant data path for fast high-precision convolutions and gradient propagation; (iii) evaluation of near-memory computing with NTX embedded into residual area on the Logic Base die of a Hybrid Memory Cube; and (iv) a scaling analysis to meshes of HMCs in a data center scenario. We demonstrate a 2.7x energy efficiency improvement of NTX over contemporary GPUs at 4.4x less silicon area, and a compute performance of 1.2 Tflop/s for training large state-of-the-art networks with full floating-point precision. At the data center scale, a mesh of NTX achieves above 95% parallel and energy efficiency, while providing 2.1x energy savings or 3.1x performance improvement over a GPU-based system., Comment: 14 pages, submitted to IEEE Transactions on Computers journal
Published: 2018

13. Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra

Author: Scheffler, Paul, primary, Zaruba, Florian, additional, Schuiki, Fabian, additional, Hoefler, Torsten, additional, and Benini, Luca, additional
Published: 2023
Full Text: View/download PDF

14. Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

Author: Zaruba, Florian, primary, Schuiki, Fabian, additional, Hoefler, Torsten, additional, and Benini, Luca, additional
Published: 2021
Full Text: View/download PDF

15. Banshee: A Fast LLVM-Based RISC-V Binary Translator

Author: Riedel, Samuel, primary, Schuiki, Fabian, additional, Scheffler, Paul, additional, Zaruba, Florian, additional, and Benini, Luca, additional
Published: 2021
Full Text: View/download PDF

16. A 10-core SoC with 20 Fine-Grain Power Domains for Energy-Proportional Data-Parallel Processing over a Wide Voltage and Temperature Range

Author: Benz, Thomas, primary, Bertaccini, Luca, additional, Zaruba, Florian, additional, Schuiki, Fabian, additional, Gurkaynak, Frank K., additional, and Benini, Luca, additional
Published: 2021
Full Text: View/download PDF

17. Streaming Architectures for Extreme Energy Efficiency in High-Performance Computing

Author: Schuiki, Fabian, Benini, Luca, and Batten, Christopher
Subjects: Electric engineering, ddc:621.3
Published: 2021
Full Text: View/download PDF

18. An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication.

Author: Kurth, Andreas, Ronninger, Wolfgang, Benz, Thomas, Cavalcante, Matheus, Schuiki, Fabian, Zaruba, Florian, and Benini, Luca
Subjects: SWITCHING systems (Telecommunication), COMPUTER architecture, COMMUNICATION infrastructure, MULTIPROCESSORS, BANDWIDTHS, SYSTEMS on a chip
Abstract: On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heterogeneous many-cores and accelerator-rich SoCs, which are not, or only partially, coherent, are a much less mature research area. In this work, we present a modular, topology-agnostic, high-performance on-chip communication platform. The platform includes components to build and link subnetworks with customizable bandwidth and concurrency properties and adheres to a state-of-the-art, industry-standard protocol. We discuss microarchitectural trade-offs and timing/area characteristics of our modules and show that they can be composed to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end on-chip communication fabrics (not only network switches but also DMA engines and memory controllers) with high degrees of concurrency. We design and implement a state-of-the-art ML training accelerator, where our communication fabric scales to 1024 cores on a die, providing 32 TB/s cross-sectional bandwidth at only 24 ns round-trip latency between any two cores. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

19. FPnew: An Open-Source Multiformat Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing

Author: Mach, Stefan, primary, Schuiki, Fabian, additional, Zaruba, Florian, additional, and Benini, Luca, additional
Published: 2021
Full Text: View/download PDF

20. Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing

Author: Zaruba, Florian, primary, Schuiki, Fabian, additional, and Benini, Luca, additional
Published: 2021
Full Text: View/download PDF

21. Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Author: Scheffler, Paul, primary, Zaruba, Florian, additional, Schuiki, Fabian, additional, Hoefler, Torsten, additional, and Benini, Luca, additional
Published: 2021
Full Text: View/download PDF

22. Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

Author: Schuiki, Fabian, primary, Zaruba, Florian, additional, Hoefler, Torsten, additional, and Benini, Luca, additional
Published: 2021
Full Text: View/download PDF

23. An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication

Author: Kurth, Andreas, primary, Ronninger, Wolfgang, additional, Benz, Thomas, additional, Cavalcante, Matheus, additional, Schuiki, Fabian, additional, Zaruba, Florian, additional, and Benini, Luca, additional
Published: 2021
Full Text: View/download PDF

24. Live Demonstration: Exploiting Body-Biasing for Static Corner Trimming and Maximum Energy Efficiency Operation in 22nm FDX Technology

Author: Di Mauro, Alfio, primary, Zaruba, Florian, additional, Schuiki, Fabian, additional, Mach, Stefan, additional, and Benini, Luca, additional
Published: 2020
Full Text: View/download PDF

25. A 4096-core RISC-V Chiplet Architecture for Ultra-efficient Floating-point Computing

Author: Zaruba, Florian, primary, Schuiki, Fabian, additional, and Benini, Luca, additional
Published: 2020
Full Text: View/download PDF

26. Replication Package for Paper: LLHD: A Multi-level Intermediate Representation for Hardware Description Languages

Author: Schuiki, Fabian, primary, Kurth, Andreas, additional, Grosser, Tobias, additional, and Benini, Luca, additional
Published: 2020
Full Text: View/download PDF

27. LLHD: a multi-level intermediate representation for hardware description languages

Author: Schuiki, Fabian, primary, Kurth, Andreas, additional, Grosser, Tobias, additional, and Benini, Luca, additional
Published: 2020
Full Text: View/download PDF

28. Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systems

Author: Cavalcante, Matheus, primary, Kurth, Andreas, additional, Schuiki, Fabian, additional, and Benini, Luca, additional
Published: 2020
Full Text: View/download PDF

29. XwattPilot: A Full-stack Cloud System Enabling Agile Development of Transprecision Software for Low-power SoCs

Author: Diamantopoulos, Dionysios, primary, Scheidegger, Florian, additional, Mach, Stefan, additional, Schuiki, Fabian, additional, Haugou, Germain, additional, Schaffner, Michael, additional, Gurkaynak, Frank K., additional, Hagleitner, Christoph, additional, Malossi, A. Cristiano I., additional, and Benini, Luca, additional
Published: 2020
Full Text: View/download PDF

30. Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI

Author: Cavalcante, Matheus, primary, Schuiki, Fabian, additional, Zaruba, Florian, additional, Schaffner, Michael, additional, and Benini, Luca, additional
Published: 2020
Full Text: View/download PDF

31. The Floating Point Trinity: A Multi-modal Approach to Extreme Energy-Efﬁciency and Performance

Author: Zaruba, Florian, Schuiki, Fabian, Mach, Stefan, and Benini, Luca
Abstract: 2019 26th IEEE International Conference on Electronics, Circuits and Systems (ICECS), ISBN:978-1-7281-0996-1, ISBN:978-1-7281-0997-8
Published: 2019
Full Text: View/download PDF

32. The Floating Point Trinity: A Multi-modal Approach to Extreme Energy-Efficiency and Performance

Author: Zaruba, Florian, primary, Schuiki, Fabian, additional, Mach, Stefan, additional, and Benini, Luca, additional
Published: 2019
Full Text: View/download PDF

33. NTX: A 260 Gflop/sW Streaming Accelerator for Oblivious Floating-Point Algorithms in 22 nm FD-SOI

Author: Schuiki, Fabian, primary, Schaffner, Michael, additional, and Benini, Luca, additional
Published: 2019
Full Text: View/download PDF

34. A 0.80pJ/flop, 1.24Tflop/sW 8-to-64 bit Transprecision Floating-Point Unit for a 64 bit RISC-V Processor in 22nm FD-SOI

Author: Mach, Stefan, primary, Schuiki, Fabian, additional, Zaruba, Florian, additional, and Benini, Luca, additional
Published: 2019
Full Text: View/download PDF

35. A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Author: Schuiki, Fabian, primary, Schaffner, Michael, additional, Gurkaynak, Frank K., additional, and Benini, Luca, additional
Published: 2019
Full Text: View/download PDF

36. NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22 nm FD-SOI

Author: Schuiki, Fabian, primary, Schaffner, Michael, additional, and Benini, Luca, additional
Published: 2019
Full Text: View/download PDF

37. Needle in a Haystack

Author: Asadpour, Mahdi, primary, Burger, Mario, additional, Schuiki, Fabian, additional, and Hummel, Karin Anna, additional
Published: 2015
Full Text: View/download PDF

38. Live Demonstration: Exploiting Body-Biasing for Static Corner Trimming and Maximum Energy Efficiency Operation in 22nm FDX Technology

Author: Luca Benini, Stefan Mach, Florian Zaruba, Alfio Di Mauro, Fabian Schuiki, Di Mauro, Alfio, Zaruba, Florian, Schuiki, Fabian, Mach, Stefan, and Benini, Luca
Subjects: Systems-on-Chip, Forcing (recursion theory), Computer science, business.industry, Electronic engineering, Process (computing), Biasing, Trimming, Frequency scaling, business, Efficient energy use, Graphical user interface, Voltage
Abstract: To provide high computational capabilities, and, at the same time, minimize the power consumption, modern Systems-on-Chip (SoCs) target very low energy consumption per operation as a primary objective. This goal has been achieved in recent years by adopting simple, yet very effective strategies like aggressive voltage and frequency scaling. However, the process variations that affects highly scaled technology nodes represents a severe limitation to the application of such techniques [2]; forcing digital designers to account for significant supply voltage margins to guarantee sign-off frequencies [1]. In this demo, we show how the effect of process variations can be statically mitigated on a chip fabricated in 22nm FDX technology, thanks to the application of a Body-Biasing (BB) voltage, which is capable to trim the performance of the circuit.
Published: 2020

39. XwattPilot: A Full-stack Cloud System Enabling Agile Development of Transprecision Software for Low-power SoCs

Author: Fabian Schuiki, A. Cristiano I. Malossi, Germain Haugou, Frank K. Gurkaynak, Michael Schaffner, Florian Scheidegger, Luca Benini, Christoph Hagleitner, Stefan Mach, Dionysios Diamantopoulos, Diamantopoulos, Dionysio, Scheidegger, Florian, Mach, Stefan, Schuiki, Fabian, Haugou, Germain, Schaffner, Michael, Gurkaynak, Frank K., Hagleitner, Christoph, Malossi, A. Cristiano I., and Benini, Luca
Subjects: 010302 applied physics, Transprecision, RISC-V, FPGA, Cache-Coheren, CAPI, Virtual prototyping, Claud, Energy efficiency, business.industry, Computer science, Transprecision, RISC-V, FPGA, Cache-Coherent, CAPI, Virtual Prototyping, Cloud, Energy Efficiency, Software development, Cloud computing, 02 engineering and technology, 01 natural sciences, 020202 computer hardware & architecture, symbols.namesake, Software, Embedded system, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, symbols, Performance improvement, business, Von Neumann architecture, Agile software development
Abstract: The performance improvement rate of conventional von Neumann processors has slowed as Moore's Law grinds to an economic halt, giving rise to a new age of heterogeneity for energy-efficient computing. Extending processors with finely tunable precision instructions have emerged as a form of heterogeneity that tradeoffs computation precision with power consumption. However, the prolonged design time due to customization of the supported framework for a system-on-a-chip may counteract the advantages of transprecision computing. We propose XwattPilot, a system aiming at accelerating the transprecision software development of low-power processors using cloud technology. We show that the total energy-to-solution can be significantly decreased by using transprecision computations, whereas the proposed system can accelerate tie energy-efficiency evaluation runtime by 10.3x.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

39 results on '"Schuiki, Fabian"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources