Author: "Benini, Luca" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Benini, Luca"' showing total 3,308 results

Start Over Author "Benini, Luca"

3,308 results on '"Benini, Luca"'

1. Design and Experimental Investigation of Trikarenos: A Fault-Tolerant 28nm RISC-V-based SoC

Author: Rogenmoser, Michael, Wiese, Philip, Forlin, Bruno Endres, Gürkaynak, Frank K., Rech, Paolo, Menicucci, Alessandra, Ottavi, Marco, and Benini, Luca
Subjects: Physics - Instrumentation and Detectors, Computer Science - Hardware Architecture, High Energy Physics - Experiment
Abstract: We present a fault-tolerant by-design RISC-V SoC and experimentally assess it under atmospheric neutrons and 200 MeV protons. The dedicated ECC and Triple-Core Lockstep countermeasures correct most errors, guaranteeing a device cross-section lower than $5.36 \times 10^{-12}$ cm$^2$., Comment: 4 pages (excluding title page), accepted at RADECS 2024
Published: 2024

2. Spatzformer: An Efficient Reconfigurable Dual-Core RISC-V V Cluster for Mixed Scalar-Vector Workloads

Author: Perotti, Matteo, Raeber, Michele, Sinigaglia, Mattia, Cavalcante, Matheus, Rossi, Davide, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Multi-core vector processor architectures excel in handling computationally intensive vectorizable tasks but struggle to achieve optimal resource utilization when facing sequential and control tasks that cannot be vectorized. This work presents Spatzformer, the first reconfigurable RISC-V V (RVV) architecture developed from a baseline open-source dual-core cluster based on Snitch scalar cores augmented with compact Spatz vector units. Spatzformer operates in two distinct modes: split mode, working as a dual-core vector architecture to handle vectorizable tasks concurrently, and merge mode, where two vector units are driven by a single scalar core, allowing the remaining scalar core to handle non-vectorizable control tasks. We implement Spatzformer in a 12-nm technology node and characterize the cost of the added architectural reconfigurability. We show that merge mode accelerates mixed scalar-vector kernels by up to 1.8x compared to split mode. Moreover, it accelerates the vector kernels that require fine-grained synchronization (such as FFT) by up to 20% with respect to the baseline. The reconfigurability features do not degrade the architecture's maximum frequency (1.2GHz, TT, 0.8V, 25C) and have a negligible area impact (+1.4%), with a worst-case energy efficiency drop of only 7% with respect to the non-reconfigurable baseline., Comment: To be published in the 2024 IEEE 35th International Conference on Application Specific Systems (ASAP), Architectures and Processors
Published: 2024

3. Ultra-Lightweight Collaborative Mapping for Robot Swarms

Author: Niculescu, Vlad, Polonelli, Tommaso, Magno, Michele, and Benini, Luca
Subjects: Computer Science - Robotics
Abstract: A key requirement in robotics is the ability to simultaneously self-localize and map a previously unknown environment, relying primarily on onboard sensing and computation. Achieving fully onboard accurate simultaneous localization and mapping (SLAM) is feasible for high-end robotic platforms, whereas small and inexpensive robots face challenges due to constrained hardware, therefore frequently resorting to external infrastructure for sensing and computation. The challenge is further exacerbated in swarms of robots, where coordination, scalability, and latency are crucial concerns. This work introduces a decentralized and lightweight collaborative SLAM approach that enables mapping on virtually any robot, even those equipped with low-cost hardware, including miniaturized insect-size devices. Moreover, the proposed solution supports large swarm formations with the capability to coordinate hundreds of agents. To substantiate our claims, we have successfully implemented collaborative SLAM on centimeter-size drones weighing only 46 grams. Remarkably, we achieve results comparable to high-end state-of-the-art solutions while reducing the cost, memory, and computation requirements by two orders of magnitude. Our approach is innovative in three main aspects. First, it enables onboard infrastructure-less collaborative mapping with a lightweight and cost-effective solution in terms of sensing and computation. Second, we optimize the data traffic within the swarm to support hundreds of cooperative agents using standard wireless protocols such as ultra-wideband (UWB), Bluetooth, or WiFi. Last, we implement a distributed swarm coordination policy to decrease mapping latency and enhance accuracy., Comment: 20 pages, 7 figures
Published: 2024

4. Tiny-PULP-Dronets: Squeezing Neural Networks for Faster and Lighter Inference on Multi-Tasking Autonomous Nano-Drones

Author: Lamberti, Lorenzo, Niculescu, Vlad, Barcis, Michał, Bellone, Lorenzo, Natalizio, Enrico, Benini, Luca, and Palossi, Daniele
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Pocket-sized autonomous nano-drones can revolutionize many robotic use cases, such as visual inspection in narrow, constrained spaces, and ensure safer human-robot interaction due to their tiny form factor and weight -- i.e., tens of grams. This compelling vision is challenged by the high level of intelligence needed aboard, which clashes against the limited computational and storage resources available on PULP (parallel-ultra-low-power) MCU class navigation and mission controllers that can be hosted aboard. This work moves from PULP-Dronet, a State-of-the-Art convolutional neural network for autonomous navigation on nano-drones. We introduce Tiny-PULP-Dronet: a novel methodology to squeeze by more than one order of magnitude model size (50x fewer parameters), and number of operations (27x less multiply-and-accumulate) required to run inference with similar flight performance as PULP-Dronet. This massive reduction paves the way towards affordable multi-tasking on nano-drones, a fundamental requirement for achieving high-level intelligence., Comment: 3 Figures, 1 table. Accepted for publication at IEEE Artificial Intelligence Circuits and Systems (AICAS), 2022
Published: 2024

5. BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring

Author: Benfenati, Luca, Ingolfsson, Thorir Mar, Cossettini, Andrea, Pagliari, Daniele Jahier, Burrello, Alessio, and Benini, Luca
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: This study presents a novel approach for EEG-based seizure detection leveraging a BERT-based model. The model, BENDR, undergoes a two-phase training process. Initially, it is pre-trained on the extensive Temple University Hospital EEG Corpus (TUEG), a 1.5 TB dataset comprising over 10,000 subjects, to extract common EEG data patterns. Subsequently, the model is fine-tuned on the CHB-MIT Scalp EEG Database, consisting of 664 EEG recordings from 24 pediatric patients, of which 198 contain seizure events. Key contributions include optimizing fine-tuning on the CHB-MIT dataset, where the impact of model architecture, pre-processing, and post-processing techniques are thoroughly examined to enhance sensitivity and reduce false positives per hour (FP/h). We also explored custom training strategies to ascertain the most effective setup. The model undergoes a novel second pre-training phase before subject-specific fine-tuning, enhancing its generalization capabilities. The optimized model demonstrates substantial performance enhancements, achieving as low as 0.23 FP/h, 2.5$\times$ lower than the baseline model, with a lower but still acceptable sensitivity rate, showcasing the effectiveness of applying a BERT-based approach on EEG-based seizure detection., Comment: 4 pages, 2 tables, 2 figures
Published: 2024

6. Basilisk: An End-to-End Open-Source Linux-Capable RISC-V SoC in 130nm CMOS

Author: Scheffler, Paul, Sauter, Philippe, Benz, Thomas, Gürkaynak, Frank K., and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Open-source hardware (OSHW) is rapidly gaining traction in academia and industry. The availability of open RTL descriptions, EDA tools, and even PDKs enables a fully auditable supply chain for end-to-end (RTL to layout) open-source silicon, significantly strengthening security and transparency. Despite promising developments, existing OSHW efforts have so far fallen short of producing end-to-end open-source SoCs at the complexity and performance level needed to run a general-purpose OS. We present Basilisk, the first end-to-end open-source, Linux-capable RISC-V SoC taped out in IHP's open 130 nm technology. Basilisk features a 64-bit RISC-V core, a fully digital HyperRAM DRAM controller, and a rich set of IO peripherals including USB 1.1 and VGA. To tape out Basilisk, we create a reusable tool pipeline to convert its industry-grade SystemVerilog description to Verilog. We optimized logic synthesis in the open source Yosys synthesis tool, obtaining an increase in Basilisk's peak clock speed by 2.3x to 77 MHz and reducing its cell area by 1.6x to 1.1 MGE while also reducing synthesis runtime and RAM usage. We further optimized place and route in OpenROAD, enabling convergence to zero DRC violations while increasing core area utilization by 10% and reducing die area by 12%., Comment: 3 pages, 4 figures. Accepted at SSH-SoC 2024 workshop
Published: 2024

7. Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

Author: Paulin, Gianna, Scheffler, Paul, Benz, Thomas, Cavalcante, Matheus, Fischer, Tim, Eggimann, Manuel, Zhang, Yichao, Wistoff, Nils, Bertaccini, Luca, Colagrande, Luca, Ottavi, Gianmarco, Gürkaynak, Frank K., Rossi, Davide, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply., Comment: 2 pages, 7 figures. Accepted at the 2024 IEEE Symposium on VLSI Technology & Circuits
Published: 2024

8. Low Latency Visual Inertial Odometry with On-Sensor Accelerated Optical Flow for Resource-Constrained UAVs

Author: Kühne, Jonas, Magno, Michele, and Benini, Luca
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Visual Inertial Odometry (VIO) is the task of estimating the movement trajectory of an agent from an onboard camera stream fused with additional Inertial Measurement Unit (IMU) measurements. A crucial subtask within VIO is the tracking of features, which can be achieved through Optical Flow (OF). As the calculation of OF is a resource-demanding task in terms of computational load and memory footprint, which needs to be executed at low latency, especially in robotic applications, OF estimation is today performed on powerful CPUs or GPUs. This restricts its use in a broad spectrum of applications where the deployment of such powerful, power-hungry processors is unfeasible due to constraints related to cost, size, and power consumption. On-sensor hardware acceleration is a promising approach to enable low latency VIO even on resource-constrained devices such as nano drones. This paper assesses the speed-up in a VIO sensor system exploiting a compact OF sensor consisting of a global shutter camera and an Application Specific Integrated Circuit (ASIC). By replacing the feature tracking logic of the VINS-Mono pipeline with data from this OF camera, we demonstrate a 49.4% reduction in latency and a 53.7% reduction of compute load of the VIO pipeline over the original VINS-Mono implementation, allowing VINS-Mono operation up to 50 FPS instead of 20 FPS on the quad-core ARM Cortex-A72 processor of a Raspberry Pi Compute Module 4., Comment: This article has been accepted for publication in the IEEE Sensors Journal (JSEN)
Published: 2024
Full Text: View/download PDF

9. GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG

Author: Frey, Sebastian, Lucchini, Mattia Alberto, Kartsch, Victor, Ingolfsson, Thorir Mar, Bernardi, Andrea Helga, Segessenmann, Michael, Osieleniec, Jakub, Benatti, Simone, Benini, Luca, and Cossettini, Andrea
Subjects: Electrical Engineering and Systems Science - Signal Processing, Electrical Engineering and Systems Science - Systems and Control
Abstract: Recent advancements in head-mounted wearable technology are revolutionizing the field of biopotential measurement, but the integration of these technologies into practical, user-friendly devices remains challenging due to issues with design intrusiveness, comfort, and data privacy. To address these challenges, this paper presents GAPSES, a novel smart glasses platform designed for unobtrusive, comfortable, and secure acquisition and processing of electroencephalography (EEG) and electrooculography (EOG) signals. We introduce a direct electrode-electronics interface with custom fully dry soft electrodes to enhance comfort for long wear. An integrated parallel ultra-low-power RISC-V processor (GAP9, Greenwaves Technologies) processes data at the edge, thereby eliminating the need for continuous data streaming through a wireless link, enhancing privacy, and increasing system reliability in adverse channel conditions. We demonstrate the broad applicability of the designed prototype through validation in a number of EEG-based interaction tasks, including alpha waves, steady-state visual evoked potential analysis, and motor movement classification. Furthermore, we demonstrate an EEG-based biometric subject recognition task, where we reach a sensitivity and specificity of 98.87% and 99.86% respectively, with only 8 EEG channels and an energy consumption per inference on the edge as low as 121 uJ. Moreover, in an EOG-based eye movement classification task, we reach an accuracy of 96.68% on 11 classes, resulting in an information transfer rate of 94.78 bit/min, which can be further increased to 161.43 bit/min by reducing the accuracy to 81.43%. The deployed implementation has an energy consumption of 24 uJ per inference and a total system power of only 16.28 mW, allowing for continuous operation of more than 12 h with a small 75 mAh battery., Comment: 10 pages, 5 figures, 5 tables. This paper has been submitted to IEEE Transactions on Biomedical Circuits and Systems
Published: 2024

10. HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

Author: Van Delm, Josse, Vandersteegen, Maarten, Burrello, Alessio, Sarda, Giuseppe Maria, Conti, Francesco, Pagliari, Daniele Jahier, Benini, Luca, and Verhelst, Marian
Subjects: Computer Science - Programming Languages, Computer Science - Distributed, Parallel, and Cluster Computing, D.3.4
Abstract: Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM - a compiler that merges TVM with DORY to maximize the utilization of heterogeneous accelerators and minimize data movements. HTVM allows deploying the MLPerf(TM) Tiny suite on DIANA, an SoC with a RISC-V CPU, and digital and analog compute-in-memory AI accelerators, at 120x improved performance over plain TVM deployment., Comment: Presented at DAC2023. Open-source code is available at https://github.com/KULeuven-MICAS/htvm
Published: 2024
Full Text: View/download PDF

11. A Gigabit, DMA-enhanced Open-Source Ethernet Controller for Mixed-Criticality Systems

Author: Liang, Chaoqun, Ottaviano, Alessandro, Benz, Thomas, Sinigaglia, Mattia, Benini, Luca, Garofalo, Angelo, and Rossi, Davide
Subjects: Computer Science - Hardware Architecture
Abstract: The ongoing revolution in application domains targeting autonomous navigation, first and foremost automotive "zonalization", has increased the importance of certain off-chip communication interfaces, particularly Ethernet. The latter will play an essential role in next-generation vehicle architectures as the backbone connecting simultaneously and instantaneously the zonal/domain controllers. There is thereby an incumbent need to introduce a performant Ethernet controller in the open-source HW community, to be used as a proxy for architectural explorations and prototyping of mixed-criticality systems (MCSs). Driven by this trend, in this work, we propose a fully open-source, DMA-enhanced, technology-agnostic Gigabit Ethernet architecture that overcomes the limitations of existing open-source architectures, such as Lowrisc's Ethernet, often tied to FPGA implementation, performance-bound by sub-optimal design choices such as large memory buffers, and in general not mature enough to bridge the gap between academia and industry. Besides the area advantage, the proposed design increases packet transmission speed up to almost 3x compared to Lowrisc's and is validated through implementation and FPGA prototyping into two open-source, heterogeneous MCSs., Comment: 4 pages,4 figures, 21st ACM International Conference on Computing Frontiers Workshops and Special Sessions
Published: 2024
Full Text: View/download PDF

12. Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Author: Potocnik, Viviane, Colagrande, Luca, Fischer, Tim, Bertaccini, Luca, Pagliari, Daniele Jahier, Burrello, Alessio, and Benini, Luca
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Hardware Architecture, C.4, C.3, I.2
Abstract: Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization., Comment: 14 pages, 10 figures, 4 tables, IEEE Transactions on Circuits and Systems for Artificial Intelligence
Published: 2024

13. xTern: Energy-Efficient Ternary Neural Network Inference on RISC-V-Based Edge Systems

Author: Rutishauser, Georg, Mihali, Joan, Scherer, Moritz, and Benini, Luca
Subjects: Computer Science - Hardware Architecture, Computer Science - Machine Learning
Abstract: Ternary neural networks (TNNs) offer a superior accuracy-energy trade-off compared to binary neural networks. However, until now, they have required specialized accelerators to realize their efficiency potential, which has hindered widespread adoption. To address this, we present xTern, a lightweight extension of the RISC-V instruction set architecture (ISA) targeted at accelerating TNN inference on general-purpose cores. To complement the ISA extension, we developed a set of optimized kernels leveraging xTern, achieving 67% higher throughput than their 2-bit equivalents. Power consumption is only marginally increased by 5.2%, resulting in an energy efficiency improvement by 57.1%. We demonstrate that the proposed xTern extension, integrated into an octa-core compute cluster, incurs a minimal silicon area overhead of 0.9% with no impact on timing. In end-to-end benchmarks, we demonstrate that xTern enables the deployment of TNNs achieving up to 1.6 percentage points higher CIFAR-10 classification accuracy than 2-bit networks at equal inference latency. Our results show that xTern enables RISC-V-based ultra-low-power edge AI platforms to benefit from the efficiency potential of TNNs., Comment: Accepted for publication at IEEE ASAP 2024
Published: 2024

14. Modeling and Controlling Many-Core HPC Processors: an Alternative to PID and Moving Average Algorithms

Author: Bambini, Giovanni, Ottaviano, Alessandro, Conficoni, Christian, Tilli, Andrea, Benini, Luca, and Bartolini, Andrea
Subjects: Electrical Engineering and Systems Science - Systems and Control, Computer Science - Performance
Abstract: The race towards performance increase and computing power has led to chips with heterogeneous and complex designs, integrating an ever-growing number of cores on the same monolithic chip or chiplet silicon die. Higher integration density, compounded with the slowdown of technology-driven power reduction, implies that power and thermal management become increasingly relevant. Unfortunately, existing research lacks a detailed analysis and modeling of thermal, power, and electrical coupling effects and how they have to be jointly considered to perform dynamic control of complex and heterogeneous Multi-Processor System on Chips (MPSoCs). To close the gap, in this work, we first provide a detailed thermal and power model targeting a modern High Performance Computing (HPC) MPSoC. We consider real-world coupling effects such as actuators' non-idealities and the exponential relation between the dissipated power, the temperature state, and the voltage level in a single processing element. We analyze how these factors affect the control algorithm behavior and the type of challenges that they pose. Based on the analysis, we propose a thermal capping strategy inspired by Fuzzy control theory to replace the state-of-the-art PID controller, as well as a root-finding iterative method to optimally choose the shared voltage value among cores grouped in the same voltage domain. We evaluate the proposed controller with model-in-the-loop and hardware-in-the-loop co-simulations. We show an improvement over state-of-the-art methods of up to 5x the maximum exceeded temperature while providing an average of 3.56% faster application execution runtime across all the evaluation scenarios., Comment: Paper in Review
Published: 2024

15. A Passive and Asynchronous Wake-up Receiver for Acoustic Underwater Communication

Author: Schulthess, Lukas, Mayer, Philipp, Benini, Luca, and Magno, Michele
Subjects: Electrical Engineering and Systems Science - Systems and Control
Abstract: Establishing reliable data exchange in an underwater domain using energy and power-efficient communication methods is crucial and challenging. Radio frequencies are absorbed by the salty and mineral-rich water and optical signals are obstructed and scattered after short distances. In contrast, acoustic communication benefits from low absorption and enables communication over long distances. Underwater communication must match low power and energy requirements as underwater sensor systems must have a long battery lifetime and need to work reliably due to their deployment and maintenance cost. For long-term deployments, the sensors' overall power consumption is determined by the power consumption during idle state. It can be reduced by integrating asynchronous always-on wake-up circuits with nano-watt power consumption. However, this approach does reduce but not eliminate idle power consumption, leaving a margin for improvement. This paper presents a passive and asynchronous wake-up receiver for acoustic underwater communication enabling zero-power always-on listening. Zero-power listening is achieved by combining energy and information transmission using a low-power wake-up receiver that extracts energy out of the acoustic signal and eliminates radio frontend idle consumption. In-field evaluations demonstrate that the wake-up circuit requires only 63 uW to detect and compare an 8-bit UUID at a data rate of 200 bps up to a distance of 5 m and that the needed energy can directly be extracted from the acoustic signal.
Published: 2024

16. SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Author: Huang, Wei, Qin, Haotong, Liu, Yangdong, Li, Yawei, Liu, Xianglong, Benini, Luca, Magno, Michele, and Qi, Xiaojuan
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Large language models (LLMs) achieve remarkable performance in natural language understanding but require substantial computation and memory resources. Post-training quantization (PTQ) is a powerful compression technique extensively investigated in LLMs. However, existing PTQ methods are still not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. Standard PTQ methods using group-wise quantization suffer difficulties in quantizing LLMs accurately to such low-bit, but advanced methods remaining high-precision weights element-wisely are hard to realize their theoretical hardware efficiency. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM. The scheme exploits the salience distribution of weights to determine optimal bit-width and quantizers for accurate LLM quantization, while aligning bit-width partition to groups for compact memory usage and fast integer inference. Specifically, the proposed SliM-LLM mainly relies on two novel techniques: (1) Salience-Determined Bit Allocation utilizes the clustering characteristics of salience distribution to allocate the bit-widths of each group, increasing the accuracy of quantized LLMs and maintaining the inference efficiency; (2) Salience-Weighted Quantizer Calibration optimizes the parameters of the quantizer by considering the element-wise salience within the group, balancing the maintenance of salient information and minimization of errors. Comprehensive experiments show that SliM-LLM significantly improves the accuracy of LLMs at ultra-low bits, e.g., 2-bit LLaMA-7B achieves a 5.5-times memory-saving than original model on NVIDIA A800 GPUs, and 48% decrease of perplexity compared to the state-of-the-art gradient-free PTQ method. Moreover, SliM-LLM+, which is integrated from the extension of SliM-LLM with gradient-based quantizers, further reduces perplexity by 35.1%., Comment: 22 pages
Published: 2024

17. SentryCore: A RISC-V Co-Processor System for Safe, Real-Time Control Applications

Author: Rogenmoser, Michael, Ottaviano, Alessandro, Benz, Thomas, Balas, Robert, Perotti, Matteo, Garofalo, Angelo, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: In the last decade, we have witnessed exponential growth in the complexity of control systems for safety-critical applications (automotive, robots, industrial automation) and their transition to heterogeneous mixed-criticality systems (MCSs). The growth of the RISC-V ecosystem is creating a major opportunity to develop open-source, vendor-neutral reference platforms for safety-critical computing. We present SentryCore, a reliable, real-time, self-contained, open-source mega-IP for advanced control functions that can be seamlessly integrated into Systems-on-Chip, e.g., for automotive applications, through industry-standard Advanced eXtensible Interface 4 (AXI4). SentryCore features three embedded RISC-V processor cores in lockstep with error-correcting code (ECC) protected data memory for reliable execution of any safety-critical application. Context switching is accelerated to under 110 clock cycles via a RISC-V core-local interrupt controller (CLIC) and dedicated hardware extensions, while a timer-based direct memory access (DMA) engine streamlines sensor data readout during periodic control loops. SentryCore was implemented in Intel's 16nm process node and tested with FreeRTOS, ThreadX, and RTIC software support., Comment: 2 pages, accepted at the RISC-V Summit Europe 2024
Published: 2024

18. Compressed Latent Replays for Lightweight Continual Learning on Spiking Neural Networks

Author: Dequino, Alberto, Carpegna, Alessio, Nadalini, Davide, Savino, Alessandro, Benini, Luca, Di Carlo, Stefano, and Conti, Francesco
Subjects: Computer Science - Neural and Evolutionary Computing, Computer Science - Artificial Intelligence, Computer Science - Emerging Technologies, Computer Science - Machine Learning
Abstract: Rehearsal-based Continual Learning (CL) has been intensely investigated in Deep Neural Networks (DNNs). However, its application in Spiking Neural Networks (SNNs) has not been explored in depth. In this paper we introduce the first memory-efficient implementation of Latent Replay (LR)-based CL for SNNs, designed to seamlessly integrate with resource-constrained devices. LRs combine new samples with latent representations of previously learned data, to mitigate forgetting. Experiments on the Heidelberg SHD dataset with Sample and Class-Incremental tasks reach a Top-1 accuracy of 92.5% and 92%, respectively, without forgetting the previously learned information. Furthermore, we minimize the LRs' requirements by applying a time-domain compression, reducing by two orders of magnitude their memory requirement, with respect to a naive rehearsal setup, with a maximum accuracy drop of 4%. On a Multi-Class-Incremental task, our SNN learns 10 new classes from an initial set of 10, reaching a Top-1 accuracy of 78.4% on the full test set.
Published: 2024

19. TeraPool-SDR: An 1.89TOPS 1024 RV-Cores 4MiB Shared-L1 Cluster for Next-Generation Open-Source Software-Defined Radios

Author: Zhang, Yichao, Bertuletti, Marco, Riedel, Samuel, Cavalcante, Matheus, Vanelli-Coralli, Alessandro, and Benini, Luca
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture
Abstract: Radio Access Networks (RAN) workloads are rapidly scaling up in data processing intensity and throughput as the 5G (and beyond) standards grow in number of antennas and sub-carriers. Offering flexible Processing Elements (PEs), efficient memory access, and a productive parallel programming model, many-core clusters are a well-matched architecture for next-generation software-defined RANs, but staggering performance requirements demand a high number of PEs coupled with extreme Power, Performance and Area (PPA) efficiency. We present the architecture, design, and full physical implementation of Terapool-SDR, a cluster for Software Defined Radio (SDR) with 1024 latency-tolerant, compact RV32 PEs, sharing a global view of a 4MiB, 4096-banked, L1 memory. We report various feasible configurations of TeraPool-SDR featuring an ultra-high bandwidth PE-to-L1-memory interconnect, clocked at 730MHz, 880MHz, and 924MHz (TT/0.80 V/25 {\deg}C) in 12nm FinFET technology. The TeraPool-SDR cluster achieves high energy efficiency on all SDR key kernels for 5G RANs: Fast Fourier Transform (93GOPS/W), Matrix-Multiplication (125GOPS/W), Channel Estimation (96GOPS/W), and Linear System Inversion (61GOPS/W). For all the kernels, it consumes less than 10W, in compliance with industry standards., Comment: 6 pages, 6 figures and 3 tables
Published: 2024
Full Text: View/download PDF

20. Insights from Basilisk: Are Open-Source EDA Tools Ready for a Multi-Million-Gate, Linux-Booting RV64 SoC Design?

Author: Sauter, Philippe, Benz, Thomas, Scheffler, Paul, Gürkaynak, Frank K., and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Designing complex, multi-million-gate application-specific integrated circuits requires robust and mature electronic design automation (EDA) tools. We describe our efforts in enhancing the open-source Yosys+Openroad EDA flow to implement Basilisk, a fully open-source, Linux-booting RV64GC system-on-chip (SoC) design. We analyze the quality-of-results impact of our enhancements to synthesis tools, interfaces between EDA tools, logic optimization scripts, and a newly open-sourced library of optimized arithmetic macro-operators. We also introduce a streamlined physical design flow with an improved power grid and cell placement integration. Our Basilisk SoC design was taped out in IHP's open 130 nm technology. It achieves an operating frequency of 77 MHz (51 logic levels) under typical conditions, a 2.3x improvement compared to the baseline open-source EDA flow, while also reducing logic area by 1.6x. Furthermore, tool runtime was reduced by 2.5x, and peak RAM usage decreased by 2.9x. Through collaboration with EDA tool developers and domain experts, Basilisk establishes solid "proof of existence" for a fully open-source EDA flow used in designing a competitive multi-million-gate digital SoC., Comment: 8 pages, 6 figures, submitted at IWLS 2024
Published: 2024

21. Basilisk: Achieving Competitive Performance with Open EDA Tools on an Open-Source Linux-Capable RISC-V SoC

Author: Sauter, Phillippe, Benz, Thomas, Scheffler, Paul, Jiang, Zerun, Muheim, Beat, Gürkaynak, Frank K., and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: We introduce Basilisk, an optimized application-specific integrated circuit (ASIC) implementation and design flow building on the end-to-end open-source Iguana system-on-chip (SoC). We present enhancements to synthesis tools and logic optimization scripts improving quality of results (QoR), as well as an optimized physical design with an improved power grid and cell placement integration enabling a higher core utilization. The tapeout-ready version of Basilisk implemented in IHP's open 130 nm technology achieves an operation frequency of 77 MHz (51 logic levels) under typical conditions, a 2.3x improvement compared to the baseline open-source EDA design flow presented in Iguana, and a higher 55 % core utilization compared to 50 % in the baseline design. Through collaboration with EDA tool developers and domain experts, Basilisk exemplifies a synergistic effort towards competitive open-source electronic design automation (EDA) tools for research and industry applications., Comment: 2 pages, 1 figure, accepted as a poster at the RISC-V Summit Europe 2024
Published: 2024

22. A Spiking Neural Network Decoder for Implantable Brain Machine Interfaces and its Sparsity-aware Deployment on RISC-V Microcontrollers

Author: Liao, Jiawei, Toomey, Oscar, Wang, Xiaying, Widmer, Lars, Chestek, Cynthia A., Benini, Luca, and Jang, Taekwang
Subjects: Electrical Engineering and Systems Science - Signal Processing
Abstract: Implantable Brain-machine interfaces (BMIs) are promising for motor rehabilitation and mobility augmentation, and they demand accurate and energy-efficient algorithms. In this paper, we propose a novel spiking neural network (SNN) decoder for regression tasks for implantable BMIs. The SNN is trained with enhanced spatio-temporal backpropagation to fully leverage its capability to handle temporal problems. The proposed SNN decoder outperforms the state-of-the-art Kalman filter and artificial neural network (ANN) decoders in offline finger velocity decoding tasks. The decoder is deployed on a RISC-V-based hardware platform and optimized to exploit sparsity. The proposed implementation has an average power consumption of 0.50 mW in a duty-cycled mode. When conducting continuous inference without duty-cycling, it achieves an energy efficiency of 1.88 uJ per inference, which is 5.5X less than the baseline ANN. Additionally, the average decoding latency is 0.12 ms for each inference, which is 5.7X faster than the ANN implementation.
Published: 2024

23. Multi-resolution Rescored ByteTrack for Video Object Detection on Ultra-low-power Embedded Systems

Author: Bompani, Luca, Rusci, Manuele, Palossi, Daniele, Conti, Francesco, and Benini, Luca
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, I.4
Abstract: This paper introduces Multi-Resolution Rescored Byte-Track (MR2-ByteTrack), a novel video object detection framework for ultra-low-power embedded processors. This method reduces the average compute load of an off-the-shelf Deep Neural Network (DNN) based object detector by up to 2.25$\times$ by alternating the processing of high-resolution images (320$\times$320 pixels) with multiple down-sized frames (192$\times$192 pixels). To tackle the accuracy degradation due to the reduced image input size, MR2-ByteTrack correlates the output detections over time using the ByteTrack tracker and corrects potential misclassification using a novel probabilistic Rescore algorithm. By interleaving two down-sized images for every high-resolution one as the input of different state-of-the-art DNN object detectors with our MR2-ByteTrack, we demonstrate an average accuracy increase of 2.16% and a latency reduction of 43% on the GAP9 microcontroller compared to a baseline frame-by-frame inference scheme using exclusively full-resolution images. Code available at: https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack, Comment: 9 pages, 3 figures Accepted for publication at the Embedded Vision Workshop of the Computer Vision and Pattern Recognition conference, Seattle, 2024
Published: 2024

24. SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers

Author: Scheffler, Paul, Colagrande, Luca, and Benini, Luca
Subjects: Computer Science - Mathematical Software, Computer Science - Hardware Architecture
Abstract: Stencil codes are performance-critical in many compute-intensive applications, but suffer from significant address calculation and irregular memory access overheads. This work presents SARIS, a general and highly flexible methodology for stencil acceleration using register-mapped indirect streams. We demonstrate SARIS for various stencil codes on an eight-core RISC-V compute cluster with indirect stream registers, achieving significant speedups of 2.72x, near-ideal FPU utilizations of 81%, and energy efficiency improvements of 1.58x over an RV32G baseline on average. Scaling out to a 256-core manycore system, we estimate an average FPU utilization of 64%, an average speedup of 2.14x, and up to 15% higher fractions of peak compute than a leading GPU code generator., Comment: 6 pages, 5 figures, 2 tables. Accepted at DAC 2024
Published: 2024

25. Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

Author: Jung, Victor J. B., Burrello, Alessio, Scherer, Moritz, Conti, Francesco, and Benini, Luca
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform., Comment: Pre-print manuscript submitted for review to the IEEE Transactions on Computers
Published: 2024

26. Foundation Models for Structural Health Monitoring

Author: Benfenati, Luca, Pagliari, Daniele Jahier, Zanatta, Luca, Velez, Yhorman Alexander Bedoya, Acquaviva, Andrea, Poncino, Massimo, Macii, Enrico, Benini, Luca, and Burrello, Alessio
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Systems and Control, I.2.1, I.2.3
Abstract: Structural Health Monitoring (SHM) is a critical task for ensuring the safety and reliability of civil infrastructures, typically realized on bridges and viaducts by means of vibration monitoring. In this paper, we propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for SHM. We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training, which, coupled with task-specific fine-tuning, allows them to outperform state-of-the-art traditional methods on diverse tasks, including Anomaly Detection (AD) and Traffic Load Estimation (TLE). We then extensively explore model size versus accuracy trade-offs and experiment with Knowledge Distillation (KD) to improve the performance of smaller Transformers, enabling their embedding directly into the SHM edge nodes. We showcase the effectiveness of our foundation models using data from three operational viaducts. For AD, we achieve a near-perfect 99.9% accuracy with a monitoring time span of just 15 windows. In contrast, a state-of-the-art method based on Principal Component Analysis (PCA) obtains its first good result (95.03% accuracy) only considering 120 windows. On two different TLE tasks, our models obtain state-of-the-art performance on multiple evaluation metrics (R$^2$ score, MAE% and MSE%). On the first benchmark, we achieve an R$^2$ score of 0.97 and 0.85 for light and heavy vehicle traffic, respectively, while the best previous approach stops at 0.91 and 0.84. On the second one, we achieve an R$^2$ score of 0.54 versus the 0.10 of the best existing method., Comment: 16 pages, 4 tables, 9 figures
Published: 2024

27. Optimizing Offload Performance in Heterogeneous MPSoCs

Author: Colagrande, Luca and Benini, Luca
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Heterogeneous multi-core architectures combine a few "host" cores, optimized for single-thread performance, with many small energy-efficient "accelerator" cores for data-parallel processing, on a single chip. Offloading a computation to the many-core acceleration fabric introduces a communication and synchronization cost which reduces the speedup attainable on the accelerator, particularly for small and fine-grained parallel tasks. We demonstrate that by co-designing the hardware and offload routines, we can increase the speedup of an offloaded DAXPY kernel by as much as 47.9%. Furthermore, we show that it is possible to accurately model the runtime of an offloaded application, accounting for the offload overheads, with as low as 1% MAPE error, enabling optimal offload decisions under offload execution time constraints., Comment: 2 pages, 1 figure. Accepted for publication in the DATE24 conference proceedings
Published: 2024

28. BatDeck: Advancing Nano-drone Navigation with Low-power Ultrasound-based Obstacle Avoidance

Author: Müller, Hanna, Kartsch, Victor, Magno, Michele, and Benini, Luca
Subjects: Computer Science - Robotics, Electrical Engineering and Systems Science - Systems and Control
Abstract: Nano-drones, distinguished by their agility, minimal weight, and cost-effectiveness, are particularly well-suited for exploration in confined, cluttered and narrow spaces. Recognizing transparent, highly reflective or absorbing materials, such as glass and metallic surfaces is challenging, as classical sensors, such as cameras or laser rangers, often do not detect them. Inspired by bats, which can fly at high speeds in complete darkness with the help of ultrasound, this paper introduces \textit{BatDeck}, a pioneering sensor-deck employing a lightweight and low-power ultrasonic sensor for nano-drone autonomous navigation. This paper first provides insights about sensor characteristics, highlighting the influence of motor noise on the ultrasound readings, then it introduces the results of extensive experimental tests for obstacle avoidance (OA) in a diverse environment. Results show that \textit{BatDeck} allows exploration for a flight time of 8 minutes while covering 136m on average before crash in a challenging environment with transparent and reflective obstacles, proving the effectiveness of ultrasonic sensors for OA on nano-drones., Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Published: 2024

29. Combining Local and Global Perception for Autonomous Navigation on Nano-UAVs

Author: Lamberti, Lorenzo, Rutishauser, Georg, Conti, Francesco, and Benini, Luca
Subjects: Computer Science - Robotics, Electrical Engineering and Systems Science - Systems and Control
Abstract: A critical challenge in deploying unmanned aerial vehicles (UAVs) for autonomous tasks is their ability to navigate in an unknown environment. This paper introduces a novel vision-depth fusion approach for autonomous navigation on nano-UAVs. We combine the visual-based PULP-Dronet convolutional neural network for semantic information extraction, i.e., serving as the global perception, with 8x8px depth maps for close-proximity maneuvers, i.e., the local perception. When tested in-field, our integration strategy highlights the complementary strengths of both visual and depth sensory information. We achieve a 100% success rate over 15 flights in a complex navigation scenario, encompassing straight pathways, static obstacle avoidance, and 90{\deg} turns., Comment: 5 pages, 2 figures, 1 table, 1 video
Published: 2024

30. On-Device Domain Learning for Keyword Spotting on Low-Power Extreme Edge Embedded Systems

Author: Cioflan, Cristian, Cavigelli, Lukas, Rusci, Manuele, de Prado, Miguel, and Benini, Luca
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Keyword spotting accuracy degrades when neural networks are exposed to noisy environments. On-site adaptation to previously unseen noise is crucial to recovering accuracy loss, and on-device learning is required to ensure that the adaptation process happens entirely on the edge device. In this work, we propose a fully on-device domain adaptation system achieving up to 14% accuracy gains over already-robust keyword spotting models. We enable on-device learning with less than 10 kB of memory, using only 100 labeled utterances to recover 5% accuracy after adapting to the complex speech noise. We demonstrate that domain adaptation can be achieved on ultra-low-power microcontrollers with as little as 806 mJ in only 14 s on always-on, battery-operated devices., Comment: 5 pages, 2 tables, 2 figures. Accepted at IEEE AICAS 2024
Published: 2024

31. 12 mJ per Class On-Device Online Few-Shot Class-Incremental Learning

Author: Wibowo, Yoga Esa, Cioflan, Cristian, Ingolfsson, Thorir Mar, Hersche, Michael, Zhao, Leo, Rahimi, Abbas, and Benini, Luca
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Few-Shot Class-Incremental Learning (FSCIL) enables machine learning systems to expand their inference capabilities to new classes using only a few labeled examples, without forgetting the previously learned classes. Classical backpropagation-based learning and its variants are often unsuitable for battery-powered, memory-constrained systems at the extreme edge. In this work, we introduce Online Few-Shot Class-Incremental Learning (O-FSCIL), based on a lightweight model consisting of a pretrained and metalearned feature extractor and an expandable explicit memory storing the class prototypes. The architecture is pretrained with a novel feature orthogonality regularization and metalearned with a multi-margin loss. For learning a new class, our approach extends the explicit memory with novel class prototypes, while the remaining architecture is kept frozen. This allows learning previously unseen classes based on only a few examples with one single pass (hence online). O-FSCIL obtains an average accuracy of 68.62% on the FSCIL CIFAR100 benchmark, achieving state-of-the-art results. Tailored for ultra-low-power platforms, we implement O-FSCIL on the 60 mW GAP9 microcontroller, demonstrating online learning capabilities within just 12 mJ per new class., Comment: 6 pages, 4 tables, 3 figures. Accepted at IEEE DATE 2024
Published: 2024

32. Boosting keyword spotting through on-device learnable user speech characteristics

Author: Cioflan, Cristian, Cavigelli, Lukas, and Benini, Luca
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers., Comment: 5 pages, 3 tables, 2 figures. Accepted as a full paper by the tinyML Research Symposium 2024
Published: 2024

33. SzCORE: A Seizure Community Open-source Research Evaluation framework for the validation of EEG-based automated seizure detection algorithms

Author: Dan, Jonathan, Pale, Una, Amirshahi, Alireza, Cappelletti, William, Ingolfsson, Thorir Mar, Wang, Xiaying, Cossettini, Andrea, Bernini, Adriano, Benini, Luca, Beniczky, Sándor, Atienza, David, and Ryvlin, Philippe
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Machine Learning
Abstract: The need for high-quality automated seizure detection algorithms based on electroencephalography (EEG) becomes ever more pressing with the increasing use of ambulatory and long-term EEG monitoring. Heterogeneity in validation methods of these algorithms influences the reported results and makes comprehensive evaluation and comparison challenging. This heterogeneity concerns in particular the choice of datasets, evaluation methodologies, and performance metrics. In this paper, we propose a unified framework designed to establish standardization in the validation of EEG-based seizure detection algorithms. Based on existing guidelines and recommendations, the framework introduces a set of recommendations and standards related to datasets, file formats, EEG data input content, seizure annotation input and output, cross-validation strategies, and performance metrics. We also propose the 10-20 seizure detection benchmark, a machine-learning benchmark based on public datasets converted to a standardized format. This benchmark defines the machine-learning task as well as reporting metrics. We illustrate the use of the benchmark by evaluating a set of existing seizure detection algorithms. The SzCORE (Seizure Community Open-source Research Evaluation) framework and benchmark are made publicly available along with an open-source software library to facilitate research use, while enabling rigorous evaluation of the clinical significance of the algorithms, fostering a collective effort to more optimally detect seizures to improve the lives of people with epilepsy.
Published: 2024

34. Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Author: Mazzola, Sergio, Riedel, Samuel, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient RISC-V cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead RISC-V ISA extensions for efficient systolic execution, namely Xqueue and Queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in MemPool, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on several DSP kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double MemPool's compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80V/25{\deg}C), in a 22 nm FDX technology, our hybrid architecture runs at 600 MHz with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.
Published: 2024

35. A Tiny Transformer for Low-Power Arrhythmia Classification on Microcontrollers

Author: Busia, Paola, Scrugli, Matteo Antonio, Jung, Victor Jean-Baptiste, Benini, Luca, and Meloni, Paolo
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Wearable systems for the continuous and real-time monitoring of cardiovascular diseases are becoming widespread and valuable assets in diagnosis and therapy. A promising approach for real-time analysis of the electrocardiographic (ECG) signal and the detection of heart conditions, such as arrhythmia, is represented by the transformer machine learning model. Transformers are powerful models for the classification of time series, although efficient implementation in the wearable domain raises significant design challenges, to combine adequate accuracy and a suitable complexity. In this work, we present a tiny transformer model for the analysis of the ECG signal, requiring only 6k parameters and reaching 98.97% accuracy in the recognition of the 5 most common arrhythmia classes from the MIT-BIH Arrhythmia database, assessed considering 8-bit integer inference as required for efficient execution on low-power microcontroller-based devices. We explored an augmentation-based training approach for improving the robustness against electrode motion artifacts noise, resulting in a worst-case post-deployment performance assessment of 98.36% accuracy. Suitability for wearable monitoring solutions is finally demonstrated through efficient deployment on the parallel ultra-low-power GAP9 processor, where inference execution requires 4.28ms and 0.09mJ., Comment: 2024 IEEE Transactions on Biomedical Circuits and Systems
Published: 2024
Full Text: View/download PDF

36. A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit for Analog In-Memory Computing

Author: Ferro, Elena, Vasilopoulos, Athanasios, Lammie, Corey, Gallo, Manuel Le, Benini, Luca, Boybat, Irem, and Sebastian, Abu
Subjects: Computer Science - Hardware Architecture, Computer Science - Emerging Technologies, Computer Science - Machine Learning
Abstract: Analog In-Memory Computing (AIMC) is an emerging technology for fast and energy-efficient Deep Learning (DL) inference. However, a certain amount of digital post-processing is required to deal with circuit mismatches and non-idealities associated with the memory devices. Efficient near-memory digital logic is critical to retain the high area/energy efficiency and low latency of AIMC. Existing systems adopt Floating Point 16 (FP16) arithmetic with limited parallelization capability and high latency. To overcome these limitations, we propose a Near-Memory digital Processing Unit (NMPU) based on fixed-point arithmetic. It achieves competitive accuracy and higher computing throughput than previous approaches while minimizing the area overhead. Moreover, the NMPU supports standard DL activation steps, such as ReLU and Batch Normalization. We perform a physical implementation of the NMPU design in a 14 nm CMOS technology and provide detailed performance, power, and area assessments. We validate the efficacy of the NMPU by using data from an AIMC chip and demonstrate that a simulated AIMC system with the proposed NMPU outperforms existing FP16-based implementations, providing 139$\times$ speed-up, 7.8$\times$ smaller area, and a competitive power consumption. Additionally, our approach achieves an inference accuracy of 86.65 %/65.06 %, with an accuracy drop of just 0.12 %/0.4 % compared to the FP16 baseline when benchmarked with ResNet9/ResNet32 networks trained on the CIFAR10/CIFAR100 datasets, respectively., Comment: Accepted at ISCAS2024
Published: 2024
Full Text: View/download PDF

37. Zero-shot Classification using Hyperdimensional Computing

Author: Ruffino, Samuele, Karunaratne, Geethan, Hersche, Michael, Benini, Luca, Sebastian, Abu, and Rahimi, Abbas
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Classification based on Zero-shot Learning (ZSL) is the ability of a model to classify inputs into novel classes on which the model has not previously seen any training examples. Providing an auxiliary descriptor in the form of a set of attributes describing the new classes involved in the ZSL-based classification is one of the favored approaches to solving this challenging task. In this work, inspired by Hyperdimensional Computing (HDC), we propose the use of stationary binary codebooks of symbol-like distributed representations inside an attribute encoder to compactly represent a computationally simple end-to-end trainable model, which we name Hyperdimensional Computing Zero-shot Classifier~(HDC-ZSC). It consists of a trainable image encoder, an attribute encoder based on HDC, and a similarity kernel. We show that HDC-ZSC can be used to first perform zero-shot attribute extraction tasks and, can later be repurposed for Zero-shot Classification tasks with minimal architectural changes and minimal model retraining. HDC-ZSC achieves Pareto optimal results with a 63.8% top-1 classification accuracy on the CUB-200 dataset by having only 26.6 million trainable parameters. Compared to two other state-of-the-art non-generative approaches, HDC-ZSC achieves 4.3% and 9.9% better accuracy, while they require more than 1.85x and 1.72x parameters compared to HDC-ZSC, respectively., Comment: This is the extended version of a paper accepted in the Design, Automation, and Test in Europe Conference (DATE), 2024
Published: 2024

38. TOP: Towards Open & Predictable Heterogeneous SoCs

Author: Valente, Luca, Restuccia, Francesco, Rossi, Davide, Kastner, Ryan, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Ensuring predictability in modern real-time Systems-on-Chip (SoCs) is an increasingly critical concern for many application domains such as automotive, robotics, and industrial automation. An effective approach involves the modeling and development of hardware components, such as interconnects and shared memory resources, to evaluate or enforce their deterministic behavior. Unfortunately, these IPs are often closed-source, and these studies are limited to the single modules that must later be integrated with third-party IPs in more complex SoCs, hindering the precision and scope of modeling and compromising the overall predictability. With the coming-of-age of open-source instruction set architectures (RISC-V) and hardware, major opportunities for changing this status quo are emerging. This study introduces an innovative methodology for modeling and analyzing State-of-the-Art (SoA) open-source SoCs for low-power cyber-physical systems. Our approach models and analyzes the entire set of open-source IPs within these SoCs and then provides a comprehensive analysis of the entire architecture. We validate this methodology on a sample heterogenous low-power RISC-V architecture through RTL simulation and FPGA implementation, minimizing pessimism in bounding the service time of transactions crossing the architecture between 28% and 1%, which is considerably lower when compared to similar SoA works.
Published: 2024

39. MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication

Author: Perotti, Matteo, Zhang, Yichao, Cavalcante, Matheus, Mustafa, Enis, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Dense Matrix Multiplication (MatMul) is arguably one of the most ubiquitous compute-intensive kernels, spanning linear algebra, DSP, graphics, and machine learning applications. Thus, MatMul optimization is crucial not only in high-performance processors but also in embedded low-power platforms. Several Instruction Set Architectures (ISAs) have recently included matrix extensions to improve MatMul performance and efficiency at the cost of added matrix register files and units. In this paper, we propose Matrix eXtension (MX), a lightweight approach that builds upon the open-source RISC-V Vector (RVV) ISA to boost MatMul energy efficiency. Instead of adding expensive dedicated hardware, MX uses the pre-existing vector register file and functional units to create a hybrid vector/matrix engine at a negligible area cost (< 3%), which comes from a compact near-FPU tile buffer for higher data reuse, and no clock frequency overhead. We implement MX on a compact and highly energy-optimized RVV processor and evaluate it in both a Dual- and 64-Core cluster in a 12-nm technology node. MX boosts the Dual-Core's energy efficiency by 10% for a double-precision 64x64x64 matrix multiplication with the same FPU utilization (~97%) and by 25% on the 64-Core cluster for the same benchmark on 32-bit data, with a 56% performance gain.
Published: 2024

40. A Heterogeneous RISC-V based SoC for Secure Nano-UAV Navigation

Author: Valente, Luca, Nadalini, Alessandro, Veeran, Asif, Sinigaglia, Mattia, Sa, Bruno, Wistoff, Nils, Tortorella, Yvan, Benatti, Simone, Psiakis, Rafail, Kulmala, Ari, Mohammad, Baker, Pinto, Sandro, Palossi, Daniele, Benini, Luca, and Rossi, Davide
Subjects: Computer Science - Hardware Architecture, Computer Science - Artificial Intelligence
Abstract: The rapid advancement of energy-efficient parallel ultra-low-power (ULP) ucontrollers units (MCUs) is enabling the development of autonomous nano-sized unmanned aerial vehicles (nano-UAVs). These sub-10cm drones represent the next generation of unobtrusive robotic helpers and ubiquitous smart sensors. However, nano-UAVs face significant power and payload constraints while requiring advanced computing capabilities akin to standard drones, including real-time Machine Learning (ML) performance and the safe co-existence of general-purpose and real-time OSs. Although some advanced parallel ULP MCUs offer the necessary ML computing capabilities within the prescribed power limits, they rely on small main memories (<1MB) and ucontroller-class CPUs with no virtualization or security features, and hence only support simple bare-metal runtimes. In this work, we present Shaheen, a 9mm2 200mW SoC implemented in 22nm FDX technology. Differently from state-of-the-art MCUs, Shaheen integrates a Linux-capable RV64 core, compliant with the v1.0 ratified Hypervisor extension and equipped with timing channel protection, along with a low-cost and low-power memory controller exposing up to 512MB of off-chip low-cost low-power HyperRAM directly to the CPU. At the same time, it integrates a fully programmable energy- and area-efficient multi-core cluster of RV32 cores optimized for general-purpose DSP as well as reduced- and mixed-precision ML. To the best of the authors' knowledge, it is the first silicon prototype of a ULP SoC coupling the RV64 and RV32 cores in a heterogeneous host+accelerator architecture fully based on the RISC-V ISA. We demonstrate the capabilities of the proposed SoC on a wide range of benchmarks relevant to nano-UAV applications. The cluster can deliver up to 90GOp/s and up to 1.8TOp/s/W on 2-bit integer kernels and up to 7.9GFLOp/s and up to 150GFLOp/s/W on 16-bit FP kernels.
Published: 2024
Full Text: View/download PDF

41. Data-Driven Power Modeling and Monitoring via Hardware Performance Counters Tracking

Author: Mazzola, Sergio, Ara, Gabriele, Benz, Thomas, Forsberg, Björn, Cucinotta, Tommaso, and Benini, Luca
Subjects: Computer Science - Performance, Computer Science - Operating Systems
Abstract: In the current high-performance and embedded computing era, full-stack energy-centric design is paramount. Use cases require increasingly high performance at an affordable power budget, often under real-time constraints. Extreme heterogeneity and parallelism address these issues but greatly complicate online power consumption assessment, which is essential for dynamic hardware and software stack adaptations. We introduce a novel architecture-agnostic power modeling methodology with state-of-the-art accuracy, low overhead, and high responsiveness. Our methodology identifies the best Performance Monitoring Counters (PMCs) to model the power consumption of each hardware sub-system at each Dynamic Voltage and Frequency Scaling (DVFS) state. The individual linear models are combined into a complete model that effectively describes the power consumption of the whole system, achieving high accuracy and low overhead. Our evaluation reports an average estimation error of 7.5 % for power consumption and 1.3 % for energy. Furthermore, we propose Runmeter, an open-source, PMC-based monitoring framework integrated into the Linux kernel. Runmeter manages PMC samples collection and manipulation, efficiently evaluating our power models at runtime. With a time overhead of only 0.7 % in the worst case, Runmeter provides responsive and accurate power measurements directly in the kernel, which can be employed for actuation policies such as Dynamic Power Management (DPM) and power-aware task scheduling., Comment: 13 pages, 5 figures, submitted to the IEEE for possible publication
Published: 2024

42. Siracusa: A 16 nm Heterogenous RISC-V SoC for Extended Reality with At-MRAM Neural Engine

Author: Prasad, Arpan Suravi, Scherer, Moritz, Conti, Francesco, Rossi, Davide, Di Mauro, Alfio, Eggimann, Manuel, Gómez, Jorge Tómas, Li, Ziyun, Sarwar, Syed Shakib, Wang, Zhao, De Salvo, Barbara, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Extended reality (XR) applications are Machine Learning (ML)-intensive, featuring deep neural networks (DNNs) with millions of weights, tightly latency-bound (10-20 ms end-to-end), and power-constrained (low tens of mW average power). While ML performance and efficiency can be achieved by introducing neural engines within low-power systems-on-chip (SoCs), system-level power for nontrivial DNNs depends strongly on the energy of non-volatile memory (NVM) access for network weights. This work introduces Siracusa, a near-sensor heterogeneous SoC for next-generation XR devices manufactured in 16 nm CMOS. Siracusa couples an octa-core cluster of RISC-V digital signal processing cores with a novel tightly-coupled "At-Memory" integration between a state-of-the-art digital neural engine called N-EUREKA and an on-chip NVM based on magnetoresistive memory(MRAM), achieving 1.7x higher throughput and 3x better energy efficiency than XR SoCs using NVM as background memory. The fabricated SoC prototype achieves an area efficiency of 65.2 GOp/s/mm2 and a peak energy efficiency of 8.84 TOp/J for DNN inference while supporting complex heterogeneous application workloads, which combine ML with conventional signal processing and control., Comment: Final accepted manuscript pre-print submitted to the IEEE Journal of Solid-State Circuits
Published: 2023

43. Multi-sensory Anti-collision Design for Autonomous Nano-swarm Exploration

Author: Pourjabar, Mahyar, Rusci, Manuele, Bompani, Luca, Lamberti, Lorenzo, Niculescu, Vlad, Palossi, Daniele, and Benini, Luca
Subjects: Computer Science - Robotics, Electrical Engineering and Systems Science - Systems and Control
Abstract: This work presents a multi-sensory anti-collision system design to achieve robust autonomous exploration capabilities for a swarm of 10 cm-side nano-drones operating on object detection missions. We combine lightweight single-beam laser ranging to avoid proximity collisions with a long-range vision-based obstacle avoidance deep learning model (i.e., PULP-Dronet) and an ultra-wide-band (UWB) based ranging module to prevent intra-swarm collisions. An in-field study shows that our multisensory approach can prevent collisions with static obstacles, improving the mission success rate from 20% to 80% in cluttered environments w.r.t. a State-of-the-Art (SoA) baseline. At the same time, the UWB-based sub-system shows a 92.8% success rate in preventing collisions between drones of a four-agent fleet within a safety distance of 65 cm. On a SoA robotic platform extended by a GAP8 multi-core processor, the PULP-Dronet runs interleaved with an objected detection task, which constraints its execution at 1.6 frame/s. This throughput is sufficient for avoiding obstacles with a probability of about 40% but shows a need for more capable processors for the next-generation nano-drone swarms.
Published: 2023

44. TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

Author: Terzic, Aleksandar, Hersche, Michael, Karunaratne, Geethan, Benini, Luca, Sebastian, Abu, and Rahimi, Abbas
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$. The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with $1.37\times$/$1.24\times$ faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to $7.07\times$/$2.86\times$ faster in the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA achieves, on average, $1.28\times$ speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.
Published: 2023

45. MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

Author: Menet, Nicolas, Hersche, Michael, Karunaratne, Geethan, Benini, Luca, Sebastian, Abu, and Rahimi, Abbas
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves about 2-4 x speedup at an accuracy delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2-4 inputs at once while maintaining a high average accuracy within a [-1.07, -3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets., Comment: accepted in NeurIPS 2023
Published: 2023

46. Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

Author: Zhang, Chi, Scheffler, Paul, Benz, Thomas, Perotti, Matteo, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Sparse matrix vector multiplication (SpMV) is central to numerous data-intensive applications, but requires streaming indirect memory accesses that severely degrade both processing and memory throughput in state-of-the-art architectures. Near-memory hardware units, decoupling indirect streams from processing elements, partially alleviate the bottleneck, but rely on low DRAM access granularity, which is highly inefficient for modern DRAM standards like HBM and LPDDR. To fully address the end-to-end challenge, we propose a low-overhead data coalescer combined with a near-memory indirect streaming unit for AXI-Pack, an extension to the widespread AXI4 protocol packing narrow irregular stream elements onto wide memory buses. Our combined solution leverages the memory-level parallelism and coalescence of streaming indirect accesses in irregular applications like SpMV to maximize the performance and bandwidth efficiency attained on wide memory interfaces. Our solution delivers an average speedup of 8x in effective indirect access, often reaching the full memory bandwidth. As a result, we achieve an average end-to-end speedup on SpMV of 3x. Moreover, our approach demonstrates remarkable on-chip efficiency, requiring merely 27kB of on-chip storage and a very compact implementation area of 0.2-0.3mm^2 in a 12nm node., Comment: 6 pages, 6 figures. Submitted to DATE 2024
Published: 2023

47. AXI-REALM: A Lightweight and Modular Interconnect Extension for Traffic Regulation and Monitoring of Heterogeneous Real-Time SoCs

Author: Benz, Thomas, Ottaviano, Alessandro, Balas, Robert, Garofalo, Angelo, Restuccia, Francesco, Biondi, Alessandro, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: The increasing demand for heterogeneous functionality in the automotive industry and the evolution of chip manufacturing processes have led to the transition from federated to integrated critical real-time embedded systems (CRTESs). This leads to higher integration challenges of conventional timing predictability techniques due to access contention on shared resources, which can be resolved by providing system-level observability and controllability in hardware. We focus on the interconnect as a shared resource and propose AXI-REALM, a lightweight, modular, and technology-independent real-time extension to industry-standard AXI4 interconnects, available open-source. AXI-REALM uses a credit-based mechanism to distribute and control the bandwidth in a multi-subordinate system on periodic time windows, proactively prevents denial of service from malicious actors in the system, and tracks each manager's access and interference statistics for optimal budget and period selection. We provide detailed performance and implementation cost assessment in a 12nm node and an end-to-end functional case study implementing AXI-REALM into an open-source Linux-capable RISC-V SoC. In a system with a general-purpose core and a hardware accelerator's DMA engine causing interference on the interconnect, AXI-REALM achieves fair bandwidth distribution among managers, allowing the core to recover 68.2 % of its performance compared to the case without contention. Moreover, near-ideal performance (above 95 %) can be achieved by distributing the available bandwidth in favor of the core, improving the worst-case memory access latency from 264 to below eight cycles. Our approach minimizes buffering compared to other solutions and introduces only 2.45 % area overhead compared to the original SoC., Comment: 6 pages, 6 figures, accepted as a regular paper at DATE24
Published: 2023

48. PELS: A Lightweight and Flexible Peripheral Event Linking System for Ultra-Low Power IoT Processors

Author: Ottaviano, Alessandro, Balas, Robert, Sauter, Philippe, Eggimann, Manuel, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: A key challenge for ultra-low-power (ULP) devices is handling peripheral linking, where the main central processing unit (CPU) periodically mediates the interaction among multiple peripherals following wake-up events. Current solutions address this problem by either integrating event interconnects that route single-wire event lines among peripherals or by general-purpose I/O processors, with a strong trade-off between the latency, efficiency of the former, and the flexibility of the latter. In this paper, we present an open-source, peripheral-agnostic, lightweight, and flexible Peripheral Event Linking System (PELS) that combines dedicated event routing with a tiny I/O processor. With the proposed approach, the power consumption of a linking event is reduced by 2.5 times compared to a baseline relying on the main core for the event-linking process, at a low area of just 7 kGE in its minimal configuration, when integrated into a ULP RISC-V IoT processor., Comment: 6 pages, accepted at DATE24 conference, camera-ready version
Published: 2023

49. Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication

Author: Schönleber, Jannis, Cavigelli, Lukas, Andri, Renzo, Perotti, Matteo, and Benini, Luca
Subjects: Computer Science - Hardware Architecture, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: From classical HPC to deep learning, MatMul is at the heart of today's computing. The recent Maddness method approximates MatMul without the need for multiplication by using a hash-based version of product quantization (PQ) indexing into a look-up table (LUT). Stella Nera is the first Maddness accelerator and it achieves 15x higher area efficiency (GMAC/s/mm^2) and more than 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators implemented in the same technology. The hash function is a decision tree, which allows for an efficient hardware implementation as the multiply-accumulate operations are replaced by decision tree passes and LUT lookups. The entire Maddness MatMul can be broken down into parts that allow an effective implementation with small computing units and memories, allowing it to reach extreme efficiency while remaining generically applicable for MatMul tasks. In a commercial 14nm technology and scaled to 3nm, we achieve an energy efficiency of 161 TOp/s/W@0.55V with a Top-1 accuracy on CIFAR-10 of more than 92.5% using ResNet9., Comment: 6 pages, 7 figures, preprint under review
Published: 2023

50. CV32RT: Enabling Fast Interrupt and Context Switching for RISC-V Microcontrollers

Author: Balas, Robert, Ottaviano, Alessandro, and Benini, Luca
Subjects: Computer Science - Hardware Architecture
Abstract: Processors using the open RISC-V ISA are finding increasing adoption in the embedded world. Many embedded use cases have real-time constraints and require flexible, predictable, and fast reactive handling of incoming events. However, RISC- V processors are still lagging in this area compared to more mature proprietary architectures, such as ARM Cortex-M and TriCore, which have been tuned for years. The default interrupt controller standardized by RISC-V, the Core Local Interruptor (CLINT), lacks configurability in prioritization and preemption of interrupts. The RISC-V Core Local Interrupt Controller (CLIC) specification addresses this concern by enabling pre-emptible, low-latency vectored interrupts while also envisioning optional extensions to improve interrupt latency. In this work, we implement a CLIC for the CV32E40P, an industrially supported open-source 32-bit MCU-class RISC-V core, and enhance it with fastirq: a custom extension that provides interrupt latency as low as 6 cycles. We call CV32RT our enhanced core. To the best of our knowledge, CV32RT is the first fully open-source RV32 core with competitive interrupt-handling features compared to the Arm Cortex-M series and TriCore. The proposed extensions are also demonstrated to improve task context switching in real-time operating systems., Comment: 12 pages, submitted to IEEE Transactions on VLSI Systems (TVLSI)
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

3,308 results on '"Benini, Luca"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources