Author: "Verhelst, Marian" / Database: arXiv - Searchworks@Jio Institute Digital Library Search Results

1. OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

Author: Yi, Xiaoling, Antonio, Ryan, Dumoulin, Joren, Sun, Jiacong, Van Delm, Josse, Paim, Guilherme, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture, Computer Science - Artificial Intelligence
Abstract: Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety ofGeMM workloads, while achieving 4.68 TOPS/W system efficiency.
Published: 2024

2. MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices

Author: Hamdi, Mohamed Amine, Daghero, Francesco, Sarda, Giuseppe Maria, Van Delm, Josse, Symons, Arne, Benini, Luca, Verhelst, Marian, Pagliari, Daniele Jahier, and Burrello, Alessio
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, I.2.2, D.1.3
Abstract: Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to a different heterogeneous MCU family implies labor-intensive re-development of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, resulting in the generation of general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework enhanced with hardware cost models can compete with and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency by up to 60.88 times on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by 2.15 times compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board., Comment: 13 pages, 11 figures, 4 tables
Published: 2024

3. Pack my weights and run! Minimizing overheads for in-memory computing accelerators

Author: Houshmand, Pouya and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: In-memory computing hardware accelerators allow more than 10x improvements in peak efficiency and performance for matrix-vector multiplications (MVM) compared to conventional digital designs. For this, they have gained great interest for the acceleration of neural network workloads. Nevertheless, these potential gains are only achieved when the utilization of the computational resources is maximized and the overhead from loading operands in the memory array minimized. To this aim, this paper proposes a novel mapping algorithm for the weights in the IMC macro, based on efficient packing of the weights of network layers in the available memory. The algorithm realizes 1) minimization of weight loading times while at the same time 2) maximally exploiting the parallelism of the IMC computational fabric. A set of case studies are carried out to show achievable trade-offs for the MLPerf Tiny benchmark \cite{mlperftiny} on IMC architectures, with potential $10-100\times$ EDP improvements., Comment: 7 pages, 9 figures
Published: 2024

4. COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN Processing

Author: Colleman, Steven, Shi, Man, and Verhelst, Marian
Subjects: Electrical Engineering and Systems Science - Systems and Control
Abstract: To achieve high accuracy, convolutional neural networks (CNNs) are increasingly growing in complexity and diversity in layer types and topologies. This makes it very challenging to efficiently deploy such networks on custom processor architectures for resource-scarce edge devices. Existing mapping exploration frameworks enable searching for the optimal execution schedules or hardware mappings of individual network layers, by optimizing each layer's spatial (dataflow parallelization) and temporal unrolling (execution order). However, these tools fail to take into account the overhead of supporting different unrolling schemes within a common hardware architecture. Using a fixed unrolling scheme across all layers is also not ideal, as this misses significant opportunities for energy and latency savings from optimizing the mapping of diverse layer types. A balanced approach assesses the right amount of mapping flexibility needed across target neural networks, while taking into account the overhead to support multiple unrollings. This paper, therefore, presents COAC, a cross-layer design space exploration and mapping framework to optimize the flexibility of neural processing architectures by balancing configurability overhead against resulting energy and latency savings for end-to-end inference. COAC does not only provide a systematical analysis of the architectural overhead in function of the supported spatial unrollings, but also builds an automated flow to find the best unrolling combination(s) for efficient end-to-end inference with limited hardware overhead. Results demonstrate that architectures with carefully optimized flexibility can achieve up to 38% EDP (energy-delay-product) savings for a set of six neural networks at the expense of a relative area increase of 9.5%., Comment: 14 pages,17 figures.Journal IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Published: 2024
Full Text: View/download PDF

5. Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis

Author: Sarda, Giuseppe M., Shah, Nimish, Bhattacharjee, Debjyoti, Debacker, Peter, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture
Abstract: GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full {hardware-mapping-algorithm} compute stack. Yet, so far, this has remained under-explored. In this work, we exploit micro-architecture parameter analysis to develop a hardware-aware, runtime mapping technique for OpenCL kernels on the open Vortex RISC-V GPGPU. Our method is based on trace observations and targets optimal hardware resource utilization to achieve superior performance and flexibility compared to hardware-agnostic mapping approaches. The technique was validated on different architectural GPU configurations across several OpenCL kernels. Overall, our approach significantly enhances the performance of the open-source Vortex GPGPU, contributing to unlocking its potential and usability.
Published: 2024
Full Text: View/download PDF

6. CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories

Author: Shi, Man, Colleman, Steven, VanDeMieroop, Charlotte, Joseph, Antony, Meijer, Maurice, Dehaene, Wim, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Deep neural networks (DNN) use a wide range of network topologies to achieve high accuracy within diverse applications. This model diversity makes it impossible to identify a single "dataflow" (execution schedule) to perform optimally across all possible layers and network topologies. Several frameworks support the exploration of the best dataflow for a given DNN layer and hardware. However, switching the dataflow from one layer to the next layer within one DNN model can result in hardware inefficiencies stemming from memory data layout mismatch among the layers. Unfortunately, all existing frameworks treat each layer independently and typically model memories as black boxes (one large monolithic wide memory), which ignores the data layout and can not deal with the data layout dependencies of sequential layers. These frameworks are not capable of doing dataflow cross-layer optimization. This work, hence, aims at cross-layer dataflow optimization, taking the data dependency and data layout reshuffling overheads among layers into account. Additionally, we propose to exploit the multibank memories typically present in modern DNN accelerators towards efficiently reshuffling data to support more dataflow at low overhead. These innovations are supported through the Cross-layer Memory-aware Dataflow Scheduler (CMDS). CMDS can model DNN execution energy/latency while considering the different data layout requirements due to the varied optimal dataflow of layers. Compared with the state-of-the-art (SOTA), which performs layer-optimized memory-unaware scheduling, CMDS achieves up to 5.5X energy reduction and 1.35X latency reduction with negligible hardware cost.
Published: 2024
Full Text: View/download PDF

7. Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms

Author: Colleman, Steven, Symons, Arne, Jung, Victor J. B., and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture
Abstract: The impact of transformer networks is booming, yet, they come with significant computational complexity. It is therefore essential to understand how to optimally map and execute these networks on modern neural processor hardware. So far, literature on transformer scheduling optimization has been focusing on deployment on GPU and specific ASICs. This work enables extensive hardware/mapping exploration by extending the DSE framework Stream towards support for transformers across a wide variety of hardware architectures and different execution schedules. After validation, we explore the optimal schedule for transformer layers/attention heads and investigate whether layer fusion is beneficial to improve latency, energy or memory requirements. Our study shows that the memory requirements for active feature data can be drastically reduced, by adapting the execution schedule based on the size of the input of the attention head., Comment: Accepted to ISQED2024
Published: 2024

8. HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

Author: Van Delm, Josse, Vandersteegen, Maarten, Burrello, Alessio, Sarda, Giuseppe Maria, Conti, Francesco, Pagliari, Daniele Jahier, Benini, Luca, and Verhelst, Marian
Subjects: Computer Science - Programming Languages, Computer Science - Distributed, Parallel, and Cluster Computing, D.3.4
Abstract: Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM - a compiler that merges TVM with DORY to maximize the utilization of heterogeneous accelerators and minimize data movements. HTVM allows deploying the MLPerf(TM) Tiny suite on DIANA, an SoC with a RISC-V CPU, and digital and analog compute-in-memory AI accelerators, at 120x improved performance over plain TVM deployment., Comment: Presented at DAC2023. Open-source code is available at https://github.com/KULeuven-MICAS/htvm
Published: 2024
Full Text: View/download PDF

9. ACCO: Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators

Author: Yin, Jun, Mei, Linyan, Guntoro, Andre, and Verhelst, Marian
Subjects: Electrical Engineering and Systems Science - Signal Processing
Abstract: Spatio-Temporal Convolutional Neural Networks (ST-CNN) allow extending CNN capabilities from image processing to consecutive temporal-pattern recognition. Generally, state-of-the-art (SotA) ST-CNNs inflate the feature maps and weights from well-known CNN backbones to represent the additional time dimension. However, edge computing applications would suffer tremendously from such large computation or memory overhead. Fortunately, the overlapping nature of ST-CNN enables various optimizations, such as the dilated causal convolution structure and Depth-First (DF) layer fusion to reuse the computation between time steps and CNN sliding windows, respectively. Yet, no hardware-aware approach has been proposed that jointly explores the optimal strategy from a scheduling as well as a hardware point of view. To this end, we present ACCO, an automated optimizer that explores efficient Causal CNN transformation and DF scheduling for ST-CNNs on edge hardware accelerators. By cost-modeling the computation and data movement on the accelerator architecture, ACCO automatically selects the best scheduling strategy for the given hardware-algorithm target. Compared to the fixed dilated causal structure, ST-CNNs with ACCO reach an ~8.4x better Energy-Delay-Product. Meanwhile, ACCO improves ~20% in layer-fusion optimals compared to the SotA DF exploration toolchain. When jointly optimizing ST-CNN on the temporal and spatial dimension, ACCO's scheduling outcomes are on average 19x faster and 37x more energy-efficient than spatial DF schemes.
Published: 2024
Full Text: View/download PDF

10. Analog or Digital In-memory Computing? Benchmarking through Quantitative Modeling

Author: Sun, Jiacong, Houshmand, Pouya, and Verhelst, Marian
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Hardware Architecture, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: In-Memory Computing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimensions, and fabrication technologies among published IMC realizations have made it difficult to grasp their relative strengths. Moreover, previous studies have primarily focused on exploring and benchmarking the peak performance of a single IMC macro rather than full system performance on real workloads. This paper aims to address the lack of a quantitative comparison of Analog In-Memory Computing (AIMC) and Digital In-Memory Computing (DIMC) processor architectures. We propose an analytical IMC performance model that is validated against published implementations and integrated into a system-level exploration framework for comprehensive performance assessments on different workloads with varying IMC configurations. Our experiments show that while DIMC generally has higher computational density than AIMC, AIMC with large macro sizes may have better energy efficiency than DIMC on convolutional-layers and pointwise-layers, which can exploit high spatial unrolling. On the other hand, DIMC with small macro size outperforms AIMC on depthwise-layers, which feature limited spatial unrolling opportunities inside a macro.
Published: 2024
Full Text: View/download PDF

11. PATRONoC: Parallel AXI Transport Reducing Overhead for Networks-on-Chip targeting Multi-Accelerator DNN Platforms at the Edge

Author: Jain, Vikram, Cavalcante, Matheus, Bruschi, Nazareno, Rogenmoser, Michael, Benz, Thomas, Kurth, Andreas, Rossi, Davide, Benini, Luca, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture
Abstract: Emerging deep neural network (DNN) applications require high-performance multi-core hardware acceleration with large data bursts. Classical network-on-chips (NoCs) use serial packet-based protocols suffering from significant protocol translation overheads towards the endpoints. This paper proposes PATRONoC, an open-source fully AXI-compliant NoC fabric to better address the specific needs of multi-core DNN computing platforms. Evaluation of PATRONoC in a 2D-mesh topology shows 34% higher area efficiency compared to a state-of-the-art classical NoC at 1 GHz. PATRONoC's throughput outperforms a baseline NoC by 2-8X on uniform random traffic and provides a high aggregated throughput of up to 350 GiB/s on synthetic and DNN workload traffic., Comment: Accepted and presented at 60th DAC
Published: 2023

12. Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference

Author: Risso, Matteo, Burrello, Alessio, Sarda, Giuseppe Maria, Benini, Luca, Macii, Enrico, Poncino, Massimo, Verhelst, Marian, and Pagliari, Daniele Jahier
Subjects: Computer Science - Machine Learning
Abstract: The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings., Comment: Accepted at 2023 ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED)
Published: 2023

13. Benchmarking and modeling of analog and digital SRAM in-memory computing architectures

Author: Houshmand, Pouya, Sun, Jiacong, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture, Electrical Engineering and Systems Science - Image and Video Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: In-memory-computing is emerging as an efficient hardware paradigm for deep neural network accelerators at the edge, enabling to break the memory wall and exploit massive computational parallelism. Two design models have surged: analog in-memory-computing (AIMC) and digital in-memory-computing (DIMC), offering a different design space in terms of accuracy, efficiency and dataflow flexibility. This paper targets the fair comparison and benchmarking of both approaches to guide future designs, through a.) an overview of published architectures; b.) an analytical cost model for energy and throughput; c.) scheduling of workloads on a variety of modeled IMC architectures for end-to-end network efficiency analysis, offering valuable workload-hardware co-design insights.
Published: 2023

14. SALSA: Simulated Annealing based Loop-Ordering Scheduler for DNN Accelerators

Author: Jung, Victor J. B., Symons, Arne, Mei, Linyan, Verhelst, Marian, and Benini, Luca
Subjects: Computer Science - Hardware Architecture, Computer Science - Artificial Intelligence
Abstract: To meet the growing need for computational power for DNNs, multiple specialized hardware architectures have been proposed. Each DNN layer should be mapped onto the hardware with the most efficient schedule, however, SotA schedulers struggle to consistently provide optimum schedules in a reasonable time across all DNN-HW combinations. This paper proposes SALSA, a fast dual-engine scheduler to generate optimal execution schedules for both even and uneven mapping. We introduce a new strategy, combining exhaustive search with simulated annealing to address the dynamic nature of the loop ordering design space size across layers. SALSA is extensively benchmarked against two SotA schedulers, LOMA and Timeloop on 5 different DNNs, on average SALSA finds schedules with 11.9% and 7.6% lower energy while speeding up the search by 1.7x and 24x compared to LOMA and Timeloop, respectively., Comment: 5 pages, 6 figures, open-source at https://github.com/ZigZag-Project/zigzag
Published: 2023

15. NeuroBench: A Framework for Benchmarking Neuromorphic Computing Algorithms and Systems

Author: Yik, Jason, Berghe, Korneel Van den, Blanken, Douwe den, Bouhadjar, Younes, Fabre, Maxime, Hueber, Paul, Ke, Weijie, Khoei, Mina A, Kleyko, Denis, Pacik-Nelson, Noah, Pierro, Alessandro, Stratmann, Philipp, Sun, Pao-Sheng Vincent, Tang, Guangzhi, Wang, Shenqi, Zhou, Biyan, Ahmed, Soikat Hasan, Joseph, George Vathakkattil, Leto, Benedetto, Micheli, Aurora, Mishra, Anurag Kumar, Lenz, Gregor, Sun, Tao, Ahmed, Zergham, Akl, Mahmoud, Anderson, Brian, Andreou, Andreas G., Bartolozzi, Chiara, Basu, Arindam, Bogdan, Petrut, Bohte, Sander, Buckley, Sonia, Cauwenberghs, Gert, Chicca, Elisabetta, Corradi, Federico, de Croon, Guido, Danielescu, Andreea, Daram, Anurag, Davies, Mike, Demirag, Yigit, Eshraghian, Jason, Fischer, Tobias, Forest, Jeremy, Fra, Vittorio, Furber, Steve, Furlong, P. Michael, Gilpin, William, Gilra, Aditya, Gonzalez, Hector A., Indiveri, Giacomo, Joshi, Siddharth, Karia, Vedant, Khacef, Lyes, Knight, James C., Kriener, Laura, Kubendran, Rajkumar, Kudithipudi, Dhireesha, Liu, Yao-Hong, Liu, Shih-Chii, Ma, Haoyuan, Manohar, Rajit, Margarit-Taulé, Josep Maria, Mayr, Christian, Michmizos, Konstantinos, Muir, Dylan, Neftci, Emre, Nowotny, Thomas, Ottati, Fabrizio, Ozcelikkale, Ayca, Panda, Priyadarshini, Park, Jongkil, Payvand, Melika, Pehle, Christian, Petrovici, Mihai A., Posch, Christoph, Renner, Alpha, Sandamirskaya, Yulia, Schaefer, Clemens JS, van Schaik, André, Schemmel, Johannes, Schmidgall, Samuel, Schuman, Catherine, Seo, Jae-sun, Sheik, Sadique, Shrestha, Sumit Bam, Sifalakis, Manolis, Sironi, Amos, Stewart, Matthew, Stewart, Kenneth, Stewart, Terrence C., Timcheck, Jonathan, Tömen, Nergis, Urgese, Gianvito, Verhelst, Marian, Vineyard, Craig M., Vogginger, Bernhard, Yousefzadeh, Amirreza, Zohora, Fatima Tuz, Frenkel, Charlotte, and Reddi, Vijay Janapa
Subjects: Computer Science - Artificial Intelligence
Abstract: Neuromorphic computing shows promise for advancing computing efficiency and capabilities of AI applications using brain-inspired principles. However, the neuromorphic research field currently lacks standardized benchmarks, making it difficult to accurately measure technological advancements, compare performance with conventional methods, and identify promising future research directions. Prior neuromorphic computing benchmark efforts have not seen widespread adoption due to a lack of inclusive, actionable, and iterative benchmark design and guidelines. To address these shortcomings, we present NeuroBench: a benchmark framework for neuromorphic computing algorithms and systems. NeuroBench is a collaboratively-designed effort from an open community of researchers across industry and academia, aiming to provide a representative structure for standardizing the evaluation of neuromorphic approaches. The NeuroBench framework introduces a common set of tools and systematic methodology for inclusive benchmark measurement, delivering an objective reference framework for quantifying neuromorphic approaches in both hardware-independent (algorithm track) and hardware-dependent (system track) settings. In this article, we outline tasks and guidelines for benchmarks across multiple application domains, and present initial performance baselines across neuromorphic and conventional approaches for both benchmark tracks. NeuroBench is intended to continually expand its benchmarks and features to foster and track the progress made by the research community., Comment: System track baselines added
Published: 2023

16. Real-Time Acoustic Perception for Automotive Applications

Author: Yin, Jun, Damiano, Stefano, Verhelst, Marian, van Waterschoot, Toon, and Guntoro, Andre
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years the automotive industry has been strongly promoting the development of smart cars, equipped with multi-modal sensors to gather information about the surroundings, in order to aid human drivers or make autonomous decisions. While the focus has mostly been on visual sensors, also acoustic events are crucial to detect situations that require a change in the driving behavior, such as a car honking, or the sirens of approaching emergency vehicles. In this paper, we summarize the results achieved so far in the Marie Sklodowska-Curie Actions (MSCA) European Industrial Doctorates (EID) project Intelligent Ultra Low-Power Signal Processing for Automotive (I-SPOT). On the algorithmic side, the I-SPOT Project aims to enable detecting, localizing and tracking environmental audio signals by jointly developing microphone array processing and deep learning techniques that specifically target automotive applications. Data generation software has been developed to cover the I-SPOT target scenarios and research challenges. This tool is currently being used to develop low-complexity deep learning techniques for emergency sound detection. On the hardware side, the goal impels workflows for hardware-algorithm co-design to ease the generation of architectures that are sufficiently flexible towards algorithmic evolutions without giving up on efficiency, as well as enable rapid feedback of hardware implications of algorithmic decision. This is pursued though a hierarchical workflow that breaks the hardware-algorithm design space into reasonable subsets, which has been tested for operator-level optimizations on state-of-the-art robust sound source localization for edge devices. Further, several open challenges towards an end-to-end system are clarified for the next stage of I-SPOT.
Published: 2023

17. TinyVers: A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML Inference at the Extreme Edge

Author: Jain, Vikram, Giraldo, Sebastian, De Roose, Jaro, Mei, Linyan, Boons, Bert, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture
Abstract: Extreme edge devices or Internet-of-thing nodes require both ultra-low power always-on processing as well as the ability to do on-demand sampling and processing. Moreover, support for IoT applications like voice recognition, machine monitoring, etc., requires the ability to execute a wide range of ML workloads. This brings challenges in hardware design to build flexible processors operating in ultra-low power regime. This paper presents TinyVers, a tiny versatile ultra-low power ML system-on-chip to enable enhanced intelligence at the Extreme Edge. TinyVers exploits dataflow reconfiguration to enable multi-modal support and aggressive on-chip power management for duty-cycling to enable smart sensing applications. The SoC combines a RISC-V host processor, a 17 TOPS/W dataflow reconfigurable ML accelerator, a 1.7 $\mu$W deep sleep wake-up controller, and an eMRAM for boot code and ML parameter retention. The SoC can perform up to 17.6 GOPS while achieving a power consumption range from 1.7 $\mu$W-20 mW. Multiple ML workloads aimed for diverse applications are mapped on the SoC to showcase its flexibility and efficiency. All the models achieve 1-2 TOPS/W of energy efficiency with power consumption below 230 $\mu$W in continuous operation. In a duty-cycling use case for machine monitoring, this power is reduced to below 10 $\mu$W., Comment: Accepted in IEEE Journal of Solid-State Circuits
Published: 2023
Full Text: View/download PDF

18. Towards Heterogeneous Multi-core Accelerators Exploiting Fine-grained Scheduling of Layer-Fused Deep Neural Networks

Author: Symons, Arne, Mei, Linyan, Colleman, Steven, Houshmand, Pouya, Karl, Sebastian, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture
Abstract: To keep up with the ever-growing performance demand of neural networks, specialized hardware (HW) accelerators are shifting towards multi-core and chiplet architectures. So far, these multi-accelerator systems exploit the increased parallelism by pipelining different NN layers across input batches on different cores to increase throughput. Yet, when pursuing this with non-batched layer-by-layer scheduling of latency-critical applications, this fails to fully exploit the available HW resources towards energy-efficient execution at the edge. This work, therefore, enables fine-grained depth-first scheduling of layer-fused DNNs onto multi-core architectures through an open-source modeling framework called Stream. Stream is capable of representing a wide range of scheduling granularities and HW architectures and optimizes execution schedules towards minimal energy, minimal latency and/or minimal memory footprint for constrained edge devices. We validate against three SotA HW implementations employing layer-fused scheduling showing tight matching with measured efficiencies. Using Stream in further explorations, we demonstrate that high-level architectural decisions greatly impact hardware efficiency under the fine-grained scheduling paradigm, reducing the energy-delay product from 2.4x for single-core architectures to up to 30x for heterogeneous multi-core architectures compared to the traditional scheduling at layer granularity., Comment: 9 pages + references, 15 figures
Published: 2022

19. DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling

Author: Mei, Linyan, Goetschalckx, Koen, Symons, Arne, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: DNN workloads can be scheduled onto DNN accelerators in many different ways: from layer-by-layer scheduling to cross-layer depth-first scheduling (a.k.a. layer fusion, or cascaded execution). This results in a very broad scheduling space, with each schedule leading to varying hardware (HW) costs in terms of energy and latency. To rapidly explore this vast space for a wide variety of hardware architectures, analytical cost models are crucial to estimate scheduling effects on the HW level. However, state-of-the-art cost models are lacking support for exploring the complete depth-first scheduling space, for instance focusing only on activations while ignoring weights, or modeling only DRAM accesses while overlooking on-chip data movements. These limitations prevent researchers from systematically and accurately understanding the depth-first scheduling space. After formalizing this design space, this work proposes a unified modeling framework, DeFiNES, for layer-by-layer and depth-first scheduling to fill in the gaps. DeFiNES enables analytically estimating the hardware cost for possible schedules in terms of both energy and latency, while considering data access at every memory level. This is done for each schedule and HW architecture under study by optimally choosing the active part of the memory hierarchy per unique combination of operand, layer, and feature map tile. The hardware costs are estimated, taking into account both data computation and data copy phases. The analytical cost model is validated against measured data from a taped-out depth-first DNN accelerator, DepFiN, showing good modeling accuracy at the end-to-end neural network level. A comparison with generalized state-of-the-art demonstrates up to 10X better solutions found with DeFiNES., Comment: Accepted by HPCA 2023
Published: 2022

20. DPU-v2: Energy-efficient execution of irregular directed acyclic graphs

Author: Shah, Nimish, Meert, Wannes, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture
Abstract: A growing number of applications like probabilistic machine learning, sparse linear algebra, robotic navigation, etc., exhibit irregular data flow computation that can be modeled with directed acyclic graphs (DAGs). The irregularity arises from the seemingly random connections of nodes, which makes the DAG structure unsuitable for vectorization on CPU or GPU. Moreover, the nodes usually represent a small number of arithmetic operations that cannot amortize the overhead of launching tasks/kernels for each node, further posing challenges for parallel execution. To enable energy-efficient execution, this work proposes DAG processing unit (DPU) version 2, a specialized processor architecture optimized for irregular DAGs with static connectivity. It consists of a tree-structured datapath for efficient data reuse, a customized banked register file, and interconnects tuned to support irregular register accesses. DPU-v2 is utilized effectively through a targeted compiler that systematically maps operations to the datapath, minimizes register bank conflicts, and avoids pipeline hazards. Finally, a design space exploration identifies the optimal architecture configuration that minimizes the energy-delay product. This hardware-software co-optimization approach results in a speedup of 1.4$\times$, 3.5$\times$, and 14$\times$ over a state-of-the-art DAG processor ASIP, a CPU, and a GPU, respectively, while also achieving a lower energy-delay product. In this way, this work takes an important step toward enabling an embedded execution of emerging DAG workloads.
Published: 2022

21. Hardware-aware mobile building block evaluation for computer vision

Author: Bonnaerens, Maxim, Freiberger, Matthias, Verhelst, Marian, and Dambre, Joni
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this work we propose a methodology to accurately evaluate and compare the performance of efficient neural network building blocks for computer vision in a hardware-aware manner. Our comparison uses pareto fronts based on randomly sampled networks from a design space to capture the underlying accuracy/complexity trade-offs. We show that our approach allows to match the information obtained by previous comparison paradigms, but provides more insights in the relationship between hardware cost and accuracy. We use our methodology to analyze different building blocks and evaluate their performance on a range of embedded hardware platforms. This highlights the importance of benchmarking building blocks as a preselection step in the design process of a neural network. We show that choosing the right building block can speed up inference by up to a factor of 2x on specific hardware ML accelerators.
Published: 2022

22. Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Author: Jelčicová, Zuzana and Verhelst, Marian
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Multi-head self-attention forms the core of Transformer networks. However, their quadratically growing complexity with respect to the input sequence length impedes their deployment on resource-constrained edge devices. We address this challenge by proposing a dynamic pruning method, which exploits the temporal stability of data across tokens to reduce inference cost. The threshold-based method only retains significant differences between the subsequent tokens, effectively reducing the number of multiply-accumulates, as well as the internal tensor data sizes. The approach is evaluated on the Google Speech Commands Dataset for keyword spotting, and the performance is compared against the baseline Keyword Transformer. Our experiments show that we can reduce ~80% of operations while maintaining the original 98.4% accuracy. Moreover, a reduction of ~87-94% operations can be achieved when only degrading the accuracy by 1-4%, speeding up the multi-head self-attention inference by a factor of ~7.5-16.
Published: 2022

23. DPU: DAG Processing Unit for Irregular Graphs with Precision-Scalable Posit Arithmetic in 28nm

Author: Shah, Nimish, Olascoaga, Laura Isabel Galindez, Zhao, Shirui, Meert, Wannes, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing, Electrical Engineering and Systems Science - Systems and Control
Abstract: Computation in several real-world applications like probabilistic machine learning, sparse linear algebra, and robotic navigation, can be modeled as irregular directed acyclic graphs (DAGs). The irregular data dependencies in DAGs pose challenges to parallel execution on general-purpose CPUs and GPUs, resulting in severe under-utilization of the hardware. This paper proposes DPU, a specialized processor designed for the efficient execution of irregular DAGs. The DPU is equipped with parallel compute units that execute different subgraphs of a DAG independently. The compute units can synchronize within a cycle using a hardware-supported synchronization primitive, and communicate via an efficient interconnect to a global banked scratchpad. Furthermore, a precision-scalable posit arithmetic unit is developed to enable application-dependent precision. The DPU is taped-out in 28nm CMOS, achieving a speedup of 5.1$\times$ and 20.6$\times$ over state-of-the-art CPU and GPU implementations on DAGs of sparse linear algebra and probabilistic machine learning workloads. This performance is achieved while operating at a power budget of 0.23W, as opposed to 55W and 98W of the CPU and GPU, resulting in a peak efficiency of 538 GOPS/W with DPU, which is 1350$\times$ and 9000$\times$ higher than the CPU and GPU, respectively. Thus, with specialized architecture, DPU enables low-power execution of irregular DAG workloads., Comment: IEEE Journal of Solid-State Circuits
Published: 2021
Full Text: View/download PDF

24. Taxonomy and Benchmarking of Precision-Scalable MAC Arrays Under Enhanced DNN Dataflow Representation

Author: Ibrahim, Ehab M., Mei, Linyan, and Verhelst, Marian
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Reduced-precision and variable-precision multiply-accumulate (MAC) operations provide opportunities to significantly improve energy efficiency and throughput of DNN accelerators with no/limited algorithmic performance loss, paving a way towards deploying AI applications on resource-constraint edge devices. Accordingly, various precision-scalable MAC array (PSMA) architectures were proposed recently. However, it is difficult to make a fair comparison between those alternatives, as each proposed PSMA is demonstrated in different systems and technologies. This work aims to provide a clear view of the design space of PSMA and offer insights for selecting the optimal architectures based on designers' needs. First, we introduce a precision-enhanced for-loop representation for DNN dataflows. Next, we use this new representation towards a comprehensive PSMA taxonomy, capable of systematically covering most prominent state-of-the-art PSMAs, as well as uncovering new PSMA architectures. Following that, we build a highly parameterized PSMA template that can be design-time configured into a huge subset of the design space spanned by the taxonomy. This allows to fairly and thoroughly benchmark 72 different PSMA architectures. We perform such studies in 28nm technology targeting run-time precision scalability from 8 to 2 bits, operating at 200 MHz and 1 GHz. Analyzing resulting energy and area breakdowns reveals key design guidelines for PSMA architectures.
Published: 2021
Full Text: View/download PDF

25. GRAPHOPT: constrained-optimization-based parallelization of irregular graphs

Author: Shah, Nimish, Meert, Wannes, and Verhelst, Marian
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Sparse, irregular graphs show up in various applications like linear algebra, machine learning, engineering simulations, robotic control, etc. These graphs have a high degree of parallelism, but their execution on parallel threads of modern platforms remains challenging due to the irregular data dependencies. The execution performance can be improved by efficiently partitioning the graphs such that the communication and thread synchronization overheads are minimized without hurting the utilization of the threads. To achieve this, this paper proposes GRAPHOPT, a tool that models the graph parallelization as a constrained optimization problem and uses the open Google OR-Tools solver to find good partitions. Several scalability techniques are developed to handle large real-world graphs with millions of nodes and edges. Extensive experiments are performed on the graphs of sparse matrix triangular solves (linear algebra) and sum-product networks (machine learning), respectively, showing a mean speedup of 2.0X and 1.8X over previous state-of-the-art libraries, demonstrating the effectiveness of the constrained-optimization-based graph parallelization.
Published: 2021
Full Text: View/download PDF

26. Acceleration of probabilistic reasoning through custom processor architecture

Author: Shah, Nimish, Olascoaga, Laura I. Galindez, Meert, Wannes, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.
Published: 2021
Full Text: View/download PDF

27. ProbLP: A framework for low-precision probabilistic inference

Author: Shah, Nimish, Olascoaga, Laura I. Galindez, Meert, Wannes, and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture, Computer Science - Machine Learning, Mathematics - Numerical Analysis
Abstract: Bayesian reasoning is a powerful mechanism for probabilistic inference in smart edge-devices. During such inferences, a low-precision arithmetic representation can enable improved energy efficiency. However, its impact on inference accuracy is not yet understood. Furthermore, general-purpose hardware does not natively support low-precision representation. To address this, we propose ProbLP, a framework that automates the analysis and design of low-precision probabilistic inference hardware. It automatically chooses an appropriate energy-efficient representation based on worst-case error-bounds and hardware energy-models. It generates custom hardware for the resulting inference network exploiting parallelism, pipelining and low-precision operation. The framework is validated on several embedded-sensing benchmarks.
Published: 2021
Full Text: View/download PDF

28. Feed-Forward On-Edge Fine-tuning Using Static Synthetic Gradient Modules

Author: Neven, Robby, Verhelst, Marian, Tuytelaars, Tinne, and Goedemé, Toon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Training deep learning models on embedded devices is typically avoided since this requires more memory, computation and power over inference. In this work, we focus on lowering the amount of memory needed for storing all activations, which are required during the backward pass to compute the gradients. Instead, during the forward pass, static Synthetic Gradient Modules (SGMs) predict gradients for each layer. This allows training the model in a feed-forward manner without having to store all activations. We tested our method on a robot grasping scenario where a robot needs to learn to grasp new objects given only a single demonstration. By first training the SGMs in a meta-learning manner on a set of common objects, during fine-tuning, the SGMs provided the model with accurate gradients to successfully learn to grasp new objects. We have shown that our method has comparable results to using standard backpropagation.
Published: 2020

29. ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework

Author: Mei, Linyan, Houshmand, Pouya, Jain, Vikram, Giraldo, Sebastian, and Verhelst, Marian
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, C.1.4, C.3, C.4
Abstract: Building efficient embedded deep learning systems requires a tight co-design between DNN algorithms, memory hierarchy, and dataflow. However, owing to the large degrees of freedom in the design space, finding an optimal solution through the implementation of individual design points becomes infeasible. Recently, several estimation frameworks for fast design space exploration (DSE) have emerged, yet they either suffer from long runtimes or a limited exploration space. This work introduces ZigZag, a memory-centric rapid DNN accelerator DSE framework which extends the DSE with uneven mapping opportunities, in which operands at shared memory levels are no longer bound to use the same memory levels for each loop index. For this, ZigZag uses a memory-centric nested-for-loop format as a uniform representation to integrate algorithm, accelerator, and algorithm-to-accelerator mapping, and consists of three key components: 1) a latency-enhanced analytical Hardware Cost Estimator, 2) a Temporal Mapping Generator that supports even/uneven scheduling on any type of memory hierarchy, and 3) an Architecture Generator that explores the whole memory hierarchy design space. Benchmarking experiments against existing frameworks, together with three case studies at different design abstraction levels show the strength of ZigZag. Up to 33% more energy-efficient solutions are found by introducing ZigZag's uneven scheduling opportunities., Comment: 14 pages, 20 figures. Source code is available at https://github.com/ZigZag-Project/zigzag
Published: 2020

30. Benchmarking TinyML Systems: Challenges and Direction

Author: Banbury, Colby R., Reddi, Vijay Janapa, Lam, Max, Fu, William, Fazel, Amin, Holleman, Jeremy, Huang, Xinyuan, Hurtado, Robert, Kanter, David, Lokhmotov, Anton, Patterson, David, Pau, Danilo, Seo, Jae-sun, Sieracki, Jeff, Thakker, Urmish, Verhelst, Marian, and Yadav, Poonam
Subjects: Computer Science - Performance, Computer Science - Machine Learning
Abstract: Recent advancements in ultra-low-power machine learning (TinyML) hardware promises to unlock an entirely new class of smart applications. However, continued progress is limited by the lack of a widely accepted benchmark for these systems. Benchmarking allows us to measure and thereby systematically compare, evaluate, and improve the performance of systems and is therefore fundamental to a field reaching maturity. In this position paper, we present the current landscape of TinyML and discuss the challenges and direction towards developing a fair and useful hardware benchmark for TinyML workloads. Furthermore, we present our four benchmarks and discuss our selection methodology. Our viewpoints reflect the collective thoughts of the TinyMLPerf working group that is comprised of over 30 organizations., Comment: 6 pages, 1 figure, 3 tables
Published: 2020

31. A multi-layered energy consumption model for smart wireless acoustic sensor networks

Author: Dekkers, Gert, Rosas, Fernando, Lauwereins, Steven, Rajendran, Sreeraj, Pollin, Sofie, Vanrumste, Bart, van Waterschoot, Toon, Verhelst, Marian, and Karsmakers, Peter
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Smart sensing is expected to become a pervasive technology in smart cities and environments of the near future. These services are improving their capabilities due to integrated devices shrinking in size while maintaining their computational power, which can run diverse Machine Learning algorithms and achieve high performance in various data-processing tasks. One attractive sensor modality to be used for smart sensing are acoustic sensors, which can convey highly informative data while keeping a moderate energy consumption. Unfortunately, the energy budget of current wireless sensor networks is usually not enough to support the requirements of standard microphones. Therefore, energy efficiency needs to be increased at all layers --- sensing, signal processing and communication --- in order to bring wireless smart acoustic sensors into the market. To help to attain this goal, this paper introduces WASN-EM: an energy consumption model for wireless acoustic sensors networks (WASN), whose aim is to aid in the development of novel techniques to increase the energy-efficient of smart wireless acoustic sensors. This model provides a first step of exploration prior to custom design of a smart wireless acoustic sensor, and also can be used to compare the energy consumption of different protocols.
Published: 2018

32. BinarEye: An Always-On Energy-Accuracy-Scalable Binary CNN Processor With All Memory On Chip in 28nm CMOS

Author: Moons, Bert, Bankman, Daniel, Yang, Lita, Murmann, Boris, and Verhelst, Marian
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Neural and Evolutionary Computing
Abstract: This paper introduces BinarEye: a digital processor for always-on Binary Convolutional Neural Networks. The chip maximizes data reuse through a Neuron Array exploiting local weight Flip-Flops. It stores full network models and feature maps and hence requires no off-chip bandwidth, which leads to a 230 1b-TOPS/W peak efficiency. Its 3 levels of flexibility - (a) weight reconfiguration, (b) a programmable network depth and (c) a programmable network width - allow trading energy for accuracy depending on the task's requirements. BinarEye's full system input-to-label energy consumption ranges from 14.4uJ/f for 86% CIFAR-10 and 98% owner recognition down to 0.92uJ/f for 94% face detection at up to 1700 frames per second. This is 3-12-70x more efficient than the state-of-the-art at on-par accuracy., Comment: Presented at the 2018 IEEE Custom Integrated Circuits Conference (CICC). Presentation is available here: https://www.researchgate.net/publication/324452819_Presentation_on_Binareye_at_CICC
Published: 2018

33. Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Author: Van keirsbilck, Matthijs, Moons, Bert, and Verhelst, Marian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Today's Automatic Speech Recognition systems only rely on acoustic signals and often don't perform well under noisy conditions. Performing multi-modal speech recognition - processing acoustic speech signals and lip-reading video simultaneously - significantly enhances the performance of such systems, especially in noisy environments. This work presents the design of such an audio-visual system for Automated Speech Recognition, taking memory and computation requirements into account. First, a Long-Short-Term-Memory neural network for acoustic speech recognition is designed. Second, Convolutional Neural Networks are used to model lip-reading features. These are combined with an LSTM network to model temporal dependencies and perform automatic lip-reading on video. Finally, acoustic-speech and visual lip-reading networks are combined to process acoustic and visual features simultaneously. An attention mechanism ensures performance of the model in noisy environments. This system is evaluated on the TCD-TIMIT 'lipspeaker' dataset for audio-visual phoneme recognition with clean audio and with additive white noise at an SNR of 0dB. It achieves 75.70% and 58.55% phoneme accuracy respectively, over 14 percentage points better than the state-of-the-art for all noise levels., Comment: Tech. report
Published: 2018

34. Minimum Energy Quantized Neural Networks

Author: Moons, Bert, Goetschalckx, Koen, Van Berckelaer, Nick, and Verhelst, Marian
Subjects: Computer Science - Neural and Evolutionary Computing, Computer Science - Hardware Architecture, Computer Science - Learning
Abstract: This work targets the automated minimum-energy optimization of Quantized Neural Networks (QNNs) - networks using low precision weights and activations. These networks are trained from scratch at an arbitrary fixed point precision. At iso-accuracy, QNNs using fewer bits require deeper and wider network architectures than networks using higher precision operators, while they require less complex arithmetic and less bits per weights. This fundamental trade-off is analyzed and quantified to find the minimum energy QNN for any benchmark and hence optimize energy-efficiency. To this end, the energy consumption of inference is modeled for a generic hardware platform. This allows drawing several conclusions across different benchmarks. First, energy consumption varies orders of magnitude at iso-accuracy depending on the number of bits used in the QNN. Second, in a typical system, BinaryNets or int4 implementations lead to the minimum energy solution, outperforming int8 networks up to 2-10x at iso-accuracy. All code used for QNN training is available from https://github.com/BertMoons., Comment: preprint for work presented at the 51st Asilomar Conference on Signals, Systems and Computers
Published: 2017

35. A 0.3-2.6 TOPS/W Precision-Scalable Processor for Real-Time Large-Scale ConvNets

Author: Moons, Bert and Verhelst, Marian
Subjects: Computer Science - Hardware Architecture
Abstract: A low-power precision-scalable processor for ConvNets or convolutional neural networks (CNN) is implemented in a 40nm technology. Its 256 parallel processing units achieve a peak 102GOPS running at 204MHz. To minimize energy consumption while maintaining throughput, this works is the first to both exploit the sparsity of convolutions and to implement dynamic precision-scalability enabling supply- and energy scaling. The processor is fully C-programmable, consumes 25-288mW at 204 MHz and scales efficiency from 0.3-2.6 real TOPS/W. This system hereby outperforms the state-of-the-art up to 3.9x in energy efficiency., Comment: Published at the Symposium on VLSI Circuits, 2016, Honolulu, HI, US
Published: 2016

36. Energy-Efficient ConvNets Through Approximate Computing

Author: Moons, Bert, De Brabandere, Bert, Van Gool, Luc, and Verhelst, Marian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently ConvNets or convolutional neural networks (CNN) have come up as state-of-the-art classification and detection algorithms, achieving near-human performance in visual detection. However, ConvNet algorithms are typically very computation and memory intensive. In order to be able to embed ConvNet-based classification into wearable platforms and embedded systems such as smartphones or ubiquitous electronics for the internet-of-things, their energy consumption should be reduced drastically. This paper proposes methods based on approximate computing to reduce energy consumption in state-of-the-art ConvNet accelerators. By combining techniques both at the system- and circuit level, we can gain energy in the systems arithmetic: up to 30x without losing classification accuracy and more than 100x at 99% classification accuracy, compared to the commonly used 16-bit fixed point number format., Comment: Published in IEEE Winter Conference on Applications of Computer Vision (WACV 2016)
Published: 2016
Full Text: View/download PDF

37. Understanding interdependency through complex information sharing

Author: Rosas, Fernando, Ntranos, Vasilis, Ellison, Christopher J., Pollin, Sofie, and Verhelst, Marian
Subjects: Computer Science - Information Theory
Abstract: The interactions between three or more random variables are often nontrivial, poorly understood, and yet, are paramount for future advances in fields such as network information theory, neuroscience, genetics and many others. In this work, we propose to analyze these interactions as different modes of information sharing. Towards this end, we introduce a novel axiomatic framework for decomposing the joint entropy, which characterizes the various ways in which random variables can share information. The key contribution of our framework is to distinguish between interdependencies where the information is shared redundantly, and synergistic interdependencies where the sharing structure exists in the whole but not between the parts. We show that our axioms determine unique formulas for all the terms of the proposed decomposition for a number of cases of interest. Moreover, we show how these results can be applied to several network information theory problems, providing a more intuitive understanding of their fundamental limits., Comment: 39 pages, 4 figures
Published: 2015
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

37 results on '"Verhelst, Marian"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources