Journal: ieee transactions on circuits & systems. part i: regular papers / Publication Type: Periodicals / Publication Year Range: Last 10 years / Search Limiters: Available in Library Collection / Topic: computer architecture and convolutional neural networks - Searchworks@Jio Institute Digital Library Search Results

Showing total 13 results

Start Over Search Limiters Available in Library Collection Topic computer architecture Topic convolutional neural networks Publication Year Range Last 10 years Publication Type Periodicals Journal ieee transactions on circuits & systems. part i: regular papers

13 results

1. IECA: An In-Execution Configuration CNN Accelerator With 30.55 GOPS/mm² Area Efficiency.

Author: Huang, Boming, Huan, Yuxiang, Chu, Haoming, Xu, Jiawei, Liu, Lizheng, Zheng, Lirong, and Zou, Zhuo
Subjects: *CONVOLUTIONAL neural networks, *FINITE state machines, *TILES
Abstract: It remains challenging for a Convolutional Neural Network (CNN) accelerator to maintain high hardware utilization and low processing latency with restricted on-chip memory. This paper presents an In-Execution Configuration Accelerator (IECA) that realizes an efficient control scheme, exploring architectural data reuse, unified in-execution controlling, and pipelined latency hiding to minimize configuration overhead out of the computation scope. The proposed IECA achieves row-wise convolution with tiny distributed buffers and reduces the size of total on-chip memory by removing 40% of redundant memory storage with shared delay chains. By exploiting a reconfigurable Sequence Mapping Table (SMT) and Finite State Machine (FSM) control, the chip realizes cycle-accurate Processing Element (PE) control, automatic loop tiling and latency hiding without extra time slots for pre-configuration. Evaluated on AlexNet and VGG-16, the IECA retains over 97.3% PE utilization and over 95.6% memory access time hiding on average. The chip is designed and fabricated in a UMC 55-nm process running at a frequency of 250 MHz and achieves an area efficiency of 30.55 GOPS/mm2 and 0.244 GOPS/KGE (kilo-gate-equivalent), which makes an over $2.0\times $ and $2.1\times $ improvement, respectively, compared with that of previous related works. Implementation of the IEC control scheme uses only a 0.55% area of the 2.75 mm2 core. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

2. Dynamic Dataflow Scheduling and Computation Mapping Techniques for Efficient Depthwise Separable Convolution Acceleration.

Author: Li, Baoting, Wang, Hang, Zhang, Xuchong, Ren, Jie, Liu, Longjun, Sun, Hongbin, and Zheng, Nanning
Subjects: *CONVOLUTIONAL neural networks, *FIELD programmable gate arrays, *DESIGN techniques
Abstract: Depthwise separable convolution (DSC) has become one of the essential structures for lightweight convolutional neural networks. Nevertheless, its hardware architecture has not received much attention. Several previous hardware designs incur either high off-chip memory traffic or large on-chip memory usage, and hence have deficiency in terms of hardware efficiency as well as performance. This paper proposes two efficient dynamic design techniques, i.e. adaptive row-based dataflow scheduling and adaptive computation mapping, to achieve a much better trade-off between hardware efficiency and performance for DSC-based lightweight CNN accelerator. The effectiveness and efficiency of the proposed dynamic design techniques have been extensively evaluated using six DSC-based lightweight CNNs. Compared with the reference architectures, the simulation results show the proposed architectural techniques can at least reduce on-chip buffer size by 50.4% and improve the performance of convolution calculation by 1.18 × while maintaining the minimum off-chip memory traffic. MobileNetV2 is implemented on Zynq UltraScale+ ZCU102 SoC FPGA, and the results show the proposed accelerator can achieve 381.7 frames per second (fps), which is 1.43 × of the reference design, and it can save about 36.3% on-chip buffer size compared with the reference design, while maintaining the same off-chip memory traffic. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

3. High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization.

Author: Yuan, Tian, Liu, Weiqiang, Han, Jie, and Lombardi, Fabrizio
Subjects: *CONVOLUTIONAL neural networks, *HIGH performance computing, *FIELD programmable gate arrays, *PARALLEL processing, *IMAGE recognition (Computer vision), *SIGNAL convolution
Abstract: Convolutional neural networks (CNNs) have been widely used in image classification and recognition due to their effectiveness; however, CNNs use a large volume of weight data that is difficult to store in on-chip memory of embedded designs. Pruning can compress the CNN model at a small accuracy loss; however, a pruned CNN model operates slower when implemented on a parallel architecture. In this paper, a hardware-oriented CNN compression strategy is proposed; a deep neural network (DNN) model is divided into “no-pruning layers ($NP$ -layers)” and “pruning layers ($P$ -layers)”. A $NP$ -layer has a regular weights distribution for parallel computing and high performance. A $P$ -layer is irregular due to pruning, but it generates a high compression ratio. Uniform and incremental quantization schemes are used to achieve a tradeoff between compression ratio and processing efficiency at a small loss in accuracy. A distributed convolutional architecture with several parallel finite impulse response (FIR) filters is further proposed for the regular model in the $NP$ -layers. A shift-accumulator based processing element with an activation-driven data flow (ADF) is proposed for the irregular sparse model in the $P$ -layers. Based on the proposed compression strategy and hardware architecture, a hardware/algorithm co-optimization (HACO) approach is proposed for implementing a $NP-P$ hybrid compressed CNN model on FPGAs. For a hardware accelerator on a single FPGA chip without the use of off-chip memory, a $27.5\times $ compression ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. The implementation of the compressed VGG-16 model on a Xilinx VCU118 evaluation board processes 83.0 frames per second (FPS) for image applications, this is $1.8\times $ superior than the state-of-the-art design found in the technical literature. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

4. A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator.

Author: Liu, Wenjian, Lin, Jun, and Wang, Zhongfeng
Subjects: *CONVOLUTIONAL neural networks, *COMPUTATIONAL complexity, *COMPUTER architecture, *ENERGY consumption
Abstract: Quantization is a promising technique to compress the size of Convolutional Neural Network (CNN) models. Recently, various precision-scalable designs have been presented to reduce the computational complexity in CNNs. However, most of them adopt straightforward calculation scheme to implement the CNN, which causes high bandwidth requirement and low hardware utilization efficiency. This paper proposes a new precision-scalable architecture which can fully reduce the computational complexity in CNN inference and meanwhile has a finely simplified calculation scheme. Based on the proposed scheme, a well-optimized multiplier called Compositional Processing Element (C-PE) is devised. Compared with the previous multipliers, the new C-PE requires less area and power. Furthermore, two levels of optimization are introduced to the design to relieve the bandwidth problem and increase the hardware utilization efficiency. Implemented under the TSMC 90nm CMOS technology, the whole design achieves 6-68.1 fps in various precisions on VGG16 benchmark and a 49.8TOPS/W energy efficiency at 500MHz when scaled to 28nm, which is much better than previous precision-scalable ones. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

5. FPAP: A Folded Architecture for Energy-Quality Scalable Convolutional Neural Networks.

Author: Wang, Yizhi, Lin, Jun, and Wang, Zhongfeng
Subjects: *ARTIFICIAL neural networks, *MACRO processors, *FINITE impulse response filters
Abstract: Emerging convolutional neural networks (CNNs) tend to be designed with varied per-layer data widths and sparse representations. However, these two features, which bring many redundant computations, have not been exploited simultaneously in existing hardware architectures for CNNs. This paper proposes an energy-quality scalable architecture, namely folded precision-adjustable processor (FPAP), to eliminate all computational redundancies by using folding techniques. On one hand, FPAP decomposes the dominant multiply-accumulate (MAC) operations into multiple adds and folds them into single arithmetic unit. Only effective adds (or part of them) are then calculated serially. Thus, FPAP can adapt to different per-layer data widths and enable precision-adjustable approximate computing. Particularly, FPAP adaptively selects either activation or weight to be decomposed in every single MAC to minimize the total number of adds and clock cycles. On the other hand, a 1-D convolution is undertaken by a multi-tap transposed finite impulse response (FIR) filter, which is folded into one tap to skip MACs with zero weights or activations. Besides, a judicious delay element remapping scheme and a novel genetic algorithm-based kernel reallocation scheme, are developed to reduce the power consumption in a folded FIR filter and mitigate the load imbalance issue caused by irregular sparsity, respectively. With all these optimizations, FPAP is able to reach comparable or even faster processing speed over the corresponding unfolded design in sparse CNNs while consuming smaller area. Experimental results on real CNN models demonstrate that FPAP can scale its energy efficiency from 4.28 to 23.63 TOP/s/W, and area efficiency from 37.79 to 164.15GOP/s/mm2, respectively, under the TSMC 28-nm HPC CMOS technology. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

6. An Architecture to Accelerate Convolution in Deep Neural Networks.

Author: Ardakani, Arash, Condo, Carlo, Ahmadi, Mehdi, and Gross, Warren J.
Subjects: *ARTIFICIAL neural networks, *PATTERN recognition systems, *COMPUTER architecture
Abstract: In the past few years, the demand for real-time hardware implementations of deep neural networks (DNNs), especially convolutional neural networks (CNNs), has dramatically increased, thanks to their excellent performance on a wide range of recognition and classification tasks. When considering real-time action recognition and video/image classification systems, latency is of paramount importance. Therefore, applications strive to maximize the accuracy while keeping the latency under a given application-specific maximum: in most cases, this threshold cannot exceed a few hundred milliseconds. Until now, the research on DNNs has mainly focused on achieving a better classification or recognition accuracy, whereas very few works in literature take in account the computational complexity of the model. In this paper, we propose an efficient computational method, which is inspired by a computational core of fully connected neural networks, to process convolutional layers of state-of-the-art deep CNNs within strict latency requirements. To this end, we implemented our method customized for VGG and VGG-based networks which have shown state-of-the-art performance on different classification/recognition data sets. The implementation results in 65-nm CMOS technology show that the proposed accelerator can process convolutional layers of VGGNet up to 9.5 times faster than state-of-the-art accelerators reported to-date while occupying 3.5 mm2. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

7. A High Performance Multi-Bit-Width Booth Vector Systolic Accelerator for NAS Optimized Deep Learning Neural Networks.

Author: Huang, Mingqiang, Liu, Yucen, Man, Changhai, Li, Kai, Cheng, Quan, Mao, Wei, and Yu, Hao
Subjects: *ARTIFICIAL neural networks, *DEEP learning, *CONVOLUTIONAL neural networks, *FIELD programmable gate arrays, *MATRIX multiplications, *ELECTRONIC data processing
Abstract: Multi-bit-width convolutional neural network (CNN) maintains the balance between network accuracy and hardware efficiency, thus enlightening a promising method for accurate yet energy-efficient edge computing. In this work, we develop state-of-the-art multi-bit-width accelerator for NAS Optimized deep learning neural networks. To efficiently process the multi-bit-width network inferencing, multi-level optimizations have been proposed. Firstly, differential Neural Architecture Search (NAS) method is adopted for the high accuracy multi-bit-width network generation. Secondly, hybrid Booth based multi-bit-width multiply-add-accumulation (MAC) unit is developed for data processing. Thirdly, vector systolic array is proposed for effectively accelerating the matrix multiplications. With vector-style systolic dataflow, both the processing time and logic resources consumption can be reduced when compared with the classical systolic array. Finally, The proposed multi-bit-width CNN acceleration scheme has been practically deployed on FPGA platform of Xilinx ZCU102. Average performance on accelerating the full NAS optimized VGG16 network is 784.2 GOPS, and peek performance of the convolutional layer can reach as high as 871.26 GOPS for INT8, 1676.96 GOPS for INT4, and 2863.29 GOPS for INT2 respectively, which is among the best results in previous CNN accelerator benchmarks. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

8. Hybrid Stochastic-Binary Computing for Low-Latency and High-Precision Inference of CNNs.

Author: Chen, Zhiyuan, Ma, Yufei, and Wang, Zhongfeng
Subjects: *BINARY sequences
Abstract: The appealing property of low area, low power, and high bit error tolerance has made Stochastic Computing (SC) a promising alternative to conventional binary arithmetic for many computation intensive tasks, e.g., convolutional neural networks (CNNs). However, current SC-based CNN accelerators suffer from the intrinsic computation error and exponentially growing latency. In this work, we optimize both the architecture of SC multiply-and-accumulate (MAC) unit and the overall acceleration strategy of CNN accelerator to favor SC. A low-complexity bit-stream-extending method is proposed to suppress the computation error of SC and ensure the trained fix-point model can be deployed into SC-based hardware without fine-tuning. Besides, distribution-determined partition scheme is developed to design hybrid stochastic-binary computing (SBC) MAC unit which boosts the processing of bit streams at a minimum overhead. For the overall accelerator, the SBC-based MAC array is extended to reuse hardware resources and improve throughput, since the judiciously chosen loop unrolling strategy can better benefit SC operations. The proposed CNN accelerator with extended SBC-MAC array is synthesized and validated using TSMC 28nm CMOS on several representative CNNs, targeted at ImageNet dataset. Compared with precise binary implementation, our proposed design gains 44% area reduction and 50% power saving but induces only 4% additional computation latency and 0.5% accuracy degradation. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

9. A High-Level Modeling Framework for Estimating Hardware Metrics of CNN Accelerators.

Author: Juracy, Leonardo Rezende, Moreira, Matheus Trevisan, de Morais Amory, Alexandre, Hampel, Alexandre F., and Moraes, Fernando Gehm
Subjects: *CONVOLUTIONAL neural networks, *SPACE exploration
Abstract: GPUs became the reference platform for both training and inference phases of Convolutional Neural Networks (CNN) due to their tailored architecture to the CNN operators. However, GPUs are power-hungry architectures. A path to enable the deployment of CNNs in energy-constrained devices is adopting hardware accelerators for the inference phase. The design space exploration of CNNs using standard approaches, such as RTL, is limited due to their complexity. Thus, designers need frameworks enabling design space exploration that delivers accurate hardware estimation metrics to deploy CNNs. This work proposes a framework to explore CNNs design space, providing power, performance, and area (PPA) estimations. The heart of the framework is a system simulator. The system simulator front-end is TensorFlow, and the back-end is performance estimations obtained from the physical synthesis of hardware accelerators, not only from components like multipliers and adders. The first set of results evaluate the CNN accuracy using integer quantization, the accelerators PPA after physical synthesis, and the benefits of using a system simulator. These results allow a rich design space exploration, enabling selecting the best set of CNN parameters to meet the design constraints. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

10. CARLA: A Convolution Accelerator With a Reconfigurable and Low-Energy Architecture.

Author: Ahmadi, Mehdi, Vakili, Shervin, and Langlois, J. M. Pierre
Subjects: *CONVOLUTIONAL neural networks, *COMPUTER architecture, *APPLICATION-specific integrated circuits, *RANDOM access memory, *DEEP learning
Abstract: Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery–powered mobile devices, efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major problems have not yet been addressed effectively, particularly when the convolution layers have highly diverse structures: (1) minimizing energy-hungry off-chip DRAM data movements; (2) maximizing the utilization factor of processing resources to perform convolutions. This work thus proposes an energy-efficient architecture equipped with several optimized dataflows to support the structural diversity of modern CNNs. The proposed approach is evaluated on convolutional layers of VGGNet-16 and ResNet-50. Results show that the architecture achieves a Processing Element (PE) utilization factor of 98% for the majority of 3 × 3 and 1 × 1 convolutional layers, while limiting latency to 396.9 ms and 92.7 ms when performing convolutional layers of VGGNet-16 and ResNet-50, respectively. In addition, the proposed architecture benefits from the structured sparsity in ResNet-50 to reduce the latency to 42.5 ms when half of the channels are pruned. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

11. Hybrid Convolution Architecture for Energy-Efficient Deep Neural Network Processing.

Author: Kim, Suchang, Jo, Jihyuck, and Park, In-Cheol
Subjects: *COMPLEMENTARY metal oxide semiconductors, *SIGNAL convolution, *ENERGY consumption, *CONVOLUTIONAL neural networks
Abstract: Dynamic power is a major source of power dissipation for high speed designs. Domain isolation methodology is a recently-proposed technique for reducing dynamic power based on controlling the evaluation phase of dynamic logic (toggling control). This work demonstrates some design issues in the domain isolation methodology and explains why it is inefficient with pipelined systems. We propose fixes for its identified issues, which enables using the toggling control with pipelined systems in a more efficient way. A novel flow named “Power Reduction Flow” is proposed for reducing dynamic power of digital circuits. Our flow uses novel design analytical methods, novel “Dynamic Logic Modifier Flow”, and novel “Dynmic Logic Area Validation Flow” for reducing dynamic power with conditionally improving performance. The new design analytical methods are based on probability theory, SystemVerilog covergroups, and digital circuit modeling. A new event type perspective is also proposed to analyze designs to reduce dynamic power in them. Experimental results using TSMC 65 nm and low supply voltages show up to 59% power reduction compared to the original traditional techniques with improving circuit’s performance by 3 × of its original maximum operating frequency at the cost of an extra 12.3% increase in area. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

12. Time-Domain Computing in Memory Using Spintronics for Energy-Efficient Convolutional Neural Network.

Author: Zhang, Yue, Wang, Jinkai, Lian, Chenyu, Bai, Yining, Wang, Guanda, Zhang, Zhizhong, Zheng, Zhenyi, Chen, Lei, Zhang, Kun, Sirakoulis, Georgios, and Zhang, Youguang
Subjects: *CONVOLUTIONAL neural networks, *SPINTRONICS, *RANDOM access memory, *MEMORY, *MAGNETIC torque
Abstract: The data transfer bottleneck in Von Neumann architecture owing to the separation between processor and memory hinders the development of high-performance computing. The computing in memory (CIM) concept is widely considered as a promising solution for overcoming this issue. In this article, we present a time-domain CIM (TD-CIM) scheme using spintronics, which can be applied to construct the energy-efficient convolutional neural network (CNN). Basic Boolean logic operations are implemented through recording the bit-line output at different moments. A multi-addend addition mechanism is then introduced based on the TD-CIM circuit, which can eliminate the cascaded full adders. To further optimize the compatibility of TD-CIM circuit for CNN, we also propose a quantization method that transforms floating-point parameters of pre-trained CNN models into fixed-point parameters. Finally, we build a TD-CIM architecture integrating with a highly reconfigurable array of field-free spin-orbit torque magnetic random access memory (SOT-MRAM) and evaluate its benefits for the quantized CNN. By performing digit recognition with the MNIST dataset, we find that the delay and energy are respectively reduced by 1.2-2.7 times and $2.4\times 10 ^{3} - 1.1\times 10 ^{4}$ times compared with STT-CIM and CRAM based on spintronic memory. Finally, the recognition accuracy can reach 98.65% and 91.11% on MNIST and CIFAR-10, respectively. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

13. Fast and Accurate Inference on Microcontrollers With Boosted Cooperative Convolutional Neural Networks (BC-Net).

Author: Mocerino, Luca and Calimera, Andrea
Subjects: *CONVOLUTIONAL neural networks, *COMPUTER architecture, *MICROCONTROLLERS
Abstract: Arithmetic precision scaling is mandatory to deploy Convolutional Neural Networks (CNNs) on resource-constrained devices such as microcontrollers (MCUs), and quantization via fixed-point or binarization are the most adopted techniques today. Despite being born by the same concept of bit-width lowering, these two strategies differ substantially each other, and hence are often conceived and implemented separately. However, their joint integration is feasible and, if properly implemented, can bring to large savings and high processing efficiency. This work elaborates on this aspect introducing a boosted collaborative mechanism that pushes CNNs towards higher performance and more predictive capability. Referred as BC-Net, the proposed solution consists of a self-adaptive conditional scheme where a lightweight binary net and an 8-bit quantized net are trained to cooperate dynamically. Experiments conducted on four different CNN benchmarks deployed on off-the-shelf boards powered with the MCUs of the Cortex-M family by ARM show that BC-Nets outperform classical quantization and binarization when applied as separate techniques (up to 81.49% speed-up and up to 3.8% of accuracy improvement). The comparative analysis with a previously proposed cooperative method also demonstrates BC-Nets achieve substantial savings in terms of both performance (+19%) and accuracy (+3.45%). [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

13 results

1. IECA: An In-Execution Configuration CNN Accelerator With 30.55 GOPS/mm² Area Efficiency.

2. Dynamic Dataflow Scheduling and Computation Mapping Techniques for Efficient Depthwise Separable Convolution Acceleration.

3. High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization.

4. A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator.

5. FPAP: A Folded Architecture for Energy-Quality Scalable Convolutional Neural Networks.

6. An Architecture to Accelerate Convolution in Deep Neural Networks.

7. A High Performance Multi-Bit-Width Booth Vector Systolic Accelerator for NAS Optimized Deep Learning Neural Networks.

8. Hybrid Stochastic-Binary Computing for Low-Latency and High-Precision Inference of CNNs.

9. A High-Level Modeling Framework for Estimating Hardware Metrics of CNN Accelerators.

10. CARLA: A Convolution Accelerator With a Reconfigurable and Low-Energy Architecture.

11. Hybrid Convolution Architecture for Energy-Efficient Deep Neural Network Processing.

12. Time-Domain Computing in Memory Using Spintronics for Energy-Efficient Convolutional Neural Network.

13. Fast and Accurate Inference on Microcontrollers With Boosted Cooperative Convolutional Neural Networks (BC-Net).

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

13 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources