Back to Search Start Over

O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices.

Authors :
Haghi, Pouya
Kamal, Mehdi
Afzali-Kusha, Ali
Pedram, Massoud
Source :
IEEE Transactions on Circuits & Systems. Part I: Regular Papers; Sep2020, Vol. 67 Issue 9, p3056-3069, 14p
Publication Year :
2020

Abstract

In this paper, we propose O4-DNN, a high-performance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on operation packing and out-of-order (OoO) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-of-order is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), $2.5\times $ ($3.44\times$) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O4-DNN compared to the baseline structure. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
15498328
Volume :
67
Issue :
9
Database :
Complementary Index
Journal :
IEEE Transactions on Circuits & Systems. Part I: Regular Papers
Publication Type :
Periodical
Accession number :
145399757
Full Text :
https://doi.org/10.1109/TCSI.2020.2986350