Back to Search Start Over

HyCA: A Hybrid Computing Architecture for Fault-Tolerant Deep Learning.

Authors :
Liu, Cheng
Chu, Cheng
Xu, Dawen
Wang, Ying
Wang, Qianlong
Li, Huawei
Li, Xiaowei
Cheng, Kwang-Ting
Source :
IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems; Oct2022, Vol. 41 Issue 10, p3400-3413, 14p
Publication Year :
2022

Abstract

Hardware faults on the regular 2-D computing array of a typical deep learning accelerator (DLA) can lead to dramatic prediction accuracy loss. Prior redundancy design approaches typically have each homogeneous redundant processing element (PE) to mitigate faulty PEs for a limited region of the 2-D computing array rather than the entire computing array to avoid the excessive hardware overhead. However, they fail to recover the computing array when the number of faulty PEs in any region exceeds the number of redundant PEs in the same region. The mismatch problem deteriorates when the fault injection rate rises and the faults are unevenly distributed. To address the problem, we propose a hybrid computing architecture (HyCA) for fault-tolerant DLAs. It has a set of dot-production processing units (DPPUs) to recompute all the operations that are mapped to the faulty PEs despite the faulty PE locations. According to our experiments, HyCA shows significantly higher reliability, scalability, and performance with less chip area penalty when compared to the conventional redundancy approaches. Moreover, by taking advantage of the flexible recomputing, HyCA can also be utilized to scan the entire 2-D computing array and detect the faulty PEs effectively at runtime. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
02780070
Volume :
41
Issue :
10
Database :
Complementary Index
Journal :
IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems
Publication Type :
Academic Journal
Accession number :
160651740
Full Text :
https://doi.org/10.1109/TCAD.2021.3124763