A Scalable Multi-TeraOPS Core for AI Training and Inference

Authors :: Matthew M. Ziegler
Sunil Shukla
Gary W. Maier
Jinwook Oh
Kailash Gopalakrishnan
Christos Vezyrtzis
Thomas W. Fox
Michael J. Klaiber
Howard M. Haynie
Swagath Venkataramani
Leland Chang
Jungwook Choi
Nianzheng Cao
Pong-Fei Lu
Pierce Chuang
Michael A. Guillorn
Brian W. Curran
Dongsoo Lee
Fanchieh Yee
Ankur Agrawal
Ching Zhou
Silvia Melitta Mueller
Naigang Wang
George D. Gristede
Bruce M. Fleischer
Michael R. Scheuermann
Tina Babinsky
Vijayalakshmi Srinivasan
Chia-Yu Chen
Joel Abraham Silberman
Shih-Hsien Lo
Source :: IEEE Solid-State Circuits Letters. 1:217-220
Publication Year :: 2018
Publisher :: Institute of Electrical and Electronics Engineers (IEEE), 2018.
Abstract: This letter presents a multi-TOPS AI accelerator core for deep learning training and inference. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the bandwidth demands of the compute units. A custom 16b floating point (fp16) representation with 1 sign bit, 6 exponent bits, and 9 mantissa bits has also been developed for high model accuracy in training and inference as well as 1b/2b (binary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14-nm CMOS.