245 results on '"William J. Dally"'
Search Results
2. A 95.6-TOPS/W Deep Learning Inference Accelerator With Per-Vector Scaled 4-bit Quantization in 5 nm
- Author
-
Ben Keller, Rangharajan Venkatesan, Steve Dai, Stephen G. Tell, Brian Zimmer, Charbel Sakr, William J. Dally, C. Thomas Gray, and Brucek Khailany
- Subjects
Electrical and Electronic Engineering - Published
- 2023
- Full Text
- View/download PDF
3. OP-VENT
- Author
-
William J. Dally
- Subjects
General Earth and Planetary Sciences ,General Environmental Science - Abstract
A mechanical ventilator keeps a patient with respiratory failure alive by pumping precisely controlled amounts of air (or an air/O2 mixture) at controlled pressure into the patient's lungs [3, 5]. During intake (inspiration), the ventilator meters the flow of air and the duration of the flow to deliver a controlled tidal volume of air (typically 50 to 800 mL). During the exhaust (expiration) phase, the flow is turned off and a path is opened to allow the patient to exhale to the atmosphere - possibly with a positive pressure maintained at the end of the expiratory period (PEEP). The timing of the breaths can be entirely managed by the ventilator, or a new breath can be initiated by the patient.
- Published
- 2022
- Full Text
- View/download PDF
4. Evolution of the Graphics Processing Unit (GPU)
- Author
-
Stephen W. Keckler, David B. Kirk, and William J. Dally
- Subjects
Vertex (computer graphics) ,Fragment (computer graphics) ,Computer science ,Graphics processing unit ,Frame rate ,High memory ,Hardware and Architecture ,Computer graphics (images) ,Smart camera ,Electrical and Electronic Engineering ,Graphics ,Shader ,Software ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
Graphics processing units (GPUs) power today’s fastest supercomputers, are the dominant platform for deep learning, and provide the intelligence for devices ranging from self-driving cars to robots and smart cameras. They also generate compelling photorealistic images at real-time frame rates. GPUs have evolved by adding features to support new use cases. NVIDIA’s GeForce 256, the first GPU, was a dedicated processor for real-time graphics, an application that demands large amounts of floating-point arithmetic for vertex and fragment shading computations and high memory bandwidth. As real-time graphics advanced, GPUs became programmable. The combination of programmability and floating-point performance made GPUs attractive for running scientific applications. Scientists found ways to use early programmable GPUs by casting their calculations as vertex and fragment shaders. GPUs evolved to meet the needs of scientific users by adding hardware for simpler programming, double-precision floating-point arithmetic, and resilience.
- Published
- 2021
- Full Text
- View/download PDF
5. Frontier vs the Exascale Report: Why so long? and Are We Really There Yet?
- Author
-
Peter M. Kogge and William J. Dally
- Published
- 2022
- Full Text
- View/download PDF
6. A 0.297-pJ/bit 50.4-Gb/s/wire Inverter-Based Short-Reach Simultaneous Bidirectional Transceiver for Die-to-Die Interface in 5nm CMOS
- Author
-
Yoshinori Nishi, John W. Poulton, Walker J. Turner, Xi Chen, Sanquan Song, Brian Zimmer, Stephen G. Tell, Nikola Nedovic, John M. Wilson, William J. Dally, and C. Thomas Gray
- Subjects
Electrical and Electronic Engineering - Published
- 2022
- Full Text
- View/download PDF
7. A 17–95.6 TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization for Transformers in 5nm
- Author
-
Ben Keller, Rangharajan Venkatesan, Steve Dai, Stephen G. Tell, Brian Zimmer, William J. Dally, C. Thomas Gray, and Brucek Khailany
- Published
- 2022
- Full Text
- View/download PDF
8. Accelerating Chip Design With Machine Learning
- Author
-
Steve Dai, Ben Keller, William J. Dally, Brucek Khailany, Rangharajan Venkatesan, Alicia Klinefelter, Robert M. Kirby, Saad Godil, Yanqing Zhang, Haoxing Ren, and Bryan Catanzaro
- Subjects
Very-large-scale integration ,Artificial neural network ,Design space exploration ,Computer science ,business.industry ,02 engineering and technology ,Integrated circuit design ,Machine learning ,computer.software_genre ,Convolutional neural network ,020202 computer hardware & architecture ,Workflow ,Hardware and Architecture ,Logic gate ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,Artificial intelligence ,Electrical and Electronic Engineering ,Design methods ,business ,computer ,Software - Abstract
Recent advancements in machine learning provide an opportunity to transform chip design workflows. We review recent research applying techniques such as deep convolutional neural networks and graph-based neural networks in the areas of automatic design space exploration, power analysis, VLSI physical design, and analog design. We also present a future vision of an AI-assisted automated chip design workflow to aid designer productivity and automate optimization tasks.
- Published
- 2020
- Full Text
- View/download PDF
9. Domain-specific hardware accelerators
- Author
-
Yatish Turakhia, Song Han, and William J. Dally
- Subjects
General Computer Science ,Computer science ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,02 engineering and technology ,020202 computer hardware & architecture ,Computational science ,Domain (software engineering) - Abstract
DSAs gain efficiency from specialization and performance from parallelism.
- Published
- 2020
- Full Text
- View/download PDF
10. A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm
- Author
-
Joel Emer, Matthew Fojtik, C. Thomas Gray, Ben Keller, Stephen G. Tell, Priyanka Raina, Stephen W. Keckler, Alicia Klinefelter, William J. Dally, Brucek Khailany, Brian Zimmer, Jason Clemons, Rangharajan Venkatesan, Nan Jiang, Yanqing Zhang, Nathaniel Pinckney, and Yakun Sophia Shao
- Subjects
Computer science ,business.industry ,Multi-chip module ,Bandwidth (signal processing) ,Scalability ,Mesh networking ,Inference ,System on a chip ,Electrical and Electronic Engineering ,Chip ,business ,Computer hardware ,Efficient energy use - Abstract
Custom accelerators improve the energy efficiency, area efficiency, and performance of deep neural network (DNN) inference. This article presents a scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic chip are optimal for specific network sizes, the proposed architecture enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains. Communication energy is minimized with large on-chip distributed weight storage and a hierarchical network-on-chip and network-on-package, and inference energy is minimized through extensive data reuse. The 16-nm prototype achieves 1.29-TOPS/mm2 area efficiency, 0.11 pJ/op (9.5 TOPS/W) energy efficiency, 4.01-TOPS peak performance for a one-chip system, and 127.8 peak TOPS and 1903 images/s ResNet-50 batch-1 inference for a 36-chip system.
- Published
- 2020
- Full Text
- View/download PDF
11. Energy Efficient On-Demand Dynamic Branch Prediction Models
- Author
-
Ehsan Atoofian, Amirali Baniasadi, Milad Mohammadi, Tor M. Aamodt, William J. Dally, and Song Han
- Subjects
Computer science ,Fetch ,02 engineering and technology ,Parallel computing ,Supercomputer ,computer.software_genre ,Branch predictor ,020202 computer hardware & architecture ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Compiler ,Cache ,computer ,Software ,Integer (computer science) ,Efficient energy use - Abstract
The branch predictor unit (BPU) is among the main energy consuming components in out-of-order (OoO) processors. For integer applications, we find 16 percent of the processor energy is consumed by the BPU. BPU is accessed in parallel with the instruction cache before it is known if a fetch group contains control instructions. We find 85 percent of BPU lookups are done for non-branch operations, and of the remaining lookups, 42 percent are done for highly biased branches that can be predicted statically with high accuracy. We evaluate two variants of a branch prediction model that combines dynamic and static branch prediction to achieve energy improvements for power-constrained applications. These models, named on-demand branch prediction (ODBP) and path-based on-demand branch prediction (ODBP-PATH), are two novel prediction techniques that eliminate unnecessary BPU lookups using compiler generated hints to identify instructions that can be more accurately predicted statically. ODBP-PATH is an implementation of ODBP that combines static and dynamic branch prediction based on the program path of execution. For a 4-wide OoO processor, ODBP-PATH delivers 11 percent average energy-delay (ED) product improvement, and 9 percent core average energy saving on the SPEC Int 2006 benchmarks.
- Published
- 2020
- Full Text
- View/download PDF
12. Champagne: Automated Whole-Genome Phylogenomic Character Matrix Method Using Large Genomic Indels for Homoplasy-Free Inference
- Author
-
James K Schull, Yatish Turakhia, James A Hemker, William J Dally, Gill Bejerano, and Holland, Barbara
- Subjects
rare genomic changes ,Mammals ,Evolutionary Biology ,Genome ,homoplasy-free characters ,Nucleotides ,Human Genome ,incomplete lineage sorting ,phylogenomics ,Genomics ,phylogenetics ,INDEL Mutation ,Genetics ,Animals ,Biochemistry and Cell Biology ,Ecology, Evolution, Behavior and Systematics ,Phylogeny ,Developmental Biology - Abstract
We present Champagne, a whole-genome method for generating character matrices for phylogenomic analysis using large genomic indel events. By rigorously picking orthologous genes and locating large insertion and deletion events, Champagne delivers a character matrix that considerably reduces homoplasy compared with morphological and nucleotide-based matrices, on both established phylogenies and difficult-to-resolve nodes in the mammalian tree. Champagne provides ample evidence in the form of genomic structural variation to support incomplete lineage sorting and possible introgression in Paenungulata and human–chimp–gorilla which were previously inferred primarily through matrices composed of aligned single-nucleotide characters. Champagne also offers further evidence for Myomorpha as sister to Sciuridae and Hystricomorpha in the rodent tree. Champagne harbors distinct theoretical advantages as an automated method that produces nearly homoplasy-free character matrices on the whole-genome scale.
- Published
- 2022
13. SPAA'21 Panel Paper: Architecture-Friendly Algorithms versus Algorithm-Friendly Architectures
- Author
-
Guy E. Blelloch, William J. Dally, Uzi Vishkin, Katherine Yelick, and Margraret Martonosi
- Subjects
Computer science ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Parallel algorithm ,02 engineering and technology ,Architecture ,Algorithm ,Panel discussion - Abstract
The current paper provides preliminary statements of the panelists ahead of a panel discussion at the ACM SPAA 2021 conference on the topic: algorithm-friendly architecture versus architecture-friendly algorithms.
- Published
- 2021
- Full Text
- View/download PDF
14. Darwin: A Genomics Coprocessor
- Author
-
William J. Dally, Gill Bejerano, and Yatish Turakhia
- Subjects
Coprocessor ,Speedup ,Computer science ,Molecular biophysics ,Sequence assembly ,Genomics ,02 engineering and technology ,Parallel computing ,020202 computer hardware & architecture ,Orders of magnitude (bit rate) ,Hardware and Architecture ,Darwin (ADL) ,0202 electrical engineering, electronic engineering, information engineering ,Human genome ,Electrical and Electronic Engineering ,Software - Abstract
Long read sequencing is promising as it provides knowledge of a full spectrum of mutations in the human genome and generates more contiguous de novo assemblies. But high error rate in long reads imposes a computational barrier for genome assembly. Darwin, a specialized coprocessor, which provides orders of magnitude speedup over conventional processors in long read assembly, can eliminate this barrier.
- Published
- 2019
- Full Text
- View/download PDF
15. A 1.17-pJ/b, 25-Gb/s/pin Ground-Referenced Single-Ended Serial Link for Off- and On-Package Communication Using a Process- and Temperature-Adaptive Voltage Regulator
- Author
-
William J. Dally, C. Thomas Gray, John Wilson, Sudhir S. Kudva, John W. Poulton, Wenxu Zhao, Nikola Nedovic, Stephen G. Tell, Xi Chen, Walker J. Turner, Sunil Sudhakaran, Sanquan Song, and Brian Zimmer
- Subjects
Frequency response ,business.industry ,Serial communication ,Computer science ,020208 electrical & electronic engineering ,Transmitter ,Electrical engineering ,02 engineering and technology ,Voltage regulator ,Phase-locked loop ,CMOS ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Transceiver ,business ,Jitter - Abstract
This paper describes a short-reach serial link to connect chips mounted on the same package or on neighboring packages on a printed circuit board (PCB). The link employs an energy-efficient, single-ended ground-referenced signaling scheme. Implemented in 16-nm FinFET CMOS technology, the link operates at a data rate of 25 Gb/s/pin with 1.17-pJ/bit energy efficiency and uses a simple but robust matched-delay clock forwarding scheme that cancels most sources of jitter. The modest frequency-dependent attenuation of short-reach links is compensated using an analog equalizer in the transmitter. The receiver includes active-inductor peaking in the input amplifier to improve overall receiver frequency response. The link employs a novel power supply regulation scheme at both ends that uses a PLL ring-oscillator supply voltage as a reference to flatten circuit speed and reduce power consumption variation across PVT. The link can be calibrated once at an arbitrary voltage and temperature, then track VT variation without the need for periodic re-calibration. The link operates over a 10-mm-long on-package channel with −4 dB of attenuation with 0.77-UI eye opening at bit-error rate (BER) of 10−15. A package-to-package link with 54 mm of PCB and 26 mm of on-package trace with −8.5 dB of loss at Nyquist operates with 0.42 UI of eye opening at BER of 10−15. Overall link die area is 686 $\mu \text{m}\,\,\times $ 565 $\mu \text{m}$ with the transceiver circuitry taking up 20% of the area. The transceiver’s on-chip regulator is supplied from an off-chip 950-mV supply, while the support logic operates on a separate 850-mV supply.
- Published
- 2019
- Full Text
- View/download PDF
16. LNS-Madam: Low-Precision Training in Logarithmic Number System using Multiplicative Weight Update
- Author
-
Jiawei Zhao, Steve Dai, Rangharajan Venkatesan, Brian Zimmer, Mustafa Ali, Ming-Yu Liu, Brucek Khailany, William J. Dally, and Anima Anandkumar
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computational Theory and Mathematics ,Hardware and Architecture ,Hardware Architecture (cs.AR) ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Computer Science - Hardware Architecture ,Software ,Theoretical Computer Science ,Machine Learning (cs.LG) - Abstract
Representing deep neural networks (DNNs) in low-precision is a promising approach to enable efficient acceleration and memory reduction. Previous methods that train DNNs in low-precision typically keep a copy of weights in high-precision during the weight updates. Directly training with low-precision weights leads to accuracy degradation due to complex interactions between the low-precision number systems and the learning algorithms. To address this issue, we develop a co-designed low-precision training framework, termed LNS-Madam, in which we jointly design a logarithmic number system (LNS) and a multiplicative weight update algorithm (Madam). We prove that LNS-Madam results in low quantization error during weight updates, leading to stable performance even if the precision is limited. We further propose a hardware design of LNS-Madam that resolves practical challenges in implementing an efficient datapath for LNS computations. Our implementation effectively reduces energy overhead incurred by LNS-to-integer conversion and partial sum accumulation. Experimental results show that LNS-Madam achieves comparable accuracy to full-precision counterparts with only 8 bits on popular computer vision and natural language tasks. Compared to FP32 and FP8, LNS-Madam reduces the energy consumption by over 90% and 55%, respectively.
- Published
- 2021
- Full Text
- View/download PDF
17. Optimal Operation of a Plug-in Hybrid Vehicle with Battery Thermal and Degradation Model
- Author
-
Stephen Boyd, William J. Dally, Jongho Kim, John Fox, and Youngsuk Park
- Subjects
Battery (electricity) ,Work (thermodynamics) ,Control theory ,Computer science ,Thermal ,State (computer science) ,Hybrid vehicle ,Automotive engineering ,Degradation (telecommunications) - Abstract
We propose a control method to optimally use fuel and battery resources for power-split plug-in hybrid vehicles (PHEVs) under the case of pre determined driving route and associated energy demand profile. We integrate a battery thermal and degradation model and formulate a mixed-integer convex problem which can be approximately solved with standard efficient solvers. In simulation, we demonstrate that our controller can manage battery operation and state to avoid severe battery degradation, and balance fuel usage with battery degradation depending on ambient temperature or energy demand profiles of the routes. Under various scenarios, the results are validated by the Autonomie software [1] and compared with conventional existing CDCS controller and the earlier related work [2], which only optimized to achieve minimal fuel use and neglects the battery degradation. Lastly, we show our controller is efficient enough to be computed on the on-board vehicle computer and applied in real-time.
- Published
- 2020
- Full Text
- View/download PDF
18. SpArch: Efficient Architecture for Sparse Matrix Multiplication
- Author
-
Song Han, William J. Dally, Hanrui Wang, Zhekai Zhang, and Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
- Subjects
FOS: Computer and information sciences ,Speedup ,Computer science ,Matrix representation ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,Huffman coding ,01 natural sciences ,Matrix multiplication ,020202 computer hardware & architecture ,Matrix (mathematics) ,symbols.namesake ,Computer Science - Distributed, Parallel, and Cluster Computing ,Hardware Architecture (cs.AR) ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,symbols ,Distributed, Parallel, and Cluster Computing (cs.DC) ,0101 mathematics ,Computer Science - Hardware Architecture ,Dram ,Sparse matrix - Abstract
Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner product based SpGEMM introduces redundant input fetches for mismatched nonzero operands, while outer product based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in the reuse of either inputs or outputs data leads to extensive and expensive DRAM access. To address this problem, this paper proposes an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x. We further develop a Huffman tree scheduler to improve the scalability of the merger for larger sparse matrices, which reduces the DRAM access by another 1.8x. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the DRAM access by 1.5x. Evaluated on 20 benchmarks, SpArch reduces the total DRAM access by 2.8x over previous state-of-the-art. On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSpace, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively., National Science Foundation (U.S.). Harnessing the Data Revolution (Award 1934700)
- Published
- 2020
- Full Text
- View/download PDF
19. Optimal Operation of a Plug-In Hybrid Vehicle
- Author
-
John Fox, Nicholas Moehle, Jason A. Platt, and William J. Dally
- Subjects
business.product_category ,Computer Networks and Communications ,Computer science ,020209 energy ,Aerospace Engineering ,020302 automobile design & engineering ,02 engineering and technology ,Grid ,Automotive engineering ,Nonlinear system ,chemistry.chemical_compound ,0203 mechanical engineering ,chemistry ,Control theory ,Automotive Engineering ,Electric vehicle ,Convex optimization ,0202 electrical engineering, electronic engineering, information engineering ,Fuel efficiency ,Petroleum ,Resource management ,Electrical and Electronic Engineering ,Convex function ,Hybrid vehicle ,business - Abstract
We present a convex optimization control method that has been shown in simulations to increase the fuel efficiency of a plug-in hybrid electric vehicle by over 10%. Using information on energy demand and energy use profiles, the problem is defined to preferentially use battery resources sourced from the grid over petroleum resources. We pose the general nonlinear optimal resource management problem over a predetermined route as a convex optimization problem using a reduced model of the vehicle. This problem is computationally efficient enough to be optimized “on the fly” on the on-board vehicle computer and is thus able to adapt to changing vehicle conditions in real time. Using this reduced model to generate control inputs for the detailed vehicle simulator autonomie, we record efficiency gains of over 10% as compared to the industry standard charge depleting charge sustaining controller over synthetic mixed urban-suburban routes.
- Published
- 2018
- Full Text
- View/download PDF
20. CG-OoO
- Author
-
Milad Mohammadi, William J. Dally, and Tor M. Aamodt
- Subjects
010302 applied physics ,Out-of-order execution ,Exploit ,Computer science ,Instruction scheduling ,02 engineering and technology ,Parallel computing ,computer.software_genre ,01 natural sciences ,020202 computer hardware & architecture ,Scheduling (computing) ,Hardware and Architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Instruction pipeline ,Granularity ,Compiler ,computer ,Software ,Information Systems ,Efficient energy use - Abstract
We introduce the Coarse-Grain Out-of-Order (CG-OoO) general-purpose processor designed to achieve close to In-Order (InO) processor energy while maintaining Out-of-Order (OoO) performance. CG-OoO is an energy-performance-proportional architecture. Block-level code processing is at the heart of this architecture; CG-OoO speculates, fetches, schedules, and commits code at block-level granularity. It eliminates unnecessary accesses to energy-consuming tables and turns large tables into smaller, distributed tables that are cheaper to access. CG-OoO leverages compiler-level code optimizations to deliver efficient static code and exploits dynamic block-level and instruction-level parallelism. CG-OoO introduces Skipahead, a complexity effective, limited out-of-order instruction scheduling model. Through the energy efficiency techniques applied to the compiler and processor pipeline stages, CG-OoO closes 62% of the average energy gap between the InO and OoO baseline processors at the same area and nearly the same performance as the OoO. This makes CG-OoO 1.8× more efficient than the OoO on the energy-delay product inverse metric. CG-OoO meets the OoO nominal performance while trading off the peak scheduling performance for superior energy efficiency.
- Published
- 2017
- Full Text
- View/download PDF
21. MAGNet: A Modular Accelerator Generator for Neural Networks
- Author
-
Miaorong Wang, Nathaniel Pinckney, Brucek Khailany, Alicia Klinefelter, Rangharajan Venkatesan, Ben Keller, Jason Clemons, William J. Dally, Matthew Fojtik, Stephen W. Keckler, Brian Zimmer, Yakun Sophia Shao, Joel Emer, Priyanka Raina, Yanqing Zhang, and Steve Dai
- Subjects
010302 applied physics ,Artificial neural network ,business.industry ,Computer science ,Dataflow ,Design space exploration ,Deep learning ,02 engineering and technology ,Modular design ,01 natural sciences ,020202 computer hardware & architecture ,Software ,Application-specific integrated circuit ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Artificial intelligence ,business ,Computer hardware - Abstract
Deep neural networks have been adopted in a wide range of application domains, leading to high demand for inference accelerators. However, the high cost associated with ASIC hardware design makes it challenging to build custom accelerators for different targets. To lower design cost, we propose MAGNet, a modular accelerator generator for neural networks. MAGNet takes a target application consisting of one or more neural networks along with hardware constraints as input and produces synthesizable RTL for a neural network accelerator ASIC as well as valid mappings for running the target networks on the generated hardware. MAGNet consists of three key components: (i) MAGNet Designer, a highly configurable architectural template designed in C++ and synthesizable by high-level synthesis tools. MAGNet Designer supports a wide range of design-time parameters such as different data formats, diverse memory hierarchies, and dataflows. (ii) MAGNet Mapper, an automated framework for exploring different software mappings for executing a neural network on the generated hardware. (iii) MAGNet Tuner, a design space exploration framework encompassing the designer, the mapper, and a deep learning framework to enable fast design space exploration and co-optimization of architecture and application. We demonstrate the utility of MAGNet by designing an inference accelerator optimized for image classification application using three different neural networks—AlexNet, ResNet, and DriveNet. MAGNet-generated hardware is highly efficient and leverages a novel multi-level dataflow to achieve 40 fJ/op and 2.8 TOPS/mm2 in a 16nm technology node for the ResNet-50 benchmark with
- Published
- 2019
- Full Text
- View/download PDF
22. Simba
- Author
-
Stephen W. Keckler, Priyanka Raina, Joel Emer, Nan Jiang, Nathaniel Pinckney, Stephen G. Tell, Yakun Sophia Shao, Brian Zimmer, Brucek Khailany, Alicia Klinefelter, William J. Dally, Yanqing Zhang, Matthew Fojtik, Jason Clemons, C. Thomas Gray, Ben Keller, and Rangharajan Venkatesan
- Subjects
010302 applied physics ,Speedup ,Computer science ,business.industry ,Deep learning ,Multi-chip module ,Process (computing) ,Inference ,02 engineering and technology ,01 natural sciences ,Die (integrated circuit) ,020202 computer hardware & architecture ,Computer architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Layer (object-oriented design) ,business - Abstract
Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with batch size of one, delivering inference latency of 0.50 ms.
- Published
- 2019
- Full Text
- View/download PDF
23. A 0.11 PJ/OP, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity vlsi Methodology
- Author
-
Priyanka Raina, William J. Dally, C. Thomas Gray, Ben Keller, Joel Emer, Matthew Fojtik, Brian Zimmer, Yakun Sophia Shao, Alicia Klinefelter, Brucek Khailany, Yanqing Zhang, Nathaniel Pinckney, Jason Clemons, Nan Jiang, Rangharajan Venkatesan, Stephen G. Tell, and Stephen W. Keckler
- Subjects
Very-large-scale integration ,Logic synthesis ,Computer architecture ,Artificial neural network ,Computer science ,Scalability ,Multi-chip module ,TOPS ,Productivity - Published
- 2019
- Full Text
- View/download PDF
24. Analog/Mixed-Signal Hardware Error Modeling for Deep Learning Inference
- Author
-
William J. Dally, C. Thomas Gray, Miaorong Wang, Nikola Nedovic, Rangharajan Venkatesan, Brian Zimmer, Angad Rekhi, Ningxi Liu, and Brucek Khailany
- Subjects
Normalization (statistics) ,business.industry ,Computer science ,Computation ,Deep learning ,020208 electrical & electronic engineering ,Inference ,Dot product ,Mixed-signal integrated circuit ,02 engineering and technology ,020202 computer hardware & architecture ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,business ,Computer hardware ,Energy (signal processing) ,Efficient energy use - Abstract
Analog/mixed-signal (AMS) computation can be more energy efficient than digital approaches for deep learning inference, but incurs an accuracy penalty from precision loss. Prior AMS approaches focus on small networks/datasets, which can maintain accuracy even with 2b precision. We analyze applicability of AMS approaches to larger networks by proposing a generic AMS error model, implementing it in an existing training framework, and investigating its effect on ImageNet classification with ResNet-50. We demonstrate significant accuracy recovery by exposing the network to AMS error during retraining, and we show that batch normalization layers are responsible for this accuracy recovery. We also introduce an energy model to predict the requirements of high-accuracy AMS hardware running large networks and use it to show that for ADC-dominated designs, there is a direct tradeoff between energy efficiency and network accuracy. Our model predicts that achieving $\lt0.4$% accuracy loss on ResNet-50 with AMS hardware requires a computation energy of at least $\sim 300$ fJ/MAC. Finally, we propose methods for improving the energy-accuracy tradeoff.
- Published
- 2019
- Full Text
- View/download PDF
25. A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm
- Author
-
Brucek Khailany, Alicia Klinefelter, Joel Emer, Stephen W. Keckler, Priyanka Raina, Brian Zimmer, Rangharajan Venkatesan, Yanqing Zhang, Nathaniel Pinckney, Nan Jiang, Stephen G. Tell, Matthew Fojtik, William J. Dally, C. Thomas Gray, Ben Keller, Jason Clemons, and Yakun Sophia Shao
- Subjects
Artificial neural network ,business.industry ,Computer science ,Multi-chip module ,Mesh networking ,Scalability ,Bandwidth (signal processing) ,Transceiver ,TOPS ,business ,Computer hardware ,Efficient energy use - Abstract
This work presents a scalable deep neural network (DNN) accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic die are limited to specific network sizes, the proposed architecture enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains. The 16nm prototype achieves 1.29 TOPS/mm2, 0.11 pJ/op energy efficiency, 4.01 TOPS peak performance for a 1chip system, and 127.8 peak TOPS and 2615 images/s ResNet50 inference for a 36-chip system.
- Published
- 2019
- Full Text
- View/download PDF
26. A Fine-Grained GALS SoC with Pausible Adaptive Clocking in 16 nm FinFET
- Author
-
Tezaswi Raja, Nathaniel Pinckney, William J. Dally, Stephen G. Tell, Brucek Khailany, Alicia Klinefelter, Brian Zimmer, Ben Keller, Kevin Zhou, and Matthew Fojtik
- Subjects
Noise ,Asynchronous communication ,business.industry ,Computer science ,Embedded system ,Timing margin ,Systems design ,Independent clock ,Timing closure ,business ,Power (physics) ,Efficient energy use - Abstract
Modern SoCs suffer from power supply noise that can require significant additional timing margin, reducing performance and energy efficiency. Globally asynchronous, locally synchronous (GALS) systems can mitigate the impact of power supply noise, as well as simplify system design by removing the need for global timing closure. This work presents a 4mm2 distributed accelerator engine with 19 independent clock domains implemented in a 16nm process. Local adaptive clock generators dynamically tolerate and mitigate power supply noise, resulting in a 10% improvement in performance at the same voltage compared to a globally-clocked baseline. Pausible bisynchronous FIFOs enable low-latency global communication across an onchip network via error-free clock domain crossings. The SoC functions robustly across a wide range of voltages, frequencies, and workloads, demonstrating the practical applicability of fine grained GALS techniques for modern SoC design.
- Published
- 2019
- Full Text
- View/download PDF
27. A 2-to-20 GHz Multi-Phase Clock Generator with Phase Interpolators Using Injection-Locked Oscillation Buffers for High-Speed IOs in 16nm FinFET
- Author
-
Sudhir S. Kudva, William J. Dally, Brian Zimmer, Stephen G. Tell, Xi Chen, John W. Poulton, Walker J. Turner, Nikola Nedovic, C. Thomas Gray, Sanquan Song, and John Wilson
- Subjects
Phase-locked loop ,Voltage-controlled oscillator ,Materials science ,Oscillation ,law ,Phase (waves) ,Electronic engineering ,Inverter ,Clock generator ,Resistor ,law.invention ,Power (physics) - Abstract
To support high-speed IOs, a 2-to-20 GHz cross-coupled inverter-based multi-phase PLL with phase interpolators using injection-locked oscillation buffers is presented. The proposed voltage-controlled oscillator (VCO) is made of two 4-stage ring oscillators (RO), where the introduction of feedback and cross-coupled resistances enable the VCO to run at higher speeds compared to a conventional 4-stage RO. Injection-locked oscillation buffering is proposed to restore swing and reduce duty-cycle error (DCE) for VCO output buffers and PIs. Fabricated in 16nm process, it runs 25% faster and consumes 20% less power than prior works, with less than 1.8% DCE at 20 GHz across PI code range.
- Published
- 2019
- Full Text
- View/download PDF
28. Darwin-WGA: A Co-processor Provides Increased Sensitivity in Whole Genome Alignments with High Speedup
- Author
-
Gill Bejerano, Yatish Turakhia, William J. Dally, and Sneha D. Goenka
- Subjects
Coprocessor ,Speedup ,Computer science ,Darwin (ADL) ,Genomics ,Sensitivity (control systems) ,Computational biology ,Genome - Published
- 2019
- Full Text
- View/download PDF
29. A 28 nm 2 Mbit 6 T SRAM With Highly Configurable Low-Voltage Write-Ability Assist Implementation and Capacitor-Based Sense-Amplifier Input Offset Compensation
- Author
-
Andreas J. Gotterba, Mahmut E. Sinangil, Jesse S. Wang, Matthew Fojtik, Jason Golbus, John W. Poulton, Brian Zimmer, Stephen G. Tell, C. Thomas Gray, William J. Dally, and Thomas Hastings Greer
- Subjects
Engineering ,Offset (computer science) ,Input offset voltage ,business.industry ,Sense amplifier ,Amplifier ,020208 electrical & electronic engineering ,02 engineering and technology ,01 natural sciences ,law.invention ,Capacitor ,CMOS ,law ,0103 physical sciences ,Hardware_INTEGRATEDCIRCUITS ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,Static random-access memory ,Electrical and Electronic Engineering ,010306 general physics ,business ,Low voltage - Abstract
This paper presents a highly configurable low-voltage write-ability assist implementation along with a sense-amplifier offset reduction technique to improve SRAM read performance. Write-assist implementation combines negative bit-line (BL) and $V_{\text{DD}}$ collapse schemes in an efficient way to maximize $V_{{\min}}$ improvements while saving on area and energy overhead of these assists. Relative delay and pulse width of assist control signals are also designed with configurability to provide tuning of assist strengths. Sense-amplifier offset compensation scheme uses capacitors to store and negate threshold mismatch of input transistors. A test chip fabricated in 28 nm HP CMOS process demonstrates operation down to 0.5 V with write assists and more than 10% reduction in word-line pulsewidth with the offset compensated sense amplifiers.
- Published
- 2016
- Full Text
- View/download PDF
30. Bandwidth-efficient deep learning
- Author
-
Song Han and William J. Dally
- Subjects
010302 applied physics ,Bandwidth efficient ,Artificial neural network ,business.industry ,Computer science ,Deep learning ,Inference ,Memory bandwidth ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Power (physics) ,Computer engineering ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Bandwidth (computing) ,Artificial intelligence ,business - Abstract
Deep learning algorithms are achieving increasingly higher prediction accuracy on many machine learning tasks. However, applying brute-force programming to data demands a huge amount of machine power to perform training and inference, and a huge amount of manpower to design the neural network models, which is inefficient. In this paper, we provide techniques to solve these bottlenecks: saving memory bandwidth for inference by model compression, saving networking bandwidth for training by gradient compression, and saving engineer bandwidth for model design by using AI to automate the design of models.
- Published
- 2018
- Full Text
- View/download PDF
31. Hardware-Enabled Artificial Intelligence
- Author
-
John W. Poulton, Brucek Khailany, William J. Dally, John Wilson, C. Thomas Gray, and Larry R. Dennison
- Subjects
Computer science ,business.industry ,Deep learning ,020208 electrical & electronic engineering ,Bandwidth (signal processing) ,Inference ,02 engineering and technology ,Object (computer science) ,020202 computer hardware & architecture ,0202 electrical engineering, electronic engineering, information engineering ,System on a chip ,Artificial intelligence ,business ,Computer hardware ,Efficient energy use - Abstract
The current resurgence of artificial intelligence is due to advances in deep learning. Systems based on deep learning now exceed human capability in speech recognition [1], object classification [4], and playing games like Go [9]. Deep learning is enabled by powerful, efficient computing hardware [5]. The algorithms used have been around since the 1980s [7], but it has only been in the last few years - when powerful GPUs became available to train networks - that the technology has become practical. This paper discusses the circuit challenges in building deep-learning hardware both for inference and for training.
- Published
- 2018
- Full Text
- View/download PDF
32. INVITED: Bandwidth-Efficient Deep Learning
- Author
-
Song Han and William J. Dally
- Subjects
0209 industrial biotechnology ,Artificial neural network ,Bandwidth efficient ,Computer science ,business.industry ,Deep learning ,Inference ,Memory bandwidth ,02 engineering and technology ,Power (physics) ,020901 industrial engineering & automation ,Computer engineering ,0202 electrical engineering, electronic engineering, information engineering ,Bandwidth (computing) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business - Abstract
Deep learning algorithms are achieving increasingly higher prediction accuracy on many machine learning tasks. However, applying brute-force programming to data demands a huge amount of machine power to perform training and inference, and a huge amount of manpower to design the neural network models, which is inefficient. In this paper, we provide techniques to solve these bottlenecks: saving memory bandwidth for inference by model compression, saving networking bandwidth for training by gradient compression, and saving engineer bandwidth for model design by using AI to automate the design of models.
- Published
- 2018
- Full Text
- View/download PDF
33. Ground-referenced signaling for intra-chip and short-reach chip-to-chip interconnects
- Author
-
Walker J. Turner, Matthew Fojtik, Wenxu Zhao, John Wilson, John W. Poulton, Sunil Sudhakaran, Stephen G. Tell, Xi Chen, Brian Zimmer, Sanquan Song, Sudhir S. Kudva, Nikola Nedovic, C. Thomas Gray, William J. Dally, Thomas Hastings Greer, and Rizwan Bashirullah
- Subjects
SIMPLE (military communications protocol) ,Computer science ,business.industry ,020208 electrical & electronic engineering ,02 engineering and technology ,Chip ,Printed circuit board ,Electric power transmission ,Single-ended signaling ,0202 electrical engineering, electronic engineering, information engineering ,High speed serial link ,Transceiver ,business ,Energy (signal processing) ,Computer hardware - Abstract
While high-speed single-ended signaling maximizes pin and wire utilization within on- and off-chip serial links, problems associated with conventional signaling methods result in energy inefficiencies. Ground-referenced signaling (GRS) solves many of the problems of single-ended signaling systems and can be adapted for signaling across RC-dominated channels and LC transmission lines. The combination of GRS and clock forwarding enables simple but efficient signaling across on-chip communication fabrics, off-chip organic packages, and off-package printed circuit boards. Various methodologies compatible with GRS are presented in this paper, including design considerations and various circuit architectures. Experimental results for multiple generations of GRS-based serial links are presented, which includes a 16Gb/s 170fJ/b/mm on-chip link, a 20Gb/s 0.58pJ/b link across an organic package, and a 25Gb/s 1.17pJ/b link signaling over a printed-circuit board.
- Published
- 2018
- Full Text
- View/download PDF
34. Darwin
- Author
-
Gill Bejerano, William J. Dally, and Yatish Turakhia
- Subjects
0301 basic medicine ,Speedup ,Coprocessor ,Computer science ,0206 medical engineering ,Sequence assembly ,Genomics ,Sequence alignment ,02 engineering and technology ,Parallel computing ,medicine.disease_cause ,03 medical and health sciences ,Software ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Mutation ,business.industry ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,030104 developmental biology ,Darwin (ADL) ,Mutation (genetic algorithm) ,Hardware acceleration ,Human genome ,business ,020602 bioinformatics - Abstract
Genomics is transforming medicine and our understanding of life in fundamental ways. Genomics data, however, is far outpacing Moore»s Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. Over 1,300 CPU hours are required for reference-guided assembly of the human genome, and over 15,600 CPU hours are required for de novo assembly. This paper describes "Darwin" --- a co-processor for genomic sequence alignment that, without sacrificing sensitivity, provides up to $15,000X speedup over the state-of-the-art software for reference-guided assembly of third-generation reads. Darwin achieves this speedup through hardware/algorithm co-design, trading more easily accelerated alignment for less memory-intensive filtering, and by optimizing the memory system for filtering. Darwin combines a hardware-accelerated version of D-SOFT, a novel filtering algorithm, alignment at high speed, and with a hardware-accelerated version of GACT, a novel alignment algorithm. GACT generates near-optimal alignments of arbitrarily long genomic sequences using constant memory for the compute-intensive step. Darwin is adaptable, with tunable speed and sensitivity to match emerging sequencing technologies and to meet the requirements of genomic applications beyond read assembly.
- Published
- 2018
- Full Text
- View/download PDF
35. A 1.17pJ/b 25Gb/s/pin ground-referenced single-ended serial link for off- and on-package communication in 16nm CMOS using a process- and temperature-adaptive voltage regulator
- Author
-
William J. Dally, Walker J. Turner, Sanquan Song, Sunil Sudhakaran, Brian Zimmer, C. Thomas Gray, Sudhir S. Kudva, Stephen G. Tell, Xi Chen, John Wilson, Wenxu Zhao, John W. Poulton, and Nikola Nedovic
- Subjects
Packaging engineering ,business.industry ,Serial communication ,Computer science ,Circuit design ,020208 electrical & electronic engineering ,Bandwidth (signal processing) ,Electrical engineering ,02 engineering and technology ,Voltage regulator ,Phase-locked loop ,Data link ,CMOS ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Signal integrity ,business ,Electrical impedance - Abstract
Toward the end of the Moore's-law era, increases in system complexity will rely more heavily on packaging technology. Systems will increasingly comprise multiple chips that must be linked by high-speed data channels carrying a substantial fraction of on-chip bandwidth. To take advantage of inexpensive organic packages and conventional printed circuit (PC) boards, data links that are both energy and pin efficient are needed. A link between neighboring packages is by far the more challenging application due to increased cross-talk, signal attenuation, and reflections from impedance discontinuities. The combination of signal integrity challenges and production margining requires increased amplitude, equalization, ESD protection, and PVT-tolerant circuit design techniques.
- Published
- 2018
- Full Text
- View/download PDF
36. Reuse Distance-Based Probabilistic Cache Replacement
- Author
-
Tor M. Aamodt, Subhasis Das, and William J. Dally
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Computer science ,Adaptive replacement cache ,Probabilistic logic ,02 engineering and technology ,Parallel computing ,Reuse ,01 natural sciences ,020202 computer hardware & architecture ,Metadata ,Hardware and Architecture ,Overhead (business) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Cache algorithms ,Software ,Dram ,Information Systems - Abstract
This article proposes Probabilistic Replacement Policy (PRP), a novel replacement policy that evicts the line with minimum estimated hit probability under optimal replacement instead of the line with maximum expected reuse distance. The latter is optimal under the independent reference model of programs, which does not hold for last-level caches (LLC). PRP requires 7% and 2% metadata overheads in the cache and DRAM respectively. Using a sampling scheme makes DRAM overhead negligible, with minimal performance impact. Including detailed overhead modeling and equal cache areas, PRP outperforms SHiP, a state-of-the-art LLC replacement algorithm, by 4% for memory-intensive SPEC-CPU2006 benchmarks.
- Published
- 2015
- Full Text
- View/download PDF
37. SLIP
- Author
-
William J. Dally, Tor M. Aamodt, and Subhasis Das
- Subjects
Computer science ,Cache coloring ,CPU cache ,Pipeline burst cache ,Parallel computing ,02 engineering and technology ,Cache-oblivious algorithm ,Cache pollution ,01 natural sciences ,Cache invalidation ,Write-once ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Cache hierarchy ,Cache algorithms ,010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Memory hierarchy ,business.industry ,General Medicine ,020202 computer hardware & architecture ,Smart Cache ,Embedded system ,Bus sniffing ,Page cache ,Cache ,business - Abstract
Wire energy has become the major contributor to energy in large lower level caches. While wire energy is related to wire latency its costs are exposed differently in the memory hierarchy. We propose Sub-Level Insertion Policy (SLIP), a cache management policy which improves cache energy consumption by increasing the number of accesses from energy efficient locations while simultaneously decreasing intra-level data movement. In SLIP, each cache level is partitioned into several cache sublevels of differing sizes. Then, the recent reuse distance distribution of a line is used to choose an energy-optimized insertion and movement policy for the line. The policy choice is made by a hardware unit that predicts the number of accesses and inter-level movements. Using a full-system simulation including OS interactions and hardware overheads, we show that SLIP saves 35% energy at the L2 and 22% energy at the L3 level and performs 0.75% better than a regular cache hierarchy in a single core system. When configured to include a bypassing policy, SLIP reduces traffic to DRAM by 2.2%. This is achieved at the cost of storing 12b metadata per cache line (2.3% overhead), a 6b policy in the PTE, and 32b distribution metadata for each page in the DRAM (a overhead of 0.1%). Using SLIP in a multiprogrammed system saves 47% LLC energy, and reduces traffic to DRAM by 5.5%.
- Published
- 2015
- Full Text
- View/download PDF
38. On-Chip Active Messages for Speed, Scalability, and Efficiency
- Author
-
R. Curtis Harting and William J. Dally
- Subjects
business.industry ,Computer science ,computer.software_genre ,Computational Theory and Mathematics ,Shared memory ,Hardware and Architecture ,Signal Processing ,Scalability ,Operating system ,Concurrent computing ,Overhead (computing) ,business ,computer ,Cache coherence ,Energy (signal processing) ,Computer network - Abstract
This paper describes and quantifies the benefits of adding low-overhead active messages to many-core, cache-coherent chip-multiprocessors. The active messages we analyze are user defined and trigger the atomic execution of a custom software handler at the destination. Programmers can use these active messages to both move data with less overhead than cache coherency and, more importantly, explicitly send computation to data. Doing so greatly improves (11 $\times$ speed, 4.8 $\times$ energy) communication idioms such as shared object modification, reductions, data walks, point-to-point communication, and all-to-all communication. Active messages enhance program scalability: applications using them run 63 percent faster with 11 percent less energy on 256 cores. The relative benefits of active messages grow with larger numbers of cores.
- Published
- 2015
- Full Text
- View/download PDF
39. On-Demand Dynamic Branch Prediction
- Author
-
Song Han, Tor M. Aamodt, William J. Dally, and Milad Mohammadi
- Subjects
Speedup ,Computer science ,Speculative execution ,Thread (computing) ,Parallel computing ,computer.software_genre ,Branch predictor ,Power budget ,Branch table ,Hardware and Architecture ,Compiler ,Cache ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,computer - Abstract
In out-of-order (OoO) processors, speculative execution with high branch prediction accuracy is employed to achieve good single thread performance. In these processors the branch prediction unit tables (BPU) are accessed in parallel with the instruction cache before it is known whether a fetch group contains branch instructions. For integer applications, we find 85 percent of BPU lookups are done for non-branch operations and of the remaining lookups, 42 percent are done for highly biased branches that can be predicted statically with high accuracy. We evaluate on-demand branch prediction (ODBP), a novel technique that uses compiler generated hints to identify those instructions that can be more accurately predicted statically to eliminate unnecessary BPU lookups. We evaluate an implementation of ODBP that combines static and dynamic branch prediction. For a four wide superscalar processor, ODBP delivers as much as 9 percent improvement in average energy-delay (ED) product, 7 percent core average energy saving, and 3 percent speedup. ODBP also enables the use of large BPU’s for a given power budget.
- Published
- 2015
- Full Text
- View/download PDF
40. Fine-grained DRAM
- Author
-
Niladrish Chatterjee, Aditya Agrawal, Stephen W. Keckler, John Wilson, William J. Dally, Donghyuk Lee, and Mike OrConnor
- Subjects
010302 applied physics ,Dynamic random-access memory ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,Locality ,02 engineering and technology ,High Bandwidth Memory ,01 natural sciences ,Memory controller ,CAS latency ,020202 computer hardware & architecture ,law.invention ,law ,Universal memory ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Bandwidth (computing) ,business ,Dram ,Efficient energy use - Abstract
Future GPUs and other high-performance throughput processors will require multiple TB/s of bandwidth to DRAM. Satisfying this bandwidth demand within an acceptable energy budget is a challenge in these extreme bandwidth memory systems. We propose a new high-bandwidth DRAM architecture, Fine-Grained DRAM (FGDRAM), which improves bandwidth by 4× and improves the energy efficiency of DRAM by 2× relative to the highest-bandwidth, most energy-efficient contemporary DRAM, High Bandwidth Memory (HBM2). These benefits are in large measure achieved by partitioning the DRAM die into many independent units, called grains, each of which has a local, adjacent I/O. This approach unlocks the bandwidth of all the banks in the DRAM to be used simultaneously, eliminating shared buses interconnecting various banks. Furthermore, the on-DRAM data movement energy is significantly reduced due to the much shorter wiring distance between the cell array and the local I/O. This FGDRAM architecture readily lends itself to leveraging existing techniques to reducing the effective DRAM row size in an area efficient manner, reducing wasteful row activate energy in applications with low locality. In addition, when FGDRAM is paired with a memory controller optimized to exploit the additional concurrency provided by the independent grains, it improves GPU system performance by 19% over an iso-bandwidth and iso-capacity future HBM baseline. Thus, this energy-efficient, high-bandwidth FGDRAM architecture addresses the needs of future extreme-bandwidth memory systems. CCS CONCEPTS • Hardware → Dynamic memory; Power and energy; • Computing methodologies → Graphics processors; • Computer systems organization → Parallel architectures
- Published
- 2017
- Full Text
- View/download PDF
41. Exploring the Granularity of Sparsity in Convolutional Neural Networks
- Author
-
Xingyu Liu, Huizi Mao, Jeff Pool, William J. Dally, Wenshuo Li, Song Han, and Yu Wang
- Subjects
Hardware architecture ,Artificial neural network ,business.industry ,Computer science ,MathematicsofComputing_NUMERICALANALYSIS ,02 engineering and technology ,Machine learning ,computer.software_genre ,Convolutional neural network ,020202 computer hardware & architecture ,Statistics::Machine Learning ,Kernel (linear algebra) ,ComputingMethodologies_PATTERNRECOGNITION ,Compression (functional analysis) ,0202 electrical engineering, electronic engineering, information engineering ,Hardware acceleration ,020201 artificial intelligence & image processing ,Multiplication ,Granularity ,Artificial intelligence ,business ,Algorithm ,computer - Abstract
Sparsity helps reducing the computation complexity of DNNs by skipping the multiplication with zeros. The granularity of sparsity affects the efficiency of hardware architecture and the prediction accuracy. In this paper we quantitatively measure the accuracy-sparsity relationship with different granularity. Coarse-grained sparsity brings more regular sparsity pattern, making it easier for hardware acceleration, and our experimental results show that coarsegrained sparsity have very small impact on the sparsity ratio given no loss of accuracy. Moreover, due to the index saving effect, coarse-grained sparsity is able to obtain similar or even better compression rates than fine-grained sparsity at the same accuracy threshold. Our analysis, which is based on the framework of a recent sparse convolutional neural network (SCNN) accelerator, further demonstrates that it saves 30% – 35% of memory references compared with fine-grained sparsity.
- Published
- 2017
- Full Text
- View/download PDF
42. Architecting an Energy-Efficient DRAM System for GPUs
- Author
-
Stephen W. Keckler, Minsoo Rhu, Donghyuk Lee, Daniel R. Johnson, Niladrish Chatterjee, Mike O'Connor, and William J. Dally
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Computer science ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Partition (database) ,CAS latency ,020202 computer hardware & architecture ,0103 physical sciences ,Datapath ,0202 electrical engineering, electronic engineering, information engineering ,Bandwidth (computing) ,Memory rank ,Throughput (business) ,Dram ,Efficient energy use - Abstract
This paper proposes an energy-efficient, high-throughput DRAM architecture for GPUs and throughput processors. In these systems, requests from thousands of concurrent threads compete for a limited number of DRAM row buffers. As a result, only a fraction of the data fetched into a row buffer is used, leading to significant energy overheads. Our proposed DRAM architecture exploits the hierarchical organization of a DRAM bank to reduce the minimum row activation granularity. To avoid significant incremental area with this approach, we must partition the DRAM datapath into a number of semi-independent subchannels. These narrow subchannels increase data toggling energy which we mitigate using a static data reordering scheme designed to lower the toggle rate. This design has 35% lower energy consumption than a die-stacked DRAM with 2.6% area overhead. The resulting architecture, when augmented with an improved memory access protocol, can support parallel operations across the semi-independent subchannels, thereby improving system performance by 13% on average for a range of workloads.
- Published
- 2017
- Full Text
- View/download PDF
43. Darwin: A Hardware-acceleration Framework for Genomic Sequence Alignment
- Author
-
William J. Dally, Gill Bejerano, Yatish Turakhia, and Zheng Kj
- Subjects
Genetics ,Theoretical computer science ,Speedup ,Software ,business.industry ,Darwin (ADL) ,Mutation (genetic algorithm) ,Hardware acceleration ,Sequence assembly ,Genomics ,Biology ,business ,Alignment-free sequence analysis - Abstract
Genomics is set to transform medicine and our understanding of life in fundamental ways. But the growth in genomics data has been overwhelming - far outpacing Moore’s Law. The advent of third generation sequencing technologies is providing new insights into genomic contribution to diseases with complex mutation events, but have prohibitively high computational costs. Over 1,300 CPU hours are required to align reads from a 54× coverage of the human genome to a reference (estimated using [1]), and over 15,600 CPU hours to assemble the readsde novo[2]. This paper proposes “Darwin” - a hardware-accelerated framework for genomic sequence alignment that, without sacrificing sensitivity, provides 125× and 15.6× speedup over the state-of-the-art software counterparts for reference-guided andde novoassembly of third generation sequencing reads, respectively. For pairwise alignment of sequences, Darwin is over 39,000× more energy-efficient than software. Darwin uses (i) a novel filtration strategy, called D-SOFT, to reduce the search space for sequence alignment at high speed, and (ii) a hardware-accelerated version of GACT, a novel algorithm to generate near-optimal alignments of arbitrarily long genomic sequences using constant memory for trace-back. Darwin is adaptable, with tunable speed and sensitivity to match emerging sequencing technologies and to meet the requirements of genomic applications beyond read assembly.
- Published
- 2017
- Full Text
- View/download PDF
44. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
- Author
-
Stephen W. Keckler, Joel Emer, Minsoo Rhu, William J. Dally, Rangharajan Venkatesan, Antonio Puglielli, Angshuman Parashar, Brucek Khailany, and Anurag Mukkara
- Subjects
010302 applied physics ,FOS: Computer and information sciences ,Random access memory ,Artificial neural network ,Computer science ,Dataflow ,Computer Science - Neural and Evolutionary Computing ,Parallel computing ,02 engineering and technology ,General Medicine ,Convolutional neural network ,01 natural sciences ,020202 computer hardware & architecture ,Machine Learning (cs.LG) ,Computer Science - Learning ,Computer engineering ,0103 physical sciences ,Hardware Architecture (cs.AR) ,0202 electrical engineering, electronic engineering, information engineering ,Neural and Evolutionary Computing (cs.NE) ,Accumulator (computing) ,Computer Science - Hardware Architecture ,Efficient energy use - Abstract
Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs in a wide range of situations, especially mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator applied during inference. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to the multiplier array, where they are extensively reused. In addition, the accumulation of multiplication products are performed in a novel accumulator array. Our results show that on contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7x and 2.3x, respectively, over a comparably provisioned dense CNN accelerator.
- Published
- 2017
- Full Text
- View/download PDF
45. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications
- Author
-
John W. Poulton, Thomas Hastings Greer, C. Thomas Gray, William J. Dally, John Eyles, Stephen G. Tell, Xi Chen, and John Wilson
- Subjects
Engineering ,Serial communication ,business.industry ,Electrical engineering ,Integrated circuit design ,Chip ,Die (integrated circuit) ,CMOS ,Low-power electronics ,Hardware_INTEGRATEDCIRCUITS ,Charge pump ,Electronic engineering ,Electrical and Electronic Engineering ,business ,Ground plane - Abstract
High-speed signaling over high density interconnect on organic package substrates or silicon interposers offers an attractive solution to the off-chip bandwidth limitation problem faced in modern digital systems. In this paper, we describe a signaling system co-designed with the interconnect to take advantage of the characteristics of this environment to enable a high-speed, low area, and low-power die to die link. Ground-Referenced Signaling (GRS) is a single-ended signaling system that eliminates the major problems traditionally associated with single-ended design by using the ground plane as the reference and signaling above and below ground. This design employs a novel charge pump driver that additionally eliminates the issue of simultaneous switching noise with data independent current consumption. Silicon measurements from a test chip implementing two 16-lane links, with forwarded clocks, in a standard 28 nm process demonstrate 20 Gb/s operation at 0.54 pJ/bit over 4.5 mm organic substrate channels at a nominal 0.9 V power supply voltage. Timing margins at the receiver are >0.3 UI at a BER of 10-12. We estimate BER 10-25 at the eye center.
- Published
- 2013
- Full Text
- View/download PDF
46. Elastic Buffer Flow Control for On-Chip Networks
- Author
-
William J. Dally and George Michelogiannakis
- Subjects
Router ,Flow control (data) ,business.industry ,Computer science ,Throughput ,Buffer (optical fiber) ,Theoretical Computer Science ,Network on a chip ,Computational Theory and Mathematics ,Hardware and Architecture ,Embedded system ,business ,Software ,Computer network - Abstract
Networks-on-chip (NoCs) were developed to meet the communication requirements of large-scale systems. The majority of current NoCs spend considerable area and power for router buffers. In our past work, we have developed elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers are no longer required. Removing buffers and virtual channels (VCs) significantly simplifies router design. Compared to VC networks with highly-efficient custom SRAM buffers, EB networks provide an up to 45 percent shorter cycle time, 16 percent more throughput per unit power, or 22 percent more throughput per unit area. EB networks provide traffic classes using duplicate physical subnetworks. However, this approach negates the cost gains or becomes infeasible for a large number of traffic classes. Therefore, in this paper we propose a hybrid EB-VC router which provides an arbitrary number of traffic classes by using an input buffer to drain flits facing severe contention or deadlock. Thus, hybrid routers operate as EB routers in the common case, and as VC routers when necessary. For this reason, the hybrid EB-VC scheme offers 21 percent more throughput per unit power than VC networks and 12 percent than EB networks.
- Published
- 2013
- Full Text
- View/download PDF
47. Current parking regulator for zero droop/overshoot load transient response
- Author
-
William J. Dally, Thomas Hastings Greer, C. Thomas Gray, and Sudhir S. Kudva
- Subjects
010302 applied physics ,Engineering ,business.industry ,020208 electrical & electronic engineering ,02 engineering and technology ,Inductor ,01 natural sciences ,law.invention ,Capacitor ,CMOS ,Control theory ,law ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Overshoot (signal) ,Voltage droop ,Transient response ,Transient (oscillation) ,business ,Voltage - Abstract
Supply voltage integrity during a load transition is a critical problem. Droop on the supply may lead to logic failures, and overshoot can reduce reliability. Voltage deviations from the nominal forces designers to apply margins in design to ensure correct operation. This paper addresses two main causes of droop/overshoot on the supply line namely the sluggish converter response and the parasitics between the converter and the load. We present current parking regulator (CPR), a voltage down-converter with almost zero droop or overshoot during a load transient along with implementation techniques to nullify the effect of the parasitics. The underlying principle involves avoiding the inductor slewing time by parking sufficient excess current in the inductor, which is available to use immediately when the need arises. The system design involves on-die, package, and PCB co-design to minimize the impact of parasitics. Measurement results show negligible droop/overshoot when the load current transitions from 0.8A to 7.5A and vice versa in 2ns with only 200nF of load capacitance on the regulated output voltage node. The design was fabricated in a 28nm CMOS technology and integrated in the same package as the load using an 8 layer substrate. It achieves a peak efficiency of 83% at 8A load current.
- Published
- 2016
- Full Text
- View/download PDF
48. 8.6 A 6.5-to-23.3fJ/b/mm balanced charge-recycling bus in 16nm FinFET CMOS at 1.7-to-2.6Gb/s/wire with clock forwarding and low-crosstalk contraflow wiring
- Author
-
Stephen G. Tell, Xi Chen, William J. Dally, Thomas Hastings Greer, Matthew Fojtik, C. Thomas Gray, John W. Poulton, and John Wilson
- Subjects
Interconnection ,Engineering ,business.industry ,Voltage control ,020208 electrical & electronic engineering ,Bandwidth (signal processing) ,020206 networking & telecommunications ,02 engineering and technology ,Voltage regulator ,Crosstalk ,CMOS ,Hardware_INTEGRATEDCIRCUITS ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,business ,Coding (social sciences) ,Voltage - Abstract
Signaling over chip-scale global interconnect is consuming a larger fraction of total power in large processor chips, as processes continue to shrink. Solving this growing crisis requires simple, low-energy and area-efficient signaling for high-bandwidth data buses. This paper describes a balanced charge-recycling bus (BCRB) that achieves quadratic power savings, relative to signaling with full-swing CMOS repeaters. The scheme stacks two CMOS repeated wire links, one operating in the Vtop domain, between Vdd and Vmid=Vdd/2, the other, Vbot, between Vmid and GND. Unlike previous work [1], we require no voltage regulator to maintain the Vmid voltage at Vdd/2, to compensate for differences in data activity in Vtop and Vbot domains. The BCRB also uses simple single-ended signaling, to achieve higher bandwidth per unit bus width than differential buses [2] and lower signaling energy than precharging schemes [3], since we take full advantage of low switching activity and bus-invert coding.
- Published
- 2016
- Full Text
- View/download PDF
49. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
- Author
-
Yu Wang, Huazhong Yang, Dongliang Xie, Junlong Kang, Yubin Li, Song Han, Song Yao, Hong Luo, Huizi Mao, William J. Dally, Xin Li, and Yiming Hu
- Subjects
Hardware architecture ,FOS: Computer and information sciences ,Speedup ,Computer Science - Computation and Language ,business.industry ,Computer science ,Quantization (signal processing) ,Deep learning ,Speech recognition ,020208 electrical & electronic engineering ,02 engineering and technology ,Parallel computing ,020202 computer hardware & architecture ,Data flow diagram ,0202 electrical engineering, electronic engineering, information engineering ,Hardware acceleration ,Artificial intelligence ,Central processing unit ,business ,Field-programmable gate array ,Computation and Language (cs.CL) - Abstract
Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built larger and larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to high total cost of ownership (TCO) of a data center. In order to speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose scheduler that encodes and partitions the compressed model to each PE for parallelism, and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the compressed model. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the compressed LSTM network, corresponding to 2.52 TOPS on the uncompressed one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively., Comment: Accepted as full paper in FPGA'17, Monterey, CA; Also appeared at 1st International Workshop on Efficient Methods for Deep Neural Networks at NIPS 2016, Barcelona, Spain
- Published
- 2016
- Full Text
- View/download PDF
50. EIE: Efficient Inference Engine on Compressed Deep Neural Network
- Author
-
Ardavan Pedram, Song Han, Mark Horowitz, Jing Pu, William J. Dally, Xingyu Liu, and Huizi Mao
- Subjects
FOS: Computer and information sciences ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,Parallel computing ,01 natural sciences ,0103 physical sciences ,Hardware Architecture (cs.AR) ,0202 electrical engineering, electronic engineering, information engineering ,System on a chip ,Static random-access memory ,Computer Science - Hardware Architecture ,Throughput (business) ,010302 applied physics ,Random access memory ,Artificial neural network ,business.industry ,Deep learning ,General Medicine ,020202 computer hardware & architecture ,Uncompressed video ,Hardware acceleration ,Artificial intelligence ,business ,Dram ,Efficient energy use - Abstract
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency., Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly: https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision: http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University: https://goo.gl/6lwuer. Published as a conference paper in ISCA 2016
- Published
- 2016
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.