50 results on '"William J. Dally"'
Search Results
2. Accelerating Chip Design With Machine Learning.
- Author
-
Brucek Khailany, Haoxing Ren, Steve Dai, Saad Godil, Ben Keller, Robert Kirby 0001, Alicia Klinefelter, Rangharajan Venkatesan, Yanqing Zhang 0002, Bryan Catanzaro, and William J. Dally
- Published
- 2020
- Full Text
- View/download PDF
3. Darwin: A Genomics Coprocessor.
- Author
-
Yatish Turakhia, Gill Bejerano, and William J. Dally
- Published
- 2019
- Full Text
- View/download PDF
4. GPUs and the Future of Parallel Computing.
- Author
-
Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco
- Published
- 2011
- Full Text
- View/download PDF
5. The GPU Computing Era.
- Author
-
John Nickolls and William J. Dally
- Published
- 2010
- Full Text
- View/download PDF
6. Cost-Efficient Dragonfly Topology for Large-Scale Systems.
- Author
-
John Kim 0001, William J. Dally, Steve Scott, and Dennis Abts
- Published
- 2009
- Full Text
- View/download PDF
7. A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm
- Author
-
Joel Emer, Matthew Fojtik, C. Thomas Gray, Ben Keller, Stephen G. Tell, Priyanka Raina, Stephen W. Keckler, Alicia Klinefelter, William J. Dally, Brucek Khailany, Brian Zimmer, Jason Clemons, Rangharajan Venkatesan, Nan Jiang, Yanqing Zhang, Nathaniel Pinckney, and Yakun Sophia Shao
- Subjects
Computer science ,business.industry ,Multi-chip module ,Bandwidth (signal processing) ,Scalability ,Mesh networking ,Inference ,System on a chip ,Electrical and Electronic Engineering ,Chip ,business ,Computer hardware ,Efficient energy use - Abstract
Custom accelerators improve the energy efficiency, area efficiency, and performance of deep neural network (DNN) inference. This article presents a scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic chip are optimal for specific network sizes, the proposed architecture enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains. Communication energy is minimized with large on-chip distributed weight storage and a hierarchical network-on-chip and network-on-package, and inference energy is minimized through extensive data reuse. The 16-nm prototype achieves 1.29-TOPS/mm2 area efficiency, 0.11 pJ/op (9.5 TOPS/W) energy efficiency, 4.01-TOPS peak performance for a one-chip system, and 127.8 peak TOPS and 1903 images/s ResNet-50 batch-1 inference for a 36-chip system.
- Published
- 2020
- Full Text
- View/download PDF
8. Efficient Embedded Computing.
- Author
-
William J. Dally, James D. Balfour, David Black-Schaffer, James Chen, R. Curtis Harting, Vishal Parikh, JongSoo Park, and David Sheffield
- Published
- 2008
- Full Text
- View/download PDF
9. Research Challenges for On-Chip Interconnection Networks.
- Author
-
John D. Owens, William J. Dally, Ron Ho, Doddaballapur Narasimha-Murthy Jayasimha, Stephen W. Keckler, and Li-Shiuan Peh
- Published
- 2007
- Full Text
- View/download PDF
10. Champagne: Automated Whole-Genome Phylogenomic Character Matrix Method Using Large Genomic Indels for Homoplasy-Free Inference
- Author
-
James K Schull, Yatish Turakhia, James A Hemker, William J Dally, Gill Bejerano, and Holland, Barbara
- Subjects
rare genomic changes ,Mammals ,Evolutionary Biology ,Genome ,homoplasy-free characters ,Nucleotides ,Human Genome ,incomplete lineage sorting ,phylogenomics ,Genomics ,phylogenetics ,INDEL Mutation ,Genetics ,Animals ,Biochemistry and Cell Biology ,Ecology, Evolution, Behavior and Systematics ,Phylogeny ,Developmental Biology - Abstract
We present Champagne, a whole-genome method for generating character matrices for phylogenomic analysis using large genomic indel events. By rigorously picking orthologous genes and locating large insertion and deletion events, Champagne delivers a character matrix that considerably reduces homoplasy compared with morphological and nucleotide-based matrices, on both established phylogenies and difficult-to-resolve nodes in the mammalian tree. Champagne provides ample evidence in the form of genomic structural variation to support incomplete lineage sorting and possible introgression in Paenungulata and human–chimp–gorilla which were previously inferred primarily through matrices composed of aligned single-nucleotide characters. Champagne also offers further evidence for Myomorpha as sister to Sciuridae and Hystricomorpha in the rodent tree. Champagne harbors distinct theoretical advantages as an automated method that produces nearly homoplasy-free character matrices on the whole-genome scale.
- Published
- 2022
11. Stream Processors: Progammability and Efficiency.
- Author
-
William J. Dally, Ujval J. Kapasi, Brucek Khailany, Jung Ho Ahn, and Abhishek Das
- Published
- 2004
- Full Text
- View/download PDF
12. Programmable Stream Processors.
- Author
-
Ujval J. Kapasi, Scott Rixner, William J. Dally, Brucek Khailany, Jung Ho Ahn, Peter R. Mattson, and John D. Owens
- Published
- 2003
- Full Text
- View/download PDF
13. A Delay Model for Router Microarchitectures.
- Author
-
Li-Shiuan Peh and William J. Dally
- Published
- 2001
- Full Text
- View/download PDF
14. Energy Efficient On-Demand Dynamic Branch Prediction Models
- Author
-
Ehsan Atoofian, Amirali Baniasadi, Milad Mohammadi, Tor M. Aamodt, William J. Dally, and Song Han
- Subjects
Computer science ,Fetch ,02 engineering and technology ,Parallel computing ,Supercomputer ,computer.software_genre ,Branch predictor ,020202 computer hardware & architecture ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Compiler ,Cache ,computer ,Software ,Integer (computer science) ,Efficient energy use - Abstract
The branch predictor unit (BPU) is among the main energy consuming components in out-of-order (OoO) processors. For integer applications, we find 16 percent of the processor energy is consumed by the BPU. BPU is accessed in parallel with the instruction cache before it is known if a fetch group contains control instructions. We find 85 percent of BPU lookups are done for non-branch operations, and of the remaining lookups, 42 percent are done for highly biased branches that can be predicted statically with high accuracy. We evaluate two variants of a branch prediction model that combines dynamic and static branch prediction to achieve energy improvements for power-constrained applications. These models, named on-demand branch prediction (ODBP) and path-based on-demand branch prediction (ODBP-PATH), are two novel prediction techniques that eliminate unnecessary BPU lookups using compiler generated hints to identify instructions that can be more accurately predicted statically. ODBP-PATH is an implementation of ODBP that combines static and dynamic branch prediction based on the program path of execution. For a 4-wide OoO processor, ODBP-PATH delivers 11 percent average energy-delay (ED) product improvement, and 9 percent core average energy saving on the SPEC Int 2006 benchmarks.
- Published
- 2020
- Full Text
- View/download PDF
15. An Efficient, Protected Message Interface.
- Author
-
Whay Sing Lee, William J. Dally, Stephen W. Keckler, Nicholas P. Carter, and Andrew Chang 0001
- Published
- 1998
- Full Text
- View/download PDF
16. Transmitter equalization for 4-Gbps signaling.
- Author
-
William J. Dally and John W. Poulton
- Published
- 1997
- Full Text
- View/download PDF
17. The message-driven processor: a multicomputer processing node with efficient mechanisms.
- Author
-
William J. Dally, Stuart Fiske, John S. Keen, Richard A. Lethin, Michael D. Noakes, Peter R. Nuth, Roy E. Davison, and Gregory A. Fyler
- Published
- 1992
- Full Text
- View/download PDF
18. A 1.17-pJ/b, 25-Gb/s/pin Ground-Referenced Single-Ended Serial Link for Off- and On-Package Communication Using a Process- and Temperature-Adaptive Voltage Regulator
- Author
-
William J. Dally, C. Thomas Gray, John Wilson, Sudhir S. Kudva, John W. Poulton, Wenxu Zhao, Nikola Nedovic, Stephen G. Tell, Xi Chen, Walker J. Turner, Sunil Sudhakaran, Sanquan Song, and Brian Zimmer
- Subjects
Frequency response ,business.industry ,Serial communication ,Computer science ,020208 electrical & electronic engineering ,Transmitter ,Electrical engineering ,02 engineering and technology ,Voltage regulator ,Phase-locked loop ,CMOS ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Transceiver ,business ,Jitter - Abstract
This paper describes a short-reach serial link to connect chips mounted on the same package or on neighboring packages on a printed circuit board (PCB). The link employs an energy-efficient, single-ended ground-referenced signaling scheme. Implemented in 16-nm FinFET CMOS technology, the link operates at a data rate of 25 Gb/s/pin with 1.17-pJ/bit energy efficiency and uses a simple but robust matched-delay clock forwarding scheme that cancels most sources of jitter. The modest frequency-dependent attenuation of short-reach links is compensated using an analog equalizer in the transmitter. The receiver includes active-inductor peaking in the input amplifier to improve overall receiver frequency response. The link employs a novel power supply regulation scheme at both ends that uses a PLL ring-oscillator supply voltage as a reference to flatten circuit speed and reduce power consumption variation across PVT. The link can be calibrated once at an arbitrary voltage and temperature, then track VT variation without the need for periodic re-calibration. The link operates over a 10-mm-long on-package channel with −4 dB of attenuation with 0.77-UI eye opening at bit-error rate (BER) of 10−15. A package-to-package link with 54 mm of PCB and 26 mm of on-package trace with −8.5 dB of loss at Nyquist operates with 0.42 UI of eye opening at BER of 10−15. Overall link die area is 686 $\mu \text{m}\,\,\times $ 565 $\mu \text{m}$ with the transceiver circuitry taking up 20% of the area. The transceiver’s on-chip regulator is supplied from an off-chip 950-mV supply, while the support logic operates on a separate 850-mV supply.
- Published
- 2019
- Full Text
- View/download PDF
19. A tracking clock recovery receiver for 4-Gbps signaling.
- Author
-
John Poulton, William J. Dally, and Steve Tell
- Published
- 1998
- Full Text
- View/download PDF
20. CG-OoO
- Author
-
Milad Mohammadi, William J. Dally, and Tor M. Aamodt
- Subjects
010302 applied physics ,Out-of-order execution ,Exploit ,Computer science ,Instruction scheduling ,02 engineering and technology ,Parallel computing ,computer.software_genre ,01 natural sciences ,020202 computer hardware & architecture ,Scheduling (computing) ,Hardware and Architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Instruction pipeline ,Granularity ,Compiler ,computer ,Software ,Information Systems ,Efficient energy use - Abstract
We introduce the Coarse-Grain Out-of-Order (CG-OoO) general-purpose processor designed to achieve close to In-Order (InO) processor energy while maintaining Out-of-Order (OoO) performance. CG-OoO is an energy-performance-proportional architecture. Block-level code processing is at the heart of this architecture; CG-OoO speculates, fetches, schedules, and commits code at block-level granularity. It eliminates unnecessary accesses to energy-consuming tables and turns large tables into smaller, distributed tables that are cheaper to access. CG-OoO leverages compiler-level code optimizations to deliver efficient static code and exploits dynamic block-level and instruction-level parallelism. CG-OoO introduces Skipahead, a complexity effective, limited out-of-order instruction scheduling model. Through the energy efficiency techniques applied to the compiler and processor pipeline stages, CG-OoO closes 62% of the average energy gap between the InO and OoO baseline processors at the same area and nearly the same performance as the OoO. This makes CG-OoO 1.8× more efficient than the OoO on the energy-delay product inverse metric. CG-OoO meets the OoO nominal performance while trading off the peak scheduling performance for superior energy efficiency.
- Published
- 2017
- Full Text
- View/download PDF
21. Hot Chips 16: Power, Parallelism, and Memory Performance.
- Author
-
William J. Dally and Keith Diefendorff
- Published
- 2005
- Full Text
- View/download PDF
22. The bleeding edge.
- Author
-
Randall Rettberg, William J. Dally, and David E. Culler
- Published
- 1998
- Full Text
- View/download PDF
23. A 28 nm 2 Mbit 6 T SRAM With Highly Configurable Low-Voltage Write-Ability Assist Implementation and Capacitor-Based Sense-Amplifier Input Offset Compensation
- Author
-
Andreas J. Gotterba, Mahmut E. Sinangil, Jesse S. Wang, Matthew Fojtik, Jason Golbus, John W. Poulton, Brian Zimmer, Stephen G. Tell, C. Thomas Gray, William J. Dally, and Thomas Hastings Greer
- Subjects
Engineering ,Offset (computer science) ,Input offset voltage ,business.industry ,Sense amplifier ,Amplifier ,020208 electrical & electronic engineering ,02 engineering and technology ,01 natural sciences ,law.invention ,Capacitor ,CMOS ,law ,0103 physical sciences ,Hardware_INTEGRATEDCIRCUITS ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,Static random-access memory ,Electrical and Electronic Engineering ,010306 general physics ,business ,Low voltage - Abstract
This paper presents a highly configurable low-voltage write-ability assist implementation along with a sense-amplifier offset reduction technique to improve SRAM read performance. Write-assist implementation combines negative bit-line (BL) and $V_{\text{DD}}$ collapse schemes in an efficient way to maximize $V_{{\min}}$ improvements while saving on area and energy overhead of these assists. Relative delay and pulse width of assist control signals are also designed with configurability to provide tuning of assist strengths. Sense-amplifier offset compensation scheme uses capacitors to store and negate threshold mismatch of input transistors. A test chip fabricated in 28 nm HP CMOS process demonstrates operation down to 0.5 V with write assists and more than 10% reduction in word-line pulsewidth with the offset compensated sense amplifiers.
- Published
- 2016
- Full Text
- View/download PDF
24. Reuse Distance-Based Probabilistic Cache Replacement
- Author
-
Tor M. Aamodt, Subhasis Das, and William J. Dally
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Computer science ,Adaptive replacement cache ,Probabilistic logic ,02 engineering and technology ,Parallel computing ,Reuse ,01 natural sciences ,020202 computer hardware & architecture ,Metadata ,Hardware and Architecture ,Overhead (business) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Cache algorithms ,Software ,Dram ,Information Systems - Abstract
This article proposes Probabilistic Replacement Policy (PRP), a novel replacement policy that evicts the line with minimum estimated hit probability under optimal replacement instead of the line with maximum expected reuse distance. The latter is optimal under the independent reference model of programs, which does not hold for last-level caches (LLC). PRP requires 7% and 2% metadata overheads in the cache and DRAM respectively. Using a sampling scheme makes DRAM overhead negligible, with minimal performance impact. Including detailed overhead modeling and equal cache areas, PRP outperforms SHiP, a state-of-the-art LLC replacement algorithm, by 4% for memory-intensive SPEC-CPU2006 benchmarks.
- Published
- 2015
- Full Text
- View/download PDF
25. On-Demand Dynamic Branch Prediction
- Author
-
Song Han, Tor M. Aamodt, William J. Dally, and Milad Mohammadi
- Subjects
Speedup ,Computer science ,Speculative execution ,Thread (computing) ,Parallel computing ,computer.software_genre ,Branch predictor ,Power budget ,Branch table ,Hardware and Architecture ,Compiler ,Cache ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,computer - Abstract
In out-of-order (OoO) processors, speculative execution with high branch prediction accuracy is employed to achieve good single thread performance. In these processors the branch prediction unit tables (BPU) are accessed in parallel with the instruction cache before it is known whether a fetch group contains branch instructions. For integer applications, we find 85 percent of BPU lookups are done for non-branch operations and of the remaining lookups, 42 percent are done for highly biased branches that can be predicted statically with high accuracy. We evaluate on-demand branch prediction (ODBP), a novel technique that uses compiler generated hints to identify those instructions that can be more accurately predicted statically to eliminate unnecessary BPU lookups. We evaluate an implementation of ODBP that combines static and dynamic branch prediction. For a four wide superscalar processor, ODBP delivers as much as 9 percent improvement in average energy-delay (ED) product, 7 percent core average energy saving, and 3 percent speedup. ODBP also enables the use of large BPU’s for a given power budget.
- Published
- 2015
- Full Text
- View/download PDF
26. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications
- Author
-
John W. Poulton, Thomas Hastings Greer, C. Thomas Gray, William J. Dally, John Eyles, Stephen G. Tell, Xi Chen, and John Wilson
- Subjects
Engineering ,Serial communication ,business.industry ,Electrical engineering ,Integrated circuit design ,Chip ,Die (integrated circuit) ,CMOS ,Low-power electronics ,Hardware_INTEGRATEDCIRCUITS ,Charge pump ,Electronic engineering ,Electrical and Electronic Engineering ,business ,Ground plane - Abstract
High-speed signaling over high density interconnect on organic package substrates or silicon interposers offers an attractive solution to the off-chip bandwidth limitation problem faced in modern digital systems. In this paper, we describe a signaling system co-designed with the interconnect to take advantage of the characteristics of this environment to enable a high-speed, low area, and low-power die to die link. Ground-Referenced Signaling (GRS) is a single-ended signaling system that eliminates the major problems traditionally associated with single-ended design by using the ground plane as the reference and signaling above and below ground. This design employs a novel charge pump driver that additionally eliminates the issue of simultaneous switching noise with data independent current consumption. Silicon measurements from a test chip implementing two 16-lane links, with forwarded clocks, in a standard 28 nm process demonstrate 20 Gb/s operation at 0.54 pJ/bit over 4.5 mm organic substrate channels at a nominal 0.9 V power supply voltage. Timing margins at the receiver are >0.3 UI at a BER of 10-12. We estimate BER 10-25 at the eye center.
- Published
- 2013
- Full Text
- View/download PDF
27. Elastic Buffer Flow Control for On-Chip Networks
- Author
-
William J. Dally and George Michelogiannakis
- Subjects
Router ,Flow control (data) ,business.industry ,Computer science ,Throughput ,Buffer (optical fiber) ,Theoretical Computer Science ,Network on a chip ,Computational Theory and Mathematics ,Hardware and Architecture ,Embedded system ,business ,Software ,Computer network - Abstract
Networks-on-chip (NoCs) were developed to meet the communication requirements of large-scale systems. The majority of current NoCs spend considerable area and power for router buffers. In our past work, we have developed elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers are no longer required. Removing buffers and virtual channels (VCs) significantly simplifies router design. Compared to VC networks with highly-efficient custom SRAM buffers, EB networks provide an up to 45 percent shorter cycle time, 16 percent more throughput per unit power, or 22 percent more throughput per unit area. EB networks provide traffic classes using duplicate physical subnetworks. However, this approach negates the cost gains or becomes infeasible for a large number of traffic classes. Therefore, in this paper we propose a hybrid EB-VC router which provides an arbitrary number of traffic classes by using an input buffer to drain flits facing severe contention or deadlock. Thus, hybrid routers operate as EB routers in the common case, and as VC routers when necessary. For this reason, the hybrid EB-VC scheme offers 21 percent more throughput per unit power than VC networks and 12 percent than EB networks.
- Published
- 2013
- Full Text
- View/download PDF
28. Evaluating Elastic Buffer and Wormhole Flow Control
- Author
-
Daniel U. Becker, William J. Dally, and George Michelogiannakis
- Subjects
Router ,Flow control (data) ,Computer science ,business.industry ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,Throughput ,Buffer (optical fiber) ,Theoretical Computer Science ,Network on a chip ,Computational Theory and Mathematics ,Hardware and Architecture ,Hardware_INTEGRATEDCIRCUITS ,Wormhole ,business ,Software ,Computer network - Abstract
With the emergence of on-chip networks, router buffer power has become a primary concern. Elastic buffer (EB) flow control utilizes existing pipeline flip-flops in the channels to implement distributed FIFOs, eliminating the need for input buffers at the routers. EB routers have been shown to be more efficient than virtual channel routers, as they do not require input buffers or complex logic for managing virtual channels and tracking credits. Wormhole routers are more comparable in terms of complexity because they also lack virtual channels. This paper compares EB and wormhole routers and explores novel hybrid designs to more closely examine the effect of design simplicity and input buffer cost. Our results show that EB routers have up to 25 percent smaller cycle time compared to wormhole and hybrid routers. Moreover, EB flow control requires 10 percent less energy to transfer a single bit through a router and offers three percent more throughput per unit energy as well as 62 percent more throughput per unit area. The main contributor to these results is the cost and delay overhead of the input buffer.
- Published
- 2011
- Full Text
- View/download PDF
29. Operand Registers and Explicit Operand Forwarding
- Author
-
William J. Dally, James Balfour, and R.C. Halting
- Subjects
Memory hierarchy ,Computer science ,Pipeline (computing) ,Parallel computing ,Operand ,Hardware and Architecture ,Code generation ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Routing (electronic design automation) ,Fixed-point arithmetic ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,Operand forwarding ,Efficient energy use - Abstract
Operand register files are small, inexpensive register files that are integrated with function units in the execute stage of the pipeline, effectively extending the pipeline operand registers into register files. Explicit operand forwarding lets software opportunistically orchestrate the routing of operands through the forwarding network to avoid writing ephemeral values to registers. Both mechanisms let software capture short-term reuse and locality close to the function units, improving energy efficiency by allowing a significant fraction of operands to be delivered from inexpensive registers that are integrated with the function units. An evaluation shows that capturing operand bandwidth close to the function units allows operand registers to reduce the energy consumed in the register files and forwarding network of an embedded processor by 61%, and allows explicit forwarding to reduce the energy consumed by 26%.
- Published
- 2009
- Full Text
- View/download PDF
30. Hierarchical Instruction Register Organization
- Author
-
James Balfour, Jongsoo Park, David Black-Schaffer, William J. Dally, and V. Parikh
- Subjects
Instruction set ,Instruction register ,Kernel (linear algebra) ,Computer architecture ,Indirection ,Computer-integrated manufacturing ,Hardware and Architecture ,Computer science ,Very long instruction word ,Overhead (computing) ,Baseline (configuration management) - Abstract
This paper analyzes a range of architectures for efficient delivery of VLIW instructions for embedded media kernels. The analysis takes an efficient filter cache as a baseline and examines the benefits from 1) removing the tag overhead, 2) distributing the storage, 3) adding indirection, 4) adding efficient NOP generation, and 5) sharing instruction memory. The result is a hierarchical instruction register organization that provides a 56% energy and 40% area savings over an already efficient filter cache.
- Published
- 2008
- Full Text
- View/download PDF
31. An Energy-Efficient Processor Architecture for Embedded Systems
- Author
-
Jongsoo Park, James Balfour, David Black-Schaffer, William J. Dally, and V. Parikh
- Subjects
Instruction register ,Instructions per cycle ,Reduced instruction set computing ,Computer science ,Processor register ,business.industry ,Transport triggered architecture ,Microarchitecture ,Addressing mode ,Instruction set ,Computer architecture ,Hardware and Architecture ,Embedded system ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,business - Abstract
We present an efficient programmable architecture for compute-intensive embedded applications. The processor architecture uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data. Instruction registers capture instruction reuse and locality in inexpensive storage structures that arc located near to the functional units. The data register organization captures reuse and locality in different levels of the hierarchy to reduce the cost of delivering data. Exposed communication resources eliminate pipeline registers and control logic, and allow the compiler to schedule efficient instruction and data movement. The architecture keeps a significant fraction of instruction and data bandwidth local to the functional units, which reduces the cost of supplying instructions and data to large numbers of functional units. This architecture achieves an energy efficiency that is 23x greater than an embedded RISC processor.
- Published
- 2008
- Full Text
- View/download PDF
32. Flattened Butterfly Topology for On-Chip Networks
- Author
-
William J. Dally, John Kim, and James Balfour
- Subjects
Butterfly network ,Multi-core processor ,business.industry ,Computer science ,Energy consumption ,Parallel computing ,Network topology ,Topology ,Network on a chip ,Hardware and Architecture ,Butterfly ,Latency (engineering) ,business ,Computer network - Abstract
With the trend towards increasing number of cores in a multicore processors, the on-chip network that connects the cores needs to scale efficiently. In this work, we propose the use of high-radix networks in on-chip networks and describe how the flattened butterfly topology can be mapped to on-chip networks. By using high-radix routers to reduce the diameter of the network, the flattened butterfly offers lower latency and energy consumption than conventional on-chip topologies. In addition, by properly using bypass channels in the flattened butterfly network, non-minimal routing can be employed without increasing latency or the energy consumption.
- Published
- 2007
- Full Text
- View/download PDF
33. A 33-mW 8-Gb/s CMOS clock multiplier and CDR for highly integrated I/Os
- Author
-
R. Rathi, William J. Dally, R. Senthinathan, Hiok-Tiaq Ng, Trey Greer, M.-J.E. Lee, J. Edmondson, John W. Poulton, J. Tran, A. Nguyen, and Ramin Farjad-Rad
- Subjects
Physics ,Offset (computer science) ,Vernier scale ,business.industry ,Frequency multiplier ,Electrical engineering ,Voltage regulator ,Multiplexer ,law.invention ,Injection locking ,CMOS ,law ,Electronic engineering ,Electrical and Electronic Engineering ,business ,Clock recovery ,Jitter ,CPU multiplier - Abstract
A 0.622-8-Gb/s clock and data recovery (CDR) circuit using injection locking for jitter suppression and phase interpolation in high-bandwidth system-on-chip solutions is described. A slave injection locked oscillator (SILO) is locked to a tracking aperture-multiplying DLL (TA-MDLL) via a coarse phase selection multiplexer (MUX). For the fine timing vernier, an interpolator DAC controls the injection strength of the MUX output into the SILO. This 1.2-V 0.13-/spl mu/m CMOS CDR consumes 33 mW at 8Gb/s. Die area including voltage regulator is 0.08 mm/sup 2/. Recovered clock jitter is 49 ps pk-pk at a 200-ppm bit-rate offset.
- Published
- 2004
- Full Text
- View/download PDF
34. A second-order semidigital clock recovery circuit based on injection locking
- Author
-
Hiok-Tiaq Ng, Trey Greer, William J. Dally, Ramin Farjad-Rad, John W. Poulton, J. Edmondson, R. Senthinathan, R. Rathi, and M.-J.E. Lee
- Subjects
Phase-locked loop ,Injection locking ,Physics ,Synchronous circuit ,CMOS ,Clock domain crossing ,Electronic engineering ,Electrical and Electronic Engineering ,Clock skew ,Digital clock ,Jitter - Abstract
A compact (1 mm /spl times/ 160 /spl mu/m) and low-power (80-mW) 0.18-/spl mu/m CMOS 3.125-Gb/s clock and data recovery circuit is described. The circuit utilizes injection locking to filter out high-frequency reference clock jitter and multiplying delay-locked loop duty-cycle distortions. The injection-locked slave oscillator output can have its output clocks interpolated by current steering the injecting clocks. A second-order clock and data recovery is introduced to perform the interpolation and is capable of tracking frequency offsets while exhibiting low phase wander.
- Published
- 2003
- Full Text
- View/download PDF
35. Jitter transfer characteristics of delay-locked loops - theories and design techniques
- Author
-
Hiok-Tiaq Ng, Trey Greer, M.-J.E. Lee, R. Senthinathan, Ramin Farjad-Rad, William J. Dally, and John W. Poulton
- Subjects
Computer science ,Control theory ,Frequency domain ,Bandwidth (signal processing) ,Electronic engineering ,Electrical and Electronic Engineering ,Chip ,Jitter ,Electronic circuit - Abstract
This paper presents analyses and experimental results on the jitter transfer of delay-locked loops (DLLs). Through a z-domain model, we show that in a widely used DLL configuration, jitter peaking always exists and high-frequency jitter does not get attenuated as previous analyses suggest. This is true even in a first-order DLL and an overdamped second-order DLL. The amount of jitter peaking is shown to trade off with the tracking bandwidth and, therefore, the acquisition time. Techniques to reduce jitter amplification by loop filtering and phase filtering are discussed. Measurements from a prototype chip incorporating the discussed techniques confirm the prediction of the analytical model. In environments where the reference clock is noisy or where multiple timing circuits are cascaded, this jitter amplification effect should be carefully evaluated.
- Published
- 2003
- Full Text
- View/download PDF
36. A low-power multiplying DLL for low-jitter multigigahertz clock generation in highly integrated digital chips
- Author
-
Hiok-Tiaq Ng, William J. Dally, R. Rathi, John W. Poulton, M.-J.E. Lee, Ramin Farjad-Rad, and R. Senthinathan
- Subjects
Engineering ,business.industry ,Hardware_PERFORMANCEANDRELIABILITY ,Noise (electronics) ,Phase-locked loop ,CMOS ,Filter (video) ,Low-power electronics ,Phase noise ,Hardware_INTEGRATEDCIRCUITS ,Electronic engineering ,Electrical and Electronic Engineering ,business ,Jitter ,CPU multiplier - Abstract
A multiplying delay-locked loop (MDLL) for high-speed on-chip clock generation that overcomes the drawbacks of phase-locked loops (PLLs) such as jitter accumulation, high sensitivity to supply, and substrate noise is described. The MDLL design removes such drawbacks while maintaining the advantages of a PLL for multirate frequency multiplication. This design also uses a supply regulator and filter to further reduce on-chip jitter generation. The MDLL, implemented in 0.18-/spl mu/m CMOS technology, occupies a total active area of 0.05 mm/sup 2/ and has a speed range of 200 MHz to 2 GHz with selectable multiplication ratios of M=4, 5, 8, 10. The complete synthesizer, including the output clock buffers, dissipates 12 mW from a 1.8-V supply at 2.0 GHz. This MDLL architecture is used as a clock multiplier integrated on a single chip for a 72/spl times/72 STS-1 grooming switch and has a jitter of 1.73 ps (rms) and 13.1 ps (pk-pk).
- Published
- 2002
- Full Text
- View/download PDF
37. Low-power area-efficient high-speed I/O circuit techniques
- Author
-
M.-J.E. Lee, William J. Dally, and Patrick Chiang
- Subjects
Very-large-scale integration ,Engineering ,business.industry ,Amplifier ,Capacitive sensing ,Transmitter ,Hardware_PERFORMANCEANDRELIABILITY ,Integrated circuit design ,CMOS ,Low-power electronics ,Hardware_INTEGRATEDCIRCUITS ,Electronic engineering ,Inverter ,Electrical and Electronic Engineering ,business ,Hardware_LOGICDESIGN - Abstract
We present a 4-Gb/s I/O circuit that fits in 0.1-mm/sup 2/ of die area, dissipates 90 mW of power, and operates over 1 m of 7-mil 0.5-oz PCB trace in a 0.25-/spl mu/m CMOS technology. Swing reduction is used in an input-multiplexed transmitter to provide most of the speed advantage of an output-multiplexed architecture with significantly lower power and area. A delay-locked loop (DLL) using a supply-regulated inverter delay line gives very low jitter at a fraction of the power of a source-coupled delay line-based DLL. Receiver capacitive offset trimming decreases the minimum resolvable swing to 8 mV, greatly reducing the transmission energy without affecting the performance of the receive amplifier. These circuit techniques enable a high level of I/O integration to relieve the pin bandwidth bottleneck of modern VLSI chips.
- Published
- 2000
- Full Text
- View/download PDF
38. Concurrent event handling through multithreading
- Author
-
William J. Dally, W.S.L.S. Chatterjee, Andrew Chang, and Stephen W. Keckler
- Subjects
Speedup ,Process state ,Computer science ,business.industry ,Exception handling ,Thread (computing) ,computer.software_genre ,Theoretical Computer Science ,law.invention ,Super-threading ,Microprocessor ,Computational Theory and Mathematics ,Hardware and Architecture ,law ,Embedded system ,Multithreading ,Operating system ,Software system ,business ,computer ,Software ,Context switch - Abstract
Exceptions have traditionally been used to handle infrequently occurring and unpredictable events during normal program execution. Current trends in microprocessor and operating systems design continue to increase the cost of event handling. Because of the deep pipelines and wide out-of-order superscalar architectures of contemporary microprocessors, an event may need to nullify a large number of in-flight instructions. Large register files require existing software systems to save and restore a substantial amount of process state before executing an exception handler. At the same time, processors are executing in environments that supply higher event frequencies and demand higher performance. We have developed an alternative architecture, "concurrent event handling", that incorporates multithreading into event handling architectures. Instead of handling the event in the faulting thread's architectural and pipeline registers, the fault handler is forked into its own thread slot and executes concurrently with the faulting thread. Microbenchmark programs show a factor-of-3 speedup for concurrent event handling over a traditional architecture on code that takes frequent exceptions. We also demonstrate substantial speedups on two event-based applications. Concurrent event handling is implemented in the MIT's MAP (Multi-ALU Processor) chip.
- Published
- 1999
- Full Text
- View/download PDF
39. The M-machine multicomputer
- Author
-
Andrew Chang, Yevgeny Gurevich, Nicholas P. Carter, Stephen W. Keckler, Whay S. Lee, William J. Dally, and Marco Fillo
- Subjects
Computer science ,CPU cache ,business.industry ,Mesh networking ,Thread (computing) ,Synchronization ,Theoretical Computer Science ,Software ,Multithreading ,Embedded system ,Systems architecture ,Cache ,business ,Information Systems - Abstract
The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M-Machine computing nodes are connected with a 3-D mesh network; each node is a multithreaded processor incorporating 9 function units, on-chip cache, and local memory. The multiple function units are used to exploit both instruction-level and thread-level parallelism. A user accessible message passing system yields fast communication and synchronization between nodes. Rapid access to remote memory is provided transparently to the user with a combination of hardware and software mechanisms. This paper presents the architecture of the M-Machine and describes how its mechanisms attempt to maximize both single thread performance and overall system throughput. The architecture is complete and the MAP chip, which will serve as the M-Machine processing node, is currently being implemented.
- Published
- 1997
- Full Text
- View/download PDF
40. A universal parallel computer architecture
- Author
-
William J. Dally
- Subjects
Interconnection ,Computer Networks and Communications ,Computer science ,Uniform memory access ,Throughput ,Memory bandwidth ,Multiprocessing ,Theoretical Computer Science ,Computer architecture ,Parallel processing (DSP implementation) ,Hardware and Architecture ,Concurrent computing ,Network performance ,Software - Abstract
Advances in interconnection network performance and interprocessor interaction mechanisms enable the construction of fine-grain parallel computers in which the nodes are physically small and have a small amount of memory. This class of machines has a much higher ratio of processor to memory area and hence provides greater processor throughput and memory bandwidth per unit cost relative to conventional memory-dominated machines. This paper describes the technology and architecture trends motivating fine-grain architecture and the enabling technologies of high-performance interconnection networks and low-overhead interaction mechanisms. We conclude with a discussion of our experiences with the J-Machine, a prototype fine-grain concurrent computer.
- Published
- 1993
- Full Text
- View/download PDF
41. Deadlock-free adaptive routing in multicomputer networks using virtual channels
- Author
-
William J. Dally and H. Aoki
- Subjects
Interconnection ,Network packet ,Computer science ,business.industry ,Distributed computing ,Fault tolerance ,Adaptive routing ,Dependency graph ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Network performance ,business ,Computer network - Abstract
The use of adaptive routing in a multicomputer interconnection network improves network performance by using all available paths and provides fault tolerance by allowing messages to be routed around failed channels and nodes. Two deadlock-free adaptive routing algorithms are described. Both algorithms allocate virtual channels using a count of the number of dimension reversals a packet has performed to eliminate cycles in resource dependency graphs. The static algorithm eliminates cycles in the network channel dependency graph. The dynamic algorithm improves virtual channel utilization by permitting dependency cycles and instead eliminating cycles in the packet wait-for graph. It is proved that these algorithms are deadlock-free. Experimental measurements of their performance are presented. >
- Published
- 1993
- Full Text
- View/download PDF
42. A fast translation method for paging on top of segmentation
- Author
-
William J. Dally
- Subjects
Computer science ,Translation lookaside buffer ,Parallel computing ,Theoretical Computer Science ,Physical address ,Memory management ,Computational Theory and Mathematics ,Virtual address space ,Hardware and Architecture ,Virtual memory ,Paging ,Segmentation ,Algorithm design ,Software - Abstract
A description is presented of a fast, one-step translation method that implements paging on top of segmentation. This method translates a virtual address into a physical address, performing both the segmentation and paging translations, with a single TLB (translation lookaside buffer) read and a short add. Previous methods performed this translation in two steps and required two TLB reads and a long add. Using the fast method, the fine-grain protection and relocation of segmentation combined with paging can be provided with delay and complexity comparable to paging-only systems. This method allows small segments, particularly important in object-oriented programming systems, to be managed efficiently. >
- Published
- 1992
- Full Text
- View/download PDF
43. Express cubes: improving the performance of k-ary n-cube interconnection networks
- Author
-
William J. Dally
- Subjects
Interconnection ,Packet switching ,Computational Theory and Mathematics ,Logarithm ,Hardware and Architecture ,Computer science ,Mesh networking ,Locality ,Parallel computing ,Cube ,Telecommunications network ,Software ,Theoretical Computer Science - Abstract
The author discusses express cubes, k-ary n-cube interconnection networks augmented by express channels that provide a short path for nonlocal messages. An express cube combines the logarithmic diameter of a multistage network with the wire-efficiency and ability to exploit locality of a low-dimensional mesh network. The insertion of express channels reduces the network diameter and thus the distance component of network latency. Wire length is increased, allowing networks to operate with latencies that approach the physical speed-of-light limitation rather than being limited by node delays. Express channels increase wire bisection in a manner that allows the bisection to be controlled independently of the choice of radix, dimension, and channel width. By increasing wire bisection to saturate the available wiring media, throughput can be substantially increased. With an express cube both latency and throughput are wire-limited and within a small factor of the physical limit on performance. >
- Published
- 1991
- Full Text
- View/download PDF
44. Performance analysis of k-ary n-cube interconnection networks
- Author
-
William J. Dally
- Subjects
Interconnection ,Computational Theory and Mathematics ,Hardware and Architecture ,Computer science ,Throughput ,Torus ,Parallel computing ,Deterministic routing ,Topology ,Telecommunications network ,Software ,Theoretical Computer Science - Abstract
VLSI communication networks are wire-limited, i.e. the cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the network. Communication networks of varying dimensions are analyzed under the assumption of constant wire bisection. Expressions for the latency, average case throughput, and hot-spot throughput of k-ary n-cube networks with constant bisection that agree closely with experimental measurements are derived. It is shown that low-dimensional networks (e.g. tori) have lower latency and higher hot-spot throughput than high-dimensional networks (e.g. binary n-cubes) with the same bisection width. >
- Published
- 1990
- Full Text
- View/download PDF
45. A hardware logic simulation system
- Author
-
Prathima Agrawal and William J. Dally
- Subjects
Very-large-scale integration ,Amdahl's law ,Event (computing) ,Computer science ,business.industry ,Logic simulation ,System testing ,Integrated circuit ,Parallel computing ,Mars Exploration Program ,Computer Graphics and Computer-Aided Design ,law.invention ,symbols.namesake ,law ,symbols ,Algorithm design ,Electrical and Electronic Engineering ,business ,Software ,Computer hardware - Abstract
Multiple-delay logic simulation algorithms developed for the microprogrammable accelerator for rapid simulations (MARS) hardware simulator are discussed. In particular, timing-analysis algorithms for event cancellations, spike and race analyses, and oscillation detection are described. It is shown how a reconfigurable set of processors, called processing elements (PEs), can be arranged in a pipelined configuration to implement these algorithms. The algorithms operate within the partitioned-memory, message-passing architecture of MARS. Three logic simulators-two multiple delay and one unit delay-have been implemented using slightly different configuration of the available PEs. In these simulators, VLSI chips are modeled at the gate level with accurate rise/fall delays assigned to each logic primitive. On-chip memory blocks are modeled functionally and are integrated into the simulation framework. The MARS hardware simulator has been tested on many VLSI chip designs and has demonstrated a speed improvement of about 50 times that of an Amdahl 5870 system running a production-quality software simulator while retaining the accuracy of simulations. >
- Published
- 1990
- Full Text
- View/download PDF
46. Buffer and Delay Bounds in High Radix Interconnection Networks
- Author
-
Arjun Singh and William J. Dally
- Subjects
Queueing theory ,Interconnection ,Intelligent Network ,Hardware and Architecture ,business.industry ,Network packet ,Computer science ,Bounding overwatch ,Parallel computing ,Latency (engineering) ,business ,Buffer (optical fiber) ,Computer network - Abstract
We apply recent results in queueing theory to propose a methodology for bounding the buffer depth and packet delay in high radix interconnection networks. While most work in interconnection networks has been focused on the throughput and average latency in such systems, few studies have been done providing statistical guarantees for buffer depth and packet delays. These parameters are key in the design and performance of a network. We present a methodology for calculating such bounds for a practical high radix network and through extensive simulations show its effectiveness for both bursty and non-bursty injection traffic. Our results suggest that modest speedups and buffer depths enable reliable networks without flow control to be constructed.
- Published
- 2004
- Full Text
- View/download PDF
47. Globally Adaptive Load-Balanced Routing on Tori
- Author
-
Brian Towles, Arjun Singh, Amit Gupta, and William J. Dally
- Subjects
Zone Routing Protocol ,Static routing ,Dynamic Source Routing ,Computer science ,business.industry ,Distributed computing ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,Enhanced Interior Gateway Routing Protocol ,Policy-based routing ,Link-state routing protocol ,Hardware and Architecture ,Multipath routing ,Hardware_INTEGRATEDCIRCUITS ,Destination-Sequenced Distance Vector routing ,business ,Computer network - Abstract
We introduce a new method of adaptive routing on k-ary n-cubes, Globally Adaptive Load-Balance (GAL). GAL makes global routing decisions using global information. In contrast, most previous adaptive routing algorithms make local routing decisions using local information (typically channel queue depth). GAL senses global congestion using segmented injection queues to decide the directions to route in each dimension. It further load balances the network by routing in the selected directions adaptively. Using global information, GAL achieves the performance (latency and throughput) of minimal adaptive routing on benign traffic patterns and performs as well as the best obliviously load-balanced routing algorithm (GOAL) on adversarial traffic.
- Published
- 2004
- Full Text
- View/download PDF
48. Migration in Single Chip Multiprocessors
- Author
-
Kelly A. Shaw and William J. Dally
- Subjects
Reduction (complexity) ,Single chip ,Resource (project management) ,Ideal (set theory) ,Hardware and Architecture ,Computer science ,Locality ,Parallel computing - Abstract
Global communication costs in future single-chipmultiprocessors will increase linearly with distance. In this paper,we revisit the issues of locality and load balance in order totake advantage of these new costs. We present a technique whichsimultaneously migrates data and threads based on vectors specifyinglocality and resource usage. This technique improves performanceon applications with distinguishable locality and imbalancedresource usage. 64% of the ideal reduction in execution timewas achieved on an application with these traits while no improvementwas obtained on a balanced application with little locality.
- Published
- 2002
- Full Text
- View/download PDF
49. MARS: A Multiprocessor-Based Programmable Accelerator
- Author
-
R. Tutundjian, Prathima Agrawal, William J. Dally, Anjur Sundaresan Krishnakumar, H. V. Jagadish, and W. C. Fischer
- Subjects
Very-large-scale integration ,Flexibility (engineering) ,business.industry ,Computer science ,Logic simulation ,Multiprocessing ,Mars Exploration Program ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Telecommunications network ,Acceleration ,Hardware and Architecture ,Embedded system ,Hardware acceleration ,Electrical and Electronic Engineering ,business ,Software - Abstract
MARS, short for microprogrammable accelerator for rapid simulations, is a multiprocessor-based hardware accelerator that can efficiently implement a wide range of computationally complex algorithms. In addition to accelerating many graph-related problem solutions, MARS is ideally suited for performing event-driven simulations of VLSI circuits. Its highly pipelined and parallel architecture yields a performance comparable to that of existing special-purpose hardware simulators. MARS has the added advantage of flexibility because its VLSI processors are custom-designed to be microprogrammable and reconfigurable. When programmed as a logic simulator, MARS should be able to achieve 1 million gate evaluations per second.
- Published
- 1987
- Full Text
- View/download PDF
50. The torus routing chip
- Author
-
William J. Dally and Charles L. Seitz
- Subjects
Very-large-scale integration ,Dynamic Source Routing ,Static routing ,Computer Networks and Communications ,business.industry ,Computer science ,Chip ,Theoretical Computer Science ,Computational Theory and Mathematics ,Link-state routing protocol ,Intel iPSC ,Hardware and Architecture ,Embedded system ,Hardware_INTEGRATEDCIRCUITS ,Routing (electronic design automation) ,business ,Wormhole switching - Abstract
The torus routing chip (TRC) is a self-timed chip that performs deadlock-free cut-through routing in k-ary n-cube multiprocessor interconnection networks using a new method of deadlock avoidance called virtual channels. A prototype TRC with byte wide self-timed communication channels achieved on first silicon a throughput of 64Mbits/s in each dimension, about an order of magnitude better performance than the communication networks used by machines such as the Caltech Cosmic Cube or Intel iPSC. The latency of the cut-through routing of only 150ns per routing step largely eliminates message locality considerations in the concurrent programs for such machines. The design and testing of the TRC as a self-timed chip was no more difficult than it would have been for a synchronous chip.
- Published
- 1986
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.