47 results on '"William J. Dally"'
Search Results
2. A 95.6-TOPS/W Deep Learning Inference Accelerator With Per-Vector Scaled 4-bit Quantization in 5 nm
- Author
-
Ben Keller, Rangharajan Venkatesan, Steve Dai, Stephen G. Tell, Brian Zimmer, Charbel Sakr, William J. Dally, C. Thomas Gray, and Brucek Khailany
- Subjects
Electrical and Electronic Engineering - Published
- 2023
- Full Text
- View/download PDF
3. Evolution of the Graphics Processing Unit (GPU)
- Author
-
Stephen W. Keckler, David B. Kirk, and William J. Dally
- Subjects
Vertex (computer graphics) ,Fragment (computer graphics) ,Computer science ,Graphics processing unit ,Frame rate ,High memory ,Hardware and Architecture ,Computer graphics (images) ,Smart camera ,Electrical and Electronic Engineering ,Graphics ,Shader ,Software ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
Graphics processing units (GPUs) power today’s fastest supercomputers, are the dominant platform for deep learning, and provide the intelligence for devices ranging from self-driving cars to robots and smart cameras. They also generate compelling photorealistic images at real-time frame rates. GPUs have evolved by adding features to support new use cases. NVIDIA’s GeForce 256, the first GPU, was a dedicated processor for real-time graphics, an application that demands large amounts of floating-point arithmetic for vertex and fragment shading computations and high memory bandwidth. As real-time graphics advanced, GPUs became programmable. The combination of programmability and floating-point performance made GPUs attractive for running scientific applications. Scientists found ways to use early programmable GPUs by casting their calculations as vertex and fragment shaders. GPUs evolved to meet the needs of scientific users by adding hardware for simpler programming, double-precision floating-point arithmetic, and resilience.
- Published
- 2021
- Full Text
- View/download PDF
4. Accelerating Chip Design With Machine Learning
- Author
-
Steve Dai, Ben Keller, William J. Dally, Brucek Khailany, Rangharajan Venkatesan, Alicia Klinefelter, Robert M. Kirby, Saad Godil, Yanqing Zhang, Haoxing Ren, and Bryan Catanzaro
- Subjects
Very-large-scale integration ,Artificial neural network ,Design space exploration ,Computer science ,business.industry ,02 engineering and technology ,Integrated circuit design ,Machine learning ,computer.software_genre ,Convolutional neural network ,020202 computer hardware & architecture ,Workflow ,Hardware and Architecture ,Logic gate ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,Artificial intelligence ,Electrical and Electronic Engineering ,Design methods ,business ,computer ,Software - Abstract
Recent advancements in machine learning provide an opportunity to transform chip design workflows. We review recent research applying techniques such as deep convolutional neural networks and graph-based neural networks in the areas of automatic design space exploration, power analysis, VLSI physical design, and analog design. We also present a future vision of an AI-assisted automated chip design workflow to aid designer productivity and automate optimization tasks.
- Published
- 2020
- Full Text
- View/download PDF
5. A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm
- Author
-
Joel Emer, Matthew Fojtik, C. Thomas Gray, Ben Keller, Stephen G. Tell, Priyanka Raina, Stephen W. Keckler, Alicia Klinefelter, William J. Dally, Brucek Khailany, Brian Zimmer, Jason Clemons, Rangharajan Venkatesan, Nan Jiang, Yanqing Zhang, Nathaniel Pinckney, and Yakun Sophia Shao
- Subjects
Computer science ,business.industry ,Multi-chip module ,Bandwidth (signal processing) ,Scalability ,Mesh networking ,Inference ,System on a chip ,Electrical and Electronic Engineering ,Chip ,business ,Computer hardware ,Efficient energy use - Abstract
Custom accelerators improve the energy efficiency, area efficiency, and performance of deep neural network (DNN) inference. This article presents a scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic chip are optimal for specific network sizes, the proposed architecture enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains. Communication energy is minimized with large on-chip distributed weight storage and a hierarchical network-on-chip and network-on-package, and inference energy is minimized through extensive data reuse. The 16-nm prototype achieves 1.29-TOPS/mm2 area efficiency, 0.11 pJ/op (9.5 TOPS/W) energy efficiency, 4.01-TOPS peak performance for a one-chip system, and 127.8 peak TOPS and 1903 images/s ResNet-50 batch-1 inference for a 36-chip system.
- Published
- 2020
- Full Text
- View/download PDF
6. Energy Efficient On-Demand Dynamic Branch Prediction Models
- Author
-
Ehsan Atoofian, Amirali Baniasadi, Milad Mohammadi, Tor M. Aamodt, William J. Dally, and Song Han
- Subjects
Computer science ,Fetch ,02 engineering and technology ,Parallel computing ,Supercomputer ,computer.software_genre ,Branch predictor ,020202 computer hardware & architecture ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Compiler ,Cache ,computer ,Software ,Integer (computer science) ,Efficient energy use - Abstract
The branch predictor unit (BPU) is among the main energy consuming components in out-of-order (OoO) processors. For integer applications, we find 16 percent of the processor energy is consumed by the BPU. BPU is accessed in parallel with the instruction cache before it is known if a fetch group contains control instructions. We find 85 percent of BPU lookups are done for non-branch operations, and of the remaining lookups, 42 percent are done for highly biased branches that can be predicted statically with high accuracy. We evaluate two variants of a branch prediction model that combines dynamic and static branch prediction to achieve energy improvements for power-constrained applications. These models, named on-demand branch prediction (ODBP) and path-based on-demand branch prediction (ODBP-PATH), are two novel prediction techniques that eliminate unnecessary BPU lookups using compiler generated hints to identify instructions that can be more accurately predicted statically. ODBP-PATH is an implementation of ODBP that combines static and dynamic branch prediction based on the program path of execution. For a 4-wide OoO processor, ODBP-PATH delivers 11 percent average energy-delay (ED) product improvement, and 9 percent core average energy saving on the SPEC Int 2006 benchmarks.
- Published
- 2020
- Full Text
- View/download PDF
7. Darwin: A Genomics Coprocessor
- Author
-
William J. Dally, Gill Bejerano, and Yatish Turakhia
- Subjects
Coprocessor ,Speedup ,Computer science ,Molecular biophysics ,Sequence assembly ,Genomics ,02 engineering and technology ,Parallel computing ,020202 computer hardware & architecture ,Orders of magnitude (bit rate) ,Hardware and Architecture ,Darwin (ADL) ,0202 electrical engineering, electronic engineering, information engineering ,Human genome ,Electrical and Electronic Engineering ,Software - Abstract
Long read sequencing is promising as it provides knowledge of a full spectrum of mutations in the human genome and generates more contiguous de novo assemblies. But high error rate in long reads imposes a computational barrier for genome assembly. Darwin, a specialized coprocessor, which provides orders of magnitude speedup over conventional processors in long read assembly, can eliminate this barrier.
- Published
- 2019
- Full Text
- View/download PDF
8. A 1.17-pJ/b, 25-Gb/s/pin Ground-Referenced Single-Ended Serial Link for Off- and On-Package Communication Using a Process- and Temperature-Adaptive Voltage Regulator
- Author
-
William J. Dally, C. Thomas Gray, John Wilson, Sudhir S. Kudva, John W. Poulton, Wenxu Zhao, Nikola Nedovic, Stephen G. Tell, Xi Chen, Walker J. Turner, Sunil Sudhakaran, Sanquan Song, and Brian Zimmer
- Subjects
Frequency response ,business.industry ,Serial communication ,Computer science ,020208 electrical & electronic engineering ,Transmitter ,Electrical engineering ,02 engineering and technology ,Voltage regulator ,Phase-locked loop ,CMOS ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Transceiver ,business ,Jitter - Abstract
This paper describes a short-reach serial link to connect chips mounted on the same package or on neighboring packages on a printed circuit board (PCB). The link employs an energy-efficient, single-ended ground-referenced signaling scheme. Implemented in 16-nm FinFET CMOS technology, the link operates at a data rate of 25 Gb/s/pin with 1.17-pJ/bit energy efficiency and uses a simple but robust matched-delay clock forwarding scheme that cancels most sources of jitter. The modest frequency-dependent attenuation of short-reach links is compensated using an analog equalizer in the transmitter. The receiver includes active-inductor peaking in the input amplifier to improve overall receiver frequency response. The link employs a novel power supply regulation scheme at both ends that uses a PLL ring-oscillator supply voltage as a reference to flatten circuit speed and reduce power consumption variation across PVT. The link can be calibrated once at an arbitrary voltage and temperature, then track VT variation without the need for periodic re-calibration. The link operates over a 10-mm-long on-package channel with −4 dB of attenuation with 0.77-UI eye opening at bit-error rate (BER) of 10−15. A package-to-package link with 54 mm of PCB and 26 mm of on-package trace with −8.5 dB of loss at Nyquist operates with 0.42 UI of eye opening at BER of 10−15. Overall link die area is 686 $\mu \text{m}\,\,\times $ 565 $\mu \text{m}$ with the transceiver circuitry taking up 20% of the area. The transceiver’s on-chip regulator is supplied from an off-chip 950-mV supply, while the support logic operates on a separate 850-mV supply.
- Published
- 2019
- Full Text
- View/download PDF
9. Optimal Operation of a Plug-In Hybrid Vehicle
- Author
-
John Fox, Nicholas Moehle, Jason A. Platt, and William J. Dally
- Subjects
business.product_category ,Computer Networks and Communications ,Computer science ,020209 energy ,Aerospace Engineering ,020302 automobile design & engineering ,02 engineering and technology ,Grid ,Automotive engineering ,Nonlinear system ,chemistry.chemical_compound ,0203 mechanical engineering ,chemistry ,Control theory ,Automotive Engineering ,Electric vehicle ,Convex optimization ,0202 electrical engineering, electronic engineering, information engineering ,Fuel efficiency ,Petroleum ,Resource management ,Electrical and Electronic Engineering ,Convex function ,Hybrid vehicle ,business - Abstract
We present a convex optimization control method that has been shown in simulations to increase the fuel efficiency of a plug-in hybrid electric vehicle by over 10%. Using information on energy demand and energy use profiles, the problem is defined to preferentially use battery resources sourced from the grid over petroleum resources. We pose the general nonlinear optimal resource management problem over a predetermined route as a convex optimization problem using a reduced model of the vehicle. This problem is computationally efficient enough to be optimized “on the fly” on the on-board vehicle computer and is thus able to adapt to changing vehicle conditions in real time. Using this reduced model to generate control inputs for the detailed vehicle simulator autonomie, we record efficiency gains of over 10% as compared to the industry standard charge depleting charge sustaining controller over synthetic mixed urban-suburban routes.
- Published
- 2018
- Full Text
- View/download PDF
10. A 28 nm 2 Mbit 6 T SRAM With Highly Configurable Low-Voltage Write-Ability Assist Implementation and Capacitor-Based Sense-Amplifier Input Offset Compensation
- Author
-
Andreas J. Gotterba, Mahmut E. Sinangil, Jesse S. Wang, Matthew Fojtik, Jason Golbus, John W. Poulton, Brian Zimmer, Stephen G. Tell, C. Thomas Gray, William J. Dally, and Thomas Hastings Greer
- Subjects
Engineering ,Offset (computer science) ,Input offset voltage ,business.industry ,Sense amplifier ,Amplifier ,020208 electrical & electronic engineering ,02 engineering and technology ,01 natural sciences ,law.invention ,Capacitor ,CMOS ,law ,0103 physical sciences ,Hardware_INTEGRATEDCIRCUITS ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,Static random-access memory ,Electrical and Electronic Engineering ,010306 general physics ,business ,Low voltage - Abstract
This paper presents a highly configurable low-voltage write-ability assist implementation along with a sense-amplifier offset reduction technique to improve SRAM read performance. Write-assist implementation combines negative bit-line (BL) and $V_{\text{DD}}$ collapse schemes in an efficient way to maximize $V_{{\min}}$ improvements while saving on area and energy overhead of these assists. Relative delay and pulse width of assist control signals are also designed with configurability to provide tuning of assist strengths. Sense-amplifier offset compensation scheme uses capacitors to store and negate threshold mismatch of input transistors. A test chip fabricated in 28 nm HP CMOS process demonstrates operation down to 0.5 V with write assists and more than 10% reduction in word-line pulsewidth with the offset compensated sense amplifiers.
- Published
- 2016
- Full Text
- View/download PDF
11. On-Chip Active Messages for Speed, Scalability, and Efficiency
- Author
-
R. Curtis Harting and William J. Dally
- Subjects
business.industry ,Computer science ,computer.software_genre ,Computational Theory and Mathematics ,Shared memory ,Hardware and Architecture ,Signal Processing ,Scalability ,Operating system ,Concurrent computing ,Overhead (computing) ,business ,computer ,Cache coherence ,Energy (signal processing) ,Computer network - Abstract
This paper describes and quantifies the benefits of adding low-overhead active messages to many-core, cache-coherent chip-multiprocessors. The active messages we analyze are user defined and trigger the atomic execution of a custom software handler at the destination. Programmers can use these active messages to both move data with less overhead than cache coherency and, more importantly, explicitly send computation to data. Doing so greatly improves (11 $\times$ speed, 4.8 $\times$ energy) communication idioms such as shared object modification, reductions, data walks, point-to-point communication, and all-to-all communication. Active messages enhance program scalability: applications using them run 63 percent faster with 11 percent less energy on 256 cores. The relative benefits of active messages grow with larger numbers of cores.
- Published
- 2015
- Full Text
- View/download PDF
12. On-Demand Dynamic Branch Prediction
- Author
-
Song Han, Tor M. Aamodt, William J. Dally, and Milad Mohammadi
- Subjects
Speedup ,Computer science ,Speculative execution ,Thread (computing) ,Parallel computing ,computer.software_genre ,Branch predictor ,Power budget ,Branch table ,Hardware and Architecture ,Compiler ,Cache ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,computer - Abstract
In out-of-order (OoO) processors, speculative execution with high branch prediction accuracy is employed to achieve good single thread performance. In these processors the branch prediction unit tables (BPU) are accessed in parallel with the instruction cache before it is known whether a fetch group contains branch instructions. For integer applications, we find 85 percent of BPU lookups are done for non-branch operations and of the remaining lookups, 42 percent are done for highly biased branches that can be predicted statically with high accuracy. We evaluate on-demand branch prediction (ODBP), a novel technique that uses compiler generated hints to identify those instructions that can be more accurately predicted statically to eliminate unnecessary BPU lookups. We evaluate an implementation of ODBP that combines static and dynamic branch prediction. For a four wide superscalar processor, ODBP delivers as much as 9 percent improvement in average energy-delay (ED) product, 7 percent core average energy saving, and 3 percent speedup. ODBP also enables the use of large BPU’s for a given power budget.
- Published
- 2015
- Full Text
- View/download PDF
13. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications
- Author
-
John W. Poulton, Thomas Hastings Greer, C. Thomas Gray, William J. Dally, John Eyles, Stephen G. Tell, Xi Chen, and John Wilson
- Subjects
Engineering ,Serial communication ,business.industry ,Electrical engineering ,Integrated circuit design ,Chip ,Die (integrated circuit) ,CMOS ,Low-power electronics ,Hardware_INTEGRATEDCIRCUITS ,Charge pump ,Electronic engineering ,Electrical and Electronic Engineering ,business ,Ground plane - Abstract
High-speed signaling over high density interconnect on organic package substrates or silicon interposers offers an attractive solution to the off-chip bandwidth limitation problem faced in modern digital systems. In this paper, we describe a signaling system co-designed with the interconnect to take advantage of the characteristics of this environment to enable a high-speed, low area, and low-power die to die link. Ground-Referenced Signaling (GRS) is a single-ended signaling system that eliminates the major problems traditionally associated with single-ended design by using the ground plane as the reference and signaling above and below ground. This design employs a novel charge pump driver that additionally eliminates the issue of simultaneous switching noise with data independent current consumption. Silicon measurements from a test chip implementing two 16-lane links, with forwarded clocks, in a standard 28 nm process demonstrate 20 Gb/s operation at 0.54 pJ/bit over 4.5 mm organic substrate channels at a nominal 0.9 V power supply voltage. Timing margins at the receiver are >0.3 UI at a BER of 10-12. We estimate BER 10-25 at the eye center.
- Published
- 2013
- Full Text
- View/download PDF
14. Elastic Buffer Flow Control for On-Chip Networks
- Author
-
William J. Dally and George Michelogiannakis
- Subjects
Router ,Flow control (data) ,business.industry ,Computer science ,Throughput ,Buffer (optical fiber) ,Theoretical Computer Science ,Network on a chip ,Computational Theory and Mathematics ,Hardware and Architecture ,Embedded system ,business ,Software ,Computer network - Abstract
Networks-on-chip (NoCs) were developed to meet the communication requirements of large-scale systems. The majority of current NoCs spend considerable area and power for router buffers. In our past work, we have developed elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers are no longer required. Removing buffers and virtual channels (VCs) significantly simplifies router design. Compared to VC networks with highly-efficient custom SRAM buffers, EB networks provide an up to 45 percent shorter cycle time, 16 percent more throughput per unit power, or 22 percent more throughput per unit area. EB networks provide traffic classes using duplicate physical subnetworks. However, this approach negates the cost gains or becomes infeasible for a large number of traffic classes. Therefore, in this paper we propose a hybrid EB-VC router which provides an arbitrary number of traffic classes by using an input buffer to drain flits facing severe contention or deadlock. Thus, hybrid routers operate as EB routers in the common case, and as VC routers when necessary. For this reason, the hybrid EB-VC scheme offers 21 percent more throughput per unit power than VC networks and 12 percent than EB networks.
- Published
- 2013
- Full Text
- View/download PDF
15. GPUs and the Future of Parallel Computing
- Author
-
Michael Garland, D. Glasco, William J. Dally, Brucek Khailany, and Stephen W. Keckler
- Subjects
Coprocessor ,Computer architecture ,Parallel processing (DSP implementation) ,Hardware and Architecture ,Computer science ,End-user computing ,Next-generation network ,Graphics processing unit ,Parallel computing ,Electrical and Electronic Engineering ,Software - Abstract
This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. Nvidia Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.
- Published
- 2011
- Full Text
- View/download PDF
16. Evaluating Elastic Buffer and Wormhole Flow Control
- Author
-
Daniel U. Becker, William J. Dally, and George Michelogiannakis
- Subjects
Router ,Flow control (data) ,Computer science ,business.industry ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,Throughput ,Buffer (optical fiber) ,Theoretical Computer Science ,Network on a chip ,Computational Theory and Mathematics ,Hardware and Architecture ,Hardware_INTEGRATEDCIRCUITS ,Wormhole ,business ,Software ,Computer network - Abstract
With the emergence of on-chip networks, router buffer power has become a primary concern. Elastic buffer (EB) flow control utilizes existing pipeline flip-flops in the channels to implement distributed FIFOs, eliminating the need for input buffers at the routers. EB routers have been shown to be more efficient than virtual channel routers, as they do not require input buffers or complex logic for managing virtual channels and tracking credits. Wormhole routers are more comparable in terms of complexity because they also lack virtual channels. This paper compares EB and wormhole routers and explores novel hybrid designs to more closely examine the effect of design simplicity and input buffer cost. Our results show that EB routers have up to 25 percent smaller cycle time compared to wormhole and hybrid routers. Moreover, EB flow control requires 10 percent less energy to transfer a single bit through a router and offers three percent more throughput per unit energy as well as 62 percent more throughput per unit area. The main contributor to these results is the cost and delay overhead of the input buffer.
- Published
- 2011
- Full Text
- View/download PDF
17. The GPU Computing Era
- Author
-
John R. Nickolls and William J. Dally
- Subjects
Ubiquitous computing ,Coprocessor ,Computer science ,Graphics processing unit ,Parallel computing ,Supercomputer ,CUDA ,Hardware and Architecture ,Concurrent computing ,Electrical and Electronic Engineering ,Graphics ,General-purpose computing on graphics processing units ,Massively parallel ,Software ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
GPU computing is at a tipping point, becoming more widely used in demanding consumer applications and high-performance computing. This article describes the rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications.
- Published
- 2010
- Full Text
- View/download PDF
18. Operand Registers and Explicit Operand Forwarding
- Author
-
William J. Dally, James Balfour, and R.C. Halting
- Subjects
Memory hierarchy ,Computer science ,Pipeline (computing) ,Parallel computing ,Operand ,Hardware and Architecture ,Code generation ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Routing (electronic design automation) ,Fixed-point arithmetic ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,Operand forwarding ,Efficient energy use - Abstract
Operand register files are small, inexpensive register files that are integrated with function units in the execute stage of the pipeline, effectively extending the pipeline operand registers into register files. Explicit operand forwarding lets software opportunistically orchestrate the routing of operands through the forwarding network to avoid writing ephemeral values to registers. Both mechanisms let software capture short-term reuse and locality close to the function units, improving energy efficiency by allowing a significant fraction of operands to be delivered from inexpensive registers that are integrated with the function units. An evaluation shows that capturing operand bandwidth close to the function units allows operand registers to reduce the energy consumed in the register files and forwarding network of an embedded processor by 61%, and allows explicit forwarding to reduce the energy consumed by 26%.
- Published
- 2009
- Full Text
- View/download PDF
19. Efficient Embedded Computing
- Author
-
R.C. Harting, D. Black-Shaffer, J. Chen, James Balfour, William J. Dally, V. Parikh, D. Sheffield, and Jongsoo Park
- Subjects
General Computer Science ,Reduced instruction set computing ,Computer architecture ,Application-specific integrated circuit ,Computer science ,business.industry ,Embedded system ,business ,Energy (signal processing) - Abstract
Hardwired ASICs - 50X more efficient than programmable processors - sacrifice programmability to meet the efficiency requirements of demanding embedded systems. Programmable processors use energy mostly to supply instructions and data to the arithmetic units, and several techniques can reduce instruction- and data-supply energy costs. Using these techniques in the Stanford ELM processor closes the gap with ASICs to within 3X.
- Published
- 2008
- Full Text
- View/download PDF
20. Hierarchical Instruction Register Organization
- Author
-
James Balfour, Jongsoo Park, David Black-Schaffer, William J. Dally, and V. Parikh
- Subjects
Instruction set ,Instruction register ,Kernel (linear algebra) ,Computer architecture ,Indirection ,Computer-integrated manufacturing ,Hardware and Architecture ,Computer science ,Very long instruction word ,Overhead (computing) ,Baseline (configuration management) - Abstract
This paper analyzes a range of architectures for efficient delivery of VLIW instructions for embedded media kernels. The analysis takes an efficient filter cache as a baseline and examines the benefits from 1) removing the tag overhead, 2) distributing the storage, 3) adding indirection, 4) adding efficient NOP generation, and 5) sharing instruction memory. The result is a hierarchical instruction register organization that provides a 56% energy and 40% area savings over an already efficient filter cache.
- Published
- 2008
- Full Text
- View/download PDF
21. An Energy-Efficient Processor Architecture for Embedded Systems
- Author
-
Jongsoo Park, James Balfour, David Black-Schaffer, William J. Dally, and V. Parikh
- Subjects
Instruction register ,Instructions per cycle ,Reduced instruction set computing ,Computer science ,Processor register ,business.industry ,Transport triggered architecture ,Microarchitecture ,Addressing mode ,Instruction set ,Computer architecture ,Hardware and Architecture ,Embedded system ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,business - Abstract
We present an efficient programmable architecture for compute-intensive embedded applications. The processor architecture uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data. Instruction registers capture instruction reuse and locality in inexpensive storage structures that arc located near to the functional units. The data register organization captures reuse and locality in different levels of the hierarchy to reduce the cost of delivering data. Exposed communication resources eliminate pipeline registers and control logic, and allow the compiler to schedule efficient instruction and data movement. The architecture keeps a significant fraction of instruction and data bandwidth local to the functional units, which reduces the cost of supplying instructions and data to large numbers of functional units. This architecture achieves an energy efficiency that is 23x greater than an embedded RISC processor.
- Published
- 2008
- Full Text
- View/download PDF
22. Flattened Butterfly Topology for On-Chip Networks
- Author
-
William J. Dally, John Kim, and James Balfour
- Subjects
Butterfly network ,Multi-core processor ,business.industry ,Computer science ,Energy consumption ,Parallel computing ,Network topology ,Topology ,Network on a chip ,Hardware and Architecture ,Butterfly ,Latency (engineering) ,business ,Computer network - Abstract
With the trend towards increasing number of cores in a multicore processors, the on-chip network that connects the cores needs to scale efficiently. In this work, we propose the use of high-radix networks in on-chip networks and describe how the flattened butterfly topology can be mapped to on-chip networks. By using high-radix routers to reduce the diameter of the network, the flattened butterfly offers lower latency and energy consumption than conventional on-chip topologies. In addition, by properly using bypass channels in the flattened butterfly network, non-minimal routing can be employed without increasing latency or the energy consumption.
- Published
- 2007
- Full Text
- View/download PDF
23. A 20-Gb/s 0.13-/spl mu/m CMOS serial link transmitter using an LC-PLL to directly drive the output multiplexer
- Author
-
Mark Horowitz, Patrick Chiang, Yangjin Oh, William J. Dally, M.-J.E. Lee, and R. Senthinathan
- Subjects
Physics ,Electronic oscillator ,business.industry ,Electrical engineering ,Sample and hold ,Multiplexer ,Frequency divider ,Phase-locked loop ,CMOS ,Optical Carrier transmission rates ,Hardware_INTEGRATEDCIRCUITS ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Electrical and Electronic Engineering ,business ,Jitter - Abstract
A 20-Gb/s transmitter is implemented in 0.13-/spl mu/m CMOS technology. An on-die 10-GHz LC oscillator phase-locked loop (PLL) creates two sinusoidal 10-GHz complementary clock phases as well as eight 2.5-GHz interleaved feedback divider clock phases. After a 2/sup 20/-1 pseudorandom bit sequence generator (PRBS) creates eight 2.5-Gb/s data streams, the eight 2.5-GHz interleaved clocks 4:1 multiplex the eight 2.5-Gb/s data streams to two 10-Gb/s data streams. 10-GHz analog sample-and-hold circuits retime the two 10-Gb/s data streams to be in phase with the 10-GHz complementary clocks. Two-tap equalization of the 10-Gb/s data streams compensate for bandwidth rolloff of the 10-Gb/s data outputs at the 10-GHz analog latches. A final 20-Gb/s 2:1 output multiplexer, clocked by the complementary 10-GHz clock phases, creates 20-Gb/s data from the two retimed 10-Gb/s data streams. The LC-VCO is integrated with the output multiplexer and analog latches, resonating the load and eliminating the need for clock buffers, reducing power supply induced jitter and static phase mismatch. Power, active die area, and jitter (rms/pk-pk) are 165 mW, 650 /spl mu/m/spl times/350 /spl mu/m, and 2.37 ps/15 ps, respectively.
- Published
- 2005
- Full Text
- View/download PDF
24. A 33-mW 8-Gb/s CMOS clock multiplier and CDR for highly integrated I/Os
- Author
-
R. Rathi, William J. Dally, R. Senthinathan, Hiok-Tiaq Ng, Trey Greer, M.-J.E. Lee, J. Edmondson, John W. Poulton, J. Tran, A. Nguyen, and Ramin Farjad-Rad
- Subjects
Physics ,Offset (computer science) ,Vernier scale ,business.industry ,Frequency multiplier ,Electrical engineering ,Voltage regulator ,Multiplexer ,law.invention ,Injection locking ,CMOS ,law ,Electronic engineering ,Electrical and Electronic Engineering ,business ,Clock recovery ,Jitter ,CPU multiplier - Abstract
A 0.622-8-Gb/s clock and data recovery (CDR) circuit using injection locking for jitter suppression and phase interpolation in high-bandwidth system-on-chip solutions is described. A slave injection locked oscillator (SILO) is locked to a tracking aperture-multiplying DLL (TA-MDLL) via a coarse phase selection multiplexer (MUX). For the fine timing vernier, an interpolator DAC controls the injection strength of the MUX output into the SILO. This 1.2-V 0.13-/spl mu/m CMOS CDR consumes 33 mW at 8Gb/s. Die area including voltage regulator is 0.08 mm/sup 2/. Recovered clock jitter is 49 ps pk-pk at a 200-ppm bit-rate offset.
- Published
- 2004
- Full Text
- View/download PDF
25. A second-order semidigital clock recovery circuit based on injection locking
- Author
-
Hiok-Tiaq Ng, Trey Greer, William J. Dally, Ramin Farjad-Rad, John W. Poulton, J. Edmondson, R. Senthinathan, R. Rathi, and M.-J.E. Lee
- Subjects
Phase-locked loop ,Injection locking ,Physics ,Synchronous circuit ,CMOS ,Clock domain crossing ,Electronic engineering ,Electrical and Electronic Engineering ,Clock skew ,Digital clock ,Jitter - Abstract
A compact (1 mm /spl times/ 160 /spl mu/m) and low-power (80-mW) 0.18-/spl mu/m CMOS 3.125-Gb/s clock and data recovery circuit is described. The circuit utilizes injection locking to filter out high-frequency reference clock jitter and multiplying delay-locked loop duty-cycle distortions. The injection-locked slave oscillator output can have its output clocks interpolated by current steering the injecting clocks. A second-order clock and data recovery is introduced to perform the interpolation and is capable of tracking frequency offsets while exhibiting low phase wander.
- Published
- 2003
- Full Text
- View/download PDF
26. Programmable stream processors
- Author
-
Ujval J. Kapasi, Peter Mattson, Scott Rixner, Jung Ho Ahn, William J. Dally, John D. Owens, and Brucek Khailany
- Subjects
Stream processing ,Flexibility (engineering) ,Current (stream) ,Concurrency control ,Configuration management ,General Computer Science ,Computer architecture ,Computer science ,Concurrency ,Locality - Abstract
The demand for flexibility in media processing motivates the use of programmable processors. Stream processing bridges the gap between inflexible special-purpose solutions and current programmable architectures that cannot meet the computational demands of media-processing applications. The central idea behind stream processing is to organize an application into streams and kernels to expose the inherent locality and concurrency in media-processing applications. The performance of the Imagine stream processor on these media application is given.
- Published
- 2003
- Full Text
- View/download PDF
27. Jitter transfer characteristics of delay-locked loops - theories and design techniques
- Author
-
Hiok-Tiaq Ng, Trey Greer, M.-J.E. Lee, R. Senthinathan, Ramin Farjad-Rad, William J. Dally, and John W. Poulton
- Subjects
Computer science ,Control theory ,Frequency domain ,Bandwidth (signal processing) ,Electronic engineering ,Electrical and Electronic Engineering ,Chip ,Jitter ,Electronic circuit - Abstract
This paper presents analyses and experimental results on the jitter transfer of delay-locked loops (DLLs). Through a z-domain model, we show that in a widely used DLL configuration, jitter peaking always exists and high-frequency jitter does not get attenuated as previous analyses suggest. This is true even in a first-order DLL and an overdamped second-order DLL. The amount of jitter peaking is shown to trade off with the tracking bandwidth and, therefore, the acquisition time. Techniques to reduce jitter amplification by loop filtering and phase filtering are discussed. Measurements from a prototype chip incorporating the discussed techniques confirm the prediction of the analytical model. In environments where the reference clock is noisy or where multiple timing circuits are cascaded, this jitter amplification effect should be carefully evaluated.
- Published
- 2003
- Full Text
- View/download PDF
28. A low-power multiplying DLL for low-jitter multigigahertz clock generation in highly integrated digital chips
- Author
-
Hiok-Tiaq Ng, William J. Dally, R. Rathi, John W. Poulton, M.-J.E. Lee, Ramin Farjad-Rad, and R. Senthinathan
- Subjects
Engineering ,business.industry ,Hardware_PERFORMANCEANDRELIABILITY ,Noise (electronics) ,Phase-locked loop ,CMOS ,Filter (video) ,Low-power electronics ,Phase noise ,Hardware_INTEGRATEDCIRCUITS ,Electronic engineering ,Electrical and Electronic Engineering ,business ,Jitter ,CPU multiplier - Abstract
A multiplying delay-locked loop (MDLL) for high-speed on-chip clock generation that overcomes the drawbacks of phase-locked loops (PLLs) such as jitter accumulation, high sensitivity to supply, and substrate noise is described. The MDLL design removes such drawbacks while maintaining the advantages of a PLL for multirate frequency multiplication. This design also uses a supply regulator and filter to further reduce on-chip jitter generation. The MDLL, implemented in 0.18-/spl mu/m CMOS technology, occupies a total active area of 0.05 mm/sup 2/ and has a speed range of 200 MHz to 2 GHz with selectable multiplication ratios of M=4, 5, 8, 10. The complete synthesizer, including the output clock buffers, dissipates 12 mW from a 1.8-V supply at 2.0 GHz. This MDLL architecture is used as a clock multiplier integrated on a single chip for a 72/spl times/72 STS-1 grooming switch and has a jitter of 1.73 ps (rms) and 13.1 ps (pk-pk).
- Published
- 2002
- Full Text
- View/download PDF
29. A delay model for router microarchitectures
- Author
-
William J. Dally and Li-Shiuan Peh
- Subjects
Link state packet ,Router ,Flow control (data) ,business.industry ,Computer science ,computer.internet_protocol ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,Virtual Router Redundancy Protocol ,Throughput ,Parallel computing ,Metrics ,Core router ,Hardware and Architecture ,Bridge router ,One-armed router ,Hardware_INTEGRATEDCIRCUITS ,Electrical and Electronic Engineering ,business ,computer ,Software ,Computer network - Abstract
This article introduces a router delay model that takes into account the pipelined nature of contemporary routers and proposes pipelines matched to the specific flow control method employed. Given the type of flow control and router parameters, the model returns router latency in technology-independent units and the number of pipeline stages as a function of cycle time. We apply this model to derive realistic pipelines for wormhole and virtual-channel routers and compare their performance. Contrary to the conclusions of previous models, our results show that the latency of a virtual channel router doesn't increase as we scale the number of virtual channels up to 8 per physical channel. Our simulation results also show that a virtual-channel router gains throughput of up to 40 % over a wormhole router.
- Published
- 2001
- Full Text
- View/download PDF
30. Imagine: media processing with streams
- Author
-
Peter Mattson, Brian Towles, Ujval J. Kapasi, John D. Owens, Brucek Khailany, A. Chang, William J. Dally, Scott Rixner, and J. Namkoong
- Subjects
Computer science ,business.industry ,Image processing ,Parallel computing ,STREAMS ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,FLOPS ,Microarchitecture ,Stream processing ,Hardware and Architecture ,Encoding (memory) ,Media processor ,Electrical and Electronic Engineering ,business ,Software ,Computer hardware - Abstract
The power-efficient Imagine stream processor achieves performance densities comparable to those of special-purpose embedded processors. Executing programs mapped to streams and kernels, a single Imagine processor is expected to have a peak performance of 20 gflops and sustain 18.3 gops on mpeg-2 encoding.
- Published
- 2001
- Full Text
- View/download PDF
31. Low-power area-efficient high-speed I/O circuit techniques
- Author
-
M.-J.E. Lee, William J. Dally, and Patrick Chiang
- Subjects
Very-large-scale integration ,Engineering ,business.industry ,Amplifier ,Capacitive sensing ,Transmitter ,Hardware_PERFORMANCEANDRELIABILITY ,Integrated circuit design ,CMOS ,Low-power electronics ,Hardware_INTEGRATEDCIRCUITS ,Electronic engineering ,Inverter ,Electrical and Electronic Engineering ,business ,Hardware_LOGICDESIGN - Abstract
We present a 4-Gb/s I/O circuit that fits in 0.1-mm/sup 2/ of die area, dissipates 90 mW of power, and operates over 1 m of 7-mil 0.5-oz PCB trace in a 0.25-/spl mu/m CMOS technology. Swing reduction is used in an input-multiplexed transmitter to provide most of the speed advantage of an output-multiplexed architecture with significantly lower power and area. A delay-locked loop (DLL) using a supply-regulated inverter delay line gives very low jitter at a fraction of the power of a source-coupled delay line-based DLL. Receiver capacitive offset trimming decreases the minimum resolvable swing to 8 mV, greatly reducing the transmission energy without affecting the performance of the receive amplifier. These circuit techniques enable a high level of I/O integration to relieve the pin bandwidth bottleneck of modern VLSI chips.
- Published
- 2000
- Full Text
- View/download PDF
32. Concurrent event handling through multithreading
- Author
-
William J. Dally, W.S.L.S. Chatterjee, Andrew Chang, and Stephen W. Keckler
- Subjects
Speedup ,Process state ,Computer science ,business.industry ,Exception handling ,Thread (computing) ,computer.software_genre ,Theoretical Computer Science ,law.invention ,Super-threading ,Microprocessor ,Computational Theory and Mathematics ,Hardware and Architecture ,law ,Embedded system ,Multithreading ,Operating system ,Software system ,business ,computer ,Software ,Context switch - Abstract
Exceptions have traditionally been used to handle infrequently occurring and unpredictable events during normal program execution. Current trends in microprocessor and operating systems design continue to increase the cost of event handling. Because of the deep pipelines and wide out-of-order superscalar architectures of contemporary microprocessors, an event may need to nullify a large number of in-flight instructions. Large register files require existing software systems to save and restore a substantial amount of process state before executing an exception handler. At the same time, processors are executing in environments that supply higher event frequencies and demand higher performance. We have developed an alternative architecture, "concurrent event handling", that incorporates multithreading into event handling architectures. Instead of handling the event in the faulting thread's architectural and pipeline registers, the fault handler is forked into its own thread slot and executes concurrently with the faulting thread. Microbenchmark programs show a factor-of-3 speedup for concurrent event handling over a traditional architecture on code that takes frequent exceptions. We also demonstrate substantial speedups on two event-based applications. Concurrent event handling is implemented in the MIT's MAP (Multi-ALU Processor) chip.
- Published
- 1999
- Full Text
- View/download PDF
33. An efficient, protected message interface
- Author
-
Nicholas P. Carter, Whay S. Lee, William J. Dally, Andrew Chang, and Stephen W. Keckler
- Subjects
General Computer Science ,Shared memory ,Computer science ,Multithreading ,Distributed computing ,Interface (computing) ,Message passing ,Context (computing) ,Multiprocessing ,Message broker ,Network interface ,Interrupt - Abstract
With increasing demand for computing power, multiprocessing computers will become more common in the future. In these systems, the growing discrepancy between processor and memory technologies will cause tightly integrated message interfaces to be essential for achieving the necessary efficiency, which is especially important in light of the growing interest in software-distributed, shared memory systems. The authors conduct a performance evaluation of several primitive messaging mechanisms-dispatch mechanisms (how the processor reacts to message arrivals), memory mapped versus register mapped interfaces, and streaming versus buffered interfaces-baselining these results against the MIT M-Machine and its tightly integrated message interfaces. They find that a message can be dispatched up to 18 times faster by reserving a hardware thread context for message reception instead of an interrupt driven interface. They also find that the mapping decision is important, with integrated register mapped interfaces as much as 3.5 times more efficient than conventional systems. To meet the challenges and exploit the opportunities presented by emerging multithreaded processor architectures, low overhead mechanisms for protection against message corruption, interception, and starvation must be integral to the message system design. The authors hope that the simple messaging mechanisms presented can help provide a solution to these challenges.
- Published
- 1998
- Full Text
- View/download PDF
34. Transmitter equalization for 4-Gbps signaling
- Author
-
John W. Poulton and William J. Dally
- Subjects
business.industry ,Computer science ,Circuit design ,Amplifier ,Transmitter ,Electrical engineering ,Equalization (audio) ,Skew ,Digital clock manager ,Clock skew ,CMOS ,Hardware and Architecture ,Clock domain crossing ,Hardware_INTEGRATEDCIRCUITS ,Electrical and Electronic Engineering ,Telecommunications ,business ,Software ,Clock recovery ,Jitter - Abstract
Most digital systems today use full-swing, unterminated signaling methods that are unsuited for data rates over 100 MHz on 1-meter wires. We are currently developing 0.5-micron CMOS transmitter and receiver circuits that use active equalization to overcome the frequency-dependent attenuation of copper lines. The circuits will operate at 4 Gbps over up to 6 meters of 24AWG twisted pair or up to 1 meter of 5-mil 0.5-oz. PC trace. In addition to frequency-dependent attenuation, timing uncertainty (skew and jitter) and receiver bandwidth are also major obstacles to high-data rates. To address all of these issues, we've given our system the following characteristics: An active transmitter equalizer compensates for the frequency-dependent attenuation of the transmission line. The system performs closed-loop clock recovery independently for each signal line in a manner that cancels all clock and data skew and the low-frequency components of clock jitter. The delay line that generates the transmit and receive clocks (a 400-MHz clock with 10 equally spaced phases) uses several circuit techniques to achieve a total simulated jitter of less than 20 ps in the presence of supply and substrate noise. A clocked receive amplifier with a 50-ps aperture time senses the signal during the center of the eye at the receiver.
- Published
- 1997
- Full Text
- View/download PDF
35. Deadlock-free adaptive routing in multicomputer networks using virtual channels
- Author
-
William J. Dally and H. Aoki
- Subjects
Interconnection ,Network packet ,Computer science ,business.industry ,Distributed computing ,Fault tolerance ,Adaptive routing ,Dependency graph ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Network performance ,business ,Computer network - Abstract
The use of adaptive routing in a multicomputer interconnection network improves network performance by using all available paths and provides fault tolerance by allowing messages to be routed around failed channels and nodes. Two deadlock-free adaptive routing algorithms are described. Both algorithms allocate virtual channels using a count of the number of dimension reversals a packet has performed to eliminate cycles in resource dependency graphs. The static algorithm eliminates cycles in the network channel dependency graph. The dynamic algorithm improves virtual channel utilization by permitting dependency cycles and instead eliminating cycles in the packet wait-for graph. It is proved that these algorithms are deadlock-free. Experimental measurements of their performance are presented. >
- Published
- 1993
- Full Text
- View/download PDF
36. Hot chips 12
- Author
-
M. Tremblay, William J. Dally, and A.J. Baum
- Subjects
Hardware and Architecture ,Computer science ,Metallurgy ,Parallel computing ,Electrical and Electronic Engineering ,Software ,Hot Chips - Published
- 2001
- Full Text
- View/download PDF
37. The message-driven processor: a multicomputer processing node with efficient mechanisms
- Author
-
John S. Keen, G. A. Fyler, Richard Lethin, Michael D. Noakes, William J. Dally, R. E. Davison, P.R. Nuth, and J.A.S. Fiske
- Subjects
Very-large-scale integration ,Network architecture ,Hardware_MEMORYSTRUCTURES ,36-bit ,Computer science ,business.industry ,Node (networking) ,Memory controller ,Instruction set ,Hardware and Architecture ,Synchronization (computer science) ,Systems architecture ,Electrical and Electronic Engineering ,business ,Software ,Computer hardware ,Dram - Abstract
The message-driven processor (MDP), a 36-b, 1.1-million transistor, VLSI microcomputer, specialized to operate efficiently in a multicomputer, is described. The MDP chip includes a processor, a 4096-word by 36-b memory, and a network port. An on-chip memory controller with error checking and correction (ECC) permits local memory to be expanded to one million words by adding external DRAM chips. The MDP incorporates primitive mechanisms for communication, synchronization, and naming which support most proposed parallel programming models. The MDP system architecture, instruction set architecture, network architecture, implementation, and software are discussed. >
- Published
- 1992
- Full Text
- View/download PDF
38. A fast translation method for paging on top of segmentation
- Author
-
William J. Dally
- Subjects
Computer science ,Translation lookaside buffer ,Parallel computing ,Theoretical Computer Science ,Physical address ,Memory management ,Computational Theory and Mathematics ,Virtual address space ,Hardware and Architecture ,Virtual memory ,Paging ,Segmentation ,Algorithm design ,Software - Abstract
A description is presented of a fast, one-step translation method that implements paging on top of segmentation. This method translates a virtual address into a physical address, performing both the segmentation and paging translations, with a single TLB (translation lookaside buffer) read and a short add. Previous methods performed this translation in two steps and required two TLB reads and a long add. Using the fast method, the fine-grain protection and relocation of segmentation combined with paging can be provided with delay and complexity comparable to paging-only systems. This method allows small segments, particularly important in object-oriented programming systems, to be managed efficiently. >
- Published
- 1992
- Full Text
- View/download PDF
39. Express cubes: improving the performance of k-ary n-cube interconnection networks
- Author
-
William J. Dally
- Subjects
Interconnection ,Packet switching ,Computational Theory and Mathematics ,Logarithm ,Hardware and Architecture ,Computer science ,Mesh networking ,Locality ,Parallel computing ,Cube ,Telecommunications network ,Software ,Theoretical Computer Science - Abstract
The author discusses express cubes, k-ary n-cube interconnection networks augmented by express channels that provide a short path for nonlocal messages. An express cube combines the logarithmic diameter of a multistage network with the wire-efficiency and ability to exploit locality of a low-dimensional mesh network. The insertion of express channels reduces the network diameter and thus the distance component of network latency. Wire length is increased, allowing networks to operate with latencies that approach the physical speed-of-light limitation rather than being limited by node delays. Express channels increase wire bisection in a manner that allows the bisection to be controlled independently of the choice of radix, dimension, and channel width. By increasing wire bisection to saturate the available wiring media, throughput can be substantially increased. With an express cube both latency and throughput are wire-limited and within a small factor of the physical limit on performance. >
- Published
- 1991
- Full Text
- View/download PDF
40. Performance analysis of k-ary n-cube interconnection networks
- Author
-
William J. Dally
- Subjects
Interconnection ,Computational Theory and Mathematics ,Hardware and Architecture ,Computer science ,Throughput ,Torus ,Parallel computing ,Deterministic routing ,Topology ,Telecommunications network ,Software ,Theoretical Computer Science - Abstract
VLSI communication networks are wire-limited, i.e. the cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the network. Communication networks of varying dimensions are analyzed under the assumption of constant wire bisection. Expressions for the latency, average case throughput, and hot-spot throughput of k-ary n-cube networks with constant bisection that agree closely with experimental measurements are derived. It is shown that low-dimensional networks (e.g. tori) have lower latency and higher hot-spot throughput than high-dimensional networks (e.g. binary n-cubes) with the same bisection width. >
- Published
- 1990
- Full Text
- View/download PDF
41. A hardware logic simulation system
- Author
-
Prathima Agrawal and William J. Dally
- Subjects
Very-large-scale integration ,Amdahl's law ,Event (computing) ,Computer science ,business.industry ,Logic simulation ,System testing ,Integrated circuit ,Parallel computing ,Mars Exploration Program ,Computer Graphics and Computer-Aided Design ,law.invention ,symbols.namesake ,law ,symbols ,Algorithm design ,Electrical and Electronic Engineering ,business ,Software ,Computer hardware - Abstract
Multiple-delay logic simulation algorithms developed for the microprogrammable accelerator for rapid simulations (MARS) hardware simulator are discussed. In particular, timing-analysis algorithms for event cancellations, spike and race analyses, and oscillation detection are described. It is shown how a reconfigurable set of processors, called processing elements (PEs), can be arranged in a pipelined configuration to implement these algorithms. The algorithms operate within the partitioned-memory, message-passing architecture of MARS. Three logic simulators-two multiple delay and one unit delay-have been implemented using slightly different configuration of the available PEs. In these simulators, VLSI chips are modeled at the gate level with accurate rise/fall delays assigned to each logic primitive. On-chip memory blocks are modeled functionally and are integrated into the simulation framework. The MARS hardware simulator has been tested on many VLSI chip designs and has demonstrated a speed improvement of about 50 times that of an Amdahl 5870 system running a production-quality software simulator while retaining the accuracy of simulations. >
- Published
- 1990
- Full Text
- View/download PDF
42. Topology Optimization of Interconnection Networks
- Author
-
Amit Gupta and William J. Dally
- Subjects
Interconnection ,Critical distance ,Hardware and Architecture ,Network packet ,Computer science ,Distributed computing ,Topology optimization ,Logical topology ,Multistage interconnection networks ,Latency (engineering) ,Network topology - Abstract
This paper describes an automatic optimization tool that searches a family of network topologies to select the topology that best achieves a specified set of design goals while satisfying specified packaging constraints. Our tool uses a model of signaling technology that relates bandwidth, cost and distance of links. This model captures the distance-dependent bandwidth of modern high-speed electrical links and the cost differential between electrical and optical links. Using our optimization tool, we explore the design space of hybrid Clos-torus (C-T) networks. For a representative set of packaging constraints we determine the optimal hybrid C-T topology to minimize cost and the optimal C-T topology to minimize latency for various packet lengths. We then use the tool to measure the sensitivity of the optimal topology to several important packaging constraints such as pin count and critical distance
- Published
- 2006
- Full Text
- View/download PDF
43. Data Parallel Address Architecture
- Author
-
William J. Dally and Jung Ho Ahn
- Subjects
Hardware_MEMORYSTRUCTURES ,business.industry ,Sense amplifier ,Computer science ,Locality ,Uniform memory access ,Parallel computing ,Thread (computing) ,CAS latency ,Hardware and Architecture ,Memory rank ,Latency (engineering) ,business ,Dram ,Computer hardware - Abstract
Data parallel memory systems must maintain a large number of outstanding memory references to fully use increasing DRAM bandwidth in the presence of increasing latency. At the same time, the throughput of modern DRAMs is very sensitive to access pattern's due to the time required to precharge and activate banks and to switch between read and write access. To achieve memory reference parallelism a system may simultaneously issue references from multiple reference threads. Alternatively multiple references from a single thread can be issued in parallel. In this paper, we examine this tradeoff and show that allowing only a single thread to access DRAM at any given time significantly improves performance by increasing the locality of the reference stream and hence reducing precharge/activate operations and read/write turnaround. Simulations of scientific and multimedia applications show that generating multiple references from a single thread gives, on average, 17% better performance than generating references from two parallel threads
- Published
- 2006
- Full Text
- View/download PDF
44. Buffer and Delay Bounds in High Radix Interconnection Networks
- Author
-
Arjun Singh and William J. Dally
- Subjects
Queueing theory ,Interconnection ,Intelligent Network ,Hardware and Architecture ,business.industry ,Network packet ,Computer science ,Bounding overwatch ,Parallel computing ,Latency (engineering) ,business ,Buffer (optical fiber) ,Computer network - Abstract
We apply recent results in queueing theory to propose a methodology for bounding the buffer depth and packet delay in high radix interconnection networks. While most work in interconnection networks has been focused on the throughput and average latency in such systems, few studies have been done providing statistical guarantees for buffer depth and packet delays. These parameters are key in the design and performance of a network. We present a methodology for calculating such bounds for a practical high radix network and through extensive simulations show its effectiveness for both bursty and non-bursty injection traffic. Our results suggest that modest speedups and buffer depths enable reliable networks without flow control to be constructed.
- Published
- 2004
- Full Text
- View/download PDF
45. Globally Adaptive Load-Balanced Routing on Tori
- Author
-
Brian Towles, Arjun Singh, Amit Gupta, and William J. Dally
- Subjects
Zone Routing Protocol ,Static routing ,Dynamic Source Routing ,Computer science ,business.industry ,Distributed computing ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,Enhanced Interior Gateway Routing Protocol ,Policy-based routing ,Link-state routing protocol ,Hardware and Architecture ,Multipath routing ,Hardware_INTEGRATEDCIRCUITS ,Destination-Sequenced Distance Vector routing ,business ,Computer network - Abstract
We introduce a new method of adaptive routing on k-ary n-cubes, Globally Adaptive Load-Balance (GAL). GAL makes global routing decisions using global information. In contrast, most previous adaptive routing algorithms make local routing decisions using local information (typically channel queue depth). GAL senses global congestion using segmented injection queues to decide the directions to route in each dimension. It further load balances the network by routing in the selected directions adaptively. Using global information, GAL achieves the performance (latency and throughput) of minimal adaptive routing on benign traffic patterns and performs as well as the best obliviously load-balanced routing algorithm (GOAL) on adversarial traffic.
- Published
- 2004
- Full Text
- View/download PDF
46. Migration in Single Chip Multiprocessors
- Author
-
Kelly A. Shaw and William J. Dally
- Subjects
Reduction (complexity) ,Single chip ,Resource (project management) ,Ideal (set theory) ,Hardware and Architecture ,Computer science ,Locality ,Parallel computing - Abstract
Global communication costs in future single-chipmultiprocessors will increase linearly with distance. In this paper,we revisit the issues of locality and load balance in order totake advantage of these new costs. We present a technique whichsimultaneously migrates data and threads based on vectors specifyinglocality and resource usage. This technique improves performanceon applications with distinguishable locality and imbalancedresource usage. 64% of the ideal reduction in execution timewas achieved on an application with these traits while no improvementwas obtained on a balanced application with little locality.
- Published
- 2002
- Full Text
- View/download PDF
47. MARS: A Multiprocessor-Based Programmable Accelerator
- Author
-
R. Tutundjian, Prathima Agrawal, William J. Dally, Anjur Sundaresan Krishnakumar, H. V. Jagadish, and W. C. Fischer
- Subjects
Very-large-scale integration ,Flexibility (engineering) ,business.industry ,Computer science ,Logic simulation ,Multiprocessing ,Mars Exploration Program ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Telecommunications network ,Acceleration ,Hardware and Architecture ,Embedded system ,Hardware acceleration ,Electrical and Electronic Engineering ,business ,Software - Abstract
MARS, short for microprogrammable accelerator for rapid simulations, is a multiprocessor-based hardware accelerator that can efficiently implement a wide range of computationally complex algorithms. In addition to accelerating many graph-related problem solutions, MARS is ideally suited for performing event-driven simulations of VLSI circuits. Its highly pipelined and parallel architecture yields a performance comparable to that of existing special-purpose hardware simulators. MARS has the added advantage of flexibility because its VLSI processors are custom-designed to be microprogrammable and reconfigurable. When programmed as a logic simulator, MARS should be able to achieve 1 million gate evaluations per second.
- Published
- 1987
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.