Author: "William J. Dally" / Publisher: institute of electrical and electronics engineers (ieee) - Searchworks@Jio Institute Digital Library Search Results

1. A Novel High-Efficiency Three-Phase Multilevel PV Inverter With Reduced DC-Link Capacitance

Author: Tuofei Chen, Lei Gu, William J. Dally, Juan Rivas-Davila, and John Fox
Subjects: Control and Systems Engineering, Electrical and Electronic Engineering
Published: 2023
Full Text: View/download PDF

2. A 95.6-TOPS/W Deep Learning Inference Accelerator With Per-Vector Scaled 4-bit Quantization in 5 nm

Author: Ben Keller, Rangharajan Venkatesan, Steve Dai, Stephen G. Tell, Brian Zimmer, Charbel Sakr, William J. Dally, C. Thomas Gray, and Brucek Khailany
Subjects: Electrical and Electronic Engineering
Published: 2023
Full Text: View/download PDF

3. Evolution of the Graphics Processing Unit (GPU)

Author: Stephen W. Keckler, David B. Kirk, and William J. Dally
Subjects: Vertex (computer graphics), Fragment (computer graphics), Computer science, Graphics processing unit, Frame rate, High memory, Hardware and Architecture, Computer graphics (images), Smart camera, Electrical and Electronic Engineering, Graphics, Shader, Software, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Graphics processing units (GPUs) power today’s fastest supercomputers, are the dominant platform for deep learning, and provide the intelligence for devices ranging from self-driving cars to robots and smart cameras. They also generate compelling photorealistic images at real-time frame rates. GPUs have evolved by adding features to support new use cases. NVIDIA’s GeForce 256, the first GPU, was a dedicated processor for real-time graphics, an application that demands large amounts of floating-point arithmetic for vertex and fragment shading computations and high memory bandwidth. As real-time graphics advanced, GPUs became programmable. The combination of programmability and floating-point performance made GPUs attractive for running scientific applications. Scientists found ways to use early programmable GPUs by casting their calculations as vertex and fragment shaders. GPUs evolved to meet the needs of scientific users by adding hardware for simpler programming, double-precision floating-point arithmetic, and resilience.
Published: 2021
Full Text: View/download PDF

4. Accelerating Chip Design With Machine Learning

Author: Steve Dai, Ben Keller, William J. Dally, Brucek Khailany, Rangharajan Venkatesan, Alicia Klinefelter, Robert M. Kirby, Saad Godil, Yanqing Zhang, Haoxing Ren, and Bryan Catanzaro
Subjects: Very-large-scale integration, Artificial neural network, Design space exploration, Computer science, business.industry, 02 engineering and technology, Integrated circuit design, Machine learning, computer.software_genre, Convolutional neural network, 020202 computer hardware & architecture, Workflow, Hardware and Architecture, Logic gate, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), Artificial intelligence, Electrical and Electronic Engineering, Design methods, business, computer, Software
Abstract: Recent advancements in machine learning provide an opportunity to transform chip design workflows. We review recent research applying techniques such as deep convolutional neural networks and graph-based neural networks in the areas of automatic design space exploration, power analysis, VLSI physical design, and analog design. We also present a future vision of an AI-assisted automated chip design workflow to aid designer productivity and automate optimization tasks.
Published: 2020
Full Text: View/download PDF

5. A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm

Author: Joel Emer, Matthew Fojtik, C. Thomas Gray, Ben Keller, Stephen G. Tell, Priyanka Raina, Stephen W. Keckler, Alicia Klinefelter, William J. Dally, Brucek Khailany, Brian Zimmer, Jason Clemons, Rangharajan Venkatesan, Nan Jiang, Yanqing Zhang, Nathaniel Pinckney, and Yakun Sophia Shao
Subjects: Computer science, business.industry, Multi-chip module, Bandwidth (signal processing), Scalability, Mesh networking, Inference, System on a chip, Electrical and Electronic Engineering, Chip, business, Computer hardware, Efficient energy use
Abstract: Custom accelerators improve the energy efficiency, area efficiency, and performance of deep neural network (DNN) inference. This article presents a scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic chip are optimal for specific network sizes, the proposed architecture enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains. Communication energy is minimized with large on-chip distributed weight storage and a hierarchical network-on-chip and network-on-package, and inference energy is minimized through extensive data reuse. The 16-nm prototype achieves 1.29-TOPS/mm2 area efficiency, 0.11 pJ/op (9.5 TOPS/W) energy efficiency, 4.01-TOPS peak performance for a one-chip system, and 127.8 peak TOPS and 1903 images/s ResNet-50 batch-1 inference for a 36-chip system.
Published: 2020
Full Text: View/download PDF

6. Energy Efficient On-Demand Dynamic Branch Prediction Models

Author: Ehsan Atoofian, Amirali Baniasadi, Milad Mohammadi, Tor M. Aamodt, William J. Dally, and Song Han
Subjects: Computer science, Fetch, 02 engineering and technology, Parallel computing, Supercomputer, computer.software_genre, Branch predictor, 020202 computer hardware & architecture, Theoretical Computer Science, Computational Theory and Mathematics, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Compiler, Cache, computer, Software, Integer (computer science), Efficient energy use
Abstract: The branch predictor unit (BPU) is among the main energy consuming components in out-of-order (OoO) processors. For integer applications, we find 16 percent of the processor energy is consumed by the BPU. BPU is accessed in parallel with the instruction cache before it is known if a fetch group contains control instructions. We find 85 percent of BPU lookups are done for non-branch operations, and of the remaining lookups, 42 percent are done for highly biased branches that can be predicted statically with high accuracy. We evaluate two variants of a branch prediction model that combines dynamic and static branch prediction to achieve energy improvements for power-constrained applications. These models, named on-demand branch prediction (ODBP) and path-based on-demand branch prediction (ODBP-PATH), are two novel prediction techniques that eliminate unnecessary BPU lookups using compiler generated hints to identify instructions that can be more accurately predicted statically. ODBP-PATH is an implementation of ODBP that combines static and dynamic branch prediction based on the program path of execution. For a 4-wide OoO processor, ODBP-PATH delivers 11 percent average energy-delay (ED) product improvement, and 9 percent core average energy saving on the SPEC Int 2006 benchmarks.
Published: 2020
Full Text: View/download PDF

7. Darwin: A Genomics Coprocessor

Author: William J. Dally, Gill Bejerano, and Yatish Turakhia
Subjects: Coprocessor, Speedup, Computer science, Molecular biophysics, Sequence assembly, Genomics, 02 engineering and technology, Parallel computing, 020202 computer hardware & architecture, Orders of magnitude (bit rate), Hardware and Architecture, Darwin (ADL), 0202 electrical engineering, electronic engineering, information engineering, Human genome, Electrical and Electronic Engineering, Software
Abstract: Long read sequencing is promising as it provides knowledge of a full spectrum of mutations in the human genome and generates more contiguous de novo assemblies. But high error rate in long reads imposes a computational barrier for genome assembly. Darwin, a specialized coprocessor, which provides orders of magnitude speedup over conventional processors in long read assembly, can eliminate this barrier.
Published: 2019
Full Text: View/download PDF

8. A 1.17-pJ/b, 25-Gb/s/pin Ground-Referenced Single-Ended Serial Link for Off- and On-Package Communication Using a Process- and Temperature-Adaptive Voltage Regulator

Author: William J. Dally, C. Thomas Gray, John Wilson, Sudhir S. Kudva, John W. Poulton, Wenxu Zhao, Nikola Nedovic, Stephen G. Tell, Xi Chen, Walker J. Turner, Sunil Sudhakaran, Sanquan Song, and Brian Zimmer
Subjects: Frequency response, business.industry, Serial communication, Computer science, 020208 electrical & electronic engineering, Transmitter, Electrical engineering, 02 engineering and technology, Voltage regulator, Phase-locked loop, CMOS, 0202 electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Transceiver, business, Jitter
Abstract: This paper describes a short-reach serial link to connect chips mounted on the same package or on neighboring packages on a printed circuit board (PCB). The link employs an energy-efficient, single-ended ground-referenced signaling scheme. Implemented in 16-nm FinFET CMOS technology, the link operates at a data rate of 25 Gb/s/pin with 1.17-pJ/bit energy efficiency and uses a simple but robust matched-delay clock forwarding scheme that cancels most sources of jitter. The modest frequency-dependent attenuation of short-reach links is compensated using an analog equalizer in the transmitter. The receiver includes active-inductor peaking in the input amplifier to improve overall receiver frequency response. The link employs a novel power supply regulation scheme at both ends that uses a PLL ring-oscillator supply voltage as a reference to flatten circuit speed and reduce power consumption variation across PVT. The link can be calibrated once at an arbitrary voltage and temperature, then track VT variation without the need for periodic re-calibration. The link operates over a 10-mm-long on-package channel with −4 dB of attenuation with 0.77-UI eye opening at bit-error rate (BER) of 10−15. A package-to-package link with 54 mm of PCB and 26 mm of on-package trace with −8.5 dB of loss at Nyquist operates with 0.42 UI of eye opening at BER of 10−15. Overall link die area is 686 $\mu \text{m}\,\,\times $ 565 $\mu \text{m}$ with the transceiver circuitry taking up 20% of the area. The transceiver’s on-chip regulator is supplied from an off-chip 950-mV supply, while the support logic operates on a separate 850-mV supply.
Published: 2019
Full Text: View/download PDF

9. Optimal Operation of a Plug-In Hybrid Vehicle

Author: John Fox, Nicholas Moehle, Jason A. Platt, and William J. Dally
Subjects: business.product_category, Computer Networks and Communications, Computer science, 020209 energy, Aerospace Engineering, 020302 automobile design & engineering, 02 engineering and technology, Grid, Automotive engineering, Nonlinear system, chemistry.chemical_compound, 0203 mechanical engineering, chemistry, Control theory, Automotive Engineering, Electric vehicle, Convex optimization, 0202 electrical engineering, electronic engineering, information engineering, Fuel efficiency, Petroleum, Resource management, Electrical and Electronic Engineering, Convex function, Hybrid vehicle, business
Abstract: We present a convex optimization control method that has been shown in simulations to increase the fuel efficiency of a plug-in hybrid electric vehicle by over 10%. Using information on energy demand and energy use profiles, the problem is defined to preferentially use battery resources sourced from the grid over petroleum resources. We pose the general nonlinear optimal resource management problem over a predetermined route as a convex optimization problem using a reduced model of the vehicle. This problem is computationally efficient enough to be optimized “on the fly” on the on-board vehicle computer and is thus able to adapt to changing vehicle conditions in real time. Using this reduced model to generate control inputs for the detailed vehicle simulator autonomie, we record efficiency gains of over 10% as compared to the industry standard charge depleting charge sustaining controller over synthetic mixed urban-suburban routes.
Published: 2018
Full Text: View/download PDF

10. A 28 nm 2 Mbit 6 T SRAM With Highly Configurable Low-Voltage Write-Ability Assist Implementation and Capacitor-Based Sense-Amplifier Input Offset Compensation

Author: Andreas J. Gotterba, Mahmut E. Sinangil, Jesse S. Wang, Matthew Fojtik, Jason Golbus, John W. Poulton, Brian Zimmer, Stephen G. Tell, C. Thomas Gray, William J. Dally, and Thomas Hastings Greer
Subjects: Engineering, Offset (computer science), Input offset voltage, business.industry, Sense amplifier, Amplifier, 020208 electrical & electronic engineering, 02 engineering and technology, 01 natural sciences, law.invention, Capacitor, CMOS, law, 0103 physical sciences, Hardware_INTEGRATEDCIRCUITS, 0202 electrical engineering, electronic engineering, information engineering, Electronic engineering, Static random-access memory, Electrical and Electronic Engineering, 010306 general physics, business, Low voltage
Abstract: This paper presents a highly configurable low-voltage write-ability assist implementation along with a sense-amplifier offset reduction technique to improve SRAM read performance. Write-assist implementation combines negative bit-line (BL) and $V_{\text{DD}}$ collapse schemes in an efficient way to maximize $V_{{\min}}$ improvements while saving on area and energy overhead of these assists. Relative delay and pulse width of assist control signals are also designed with configurability to provide tuning of assist strengths. Sense-amplifier offset compensation scheme uses capacitors to store and negate threshold mismatch of input transistors. A test chip fabricated in 28 nm HP CMOS process demonstrates operation down to 0.5 V with write assists and more than 10% reduction in word-line pulsewidth with the offset compensated sense amplifiers.
Published: 2016
Full Text: View/download PDF

11. On-Chip Active Messages for Speed, Scalability, and Efficiency

Author: R. Curtis Harting and William J. Dally
Subjects: business.industry, Computer science, computer.software_genre, Computational Theory and Mathematics, Shared memory, Hardware and Architecture, Signal Processing, Scalability, Operating system, Concurrent computing, Overhead (computing), business, computer, Cache coherence, Energy (signal processing), Computer network
Abstract: This paper describes and quantifies the benefits of adding low-overhead active messages to many-core, cache-coherent chip-multiprocessors. The active messages we analyze are user defined and trigger the atomic execution of a custom software handler at the destination. Programmers can use these active messages to both move data with less overhead than cache coherency and, more importantly, explicitly send computation to data. Doing so greatly improves (11 $\times$ speed, 4.8 $\times$ energy) communication idioms such as shared object modification, reductions, data walks, point-to-point communication, and all-to-all communication. Active messages enhance program scalability: applications using them run 63 percent faster with 11 percent less energy on 256 cores. The relative benefits of active messages grow with larger numbers of cores.
Published: 2015
Full Text: View/download PDF

12. On-Demand Dynamic Branch Prediction

Author: Song Han, Tor M. Aamodt, William J. Dally, and Milad Mohammadi
Subjects: Speedup, Computer science, Speculative execution, Thread (computing), Parallel computing, computer.software_genre, Branch predictor, Power budget, Branch table, Hardware and Architecture, Compiler, Cache, Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING, computer
Abstract: In out-of-order (OoO) processors, speculative execution with high branch prediction accuracy is employed to achieve good single thread performance. In these processors the branch prediction unit tables (BPU) are accessed in parallel with the instruction cache before it is known whether a fetch group contains branch instructions. For integer applications, we find 85 percent of BPU lookups are done for non-branch operations and of the remaining lookups, 42 percent are done for highly biased branches that can be predicted statically with high accuracy. We evaluate on-demand branch prediction (ODBP), a novel technique that uses compiler generated hints to identify those instructions that can be more accurately predicted statically to eliminate unnecessary BPU lookups. We evaluate an implementation of ODBP that combines static and dynamic branch prediction. For a four wide superscalar processor, ODBP delivers as much as 9 percent improvement in average energy-delay (ED) product, 7 percent core average energy saving, and 3 percent speedup. ODBP also enables the use of large BPU’s for a given power budget.
Published: 2015
Full Text: View/download PDF

13. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications

Author: John W. Poulton, Thomas Hastings Greer, C. Thomas Gray, William J. Dally, John Eyles, Stephen G. Tell, Xi Chen, and John Wilson
Subjects: Engineering, Serial communication, business.industry, Electrical engineering, Integrated circuit design, Chip, Die (integrated circuit), CMOS, Low-power electronics, Hardware_INTEGRATEDCIRCUITS, Charge pump, Electronic engineering, Electrical and Electronic Engineering, business, Ground plane
Abstract: High-speed signaling over high density interconnect on organic package substrates or silicon interposers offers an attractive solution to the off-chip bandwidth limitation problem faced in modern digital systems. In this paper, we describe a signaling system co-designed with the interconnect to take advantage of the characteristics of this environment to enable a high-speed, low area, and low-power die to die link. Ground-Referenced Signaling (GRS) is a single-ended signaling system that eliminates the major problems traditionally associated with single-ended design by using the ground plane as the reference and signaling above and below ground. This design employs a novel charge pump driver that additionally eliminates the issue of simultaneous switching noise with data independent current consumption. Silicon measurements from a test chip implementing two 16-lane links, with forwarded clocks, in a standard 28 nm process demonstrate 20 Gb/s operation at 0.54 pJ/bit over 4.5 mm organic substrate channels at a nominal 0.9 V power supply voltage. Timing margins at the receiver are >0.3 UI at a BER of 10-12. We estimate BER 10-25 at the eye center.
Published: 2013
Full Text: View/download PDF

14. Elastic Buffer Flow Control for On-Chip Networks

Author: William J. Dally and George Michelogiannakis
Subjects: Router, Flow control (data), business.industry, Computer science, Throughput, Buffer (optical fiber), Theoretical Computer Science, Network on a chip, Computational Theory and Mathematics, Hardware and Architecture, Embedded system, business, Software, Computer network
Abstract: Networks-on-chip (NoCs) were developed to meet the communication requirements of large-scale systems. The majority of current NoCs spend considerable area and power for router buffers. In our past work, we have developed elastic buffer (EB) flow control which adds simple control logic in the channels to use pipeline flip-flops (FFs) as EBs with two storage locations. This way, channels act as distributed FIFOs and input buffers are no longer required. Removing buffers and virtual channels (VCs) significantly simplifies router design. Compared to VC networks with highly-efficient custom SRAM buffers, EB networks provide an up to 45 percent shorter cycle time, 16 percent more throughput per unit power, or 22 percent more throughput per unit area. EB networks provide traffic classes using duplicate physical subnetworks. However, this approach negates the cost gains or becomes infeasible for a large number of traffic classes. Therefore, in this paper we propose a hybrid EB-VC router which provides an arbitrary number of traffic classes by using an input buffer to drain flits facing severe contention or deadlock. Thus, hybrid routers operate as EB routers in the common case, and as VC routers when necessary. For this reason, the hybrid EB-VC scheme offers 21 percent more throughput per unit power than VC networks and 12 percent than EB networks.
Published: 2013
Full Text: View/download PDF

15. GPUs and the Future of Parallel Computing

Author: Michael Garland, D. Glasco, William J. Dally, Brucek Khailany, and Stephen W. Keckler
Subjects: Coprocessor, Computer architecture, Parallel processing (DSP implementation), Hardware and Architecture, Computer science, End-user computing, Next-generation network, Graphics processing unit, Parallel computing, Electrical and Electronic Engineering, Software
Abstract: This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. Nvidia Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.
Published: 2011
Full Text: View/download PDF

16. Evaluating Elastic Buffer and Wormhole Flow Control

Author: Daniel U. Becker, William J. Dally, and George Michelogiannakis
Subjects: Router, Flow control (data), Computer science, business.industry, ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, Throughput, Buffer (optical fiber), Theoretical Computer Science, Network on a chip, Computational Theory and Mathematics, Hardware and Architecture, Hardware_INTEGRATEDCIRCUITS, Wormhole, business, Software, Computer network
Abstract: With the emergence of on-chip networks, router buffer power has become a primary concern. Elastic buffer (EB) flow control utilizes existing pipeline flip-flops in the channels to implement distributed FIFOs, eliminating the need for input buffers at the routers. EB routers have been shown to be more efficient than virtual channel routers, as they do not require input buffers or complex logic for managing virtual channels and tracking credits. Wormhole routers are more comparable in terms of complexity because they also lack virtual channels. This paper compares EB and wormhole routers and explores novel hybrid designs to more closely examine the effect of design simplicity and input buffer cost. Our results show that EB routers have up to 25 percent smaller cycle time compared to wormhole and hybrid routers. Moreover, EB flow control requires 10 percent less energy to transfer a single bit through a router and offers three percent more throughput per unit energy as well as 62 percent more throughput per unit area. The main contributor to these results is the cost and delay overhead of the input buffer.
Published: 2011
Full Text: View/download PDF

17. The GPU Computing Era

Author: John R. Nickolls and William J. Dally
Subjects: Ubiquitous computing, Coprocessor, Computer science, Graphics processing unit, Parallel computing, Supercomputer, CUDA, Hardware and Architecture, Concurrent computing, Electrical and Electronic Engineering, Graphics, General-purpose computing on graphics processing units, Massively parallel, Software, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: GPU computing is at a tipping point, becoming more widely used in demanding consumer applications and high-performance computing. This article describes the rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications.
Published: 2010
Full Text: View/download PDF

18. Operand Registers and Explicit Operand Forwarding

Author: William J. Dally, James Balfour, and R.C. Halting
Subjects: Memory hierarchy, Computer science, Pipeline (computing), Parallel computing, Operand, Hardware and Architecture, Code generation, Hardware_ARITHMETICANDLOGICSTRUCTURES, Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING, Routing (electronic design automation), Fixed-point arithmetic, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, Operand forwarding, Efficient energy use
Abstract: Operand register files are small, inexpensive register files that are integrated with function units in the execute stage of the pipeline, effectively extending the pipeline operand registers into register files. Explicit operand forwarding lets software opportunistically orchestrate the routing of operands through the forwarding network to avoid writing ephemeral values to registers. Both mechanisms let software capture short-term reuse and locality close to the function units, improving energy efficiency by allowing a significant fraction of operands to be delivered from inexpensive registers that are integrated with the function units. An evaluation shows that capturing operand bandwidth close to the function units allows operand registers to reduce the energy consumed in the register files and forwarding network of an embedded processor by 61%, and allows explicit forwarding to reduce the energy consumed by 26%.
Published: 2009
Full Text: View/download PDF

19. Efficient Embedded Computing

Author: R.C. Harting, D. Black-Shaffer, J. Chen, James Balfour, William J. Dally, V. Parikh, D. Sheffield, and Jongsoo Park
Subjects: General Computer Science, Reduced instruction set computing, Computer architecture, Application-specific integrated circuit, Computer science, business.industry, Embedded system, business, Energy (signal processing)
Abstract: Hardwired ASICs - 50X more efficient than programmable processors - sacrifice programmability to meet the efficiency requirements of demanding embedded systems. Programmable processors use energy mostly to supply instructions and data to the arithmetic units, and several techniques can reduce instruction- and data-supply energy costs. Using these techniques in the Stanford ELM processor closes the gap with ASICs to within 3X.
Published: 2008
Full Text: View/download PDF

20. Hierarchical Instruction Register Organization

Author: James Balfour, Jongsoo Park, David Black-Schaffer, William J. Dally, and V. Parikh
Subjects: Instruction set, Instruction register, Kernel (linear algebra), Computer architecture, Indirection, Computer-integrated manufacturing, Hardware and Architecture, Computer science, Very long instruction word, Overhead (computing), Baseline (configuration management)
Abstract: This paper analyzes a range of architectures for efficient delivery of VLIW instructions for embedded media kernels. The analysis takes an efficient filter cache as a baseline and examines the benefits from 1) removing the tag overhead, 2) distributing the storage, 3) adding indirection, 4) adding efficient NOP generation, and 5) sharing instruction memory. The result is a hierarchical instruction register organization that provides a 56% energy and 40% area savings over an already efficient filter cache.
Published: 2008
Full Text: View/download PDF

21. An Energy-Efficient Processor Architecture for Embedded Systems

Author: Jongsoo Park, James Balfour, David Black-Schaffer, William J. Dally, and V. Parikh
Subjects: Instruction register, Instructions per cycle, Reduced instruction set computing, Computer science, Processor register, business.industry, Transport triggered architecture, Microarchitecture, Addressing mode, Instruction set, Computer architecture, Hardware and Architecture, Embedded system, Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING, business
Abstract: We present an efficient programmable architecture for compute-intensive embedded applications. The processor architecture uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data. Instruction registers capture instruction reuse and locality in inexpensive storage structures that arc located near to the functional units. The data register organization captures reuse and locality in different levels of the hierarchy to reduce the cost of delivering data. Exposed communication resources eliminate pipeline registers and control logic, and allow the compiler to schedule efficient instruction and data movement. The architecture keeps a significant fraction of instruction and data bandwidth local to the functional units, which reduces the cost of supplying instructions and data to large numbers of functional units. This architecture achieves an energy efficiency that is 23x greater than an embedded RISC processor.
Published: 2008
Full Text: View/download PDF

22. Flattened Butterfly Topology for On-Chip Networks

Author: William J. Dally, John Kim, and James Balfour
Subjects: Butterfly network, Multi-core processor, business.industry, Computer science, Energy consumption, Parallel computing, Network topology, Topology, Network on a chip, Hardware and Architecture, Butterfly, Latency (engineering), business, Computer network
Abstract: With the trend towards increasing number of cores in a multicore processors, the on-chip network that connects the cores needs to scale efficiently. In this work, we propose the use of high-radix networks in on-chip networks and describe how the flattened butterfly topology can be mapped to on-chip networks. By using high-radix routers to reduce the diameter of the network, the flattened butterfly offers lower latency and energy consumption than conventional on-chip topologies. In addition, by properly using bypass channels in the flattened butterfly network, non-minimal routing can be employed without increasing latency or the energy consumption.
Published: 2007
Full Text: View/download PDF

23. A 20-Gb/s 0.13-/spl mu/m CMOS serial link transmitter using an LC-PLL to directly drive the output multiplexer

Author: Mark Horowitz, Patrick Chiang, Yangjin Oh, William J. Dally, M.-J.E. Lee, and R. Senthinathan
Subjects: Physics, Electronic oscillator, business.industry, Electrical engineering, Sample and hold, Multiplexer, Frequency divider, Phase-locked loop, CMOS, Optical Carrier transmission rates, Hardware_INTEGRATEDCIRCUITS, Hardware_ARITHMETICANDLOGICSTRUCTURES, Electrical and Electronic Engineering, business, Jitter
Abstract: A 20-Gb/s transmitter is implemented in 0.13-/spl mu/m CMOS technology. An on-die 10-GHz LC oscillator phase-locked loop (PLL) creates two sinusoidal 10-GHz complementary clock phases as well as eight 2.5-GHz interleaved feedback divider clock phases. After a 2/sup 20/-1 pseudorandom bit sequence generator (PRBS) creates eight 2.5-Gb/s data streams, the eight 2.5-GHz interleaved clocks 4:1 multiplex the eight 2.5-Gb/s data streams to two 10-Gb/s data streams. 10-GHz analog sample-and-hold circuits retime the two 10-Gb/s data streams to be in phase with the 10-GHz complementary clocks. Two-tap equalization of the 10-Gb/s data streams compensate for bandwidth rolloff of the 10-Gb/s data outputs at the 10-GHz analog latches. A final 20-Gb/s 2:1 output multiplexer, clocked by the complementary 10-GHz clock phases, creates 20-Gb/s data from the two retimed 10-Gb/s data streams. The LC-VCO is integrated with the output multiplexer and analog latches, resonating the load and eliminating the need for clock buffers, reducing power supply induced jitter and static phase mismatch. Power, active die area, and jitter (rms/pk-pk) are 165 mW, 650 /spl mu/m/spl times/350 /spl mu/m, and 2.37 ps/15 ps, respectively.
Published: 2005
Full Text: View/download PDF

24. A 33-mW 8-Gb/s CMOS clock multiplier and CDR for highly integrated I/Os

Author: R. Rathi, William J. Dally, R. Senthinathan, Hiok-Tiaq Ng, Trey Greer, M.-J.E. Lee, J. Edmondson, John W. Poulton, J. Tran, A. Nguyen, and Ramin Farjad-Rad
Subjects: Physics, Offset (computer science), Vernier scale, business.industry, Frequency multiplier, Electrical engineering, Voltage regulator, Multiplexer, law.invention, Injection locking, CMOS, law, Electronic engineering, Electrical and Electronic Engineering, business, Clock recovery, Jitter, CPU multiplier
Abstract: A 0.622-8-Gb/s clock and data recovery (CDR) circuit using injection locking for jitter suppression and phase interpolation in high-bandwidth system-on-chip solutions is described. A slave injection locked oscillator (SILO) is locked to a tracking aperture-multiplying DLL (TA-MDLL) via a coarse phase selection multiplexer (MUX). For the fine timing vernier, an interpolator DAC controls the injection strength of the MUX output into the SILO. This 1.2-V 0.13-/spl mu/m CMOS CDR consumes 33 mW at 8Gb/s. Die area including voltage regulator is 0.08 mm/sup 2/. Recovered clock jitter is 49 ps pk-pk at a 200-ppm bit-rate offset.
Published: 2004
Full Text: View/download PDF

25. A second-order semidigital clock recovery circuit based on injection locking

Author: Hiok-Tiaq Ng, Trey Greer, William J. Dally, Ramin Farjad-Rad, John W. Poulton, J. Edmondson, R. Senthinathan, R. Rathi, and M.-J.E. Lee
Subjects: Phase-locked loop, Injection locking, Physics, Synchronous circuit, CMOS, Clock domain crossing, Electronic engineering, Electrical and Electronic Engineering, Clock skew, Digital clock, Jitter
Abstract: A compact (1 mm /spl times/ 160 /spl mu/m) and low-power (80-mW) 0.18-/spl mu/m CMOS 3.125-Gb/s clock and data recovery circuit is described. The circuit utilizes injection locking to filter out high-frequency reference clock jitter and multiplying delay-locked loop duty-cycle distortions. The injection-locked slave oscillator output can have its output clocks interpolated by current steering the injecting clocks. A second-order clock and data recovery is introduced to perform the interpolation and is capable of tracking frequency offsets while exhibiting low phase wander.
Published: 2003
Full Text: View/download PDF

26. Programmable stream processors

Author: Ujval J. Kapasi, Peter Mattson, Scott Rixner, Jung Ho Ahn, William J. Dally, John D. Owens, and Brucek Khailany
Subjects: Stream processing, Flexibility (engineering), Current (stream), Concurrency control, Configuration management, General Computer Science, Computer architecture, Computer science, Concurrency, Locality
Abstract: The demand for flexibility in media processing motivates the use of programmable processors. Stream processing bridges the gap between inflexible special-purpose solutions and current programmable architectures that cannot meet the computational demands of media-processing applications. The central idea behind stream processing is to organize an application into streams and kernels to expose the inherent locality and concurrency in media-processing applications. The performance of the Imagine stream processor on these media application is given.
Published: 2003
Full Text: View/download PDF

27. Jitter transfer characteristics of delay-locked loops - theories and design techniques

Author: Hiok-Tiaq Ng, Trey Greer, M.-J.E. Lee, R. Senthinathan, Ramin Farjad-Rad, William J. Dally, and John W. Poulton
Subjects: Computer science, Control theory, Frequency domain, Bandwidth (signal processing), Electronic engineering, Electrical and Electronic Engineering, Chip, Jitter, Electronic circuit
Abstract: This paper presents analyses and experimental results on the jitter transfer of delay-locked loops (DLLs). Through a z-domain model, we show that in a widely used DLL configuration, jitter peaking always exists and high-frequency jitter does not get attenuated as previous analyses suggest. This is true even in a first-order DLL and an overdamped second-order DLL. The amount of jitter peaking is shown to trade off with the tracking bandwidth and, therefore, the acquisition time. Techniques to reduce jitter amplification by loop filtering and phase filtering are discussed. Measurements from a prototype chip incorporating the discussed techniques confirm the prediction of the analytical model. In environments where the reference clock is noisy or where multiple timing circuits are cascaded, this jitter amplification effect should be carefully evaluated.
Published: 2003
Full Text: View/download PDF

28. A low-power multiplying DLL for low-jitter multigigahertz clock generation in highly integrated digital chips

Author: Hiok-Tiaq Ng, William J. Dally, R. Rathi, John W. Poulton, M.-J.E. Lee, Ramin Farjad-Rad, and R. Senthinathan
Subjects: Engineering, business.industry, Hardware_PERFORMANCEANDRELIABILITY, Noise (electronics), Phase-locked loop, CMOS, Filter (video), Low-power electronics, Phase noise, Hardware_INTEGRATEDCIRCUITS, Electronic engineering, Electrical and Electronic Engineering, business, Jitter, CPU multiplier
Abstract: A multiplying delay-locked loop (MDLL) for high-speed on-chip clock generation that overcomes the drawbacks of phase-locked loops (PLLs) such as jitter accumulation, high sensitivity to supply, and substrate noise is described. The MDLL design removes such drawbacks while maintaining the advantages of a PLL for multirate frequency multiplication. This design also uses a supply regulator and filter to further reduce on-chip jitter generation. The MDLL, implemented in 0.18-/spl mu/m CMOS technology, occupies a total active area of 0.05 mm/sup 2/ and has a speed range of 200 MHz to 2 GHz with selectable multiplication ratios of M=4, 5, 8, 10. The complete synthesizer, including the output clock buffers, dissipates 12 mW from a 1.8-V supply at 2.0 GHz. This MDLL architecture is used as a clock multiplier integrated on a single chip for a 72/spl times/72 STS-1 grooming switch and has a jitter of 1.73 ps (rms) and 13.1 ps (pk-pk).
Published: 2002
Full Text: View/download PDF

29. A delay model for router microarchitectures

Author: William J. Dally and Li-Shiuan Peh
Subjects: Link state packet, Router, Flow control (data), business.industry, Computer science, computer.internet_protocol, ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, Virtual Router Redundancy Protocol, Throughput, Parallel computing, Metrics, Core router, Hardware and Architecture, Bridge router, One-armed router, Hardware_INTEGRATEDCIRCUITS, Electrical and Electronic Engineering, business, computer, Software, Computer network
Abstract: This article introduces a router delay model that takes into account the pipelined nature of contemporary routers and proposes pipelines matched to the specific flow control method employed. Given the type of flow control and router parameters, the model returns router latency in technology-independent units and the number of pipeline stages as a function of cycle time. We apply this model to derive realistic pipelines for wormhole and virtual-channel routers and compare their performance. Contrary to the conclusions of previous models, our results show that the latency of a virtual channel router doesn't increase as we scale the number of virtual channels up to 8 per physical channel. Our simulation results also show that a virtual-channel router gains throughput of up to 40 % over a wormhole router.
Published: 2001
Full Text: View/download PDF

30. Imagine: media processing with streams

Author: Peter Mattson, Brian Towles, Ujval J. Kapasi, John D. Owens, Brucek Khailany, A. Chang, William J. Dally, Scott Rixner, and J. Namkoong
Subjects: Computer science, business.industry, Image processing, Parallel computing, STREAMS, ComputerSystemsOrganization_PROCESSORARCHITECTURES, FLOPS, Microarchitecture, Stream processing, Hardware and Architecture, Encoding (memory), Media processor, Electrical and Electronic Engineering, business, Software, Computer hardware
Abstract: The power-efficient Imagine stream processor achieves performance densities comparable to those of special-purpose embedded processors. Executing programs mapped to streams and kernels, a single Imagine processor is expected to have a peak performance of 20 gflops and sustain 18.3 gops on mpeg-2 encoding.
Published: 2001
Full Text: View/download PDF

31. Low-power area-efficient high-speed I/O circuit techniques

Author: M.-J.E. Lee, William J. Dally, and Patrick Chiang
Subjects: Very-large-scale integration, Engineering, business.industry, Amplifier, Capacitive sensing, Transmitter, Hardware_PERFORMANCEANDRELIABILITY, Integrated circuit design, CMOS, Low-power electronics, Hardware_INTEGRATEDCIRCUITS, Electronic engineering, Inverter, Electrical and Electronic Engineering, business, Hardware_LOGICDESIGN
Abstract: We present a 4-Gb/s I/O circuit that fits in 0.1-mm/sup 2/ of die area, dissipates 90 mW of power, and operates over 1 m of 7-mil 0.5-oz PCB trace in a 0.25-/spl mu/m CMOS technology. Swing reduction is used in an input-multiplexed transmitter to provide most of the speed advantage of an output-multiplexed architecture with significantly lower power and area. A delay-locked loop (DLL) using a supply-regulated inverter delay line gives very low jitter at a fraction of the power of a source-coupled delay line-based DLL. Receiver capacitive offset trimming decreases the minimum resolvable swing to 8 mV, greatly reducing the transmission energy without affecting the performance of the receive amplifier. These circuit techniques enable a high level of I/O integration to relieve the pin bandwidth bottleneck of modern VLSI chips.
Published: 2000
Full Text: View/download PDF

32. Concurrent event handling through multithreading

Author: William J. Dally, W.S.L.S. Chatterjee, Andrew Chang, and Stephen W. Keckler
Subjects: Speedup, Process state, Computer science, business.industry, Exception handling, Thread (computing), computer.software_genre, Theoretical Computer Science, law.invention, Super-threading, Microprocessor, Computational Theory and Mathematics, Hardware and Architecture, law, Embedded system, Multithreading, Operating system, Software system, business, computer, Software, Context switch
Abstract: Exceptions have traditionally been used to handle infrequently occurring and unpredictable events during normal program execution. Current trends in microprocessor and operating systems design continue to increase the cost of event handling. Because of the deep pipelines and wide out-of-order superscalar architectures of contemporary microprocessors, an event may need to nullify a large number of in-flight instructions. Large register files require existing software systems to save and restore a substantial amount of process state before executing an exception handler. At the same time, processors are executing in environments that supply higher event frequencies and demand higher performance. We have developed an alternative architecture, "concurrent event handling", that incorporates multithreading into event handling architectures. Instead of handling the event in the faulting thread's architectural and pipeline registers, the fault handler is forked into its own thread slot and executes concurrently with the faulting thread. Microbenchmark programs show a factor-of-3 speedup for concurrent event handling over a traditional architecture on code that takes frequent exceptions. We also demonstrate substantial speedups on two event-based applications. Concurrent event handling is implemented in the MIT's MAP (Multi-ALU Processor) chip.
Published: 1999
Full Text: View/download PDF

33. An efficient, protected message interface

Author: Nicholas P. Carter, Whay S. Lee, William J. Dally, Andrew Chang, and Stephen W. Keckler
Subjects: General Computer Science, Shared memory, Computer science, Multithreading, Distributed computing, Interface (computing), Message passing, Context (computing), Multiprocessing, Message broker, Network interface, Interrupt
Abstract: With increasing demand for computing power, multiprocessing computers will become more common in the future. In these systems, the growing discrepancy between processor and memory technologies will cause tightly integrated message interfaces to be essential for achieving the necessary efficiency, which is especially important in light of the growing interest in software-distributed, shared memory systems. The authors conduct a performance evaluation of several primitive messaging mechanisms-dispatch mechanisms (how the processor reacts to message arrivals), memory mapped versus register mapped interfaces, and streaming versus buffered interfaces-baselining these results against the MIT M-Machine and its tightly integrated message interfaces. They find that a message can be dispatched up to 18 times faster by reserving a hardware thread context for message reception instead of an interrupt driven interface. They also find that the mapping decision is important, with integrated register mapped interfaces as much as 3.5 times more efficient than conventional systems. To meet the challenges and exploit the opportunities presented by emerging multithreaded processor architectures, low overhead mechanisms for protection against message corruption, interception, and starvation must be integral to the message system design. The authors hope that the simple messaging mechanisms presented can help provide a solution to these challenges.
Published: 1998
Full Text: View/download PDF

34. Transmitter equalization for 4-Gbps signaling

Author: John W. Poulton and William J. Dally
Subjects: business.industry, Computer science, Circuit design, Amplifier, Transmitter, Electrical engineering, Equalization (audio), Skew, Digital clock manager, Clock skew, CMOS, Hardware and Architecture, Clock domain crossing, Hardware_INTEGRATEDCIRCUITS, Electrical and Electronic Engineering, Telecommunications, business, Software, Clock recovery, Jitter
Abstract: Most digital systems today use full-swing, unterminated signaling methods that are unsuited for data rates over 100 MHz on 1-meter wires. We are currently developing 0.5-micron CMOS transmitter and receiver circuits that use active equalization to overcome the frequency-dependent attenuation of copper lines. The circuits will operate at 4 Gbps over up to 6 meters of 24AWG twisted pair or up to 1 meter of 5-mil 0.5-oz. PC trace. In addition to frequency-dependent attenuation, timing uncertainty (skew and jitter) and receiver bandwidth are also major obstacles to high-data rates. To address all of these issues, we've given our system the following characteristics: An active transmitter equalizer compensates for the frequency-dependent attenuation of the transmission line. The system performs closed-loop clock recovery independently for each signal line in a manner that cancels all clock and data skew and the low-frequency components of clock jitter. The delay line that generates the transmit and receive clocks (a 400-MHz clock with 10 equally spaced phases) uses several circuit techniques to achieve a total simulated jitter of less than 20 ps in the presence of supply and substrate noise. A clocked receive amplifier with a 50-ps aperture time senses the signal during the center of the eye at the receiver.
Published: 1997
Full Text: View/download PDF

35. Deadlock-free adaptive routing in multicomputer networks using virtual channels

Author: William J. Dally and H. Aoki
Subjects: Interconnection, Network packet, Computer science, business.industry, Distributed computing, Fault tolerance, Adaptive routing, Dependency graph, Computational Theory and Mathematics, Hardware and Architecture, Signal Processing, Network performance, business, Computer network
Abstract: The use of adaptive routing in a multicomputer interconnection network improves network performance by using all available paths and provides fault tolerance by allowing messages to be routed around failed channels and nodes. Two deadlock-free adaptive routing algorithms are described. Both algorithms allocate virtual channels using a count of the number of dimension reversals a packet has performed to eliminate cycles in resource dependency graphs. The static algorithm eliminates cycles in the network channel dependency graph. The dynamic algorithm improves virtual channel utilization by permitting dependency cycles and instead eliminating cycles in the packet wait-for graph. It is proved that these algorithms are deadlock-free. Experimental measurements of their performance are presented. >
Published: 1993
Full Text: View/download PDF

36. Hot chips 12

Author: M. Tremblay, William J. Dally, and A.J. Baum
Subjects: Hardware and Architecture, Computer science, Metallurgy, Parallel computing, Electrical and Electronic Engineering, Software, Hot Chips
Published: 2001
Full Text: View/download PDF

37. The message-driven processor: a multicomputer processing node with efficient mechanisms

Author: John S. Keen, G. A. Fyler, Richard Lethin, Michael D. Noakes, William J. Dally, R. E. Davison, P.R. Nuth, and J.A.S. Fiske
Subjects: Very-large-scale integration, Network architecture, Hardware_MEMORYSTRUCTURES, 36-bit, Computer science, business.industry, Node (networking), Memory controller, Instruction set, Hardware and Architecture, Synchronization (computer science), Systems architecture, Electrical and Electronic Engineering, business, Software, Computer hardware, Dram
Abstract: The message-driven processor (MDP), a 36-b, 1.1-million transistor, VLSI microcomputer, specialized to operate efficiently in a multicomputer, is described. The MDP chip includes a processor, a 4096-word by 36-b memory, and a network port. An on-chip memory controller with error checking and correction (ECC) permits local memory to be expanded to one million words by adding external DRAM chips. The MDP incorporates primitive mechanisms for communication, synchronization, and naming which support most proposed parallel programming models. The MDP system architecture, instruction set architecture, network architecture, implementation, and software are discussed. >
Published: 1992
Full Text: View/download PDF

38. A fast translation method for paging on top of segmentation

Author: William J. Dally
Subjects: Computer science, Translation lookaside buffer, Parallel computing, Theoretical Computer Science, Physical address, Memory management, Computational Theory and Mathematics, Virtual address space, Hardware and Architecture, Virtual memory, Paging, Segmentation, Algorithm design, Software
Abstract: A description is presented of a fast, one-step translation method that implements paging on top of segmentation. This method translates a virtual address into a physical address, performing both the segmentation and paging translations, with a single TLB (translation lookaside buffer) read and a short add. Previous methods performed this translation in two steps and required two TLB reads and a long add. Using the fast method, the fine-grain protection and relocation of segmentation combined with paging can be provided with delay and complexity comparable to paging-only systems. This method allows small segments, particularly important in object-oriented programming systems, to be managed efficiently. >
Published: 1992
Full Text: View/download PDF

39. Express cubes: improving the performance of k-ary n-cube interconnection networks

Author: William J. Dally
Subjects: Interconnection, Packet switching, Computational Theory and Mathematics, Logarithm, Hardware and Architecture, Computer science, Mesh networking, Locality, Parallel computing, Cube, Telecommunications network, Software, Theoretical Computer Science
Abstract: The author discusses express cubes, k-ary n-cube interconnection networks augmented by express channels that provide a short path for nonlocal messages. An express cube combines the logarithmic diameter of a multistage network with the wire-efficiency and ability to exploit locality of a low-dimensional mesh network. The insertion of express channels reduces the network diameter and thus the distance component of network latency. Wire length is increased, allowing networks to operate with latencies that approach the physical speed-of-light limitation rather than being limited by node delays. Express channels increase wire bisection in a manner that allows the bisection to be controlled independently of the choice of radix, dimension, and channel width. By increasing wire bisection to saturate the available wiring media, throughput can be substantially increased. With an express cube both latency and throughput are wire-limited and within a small factor of the physical limit on performance. >
Published: 1991
Full Text: View/download PDF

40. Performance analysis of k-ary n-cube interconnection networks

Author: William J. Dally
Subjects: Interconnection, Computational Theory and Mathematics, Hardware and Architecture, Computer science, Throughput, Torus, Parallel computing, Deterministic routing, Topology, Telecommunications network, Software, Theoretical Computer Science
Abstract: VLSI communication networks are wire-limited, i.e. the cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the network. Communication networks of varying dimensions are analyzed under the assumption of constant wire bisection. Expressions for the latency, average case throughput, and hot-spot throughput of k-ary n-cube networks with constant bisection that agree closely with experimental measurements are derived. It is shown that low-dimensional networks (e.g. tori) have lower latency and higher hot-spot throughput than high-dimensional networks (e.g. binary n-cubes) with the same bisection width. >
Published: 1990
Full Text: View/download PDF

41. A hardware logic simulation system

Author: Prathima Agrawal and William J. Dally
Subjects: Very-large-scale integration, Amdahl's law, Event (computing), Computer science, business.industry, Logic simulation, System testing, Integrated circuit, Parallel computing, Mars Exploration Program, Computer Graphics and Computer-Aided Design, law.invention, symbols.namesake, law, symbols, Algorithm design, Electrical and Electronic Engineering, business, Software, Computer hardware
Abstract: Multiple-delay logic simulation algorithms developed for the microprogrammable accelerator for rapid simulations (MARS) hardware simulator are discussed. In particular, timing-analysis algorithms for event cancellations, spike and race analyses, and oscillation detection are described. It is shown how a reconfigurable set of processors, called processing elements (PEs), can be arranged in a pipelined configuration to implement these algorithms. The algorithms operate within the partitioned-memory, message-passing architecture of MARS. Three logic simulators-two multiple delay and one unit delay-have been implemented using slightly different configuration of the available PEs. In these simulators, VLSI chips are modeled at the gate level with accurate rise/fall delays assigned to each logic primitive. On-chip memory blocks are modeled functionally and are integrated into the simulation framework. The MARS hardware simulator has been tested on many VLSI chip designs and has demonstrated a speed improvement of about 50 times that of an Amdahl 5870 system running a production-quality software simulator while retaining the accuracy of simulations. >
Published: 1990
Full Text: View/download PDF

42. Topology Optimization of Interconnection Networks

Author: Amit Gupta and William J. Dally
Subjects: Interconnection, Critical distance, Hardware and Architecture, Network packet, Computer science, Distributed computing, Topology optimization, Logical topology, Multistage interconnection networks, Latency (engineering), Network topology
Abstract: This paper describes an automatic optimization tool that searches a family of network topologies to select the topology that best achieves a specified set of design goals while satisfying specified packaging constraints. Our tool uses a model of signaling technology that relates bandwidth, cost and distance of links. This model captures the distance-dependent bandwidth of modern high-speed electrical links and the cost differential between electrical and optical links. Using our optimization tool, we explore the design space of hybrid Clos-torus (C-T) networks. For a representative set of packaging constraints we determine the optimal hybrid C-T topology to minimize cost and the optimal C-T topology to minimize latency for various packet lengths. We then use the tool to measure the sensitivity of the optimal topology to several important packaging constraints such as pin count and critical distance
Published: 2006
Full Text: View/download PDF

43. Data Parallel Address Architecture

Author: William J. Dally and Jung Ho Ahn
Subjects: Hardware_MEMORYSTRUCTURES, business.industry, Sense amplifier, Computer science, Locality, Uniform memory access, Parallel computing, Thread (computing), CAS latency, Hardware and Architecture, Memory rank, Latency (engineering), business, Dram, Computer hardware
Abstract: Data parallel memory systems must maintain a large number of outstanding memory references to fully use increasing DRAM bandwidth in the presence of increasing latency. At the same time, the throughput of modern DRAMs is very sensitive to access pattern's due to the time required to precharge and activate banks and to switch between read and write access. To achieve memory reference parallelism a system may simultaneously issue references from multiple reference threads. Alternatively multiple references from a single thread can be issued in parallel. In this paper, we examine this tradeoff and show that allowing only a single thread to access DRAM at any given time significantly improves performance by increasing the locality of the reference stream and hence reducing precharge/activate operations and read/write turnaround. Simulations of scientific and multimedia applications show that generating multiple references from a single thread gives, on average, 17% better performance than generating references from two parallel threads
Published: 2006
Full Text: View/download PDF

44. Buffer and Delay Bounds in High Radix Interconnection Networks

Author: Arjun Singh and William J. Dally
Subjects: Queueing theory, Interconnection, Intelligent Network, Hardware and Architecture, business.industry, Network packet, Computer science, Bounding overwatch, Parallel computing, Latency (engineering), business, Buffer (optical fiber), Computer network
Abstract: We apply recent results in queueing theory to propose a methodology for bounding the buffer depth and packet delay in high radix interconnection networks. While most work in interconnection networks has been focused on the throughput and average latency in such systems, few studies have been done providing statistical guarantees for buffer depth and packet delays. These parameters are key in the design and performance of a network. We present a methodology for calculating such bounds for a practical high radix network and through extensive simulations show its effectiveness for both bursty and non-bursty injection traffic. Our results suggest that modest speedups and buffer depths enable reliable networks without flow control to be constructed.
Published: 2004
Full Text: View/download PDF

45. Globally Adaptive Load-Balanced Routing on Tori

Author: Brian Towles, Arjun Singh, Amit Gupta, and William J. Dally
Subjects: Zone Routing Protocol, Static routing, Dynamic Source Routing, Computer science, business.industry, Distributed computing, ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, Enhanced Interior Gateway Routing Protocol, Policy-based routing, Link-state routing protocol, Hardware and Architecture, Multipath routing, Hardware_INTEGRATEDCIRCUITS, Destination-Sequenced Distance Vector routing, business, Computer network
Abstract: We introduce a new method of adaptive routing on k-ary n-cubes, Globally Adaptive Load-Balance (GAL). GAL makes global routing decisions using global information. In contrast, most previous adaptive routing algorithms make local routing decisions using local information (typically channel queue depth). GAL senses global congestion using segmented injection queues to decide the directions to route in each dimension. It further load balances the network by routing in the selected directions adaptively. Using global information, GAL achieves the performance (latency and throughput) of minimal adaptive routing on benign traffic patterns and performs as well as the best obliviously load-balanced routing algorithm (GOAL) on adversarial traffic.
Published: 2004
Full Text: View/download PDF

46. Migration in Single Chip Multiprocessors

Author: Kelly A. Shaw and William J. Dally
Subjects: Reduction (complexity), Single chip, Resource (project management), Ideal (set theory), Hardware and Architecture, Computer science, Locality, Parallel computing
Abstract: Global communication costs in future single-chipmultiprocessors will increase linearly with distance. In this paper,we revisit the issues of locality and load balance in order totake advantage of these new costs. We present a technique whichsimultaneously migrates data and threads based on vectors specifyinglocality and resource usage. This technique improves performanceon applications with distinguishable locality and imbalancedresource usage. 64% of the ideal reduction in execution timewas achieved on an application with these traits while no improvementwas obtained on a balanced application with little locality.
Published: 2002
Full Text: View/download PDF

47. MARS: A Multiprocessor-Based Programmable Accelerator

Author: R. Tutundjian, Prathima Agrawal, William J. Dally, Anjur Sundaresan Krishnakumar, H. V. Jagadish, and W. C. Fischer
Subjects: Very-large-scale integration, Flexibility (engineering), business.industry, Computer science, Logic simulation, Multiprocessing, Mars Exploration Program, ComputerSystemsOrganization_PROCESSORARCHITECTURES, Telecommunications network, Acceleration, Hardware and Architecture, Embedded system, Hardware acceleration, Electrical and Electronic Engineering, business, Software
Abstract: MARS, short for microprogrammable accelerator for rapid simulations, is a multiprocessor-based hardware accelerator that can efficiently implement a wide range of computationally complex algorithms. In addition to accelerating many graph-related problem solutions, MARS is ideally suited for performing event-driven simulations of VLSI circuits. Its highly pipelined and parallel architecture yields a performance comparable to that of existing special-purpose hardware simulators. MARS has the added advantage of flexibility because its VLSI processors are custom-designed to be microprogrammable and reconfigurable. When programmed as a logic simulator, MARS should be able to achieve 1 million gate evaluations per second.
Published: 1987
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

47 results on '"William J. Dally"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources