30 results on '"Vaughn Betz"'
Search Results
2. Tensor Slices to the Rescue
- Author
-
Vaughn Betz, Samidh Mehta, Aman Arora, and Lizy K. John
- Subjects
010302 applied physics ,Adder ,business.industry ,Computer science ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,Programmable logic device ,Reduction (complexity) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Hardware acceleration ,Routing (electronic design automation) ,Crossbar switch ,business ,Field-programmable gate array ,Digital signal processing - Abstract
FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for IEEE half-precision (fp16) math in DSP slices, adding hard matrix multiplier blocks, etc. In this paper, we describe replacing a small percentage of the FPGA's programmable logic area with Tensor Slices. These slices are arrays of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual adders, multipliers and MACs (multiply-and-accumulate). These tiles have a local crossbar at the inputs that helps with easing the routing pressure caused by a large slice. By spending ~3% of FPGA's area on Tensor Slices, we observe an average frequency increase of 2.45x and average area reduction by 0.41x across several ML benchmarks, including a TPU-like design, compared to an Intel Agilex-like baseline FPGA. We also study the impact of spending area on Tensor slices on non-ML applications. We observe an average reduction of 1% in frequency and an average increase of 1% in routing wirelength compared to the baseline, across the non-ML benchmarks we studied. Adding these ML-specific coarse-grained hard blocks makes the proposed FPGA a much efficient hardware accelerator for ML applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain.
- Published
- 2021
- Full Text
- View/download PDF
3. Session details: Poster Session I
- Author
-
Vaughn Betz
- Subjects
Multimedia ,Computer science ,Session (computer science) ,computer.software_genre ,computer - Published
- 2020
- Full Text
- View/download PDF
4. Using OpenCL to Enable Software-like Development of an FPGA-Accelerated Biophotonic Cancer Treatment Simulator
- Author
-
Vaughn Betz, Tanner Young-Schultz, Lothar Lilge, and Stephen J. Brown
- Subjects
Iterative method ,business.industry ,Computer science ,Pipeline (computing) ,Monte Carlo method ,Parallel computing ,Solver ,01 natural sciences ,030218 nuclear medicine & medical imaging ,010309 optics ,03 medical and health sciences ,0302 clinical medicine ,Software ,0103 physical sciences ,Stratix ,Field-programmable gate array ,business ,Agile software development - Abstract
The simulation of light propagation through tissues is important for medical applications, such as photodynamic therapy (PDT) for cancer treatment. To optimize PDT an inverse problem, which works backwards from a desired distribution of light to the parameters that caused it, must be solved. These problems have no closed-form solution and therefore must be solved numerically using an iterative method. This involves running many forward light propagation simulations which is time-consuming and computationally intensive. Currently, the fastest general software solver for this problem is FulMonteSW. It models complex 3D geometries with tetrahedral meshes and uses Monte Carlo techniques to model photon interactions with tissues. This work presents FullMonteFPGACL: an FPGA-accelerated version of FullMonteSW using an Intel Stratix 10 FPGA and the Intel FPGA SDK for OpenCL. FullMonteFPGACL has been validated and benchmarked using several models and achieves improvements in performance (4x) and energy-efficiency (11x) over the optimized and multi-threaded FullMonteSW implementation. We discuss methods for extending the design to improve the performance and energy-efficiency ratios to 16x and 17x, respectively. We achieved these gains by developing in an agile fashion using OpenCL to facilitate quick prototyping and hardware-software partitioning. However, achieving competitive area and performance required careful design of the hardware pipeline and expression of its structure in OpenCL. This led to a hybrid design style that can improve productivity when developing complex applications on an FPGA.
- Published
- 2020
- Full Text
- View/download PDF
5. HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs
- Author
-
Vaughn Betz and Mathew Hall
- Subjects
FOS: Computer and information sciences ,Artificial neural network ,Computer science ,business.industry ,Interface (computing) ,B.5.1 ,Parallel computing ,computer.software_genre ,Convolutional neural network ,Hardware Architecture (cs.AR) ,Stratix ,Compiler ,Computer Science - Hardware Architecture ,Field-programmable gate array ,business ,computer ,Throughput (business) ,Digital signal processing - Abstract
We present both a novel Convolutional Neural Network (CNN) accelerator architecture and a network compiler for FPGAs that outperforms all prior work. Instead of having generic processing elements that together process one layer at a time, our network compiler statically partitions available device resources and builds custom-tailored hardware for each layer of a CNN. By building hardware for each layer we can pack our controllers into fewer lookup tables and use dedicated routing. These efficiencies enable our accelerator to utilize 2x the DSPs and operate at more than 2x the frequency of prior work on sparse CNN acceleration on FPGAs. We evaluate the performance of our architecture on both sparse Resnet-50 and dense MobileNet Imagenet classifiers on a Stratix 10 2800 FPGA. We find that the sparse Resnet-50 model has throughput at a batch size of 1 of 4550 images/s, which is nearly 4x the throughput of NVIDIA's fastest machine learning targeted GPU, the V100, and outperforms all prior work on FPGAs., Comment: 8 Pages, 11 Figures
- Published
- 2020
- Full Text
- View/download PDF
6. StateMover
- Author
-
Vaughn Betz and Sameh Attia
- Subjects
business.industry ,Computer science ,media_common.quotation_subject ,Visibility (geometry) ,0211 other engineering and technologies ,02 engineering and technology ,Porting ,020202 computer hardware & architecture ,Controllability ,Debugging ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Observability ,State (computer science) ,business ,Field-programmable gate array ,Computer hardware ,021106 design practice & management ,media_common - Abstract
Debugging consumes a large portion of FPGA design time, and with the growing complexity of traditional FPGA systems and the additional verification challenges posed by multiple FPGAs interacting within data centers, debugging productivity is becoming even more important. Current debugging flows either depend on simulation, which is extremely slow but has full visibility, or on hardware execution, which is fast but provides very limited control and visibility. In this paper, we present StateMover, a checkpointing-based debugging framework for FPGAs, which can move design state back and forth between an FPGA and a simulator in a seamless way. StateMover leverages the speed of hardware execution and the full visibility and ease-of-use of a simulator. This enables a novel debugging flow that has a software-like combination of speed with full observability and controllability. StateMover adds minimal hardware to the design to safely stop the design under test so that its state can be extracted or modified in an orderly manner. The added hardware has no timing overhead and a very small area overhead. StateMover currently supports Xilinx UltraScale devices, and its underlying techniques and tools can be ported to other device families that support configuration readback. Moving the state from/to an FPGA to/from a simulator can be performed in a few seconds for large FPGAs, enabling a new debugging flow.
- Published
- 2020
- Full Text
- View/download PDF
7. Math Doesn't Have to be Hard
- Author
-
Sadegh Yazdanshenas, Vaughn Betz, Mohamed Eldafrawy, and Andrew Boutros
- Subjects
Adder ,Artificial neural network ,Logic block ,business.industry ,Computer science ,Carry (arithmetic) ,Deep learning ,020208 electrical & electronic engineering ,02 engineering and technology ,Single-precision floating-point format ,020202 computer hardware & architecture ,Computer engineering ,0202 electrical engineering, electronic engineering, information engineering ,Multiplier (economics) ,Artificial intelligence ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Field-programmable gate array ,business - Abstract
Recent work has shown that using low-precision arithmetic in Deep Neural Network (DNN) inference acceleration can yield large efficiency gains with little or no accuracy degradation compared to half or single precision floating-point by enabling more MAC operations per unit area. The most efficient precision is a complex function of the DNN application, structure and required accuracy, which makes the variable precision capabilities of FPGAs very valuable. We propose three logic block architecture enhancements to increase the density and reduce the delay of multiply-accumulate (MAC) operations implemented in the soft fabric. Adding another level of carry chain to the ALM (extra carry chain architecture) leads to a 1.5x increase in MAC density, while ensuring a small impact on general designs as it adds only 2.6% FPGA tile area and a representative critical path delay increase of 0.8%. On the other hand, our highest impact option, which combines our 4-bit Adder architecture with a 9-bit Shadow Multiplier, increases MAC density by 6.1x, at the cost of larger tile area and representative critical path delay overheads of 16.7% and 9.8%, respectively.
- Published
- 2019
- Full Text
- View/download PDF
8. HLS-based FPGA Acceleration of Light Propagation Simulation in Turbid Media
- Author
-
Yasmin Afsharnejad, Omar Ragheb, Vaughn Betz, Abdul-Amir Yassine, and Paul Chow
- Subjects
Long cycle ,Speedup ,business.industry ,Fpga acceleration ,02 engineering and technology ,01 natural sciences ,Turnaround time ,020202 computer hardware & architecture ,010309 optics ,Acceleration ,Light propagation ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Medicine ,business ,Field-programmable gate array ,Throughput (business) - Abstract
Several clinical applications rely on understanding light transport in heterogeneous biological tissues. Researchers usually resort to Monte-Carlo (MC) simulations to model the problem accurately. However, MC simulations require acceleration for better turnaround time, and this motivates the use of Field-Programmable Gate Arrays (FPGAs) to accelerate the algorithm. Nevertheless, the long cycle of developing and verifying FPGA designs makes it challenging to model realistic tissues accurately and smoothly. To this end, we present a complete and highly-optimized MC simulator for light propagation in 3D voxel-based biological tissue representations with floating-point operations using High-Level Synthesis (HLS). We provide practical guidelines in utilizing HLS to create efficient structures that help achieve the desired throughput. We also show where future work is needed to improve HLS. We use Vivado to implement the design on a Xilinx Kintex Ultrascale FPGA running at 150 MHz. With a design time of 1.5 months, experimental results show a 3x speedup against the fastest software simulator published to date.
- Published
- 2018
- Full Text
- View/download PDF
9. Don't Forget the Memory
- Author
-
Kosuke Tatsumura, Sadegh Yazdanshenas, and Vaughn Betz
- Subjects
010302 applied physics ,Block ram ,business.industry ,Computer science ,Magnetic tunnelling ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Fpga architecture ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Static random-access memory ,Architecture ,Field-programmable gate array ,business ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,Transistor sizing - Abstract
While academic FPGA architecture exploration tools have become sufficiently advanced to enable a wide variety of explorations and optimizations on soft fabric and outing, support for Block RAM (BRAM) has been very limited. In this paper, we present enhancements to the COFFE transistor sizing tool to facilitate automatic generation and optimization of BRAM for both SRAM and Magnetic Tunnelling Junction technologies. These new capabilities enable investigation of area, delay, and energy trends for various sizes of BRAM or different BRAM technologies. We also validate these trends against available commercial FPGA BRAM data. Furthermore, we demonstrate that BRAMs generated by COFFE can be used to carry out system-level architecture explorations using an area-oriented RAM-mapping flow and the Verilog-To-Routing flow.
- Published
- 2017
- Full Text
- View/download PDF
10. Take the Highway
- Author
-
Andrew Bitar, Vaughn Betz, and Mohamed S. Abdelfattah
- Subjects
Ethernet ,Interconnection ,business.product_category ,business.industry ,Computer science ,Interface (computing) ,Hardware_PERFORMANCEANDRELIABILITY ,computer.file_format ,JPEG ,Embedded system ,Hardware_INTEGRATEDCIRCUITS ,Leverage (statistics) ,Network switch ,business ,Field-programmable gate array ,computer ,Image compression - Abstract
We explore the addition of a fast embedded network-on-chip (NoC) to augment the FPGA's existing wires and switches, and help interconnect large applications. A flexible interface between the FPGA fabric and the embedded NoC allows modules of varying widths and frequencies to transport data over the NoC. We study both latency-insensitive and latency-sensitive design styles and present the constraints for implementing each type of communication on the embedded NoC. Our application case study with image compression shows that an embedded NoC improves frequency by 10-80%, reduces utilization of scarce long wires by 40% and makes design easier and more predictable. Additionally, we leverage the embedded NoC in creating a programmable Ethernet switch that can support up to 819 Gb/s on FPGAs.
- Published
- 2015
- Full Text
- View/download PDF
11. Session details: Technical Session 7: Circuit Design
- Author
-
Vaughn Betz
- Subjects
Computer architecture ,Computer science ,Circuit design ,Session (computer science) ,Field-programmable gate array - Published
- 2015
- Full Text
- View/download PDF
12. Efficient and programmable ethernet switching with a NoC-enhanced FPGA
- Author
-
Jeffrey Cassidy, Andrew Bitar, Natalie Enright Jerger, and Vaughn Betz
- Subjects
Ethernet ,business.product_category ,business.industry ,Computer science ,Network packet ,Packet injection ,Network on a chip ,Embedded system ,Scalability ,Hardware_INTEGRATEDCIRCUITS ,Network switch ,Transceiver ,business ,Field-programmable gate array - Abstract
Communications systems make heavy use of FPGAs; their programmability allows system designers to keep up with emerging protocols and their high-speed transceivers enable high bandwidth designs. While FPGAs are extensively used for packet parsing, inspection and classification, they have seen less use as the switch fabric between network ports. However, recent work has proposed embedding a network-on-chip (NoC) as a new “hard” resource on FPGAs and we show that by properly leveraging such a NoC one can create a very efficient yet still highly programmable network switch. We compare a NoC-based 16×16 network switch for 10-Gigabit Ethernet traffic to a recent innovative FPGA-based switch fabric design. The NoC-based switch not only consumes 5.8× less logic area, but also reduces latency by 8.1× We also show that using the FPGA's programmable interconnect to adjust the packet injection points into the NoC leads to significant performance improvements. A routing algorithm tailored to this application is shown to further improve switch performance and scalability. Overall, we show that an FPGA with a low-cost hard 64-node mesh NoC with 64-bit links can support a 16 × 16 switch with up to 948 Gbps in aggregate bandwidth, roughly matching the transceiver bandwidth on the latest FPGAs.
- Published
- 2014
- Full Text
- View/download PDF
13. Cad and routing architecture for interposer-based multi-FPGA systems
- Author
-
Vaughn Betz and Andre Hahn Pereira
- Subjects
Software_OPERATINGSYSTEMS ,Computer science ,business.industry ,Dice ,Signal ,Die (integrated circuit) ,Embedded system ,Hardware_INTEGRATEDCIRCUITS ,Interposer ,Key (cryptography) ,Routing (electronic design automation) ,business ,Field-programmable gate array ,Hardware_LOGICDESIGN ,Electronic circuit - Abstract
Interposer-based multi-FPGA systems are composed of multiple FPGA dice connected through a silicon interposer. Such devices allow larger FPGA systems to be built than one monolithic die can accomodate and are now commercially available. An open question, however, is how efficient such systems are compared to a monolithic FPGA, as the number of signals passing between dice is reduced and the signal delay between dice is increased in an interposer system vs. a monolithic FPGA.We create a new version of VPR to investigate the architecture of such systems, and show that by modifying the placement cost function to minimize the number of signals that must cross between dice we can reduce routing demand by 18% and delay by 2%. We also show that the signal count between dice and the signal delay between dice are key architecture parameters for interposer-based FPGA systems. We find that if an interposer supplies (between dice) 60% of the routing capacity that the normal (within-die) FPGA routing channels supply, there is little impact on the routability of circuits. Smaller routing capacities in the interposer do impact routability however: minimum channel width increases by 20% and 50% when an interposer supplies only 40% and 30% of the within-die routing, respectively. The interposer also impacts delay, increasing circuit delay by 34% on average for a 1 ns interposer signal delay and a four-die system. Reducing the interposer delay has a greater benefit in improving circuit speed than does reducing the number of dice in the system.
- Published
- 2014
- Full Text
- View/download PDF
14. Quantifying the cost and benefit of latency insensitive communication on FPGAs
- Author
-
Kevin E. Murray and Vaughn Betz
- Subjects
Cost–benefit analysis ,Computer science ,Interfacing ,business.industry ,Embedded system ,Latency (engineering) ,Timing closure ,Field-programmable gate array ,business ,Implementation - Abstract
Latency insensitive communication offers many potential benefits for FPGA designs, including easier timing closure by enabling automatic pipelining, and easier interfacing with embedded NoCs. However, it is important to understand the costs and trade-offs associated with any new design style. This paper presents optimized implementations of latency insensitive communication building blocks, quantifies their overheads in terms of area and frequency, and provides guidance to designers on how to generate high-speed and area-efficient latency insensitive systems.
- Published
- 2014
- Full Text
- View/download PDF
15. Are FPGAs suffering from the innovator's dilemna?
- Author
-
Jason Cong and Vaughn Betz
- Subjects
Dilemma ,Creative destruction ,Margin (finance) ,Computer science ,Cash ,media_common.quotation_subject ,Profit margin ,Appeal ,Cash flow ,Industrial organization ,Barriers to entry ,media_common - Abstract
FPGAs constitute a highly profitable industry, with approximately $5 billion of sales per year. High barriers to entry keep most companies away, and enable high profit margins for the incumbents. The industry has grown greatly over the years, but still constitutes a small portion of the overall semiconductor market. This raises the question to be addressed by this panel: is the FPGA community innovating as much as it should, or is a bias to maintain high profit margins and protect the cash flow of the current FPGA market holding us back from exploring new ideas and products that could greatly expand the appeal of and market for FPGA-related technology? This would be a classic case of the innovator's dilemma defined by Clayton Christensen: it is difficult for a company to engage in creative destruction of a cash cow product.Our distinguished panel of experts will discuss whether we are seeing major innovation in architectures, design flows and applications, or only incremental improvements. We will also discuss if any new (possibly low margin) application domain is left out by the FPGA industry, what radical ideas should be explored, and whether large incumbents, new startups, academia or some combination are best able to attack these new areas.
- Published
- 2013
- Full Text
- View/download PDF
16. Comparing FPGA vs. custom cmos and the impact on processor microarchitecture
- Author
-
Jonathan Rose, Henry Wong, and Vaughn Betz
- Subjects
Multi-core processor ,Adder ,CMOS ,business.industry ,Computer science ,Embedded system ,Pipeline (computing) ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,business ,Field-programmable gate array ,Multiplexer ,Microarchitecture ,Block (data storage) - Abstract
As soft processors are increasingly used in diverse applications, there is a need to evolve their microarchitectures in a way that suits the FPGA implementation substrate. This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CMOS and FPGA substrates. We then use the results of these comparisons to infer how the microarchitecture of soft processors on FPGAs should be different from hard processors on custom CMOS.We find that the ratios of the area required by an FPGA to that of custom CMOS for different building blocks varies significantly more than the speed ratios. As area is often a key design constraint in FPGA circuits, area ratios have the most impact on microarchitecture choices. Complete processor cores have area ratios of 17-27x and delay ratios of 18-26x. Building blocks that have dedicated hardware support on FPGAs such as SRAMs, adders, and multipliers are particularly area-efficient (2-7x area ratio), while multiplexers and CAMs are particularly area-inefficient (>100x area ratio), leading to cheaper ALUs, larger caches of low associativity, and more expensive bypass networks than on similar hard processors. We also find that a low delay ratio for pipeline latches (12-19x) suggests soft processors should have pipeline depths 20% greater than hard processors of similar complexity.
- Published
- 2011
- Full Text
- View/download PDF
17. A comprehensive approach to modeling, characterizing and optimizing for metastability in FPGAs
- Author
-
David Neto, Deshanand Singh, Ryan Fung, Jeffrey Christopher Chromczak, David Lewis, Doris Chen, and Vaughn Betz
- Subjects
Digital electronics ,Mean time between failures ,business.industry ,Computer science ,Suite ,Transistor ,CAD ,Hardware_PERFORMANCEANDRELIABILITY ,law.invention ,Asynchronous communication ,law ,Metastability ,Embedded system ,Hardware_INTEGRATEDCIRCUITS ,Electronic engineering ,business ,Field-programmable gate array - Abstract
Metastability is a phenomenon that can cause system failures in digital circuits. It may occur whenever signals are being transmitted across asynchronous or unrelated clock domains. The impact of metastability is increasing as process geometries shrink and supply voltages drop faster than transistor Vts. FPGA technologies are significantly affected since leading edge FPGAs are amongst the first devices to adopt the most recent process nodes. In this paper, we present a comprehensive suite of techniques for modeling, characterizing and optimizing metastability effects in FPGAs. We first discuss a theoretical model of metastability, and verify the predictions using both circuit level simulations and board measurements. Next we show how designers have traditionally dealt with metastability problems and contrast that with the automatic CAD algorithms described in this paper that both analyze and optimize metastability-related issues. Through our detailed experimental results, we show that we can improve the metastability characteristics of a large suite of industrial benchmarks by an average of 268,000 times with our optimization techniques.
- Published
- 2010
- Full Text
- View/download PDF
18. High-quality, deterministic parallel placement for FPGAs on commodity hardware
- Author
-
Ketan Padalia, Vaughn Betz, and Adrian Ludwin
- Subjects
business.industry ,Computer science ,media_common.quotation_subject ,Process (computing) ,CAD ,Parallel computing ,Reduction (complexity) ,Software ,Synchronization (computer science) ,Overhead (computing) ,Quality (business) ,Field-programmable gate array ,business ,media_common - Abstract
In this paper, we describe the application of two parallelization strategies to the Quartus II FPGA placer. The first uses a pipelining approach and achieves speedups of 1.3x on two processing cores. The second uses a parallel moves approach and achieves speedups of 2.2x on four cores. Unlike all previous parallel moves algorithms, ours is deterministic and always gives the same answer as the serial version of the algorithm, without any significant reduction in performance.We also describe a process to quantify multi-core performance effects, such as memory subsystem limitations and explicit synchronization overhead, and fully describe these effects on a CAD tool for the first time. Memory limitations alone are found to cost up to 35% of total runtime. Unlike previous algorithms, our algorithms have negligible explicit synchronization overhead. These results are relevant to both CAD designers and to any developers seeking to parallelize existing software.
- Published
- 2008
- Full Text
- View/download PDF
19. Session details: CAD
- Author
-
Vaughn Betz
- Subjects
Engineering drawing ,Computer science ,CAD ,Session (computer science) ,Field-programmable gate array - Published
- 2007
- Full Text
- View/download PDF
20. Power-aware RAM mapping for FPGA embedded memory blocks
- Author
-
David Neto, Vaughn Betz, Thiagaraja Gopalsamy, and Russell Tessier
- Subjects
Memory management ,business.industry ,Computer science ,Embedded system ,Interleaved memory ,Registered memory ,Semiconductor memory ,Memory refresh ,business ,Memory map ,Computer memory ,Extended memory - Abstract
Embedded memory blocks are important resources in contemporary FPGA devices. When targeting FPGAs, application designers often specify high-level memory functions which exhibit a range of sizes and control structures. These logical memories must be mapped to FPGA embedded memory resources such that physical design objectives are met. In this work a set of power-aware logical-to-physical RAM mapping algorithms are described which convert user-defined memory specifications to on-chip FPGA memory block resources. These algorithms minimize RAM dynamic power by evaluating a range of possible embedded memory block mappings and selecting the most power-efficient choice. Our automated approach has been integrated into a commercial FPGA compiler and tested with 40 large FPGA benchmarks. Through experimentation, we show that, on average, embedded memory dynamic power can be reduced by 21% and overall core dynamic power can be reduced by 7% with a minimal loss (1%) in design performance.
- Published
- 2006
- Full Text
- View/download PDF
21. Session details: Will power kill FPGAs?
- Author
-
Vaughn Betz, Jan M. Rabaey, Michael D. Hutton, Ronnie Vasishta, Steven K. Knapp, and Gary Scott Delp
- Subjects
business.industry ,Computer science ,Embedded system ,Session (computer science) ,Field-programmable gate array ,business ,Power (physics) - Published
- 2006
- Full Text
- View/download PDF
22. Session details: FPGA circuit design and layout
- Author
-
Vaughn Betz
- Subjects
Computer science ,business.industry ,Circuit design ,Session (computer science) ,Field-programmable gate array ,business ,Computer hardware - Published
- 2005
- Full Text
- View/download PDF
23. The Stratix II logic and routing architecture
- Author
-
Mark Bourgeault, David Galloway, Gregg William Baeckler, Elias Ahmed, Andy L. Lee, David Lewis, Boris Ratchev, Sandy Marquardt, Christopher F. Lane, Richard G. Cliff, Srinivas T. Reddy, Bruce B. Pedersen, Michael D. Hutton, Jay Schleicher, Paul Leventis, Giles Powell, Cameron McClintock, Kevin Stevens, David Cashman, Jonathan Rose, Ketan Padalia, Vaughn Betz, and Richard Yuan
- Subjects
business.industry ,Computer science ,Multiplexer ,Reduction (complexity) ,Embedded system ,Stratix ,Lookup table ,Hardware_INTEGRATEDCIRCUITS ,Routing (electronic design automation) ,Performance improvement ,business ,Field-programmable gate array ,Process migration ,Hardware_LOGICDESIGN - Abstract
This paper describes the Altera Stratix II™ logic and routing architecture. This architecture features a novel adaptive logic module (ALM) that is based on a 6-LUT, but can be partitioned into two smaller LUTs to efficiently implement circuits containing a range of LUT sizes that arises in conventional synthesis flows. This provides a performance increase of 15% in the Stratix II architecture while reducing area by 2%. The ALM also includes a more powerful arithmetic structure that can perform two bits of arithmetic per ALM, and perform a sum of up to three inputs. The routing fabric adds a new set of fast inputs to the routing multiplexers for another 3% improvement in performance, while other improvements in routing efficiency cause another 6% reduction in area. These changes in combination with other circuit and architecture changes in Stratix II contribute 27% of an overall 51% performance improvement (including architecture and process improvement). The architecture changes reduce area by 10% in the same process, and by 50% after including process migration.
- Published
- 2005
- Full Text
- View/download PDF
24. The stratixπ routing and logic architecture
- Author
-
Srinivas T. Reddy, David Lewis, Paul Leventis, Giles Powell, Richard G. Cliff, Christopher F. Lane, Chris Wysocki, Sandy Marquardt, Cameron McClintock, Andy L. Lee, David Jefferson, Bruce B. Pedersen, Jonathan Rose, and Vaughn Betz
- Subjects
Programmable logic device ,Programmable Array Logic ,Logic synthesis ,Diode–transistor logic ,Computer architecture ,Computer science ,Logic gate ,Hardware_INTEGRATEDCIRCUITS ,Logic family ,Hardware_LOGICDESIGN ,Logic optimization ,Register-transfer level - Abstract
This paper describes the Altera Stratix logic and routing architecture. The primary goals of the architecture were to achieve high performance and logic density. We give an overview of the entire device, and then focus on the logic and routing architecture. The Stratix logic architecture is based on a cluster of ten 4-input LUTs and its routing consists of staggered routing lines. We describe the development of the routing architecture, including its directional bias, its direct-drive routing which reduces both area and delay. The logic array block and logic cell design is also described, and new routing structures with in the logic array block, and logic element features are described.
- Published
- 2003
- Full Text
- View/download PDF
25. Session details: Architecture Analysis and Automation
- Author
-
Vaughn Betz
- Subjects
Computer architecture ,business.industry ,Computer science ,Session (computer science) ,Architecture ,business ,Field-programmable gate array ,Automation - Published
- 2002
- Full Text
- View/download PDF
26. Automatic generation of FPGA routing architectures from high-level descriptions
- Author
-
Vaughn Betz and Jonathan Rose
- Subjects
Computer science ,business.industry ,Reconfigurable computing ,Computer architecture ,Embedded system ,Applications architecture ,Hardware_INTEGRATEDCIRCUITS ,Reference architecture ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Routing (electronic design automation) ,Architecture ,Field-programmable gate array ,business ,Software architecture description ,Hardware_LOGICDESIGN ,FPGA prototype - Abstract
In this paper we present a “high-level” FPGA architecture description language which lets FPGA architects succinctly and quickly describe an FPGA routing architecture. We then present an “architecture generator” built into the VPR CAD tool [1, 2] that converts this high-level architecture description into a detailed and completely specified flat FPGA architecture. This flat architecture is the representation with which CAD optimization and visualization modules typically work. By allowing FPGA researchers to specify an architecture at a high-level, an architecture generator enables quick and easy “what-if” experimentation with a wide range of FPGA architectures. The net effect is a more fully optimized final FPGA architecture. In contrast, when FPGA architects are forced to use more traditional methods of describing an FPGA (such as the manual specification of every switch in the basic file of the FPGA), far less experimentation can be performed in the same time, and the architectures experimented upon are likely to be highly similar, leaving important parts of the design space completely unexplored.This paper describes the automated routing architecture generation problem, and highlights the two key difficulties — creating an FPGA architecture that matches all of an FPGA architect's specifications, while simultaneously determining good values for the many unspecified portions of an FPGA so that a high quality FPGA results. We describe the method by which we generate FPGA routing architectures automatically, and present several examples.
- Published
- 2000
- Full Text
- View/download PDF
27. Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density
- Author
-
Alexander Marquardt, Jonathan Rose, and Vaughn Betz
- Subjects
Virtex ,Computer science ,business.industry ,Embedded system ,Cad flow ,Cluster (physics) ,FLEX ,Parallel computing ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Field-programmable gate array ,business ,Hardware_LOGICDESIGN ,Cluster based - Abstract
In 1999, most commercial FPGAs, like the Altera Flex and Xilinx Virtex FPGAs already had cluster-based logic blocks. However, the modeling and evaluation of these sorts of architectures was still in its infancy. In the previous year, Betz had shown that cluster-based logic blocks led to improved density. The real advantage of clustered-based logic blocks, though, was speed, as this paper demonstrates. In doing so, this paper opened up an entirely new research area, setting the framework for numerous packing algorithms that have become a fundamental part of any FPGA CAD flow.
- Published
- 1999
- Full Text
- View/download PDF
28. FPGA routing architecture
- Author
-
Jonathan Rose and Vaughn Betz
- Subjects
Router ,Pass transistor logic ,Logic block ,Computer science ,Transistor ,Parallel computing ,law.invention ,law ,Hardware_INTEGRATEDCIRCUITS ,Segmentation ,Routing (electronic design automation) ,Remainder ,Field-programmable gate array ,Hardware_LOGICDESIGN - Abstract
In this work we investigate the routing architecture of FPGAs, focusing primarily on determining the best distribution of routing segment lengths and the best mix of pass transistor and tri-state buffer routing switches. While most commercial FPGAs contain many length 1 wires (wires that span only one logic block) we find that wires this short lead to FPGAs that are inferior in terms of both delay and routing area. Our results show instead that it is best for FPGA routing segments to have lengths of 4 to 8 logic blocks. We also show that 50% to 80% of the routing switches in an FPGA should be pass transistors, with the remainder being tri-state buffers. Architectures that employ the best segmentation distributions and the best mixes of pass transistor and tri-state buffer switches found in this paper are not only 11% to 18% faster than a routing architecture very similar to that of the Xilinx XC4000X but also considerably simpler. These results are obtained using an architecture investigation infrastructure that contains a fully timing-driven router and detailed area and delay models.
- Published
- 1999
- Full Text
- View/download PDF
29. The 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '14, Monterey, CA, USA - February 26 - 28, 2014
- Author
-
Vaughn Betz and George A. Constantinides
- Published
- 2014
- Full Text
- View/download PDF
30. The 2013 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '13, Monterey, CA, USA, February 11-13, 2013
- Author
-
Brad L. Hutchings and Vaughn Betz
- Published
- 2013
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.