27 results on '"Todd Austin"'
Search Results
2. Cyclone
- Author
-
Pranav Kumar, Mohit Tiwari, Todd Austin, Shijia Wei, Austin Harris, and Prateek Sahu
- Subjects
010302 applied physics ,business.industry ,Address space ,Computer science ,Bandwidth (signal processing) ,Speculative execution ,02 engineering and technology ,Interference (wave propagation) ,01 natural sciences ,020202 computer hardware & architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,State (computer science) ,Isolation (database systems) ,Cache ,business ,Computer network ,Communication channel - Abstract
Micro-architecture units like caches are notorious for leaking secrets across security domains. An attacker program can contend for on-chip state or bandwidth and can even use speculative execution in processors to drive this contention; and protecting against all contention-driven attacks is exceptionally challenging. Prior works can mitigate contention channels through caches by partitioning the larger, lower-level caches or by looking for anomalous performance or contention behavior. Neither scales to large number of fine-grained domains as required by browsers and web-services that place many domains within the same address space. We observe that cache contention channels have a unique property - contention leaks information only when it is cyclic, i.e., domain A interferes with domain B, followed by interference from B to A. We propose to use this cyclic interference property to detect micro-architectural attacks as anomalous cyclic interference. Unlike partitioning, our detection approach scales to many concurrent domains in a single address space; and unlike prior anomaly detectors, cyclic interference is robust to noise from benign interference. We track cyclic interference using non-intrusive detectors in an out-of-order core and stress test our prototype, Cyclone, with fine-grained isolation in browsers (against speculation-driven attacks) and coarse-grained isolation of cores (against covert-channels embedded in database and machine learning workloads). Full-system simulations on an ARM micro-architecture show close to perfect detection rates and 260 - 1000× lower false positives than using (state-of-the-art) contention alone, with slowdowns of only ~3.6%.
- Published
- 2019
- Full Text
- View/download PDF
3. Morpheus
- Author
-
Salessawi Ferede Yitbarek, Baris Kasikci, Mohit Tiwari, Misiker Tadesse Aga, Austin Harris, Valeria Bertacco, Zhixing Xu, Todd Austin, Sharad Malik, Zelalem Birhanu Aweke, Mark Gallagher, Shibo Chen, and Lauren Biernacki
- Subjects
Hardware architecture ,business.industry ,Computer science ,Offensive ,02 engineering and technology ,Encryption ,Computer security ,computer.software_genre ,Security testing ,020202 computer hardware & architecture ,020204 information systems ,Pointer (computer programming) ,0202 electrical engineering, electronic engineering, information engineering ,Systems design ,Moving target defense ,Architecture ,business ,computer - Abstract
Attacks often succeed by abusing the gap between program and machine-level semantics-- for example, by locating a sensitive pointer, exploiting a bug to overwrite this sensitive data, and hijacking the victim program's execution. In this work, we take secure system design on the offensive by continuously obfuscating information that attackers need but normal programs do not use, such as representation of code and pointers or the exact location of code and data. Our secure hardware architecture, Morpheus, combines two powerful protections: ensembles of moving target defenses and churn. Ensembles of moving target defenses randomize key program values (e.g., relocating pointers and encrypting code and pointers) which forces attackers to extensively probe the system prior to an attack. To ensure attack probes fail, the architecture incorporates churn to transparently re-randomize program values underneath the running system. With frequent churn, systems quickly become impractically difficult to penetrate. We demonstrate Morpheus through a RISC-V-based prototype designed to stop control-flow attacks. Each moving target defense in Morpheus uses hardware support to individually offer more randomness at a lower cost than previous techniques. When ensembled with churn, Morpheus defenses offer strong protection against control-flow attacks, with our security testing and performance studies revealing: i) high-coverage protection for a broad array of control-flow attacks, including protections for advanced attacks and an attack disclosed after the design of Morpheus, and ii) negligible performance impacts (1%) with churn periods up to 50 ms, which our study estimates to be at least 5000x faster than the time necessary to possibly penetrate Morpheus.
- Published
- 2019
- Full Text
- View/download PDF
4. Vulnerability-tolerant secure architectures
- Author
-
Todd Austin, Sharad Malik, Mohit Tiwari, Valeria Bertacco, and Baris Kasikci
- Subjects
Computer science ,Offensive ,Vulnerability ,020207 software engineering ,02 engineering and technology ,Computer security ,computer.software_genre ,020202 computer hardware & architecture ,Immune system ,Software bug ,0202 electrical engineering, electronic engineering, information engineering ,Systems design ,State (computer science) ,Speculation ,computer - Abstract
Today, secure systems are built by identifying potential vulnerabilities and then adding protections to thwart the associated attacks. Unfortunately, the complexity of today's systems makes it impossible to prove that all attacks are stopped, so clever attackers find a way around even the most carefully designed protections. In this article, we take a sobering look at the state of secure system design, and ask ourselves why the "security arms race" never ends? The answer lies in our inability to develop adequate security verification technologies. We then examine an advanced defensive system in nature - the human immune system - and we discover that it does not remove vulnerabilities, rather it adds offensive measures to protect the body when its vulnerabilities are penetrated We close the article with brief speculation on how the human immune system could inspire more capable secure system designs.
- Published
- 2018
- Full Text
- View/download PDF
5. SWAN
- Author
-
Pete Ehrett, Timothy Linscott, Todd Austin, and Valeria Bertacco
- Subjects
Computer science ,business.industry ,media_common.quotation_subject ,Overhead (engineering) ,0211 other engineering and technologies ,02 engineering and technology ,Ambiguity ,020202 computer hardware & architecture ,Trojan ,0202 electrical engineering, electronic engineering, information engineering ,Decoy ,business ,Computer hardware ,021106 design practice & management ,media_common - Abstract
For the past decade, security experts have warned that malicious engineers could modify hardware designs to include hardware back-doors (trojans), which, in turn, could grant attackers full control over a system. Proposed defenses to detect these attacks have been outpaced by the development of increasingly small, but equally dangerous, trojans. To thwart trojan-based attacks, we propose a novel architecture that maps the security-critical portions of a processor design to a one-time programmable, LUT-free fabric. The programmable fabric is automatically generated by analyzing the HDL of targeted modules. We present our tools to generate the fabric and map functionally equivalent designs onto the fabric. By having a trusted party randomly select a mapping and configure each chip, we prevent an attacker from knowing the physical location of targeted signals at manufacturing time. In addition, we provide decoy options (canaries) for the mapping of security-critical signals, such that hardware trojans hitting a decoy are thwarted and exposed. Using this defense approach, any trojan capable of analyzing the entire configurable fabric must employ complex logic functions with a large silicon footprint, thus exposing it to detection by inspection. We evaluated our solution on a RISC-V BOOM processor and demonstrated that, by providing the ability to map each critical signal to 6 distinct locations on the chip, we can reduce the chance of attack success by an undetectable trojan by 99%, incurring only a 27% area overhead.
- Published
- 2018
- Full Text
- View/download PDF
6. Reducing the overhead of authenticated memory encryption using delta encoding and ECC memory
- Author
-
Salessawi Ferede Yitbarek and Todd Austin
- Subjects
010302 applied physics ,Random access memory ,Hardware_MEMORYSTRUCTURES ,Delta encoding ,business.industry ,Computer science ,02 engineering and technology ,Encryption ,01 natural sciences ,020202 computer hardware & architecture ,law.invention ,ECC memory ,Memory management ,law ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Error detection and correction ,business ,Dram ,Parity bit - Abstract
Data stored in an off-chip memory, such as DRAM or non-volatile main memory, can potentially be extracted or tampered by an attacker with physical access to a device. Protecting such attacks requires storing message authentication codes and counters - which incur a 22% storage overhead. In this work, we propose techniques for reducing these overheads. We first present a scheme that leverages ECC DRAMs to reduce MAC verification & storage overheads. We replace the parity bits in standard ECC by a combination of MAC and parity bits to provide both authentication and error correction. This eliminates the extra MAC storage and minimizes the verification overhead as MACs can be read in parallel with data through the ECC bus. Next, we use efficient integer encodings to reduce counter storage overhead by 6× while enhancing application performance.
- Published
- 2018
- Full Text
- View/download PDF
7. Regaining Lost Cycles with HotCalls
- Author
-
Valeria Bertacco, Todd Austin, and Ofir Weisse
- Subjects
010302 applied physics ,Hardware security module ,Speedup ,business.industry ,Computer science ,020206 networking & telecommunications ,Cryptography ,Cloud computing ,02 engineering and technology ,computer.software_genre ,Encryption ,01 natural sciences ,System call ,Server ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Leverage (statistics) ,business ,computer - Abstract
Intel's SGX secure execution technology allows running computations on secret data using untrusted servers. While recent work showed how to port applications and large-scale computations to run under SGX, the performance implications of using the technology remains an open question. We present the first comprehensive quantitative study to evaluate the performance of SGX. We show that straightforward use of SGX library primitives for calling functions add between 8,200 - 17,000 cycles overhead, compared to 150 cycles of a typical system call. We quantify the performance impact of these library calls and show that in applications with high system calls frequency, such as memcached, openVPN, and lighttpd, which all have high bandwidth network requirements, the performance degradation may be as high as 79%. We investigate the sources of this performance degradation by leveraging a new set of microbenchmarks for SGX-specific operations such as enclave entry-calls and out-calls, and encrypted memory I/O accesses. We leverage the insights we gain from these analyses to design a new SGX interface framework HotCalls. HotCalls are based on a synchronization spin-lock mechanism and provide a 13-27x speedup over the default interface. It can easily be integrated into existing code, making it a practical solution. Compared to a baseline SGX implementation of memcached, openVPN, and lighttpd - we show that using the new interface boosts the throughput by 2.6-3.7x, and reduces application latency by 62-74%.
- Published
- 2017
- Full Text
- View/download PDF
8. Locking down insecure indirection with hardware-based control-data isolation
- Author
-
Reetuparna Das, William Arthur, Todd Austin, and Sahil Madeka
- Subjects
Indirection ,Exploit ,business.industry ,Memoization ,Computer science ,Program transformation ,Parallel computing ,computer.software_genre ,Control flow ,Embedded system ,Code injection ,Compiler ,Cache ,Isolation (database systems) ,Programmer ,business ,computer - Abstract
Arbitrary code injection pervades as a central issue in computer security where attackers seek to exploit the software attack surface. A key component in many exploits today is the successful execution of a control-flow attack. Control-Data Isolation (CDI) has emerged as a work which eliminates the root cause of contemporary control-flow attacks: indirect control flow instructions. These instructions are replaced by direct control flow edges dictated by the programmer and encoded into the application by the compiler. By subtracting the root cause of control-flow attack, Control-Data Isolation sidesteps the vulnerabilities and restrictive threat models adopted by other solutions in this space (e.g., Control-Flow Integrity). The CDI approach, while eliminating contemporary control-flow attacks, introduces non-trivial overheads to validate indirect targets at runtime. In this work we introduce novel architectural support to accelerate the execution of CDI-compliant code. Through the addition of an edge cache, we are able to cache legal indirect target edges and eliminate nearly all execution overhead for indirection-free applications. We demonstrate that through memoization of compiler-confirmed control flow transitions, overheads are reduced from 19% to 0.5% on average for Control-Data Isolated applications. Additionally, we show that the edge cache can efficiently provide the double-duty of predicting multi-way branch targets, thus providing even speedups for some CDI-compliant executions, compared to an architecture with unsophisticated indirect control prediction (e.g., BTB).
- Published
- 2015
- Full Text
- View/download PDF
9. Bridging the Moore's Law Performance Gap with Innovation Scaling
- Author
-
Todd Austin
- Subjects
Engineering ,Moore's law ,Dennard scaling ,Bridging (networking) ,business.industry ,media_common.quotation_subject ,Performance gap ,Industrial engineering ,Scale (social sciences) ,business ,Scaling ,Simulation ,media_common ,Simple (philosophy) - Abstract
The end of Dennard scaling and the tyranny of Ahmdal's law have created significant barriers to system scaling, leading to a gap between today's system performance and where Moore's law predicted it should be.I believe the solution to this problem is to scale innovation. Finding better solutions to improve system performance and efficiency, and doing this more quickly than previously possible could address the growing performance gap.In this talk, I will highlight a number of simple (and not so simple) ideas to address this challenge.
- Published
- 2015
- Full Text
- View/download PDF
10. EFFEX
- Author
-
Silvio Savarese, Robert Perricone, Andrew D. Jones, Jason Clemons, and Todd Austin
- Subjects
Speedup ,business.industry ,Computer science ,Feature extraction ,Mobile computing ,Scale-invariant feature transform ,Pipeline (software) ,CUDA ,Software ,Computer architecture ,Feature (computer vision) ,Embedded system ,Memory architecture ,Computer vision ,Artificial intelligence ,business - Abstract
The deployment of computer vision algorithms in mobile applications is growing at a rapid pace. A primary component of the computer vision software pipeline is feature extraction, which identifies and encodes relevant image features. We present an embedded heterogeneous multicore design named EFFEX that incorporates novel functional units and memory architecture support, making it capable of increasing mobile vision performance while balancing power and area. We demonstrate this architecture running three common feature extraction algorithms, and show that it is capable of providing significant speedups at low cost. Our simulations show a speedup of as much as 14× for feature extraction with a decrease in energy of 40× for memory accesses.
- Published
- 2011
- Full Text
- View/download PDF
11. The potential of sampling for dynamic analysis
- Author
-
Todd Austin and Joseph L. Greathouse
- Subjects
education.field_of_study ,Offset (computer science) ,Risk analysis (engineering) ,Computer science ,Research community ,Population ,Data mining ,education ,computer.software_genre ,computer ,Dynamic software - Abstract
This paper presents an argument for distributing dynamic software analyses to large populations of users in order to locate bugs that cause security flaws. We review a collection of dynamic analysis systems and show that, despite a great deal of effort from the research community, their performance is still too low to allow their use in the field. We then show that there are effective sampling mechanisms for accelerating a wide range of powerful dynamic analyses. These mechanisms reduce the rate at which errors are observed by individual analyses, but this loss can be offset by the subsequent increase in test population. Nevertheless, there are unsolved issues in this domain that deserve attention if this technique is to be widely utilized.
- Published
- 2011
- Full Text
- View/download PDF
12. What input-language is the best choice for high level synthesis (HLS)?
- Author
-
Steve Svoboda, Todd Austin, and D.D. Gajski
- Subjects
Point (typography) ,Programming language ,Computer science ,Hardware description language ,Control (management) ,SystemVerilog ,computer.software_genre ,SystemC ,High-level synthesis ,Key (cryptography) ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,computer ,Hardware_LOGICDESIGN ,computer.programming_language - Abstract
As of 2010, over 30 of the world's top semiconductor / systems companies have adopted HLS. In 2009, SOCs tape-outs containing IPs developed using HLS exceeded 50 for the first time. Now that the practicality and value of HLS is established, engineers are turning to the question of "what input-language works best?" The answer is critical because it drives key decisions regarding the tool/methodology infrastructure companies will create around this new flow. ANSI-C/C++ advocates cite ease-of-learning, simulation speed. SystemC advocates make similar claims, and point to SystemC's hardware-oriented features. Proponents of BSV (Bluespec SystemVerilog) claim that language enhances architectural transparency and control. To maximize the benefits of HLS, companies must consider many factors and tradeoffs.
- Published
- 2010
- Full Text
- View/download PDF
13. Using introspective software-based testing for post-silicon debug and repair
- Author
-
Todd Austin and Kypros Constantinides
- Subjects
business.industry ,Computer science ,Firmware ,Hardware_PERFORMANCEANDRELIABILITY ,Application software ,computer.software_genre ,Maintenance engineering ,Software quality ,law.invention ,Instruction set ,Microprocessor ,Software ,law ,Embedded system ,Microcode ,business ,computer - Abstract
As silicon process technology scales deeper into the nanometer regime, hardware defects are becoming more common, to the point of threatening yield rates and product lifetimes. Introspective software mechanisms hold great promise to address these reliability challenges with both low cost and high coverage. To address these challenges, we have developed a novel instruction set enhancement, called Access-Control Extensions (ACE), that can access and control a microprocessor's internal state. Using ACE technology, special firmware can periodically probe the microprocessor during execution to locate run-time faults, repair design errors (even those discovered in the field), and streamline manufacturing tests.
- Published
- 2010
- Full Text
- View/download PDF
14. Session details: Computation in the post-Turing era
- Author
-
Todd Austin
- Subjects
Theoretical computer science ,Computer science ,Computation ,Session (computer science) ,Turing ,computer ,computer.programming_language - Published
- 2009
- Full Text
- View/download PDF
15. Session details: Microarchitecture analysis and optimisation
- Author
-
Todd Austin and Georgi Gaydadjiev
- Subjects
Computer architecture ,Computer science ,Session (computer science) ,Microarchitecture - Published
- 2008
- Full Text
- View/download PDF
16. Session details: Instruction-set optimisations
- Author
-
Georgi Gaydadjiev and Todd Austin
- Subjects
Instruction set ,Multimedia ,Computer science ,Session (computer science) ,computer.software_genre ,computer - Published
- 2008
- Full Text
- View/download PDF
17. Session details: Clocks, scheduling, and stores
- Author
-
Todd Austin
- Subjects
Computer science ,business.industry ,business ,Scheduling (computing) ,Computer network - Published
- 2007
- Full Text
- View/download PDF
18. Reliability-aware data placement for partial memory protection in embedded processors
- Author
-
Todd Austin and Mojtaba Mehrara
- Subjects
Page fault ,Computer science ,business.industry ,General protection fault ,Embedded system ,Fault coverage ,Overhead (computing) ,Fault tolerance ,Fault injection ,business ,Segmentation fault ,Memory protection - Abstract
Low cost protection of embedded systems against soft errors has recently become a major concern. This issue is even more critical in memory elements that are inherently more prone to transient faults. In this paper, we propose a reliability aware data placement technique in order to partially protect embedded memory systems. We show that by adopting this method instead of traditional placement schemes with complete memory protection, an acceptable level of fault tolerance can be achieved while incurring less area and power overhead. In this approach, each variable in the program is placed in either protected or non-protected memory area according to the profile-driven liveness analysis of all memory variables. In order to measure the level of fault coverage, we inject faults into the memory during the course of program execution in a Monte Carlo simulation framework. Subsequently, we calculate the coverage of partial protection scheme based on the number of protected, failed and crashed runs during the fault injection experiment.
- Published
- 2006
- Full Text
- View/download PDF
19. Robust low power computing in the nanoscale era
- Author
-
Todd Austin
- Subjects
Engineering ,business.industry ,Emerging technologies ,Robustness (computer science) ,Embedded system ,Dynamic demand ,business ,Speculation ,Scaling ,Microarchitecture ,Reliability engineering - Abstract
This tutorial will present recent results in robust low power computing. The perspective will be microarchitectural: what limitations does it put on the microarchitecture and what can the microarchitect do to reduce the dependency on power and improve robustness? The tutorial will start with a technology overview that charts future trends in power and reliability. We will present a summary of prior research in dynamic power reduction in microarchitectures, and give some examples of industrial solutions. We will also review prior research in microarchitectural reduction of leakage performed by us and others. While the continued scaling that Moore's Law predicts is in many ways good for reducing power, scaling also reduces reliability by increasing uncertainty in device performance. Therefore, in order to take advantage of scaling, it will be necessary to compute in the presence of various types of silicon-related faults. Two that are particularly important are single-event upsets, and, even more serious, gates that will not meet their specifications. We will review techniques to provide robustness in light of these trends. In particular, we will revisit techniques developed by the fault-tolerant community as well as newer ideas in timing speculation, exemplified by our Razor research.The tutorial is intended for computer architects and circuit designers interested in a better understanding of current reliability challenges and emerging technologies to address them.
- Published
- 2006
- Full Text
- View/download PDF
20. A second-generation sensor network processor with application-driven memory optimizations and out-of-order execution
- Author
-
M. Minuth, Leyla Nazhandali, Todd Austin, David Blaauw, Bo Zhai, and J. Olson
- Subjects
Instruction prefetch ,Out-of-order execution ,business.industry ,Computer science ,Operand ,law.invention ,Instruction set ,Microprocessor ,law ,Embedded system ,Memory architecture ,Systems design ,business ,Wireless sensor network ,Computer hardware - Abstract
In this paper we present a second-generation sensor network processor which consumes less than one picoJoule per instruction (typical processors use 100's to 1000's of picoJoules per instruction). As in our first-generation design effort, we strive to build microarchitectures that minimize area to reduce leakage, maximize transistor utility to reduce the energy-optimal voltage, and optimize CPI for efficient processing. The new design builds on our previous work to develop a low-power subthreshold-voltage sensor processor, this time improving the design by focusing on ISA, memory system design, and microarchitectural optimizations that reduce overall design size and improve energy-per-instruction. The new design employs 8-bit datapaths and an ultra-compact 12-bit wide RISC instruction set architecture, which enables high code density via micro-operations and flexible operand modes. The design also features a unique memory architecture with prefetch buffer and predecoded address bits, which permits both faster access to the memory and smaller instructions due to few address bits. To achieve efficient processing, the design incorporates branch speculation and out-of-order execution, but in a simplified form for reduced area and leakage-energy overheads. Using SPICE-level timing and power simulation, we find that these optimizations produce a number of Pareto-optimal designs with varied performance-energy tradeoffs. Our most efficient design is capable of running at 142 kHz (0.1 MIPS) while consuming only 600 fJ/instruction, allowing the processor to run continuously for 41 years on the energy stored in a miniature 1g lithium-ion battery. Work is ongoing to incorporate this design into an intra-ocular pressure sensor.
- Published
- 2005
- Full Text
- View/download PDF
21. Microarchitectural power modeling techniques for deep sub-micron microprocessors
- Author
-
Taeho Kgil, Nam Sung Kim, Valeria Bertacco, Todd Austin, and Trevor Mudge
- Subjects
Computer engineering ,Computer science ,business.industry ,Low-power electronics ,Embedded system ,Scalability ,Constraint (computer-aided design) ,Logic simulation ,Parameterized complexity ,Permission ,business ,Microarchitecture ,Power (physics) - Abstract
The need to perform early design studies that combine architectural simulation with power estimation has become critical as power has become a design constraint whose importance has moved to the fore. To satisfy this demand several microarchitectural power simulators have been developed around SimpleScalar, a widely used microarchitectural performance simulator They have proven to be very useful at providing insights into power/performance trade-offs. However, they are neither parameterized nor technology scalable. In this paper, we propose more accurate parameterized power modeling techniques reflecting the actual technology parameters as well as input switching-events for memory and execution units. Compared to HSPICE, the proposed techniques show 93% and 91% accuracies for those blocks, but with a much faster simulation time. We also propose a more realistic power modeling technique for external I/O. In general, our approach includes more detailed microarchitectural and circuit modeling than has been the case in earlier simulators, without incurring a significant simulation time overhead - it can be as small as a few percent.
- Published
- 2004
- Full Text
- View/download PDF
22. Designing robust microarchitectures
- Author
-
Todd Austin
- Subjects
Engineering ,business.industry ,Computation ,Microarchitecture ,law.invention ,Microprocessor ,Robustness (computer science) ,law ,Embedded system ,Systems design ,Verifiable secret sharing ,System on a chip ,business ,Error detection and correction - Abstract
A fault-tolerant approach to microprocessor design, developed at the University of Michigan, is presented. Our approach is based on the use of in-situ checker components that validate the functional and electrical characteristics of complex microprocessor designs. Two design techniques are highlighted: a low-cost double-sampling latch design capable of eliminating power-hungry voltage margins, and a formally verifiable checker co-processor that validates all computation produced by a complex microprocessor core. By adopting a "better than worst-case" approach to system design, it is possible to address reliability and uncertainty concerns that arise during design, manufacturing and system operation.
- Published
- 2004
- Full Text
- View/download PDF
23. Architectural optimizations for low-power, real-time speech recognition
- Author
-
Scott Mahlke, Todd Austin, and Rajeev Krishna
- Subjects
Exploit ,Application domain ,Computer science ,Speech recognition ,Interface (computing) ,Task parallelism ,Energy consumption ,Power domains ,Domain (software engineering) ,Microarchitecture - Abstract
The proliferation of computing technology to low power domains such as hand--held devices has lead to increased interest in portable interface technologies, with particular interest in speech recognition. The computational demands of robust, large vocabulary speech recognition systems, however, are currently prohibitive for such low power devices. This work begins anexploration of domain specific characteristics of speech recognition that might be exploited to achieve the requisite performance within the power constraints of such devices. We focus primarily on architectural techniques to exploit the massive amounts of potential thread level parallelism apparent in this application domain, and consider the performance / power trade-offs of such architectures. Our results show that a simple, multi-threaded, multi-pipelined processor architecture can significantly improve the performance of the time-consuming search phase of modern speech recognition algorithms, and may reduce overall energy consumption by drastically reducing dissipation of static power. We also show that the primary hurdle to achieving these performance benefits is the data request rate into the memory system, and consider some initial solutions to this problem.
- Published
- 2003
- Full Text
- View/download PDF
24. Compiler controlled value prediction using branch predictor based confidence
- Author
-
Todd Austin and Eric D. Larson
- Subjects
Speedup ,Computer science ,Value (computer science) ,Compiler ,Parallel computing ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Arithmetic ,Branch predictor ,computer.software_genre ,Instruction-level parallelism ,Branch misprediction ,computer ,Critical path method - Abstract
Value prediction breaks data dependencies in a program thereby creating instruction level parallelism that can increase program performance. Hardware based value prediction techniques have been shown to increase speed, but at great cost as designs include prediction tables, selection logic, and a confidence mechanism. This paper proposes compiler-controlled value prediction optimizations that obtain good speedups while keeping hardware costs low. The branch predictor is used to estimate the confidence of the value predictor for speculated instructions. This technique obtains 4.6% speedup when completely implemented in software and 15.2% speedup when minimal hardware support (a 1 KB predictor table) is added. We also explore the use of critical path information to aid in the selection of value prediction candidates. The key result of our study is that programs with long dynamic dependence chains benefit with this technique while programs with shorter chains benefit more so from simple selection methods that favor optimization frequency. A new branch instruction that ignores innocuous value mispredictions is shown to eliminate unnecessary mispredictions when program semantics aren't violated by confidence branch mispredictions.
- Published
- 2000
- Full Text
- View/download PDF
25. Architectural support for fast symmetric-key cryptography
- Author
-
Todd Austin, Jerome Burke, and John W. McDonald
- Subjects
Triple DES ,Speedup ,Modular arithmetic ,business.industry ,Computer science ,Cryptography ,Encryption ,Computer Graphics and Computer-Aided Design ,Permutation ,Secure communication ,Cipher ,Computer engineering ,Symmetric-key algorithm ,Embedded system ,Strong cryptography ,business ,Software - Abstract
The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems. Cryptography provides the mechanisms necessary to implement accountability, accuracy, and confidentiality in communication. As demands for secure communication bandwidth grow, efficient cryptographic processing will become increasingly vital to good system performance.In this paper, we explore techinques to improve the performance of symmetric key cipher algorithms. Eight popular strong encryption algorithms are examined in detail. Analysis reveals the algorithms are computaionally complex and contain little parallelism. Overall throughput on high-end microprocessor is quite poor, a 600 Mhz processor is incapable of saturation a T3 communication line with 3DES (triple DES) encrypted data.We introduce new instructions taht improve the efficiency of the analyzed algorithms. Our approach adds instruction set support for fast substitutions, general permutations, rotates, and modular arithmetic. Performance analysis of the optimized ciphers shows an overall speedup of 59% over a baseline machine with rotate instructions and 74% speedup over a baseline without rotates. Even higher speedups are demonstrated with optimized subtitutions (SBOXes) and additional functional unit resources. our analyses of the original and optimized algorithms suggest future directions for the design of high-performance programmable cryptographic processors.
- Published
- 2000
- Full Text
- View/download PDF
26. Classifying load and store instructions for memory renaming
- Author
-
Brad Calder, Glenn Reinman, Todd Austin, Gary Tyson, and Dean M. Tullsen
- Subjects
Profiling (computer programming) ,Memory address ,Dependency (UML) ,Computer science ,Value (computer science) ,Table (database) ,Compiler ,Parallel computing ,computer.software_genre ,computer ,Bottleneck - Abstract
Memory operations remain a significant bottleneck in dynamically scheduled pipelined processors, due in part to the inability to statically determine the existence of memory address dependencies. Hardware memory renaming techniques have been proposed to predict which stores a load might be dependent upon. These prediction techniques can be used to speculatively forward a value from a predicted store dependency to a load through a value prediction table. However, these techniques require large, timeconsuming hardware tables. In this paper we propose a software-guided approach for identifying dependencies between store and load instructions and the Load Marking (LM) architecture to communicate these dependencies to the hardware. Compiler analysis and profiles are used to find important store/load relationships, and these relationships are identified during execution via hints or an n-bit tag. For those loads that are not marked for renaming, we then use additional profiling information to further classify the loads into those that have accurate value prediction and those that do not. These classifications allow the processor to individually apply the most appropriate aggressive form of execution for each load.
- Published
- 1999
- Full Text
- View/download PDF
27. Cache-conscious data placement
- Author
-
Simmi John, Todd Austin, Brad Calder, and Chandra Krintz
- Subjects
Snoopy cache ,Hardware_MEMORYSTRUCTURES ,Computer science ,Cache coloring ,CPU cache ,MESI protocol ,Pipeline burst cache ,Global Assembly Cache ,Parallel computing ,Cache pollution ,Cache-oblivious algorithm ,MESIF protocol ,Smart Cache ,Virtual address space ,Cache invalidation ,Write-once ,Bus sniffing ,Locality of reference ,Page cache ,Cache ,Least frequently used ,Cache algorithms ,Heap (data structure) - Abstract
As the gap between memory and processor speeds continues to widen, cache eficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache pet$ormance by mapping code with temporal locality to different cache blocks in the virtual address space eliminating cache conflicts. These code placement techniques can be applied directly to the problem of placing data for improved data cache pedormance.In this paper we present a general framework for Cache Conscious Data Placement. This is a compiler directed approach that creates an address placement for the stack (local variables), global variables, heap objects, and constants in order to reduce data cache misses. The placement of data objects is guided by a temporal relationship graph between objects generated via profiling. Our results show that profile driven data placement significantly reduces the data miss rate by 24% on average.
- Published
- 1998
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.