57 results
Search Results
2. Instruction-Set Accelerated Implementation of CRYSTALS-Kyber.
- Author
-
Bisheh-Niasar, Mojtaba, Azarderakhsh, Reza, and Mozaffari-Kermani, Mehran
- Subjects
- *
QUANTUM computers , *ELLIPTIC curve cryptography , *QUANTUM cryptography , *CRYPTOGRAPHY , *FIELD programmable gate arrays - Abstract
Large scale quantum computers will break classical public-key cryptography protocols by quantum algorithms such as Shor’s algorithm. Hence, designing quantum-safe cryptosystems to replace current classical algorithms is crucial. Luckily there are some post-quantum candidates that are assumed to be resistant against future attacks from quantum computers, and NIST is considering standardizing them. Among these candidates, lattice-based cryptography sounds more interesting than others due to the performance results as well as confidence in the security. There are few works in the literature evaluating the performance of lattice-based cryptography in hardware. In this paper, we focus on Cryptographic Suite for Algebraic Lattices (CRYSTALS) key exchange mechanisms known as Kyber and provide an instruction-set hardware architecture and implement on Xilinx Artix-7 FPGA for performance evaluation and testing. Our proposed architecture provides an efficient and high-performance set of components to perform polynomial sampling, number-theoretic transform (NTT), and point-wise multiplication to speed up lattice-based post-quantum cryptography (PQC). This architecture implemented on ASIC outperforms state-of-the-art implementations. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
3. ParaML: A Polyvalent Multicore Accelerator for Machine Learning.
- Author
-
Zhou, Shengyuan, Guo, Qi, Du, Zidong, Liu, Daofu, Chen, Tianshi, Li, Ling, Liu, Shaoli, Zhou, Jinhong, Temam, Olivier, Feng, Xiaobing, Zhou, Xuehai, and Chen, Yunji
- Subjects
- *
MACHINE learning , *MULTICORE processors , *SUPPORT vector machines , *K-nearest neighbor classification , *PRINCIPAL components analysis , *VECTOR quantization , *COMPUTER architecture - Abstract
In recent years, machine learning (ML) techniques are proven to be powerful tools in various emerging applications. Traditionally, ML techniques are processed on general-purpose CPUs and GPUs, but their energy efficiencies are limited due to their excessive support for flexibility. As an efficient alternative to CPUs/GPUs, hardware accelerators are still limited as they often accommodate only a single ML technique (family). However, different problems may require different ML techniques, which implies that such accelerators may achieve poor learning accuracy or even be ineffective. In this paper, we present a polyvalent accelerator architecture integrated with multiple processing cores, called ParaML, which accommodates ten representative ML techniques, including $k$ -means, $k$ -nearest neighbors ($k$ -NN), naive Bayes (NB), support vector machine (SVM), linear regression (LR), classification tree (CT), deep neural network (DNN), learning vector quantization (LVQ), parzen window (PW), and principal component analysis (PCA). Benefited from our thorough analysis on computational primitives and locality properties of different ML techniques, the single-core ParaML can perform up to 1056 GOP/s (e.g., additions and multiplications) in an area of 3.51 mm2 and consumes 596 mW only, estimated by ICC and PrimeTime PX with post-synthesis netlist, respectively. Compared with the NVIDIA K20M GPU (28-nm process), the single-core ParaML (65-nm process) is $1.21\times $ faster, and can reduce the energy by $137.93\times $. We also compare the single-core ParaML with other accelerators. Compared with PRINS, single-core ParaML achieves $72.09\times $ and $2.57\times $ energy benefit for $k$ -NN and $k$ -means, respectively, and speeds up each query in $k$ -NN by $44.76\times $. Compared with EIE, the single-core ParaML achieves $5.02\times $ speedup and $4.97\times $ energy benefit with $11.62\times $ less area when evaluating with dense DNN. Compared with TPU, the single-core ParaML achieves $2.45\times $ better power efficiency (5647 Gop/W versus 2300 Gop/W) with $321.36\times $ less area. Compared to the single-core version, the 8-core ParaML will further improve the speedup up to $3.98\times $ with an area of 13.44 mm2 and a power of 2036 mW. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
4. Quick-and-Dirty: An Architecture for High-Performance Temporary Short Writes in MLC PCM.
- Author
-
Zhang, Mingzhe, Zhang, Lunkai, Jiang, Lei, Chong, Frederic T., and Liu, Zhiyong
- Subjects
- *
DYNAMIC random access memory , *PULSE-code modulation - Abstract
MLC PCM provides high-density data storage and extended data retention; therefore it is a promising alternative for DRAM main memory. However, its low write performance is a major obstacle to commercialization. One opportunity for improving the latency of MLC PCM writes is to use fewer SET iterations in a single write. Unfortunately, this comes with a cost: the data written by these short writes have remarkably shorter retentions and thus need frequent refreshes. As a result, it is impractical to use these short-latency, short-retention writes globally. In this paper, we analyze the temporal behavior of write operations in typical applications and show that the write operations are bursty in nature, that is, during some time intervals the memory is subject to a large number of writes, while during other time intervals there hardly any memory operations take place. Based on this observation, we propose Quick-and-Dirty (QnD), a lightweight scheme to improve the performance of MLC PCM. When the write performance becomes the system bottleneck, QnD performs some write operations using the short-latency, short-retention write mode. Then, when the memory system is relatively quiet, QnD uses idle-memory intervals to refresh the data written by short-latency, short-retention writes in order to mitigate the short retention problem. Our experimental results show that QnD improves performance by 30.9 percent on geometric mean while still providing acceptable memory lifetime (7.58 years on geometric mean). We also provide sensitivity studies of the aggressiveness, memory coverage and granularity of QnD technique. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
5. An Efficient Methodology for Mapping Quantum Circuits to the IBM QX Architectures.
- Author
-
Zulehner, Alwin, Paler, Alexandru, and Wille, Robert
- Subjects
- *
QUANTUM computers , *QUANTUM gates , *QUANTUM computing , *SOFTWARE development tools , *COMPUTER architecture , *LOGIC circuits - Abstract
In the past years, quantum computers more and more have evolved from an academic idea to an upcoming reality. IBM’s project IBM ${Q}$ can be seen as evidence of this progress. Launched in March 2017 with the goal to provide access to quantum computers for a broad audience, this allowed users to conduct quantum experiments on a 5-qubit and, since June 2017, also on a 16-qubit quantum computer (called IBM QX2 and IBM QX3, respectively). Revised versions of these 5- and 16-qubit quantum computers (named IBM QX4 and IBM QX5, respectively) are available since September 2017. In order to use these, the desired quantum functionality (e.g., provided in terms of a quantum circuit) has to be properly mapped so that the underlying physical constraints are satisfied—a complex task. This demands solutions to automatically and efficiently conduct this mapping process. In this paper, we propose a methodology which addresses this problem, i.e., maps the given quantum functionality to a realization which satisfies all constraints given by the architecture and, at the same time, keeps the overhead in terms of additionally required quantum gates minimal. The proposed methodology is generic, can easily be configured for similar future architectures, and is fully integrated into IBM’s SDK. Experimental evaluations show that the proposed approach clearly outperforms IBM’s own mapping solution. In fact, for many quantum circuits, the proposed approach determines a mapping to the IBM architecture within minutes, while IBM’s solution suffers from long runtimes and runs into a timeout of 1 h in several cases. As an additional benefit, the proposed approach yields mapped circuits with smaller costs (i.e., fewer additional gates are required). All implementations of the proposed methodology are publicly available at http://iic.jku.at/eda/research/ibm_qx_mapping. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
6. Content Aware Refresh: Exploiting the Asymmetry of DRAM Retention Errors to Reduce the Refresh Frequency of Less Vulnerable Data.
- Author
-
Wang, Shibo, Bojnordi, Mahdi Nazm, Guo, Xiaochen, and Ipek, Engin
- Subjects
- *
ERROR correction (Information theory) , *DYNAMIC random access memory , *MICROPROCESSORS , *COMPUTER systems , *COMPUTER architecture - Abstract
DRAM refresh is responsible for significant performance and energy overheads in a wide range of computer systems, from mobile platforms to datacenters . With the growing demand for DRAM capacity and the worsening retention time characteristics of deeply scaled DRAM, refresh is expected to become an even more pronounced problem in future technology generations . This paper examines content aware refresh, a new technique that reduces the refresh frequency by exploiting the unidirectional nature of DRAM retention errors: assuming that a logical 1 and 0 respectively are represented by the presence and absence of charge, 1-to-0 failures are much more likely than 0-to-1 failures. As a result, in a DRAM system that uses a block error correcting code (ECC) to protect memory, blocks with fewer 1s can attain a specified reliability target (i.e., mean time to failure) with a refresh rate lower than that which is required for a block with all 1s. Leveraging this key insight, and without compromising memory reliability, the proposed content aware refresh mechanism refreshes memory blocks with fewer 1s less frequently. To keep the overhead of tracking multiple refresh rates manageable, refresh groups—groups of DRAM rows refreshed together—are dynamically arranged into one of a predefined number of refresh bins and refreshed at the rate determined by the ECC block with the greatest number of 1s in that bin. By tailoring the refresh rate to the actual content of a memory block rather than assuming a worst case data pattern, content aware refresh respectively outperforms DRAM systems that employ RAS-only Refresh, all-bank Auto Refresh, and per-bank Auto Refresh mechanisms by 12, 8, and 13 percent. It also reduces DRAM system energy by 15, 13, and 16 percent as compared to these systems. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
7. Reminiscences of Project Y and the ACS Project.
- Author
-
Randell, Brian
- Subjects
- *
SUPERCOMPUTERS , *COMPUTER industry , *PROJECT management , *TWENTIETH century , *CORPORATE history , *HISTORY - Abstract
These reminiscences relate to the period that Brian Randell spent between 1964 and 1966 first at IBM Research working on Project Y and then in the IBM Systems Development Division on the resulting ACS Project--then-secret projects that aimed to build a supercomputer that would be 100 times faster than Stretch. Randell's account is based in part on his memory, but also makes extensive use of the small set of files that he had retained, mainly relating to patent applications. A scanned copy of one of these files, the paper "Dynamic Instruction Scheduling," that he coauthored with Lynn Conway, Don Rozenberg, and Don Senzig in February 1966, is available as an online Web extra https://s3.amazonaws.com/ieeecs.cdn.csdl.public/mags/an/2015/03/man2015030055s.pdf and in Newcastle University's online archive at www.cs.ncl.ac.uk/publications/trs/papers/891.pdf. Because of space constraints just the initial three pages of this paper are included in the present article, which also includes the text from a section on "Interrupts" that was added to the 1969 IBM San Jose Technical Report version of the 1966 paper. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
8. Analysis of Introducing Active Learning Methodologies in a Basic Computer Architecture Course.
- Author
-
Arbelaitz, Olatz, Martin, Jose I., and Muguerza, Javier
- Subjects
- *
ACTIVE learning , *COMPUTER architecture , *INTERDISCIPLINARY education , *ACADEMIC workload of students , *PROJECT method in teaching , *STUDENT interests , *TEACHING methods - Abstract
This paper presents an analysis of introducing active methodologies in the Computer Architecture course taught in the second year of the Computer Engineering Bachelor's degree program at the University of the Basque Country (UPV/EHU), Spain. The paper reports the experience from three academic years, 2011–2012, 2012–2013, and 2013–2014, in which three types of data were considered for analysis: students' dedication, as measured by time spent on the project, their marks, and their level of satisfaction. The study shows that about 86% of students are satisfied with the teaching methodology and are willing to continue using it in future courses. The study also shows that the active methodologies used contribute to the students' cross-curricular training and do not generate any great increase in student workload. Finally, a statistical analysis of the evolution of student performance showed that marks improved to a statistically significant extent after introducing active methodologies. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
9. Parallel H.264/AVC Fast Rate-Distortion Optimized Motion Estimation by Using a Graphics Processing Unit and Dedicated Hardware.
- Author
-
Shahid, Muhammad Usman, Ahmed, Ashfaq, Martina, Maurizio, Masera, Guido, and Magli, Enrico
- Subjects
- *
ESTIMATION theory , *GRAPHICS processing units , *COMPUTERS , *FIELD programmable gate arrays , *INTEGRATED circuits , *MOTION estimation (Signal processing) , *RATE distortion theory - Abstract
Heterogeneous systems on a single chip composed of a central processing unit, graphics processing unit (GPU), and field-programmable gate array (FPGA) are expected to emerge in the near future. In this context, the system on chip can be dynamically adapted to employ different architectures for execution of data-intensive applications. Motion estimation (ME) is one such task that can be accelerated using FPGA and GPU for high-performance H.264/Advanced Video Coding encoder implementation. This paper presents an inherent parallel low-complexity rate-distortion (RD) optimized fast ME algorithm well suited for parallel implementations, eliminating various data dependencies caused by a reliance on spatial predictions. In addition, this paper provides details of the GPU and FPGA implementations of the parallel algorithm by using OpenCL and Very High Speed Integrated Circuits (VHSIC) Hardware Descriptive Language (VHDL), respectively, and presents a practical performance comparison between the two implementations. The experimental results show that the proposed scheme achieves significant speedup on GPU and FPGA, and has comparable RD performance with respect to sequential fast ME algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
10. Conceptual Design of 3-D FDTD Dedicated Computer With Dataflow Architecture for High Performance Microwave Simulation.
- Author
-
Kawaguchi, Hideki and Matsuoka, Shun-Suke
- Subjects
- *
CONCEPTUAL design , *FINITE difference time domain method , *DATA flow computing , *COMPUTER architecture , *MICROWAVES - Abstract
For practical use of microwave simulations in industry applications such as high frequency product design, this paper presents a conceptual design of 3-D finite difference time domain (FDTD) dedicated computer with dataflow architecture as one of the portable high performance computing technologies. A basic concept of the dataflow architecture for the FDTD dedicated computer itself was presented already in 2003 for 2-D microwave simulations. Detail design of 3-D FDTD dataflow machine is considered in this paper. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
11. Robotic Adherent Cell Injection for Characterizing Cell–Cell Communication.
- Author
-
Liu, Jun, Siragam, Vinayakumar, Gong, Zheng, Chen, Jun, Fridman, Michael D., Leung, Clement, Lu, Zhe, Ru, Changhai, Xie, Shaorong, Luo, Jun, Hamilton, Robert M., and Sun, Yu
- Subjects
- *
MEDICAL robotics , *CELL membranes , *MUSCLE cells , *CELL lines , *BIOLOGICAL membranes - Abstract
Compared to robotic injection of suspended cells (e.g., embryos and oocytes), fewer attempts were made to automate the injection of adherent cells (e.g., cancer cells and cardiomyocytes) due to their smaller size, highly irregular morphology, small thickness (a few micrometers thick), and large variations in thickness across cells. This paper presents a robotic system for automated microinjection of adherent cells. The system is embedded with several new capabilities: automatically locating micropipette tips; robustly detecting the contact of micropipette tip with cell culturing surface and directly with cell membrane; and precisely compensating for accumulative positioning errors. These new capabilities make it practical to perform adherent cell microinjection truly via computer mouse clicking in front of a computer monitor, on hundreds and thousands of cells per experiment (versus a few to tens of cells as state of the art). System operation speed, success rate, and cell viability rate were quantitatively evaluated based on robotic microinjection of over 4000 cells. This paper also reports the use of the new robotic system to perform cell–cell communication studies using large sample sizes. The gap junction function in a cardiac muscle cell line (HL-1 cells), for the first time, was quantified with the system. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
12. Hardware-Based Trusted Computing Architectures for Isolation and Attestation.
- Author
-
Maene, Pieter, Gotzfried, Johannes, de Clercq, Ruan, Muller, Tilo, Freiling, Felix, and Verbauwhede, Ingrid
- Subjects
- *
COMPUTER architecture , *INTERNET of things , *EMBEDDED computer systems , *COMPUTER input-output equipment , *MALWARE , *INDUSTRIAL controls manufacturing - Abstract
Attackers target many different types of computer systems in use today, exploiting software vulnerabilities to take over the device and make it act maliciously. Reports of numerous attacks have been published, against the constrained embedded devices of the Internet of Things, mobile devices like smartphones and tablets, high-performance desktop and server environments, as well as complex industrial control systems. Trusted computing architectures give users and remote parties like software vendors guarantees about the behaviour of the software they run, protecting them against software-level attackers. This paper defines the security properties offered by them, and presents detailed descriptions of twelve hardware-based attestation and isolation architectures from academia and industry. We compare all twelve designs with respect to the security properties and architectural features they offer. The presented architectures have been designed for a wide range of devices, supporting different security properties. [ABSTRACT FROM PUBLISHER]
- Published
- 2018
- Full Text
- View/download PDF
13. A Fully Pipelined Hardware Architecture for Intra Prediction of HEVC.
- Author
-
Min, Biao, Xu, Zhe, and Cheung, Ray C. C.
- Subjects
- *
FIELD programmable gate arrays , *COMPUTERS , *COMPUTER architecture , *OPTICAL resolution , *VIDEO coding - Abstract
Ultrahigh definition (UHD), such as 4K/8K, is becoming the mainstream of video resolution nowadays. High Efficiency Video Coding (HEVC) is the emerging video coding standard to process the encoding and decoding of UHD video. This paper first develops multiple techniques that allow the proposed hardware architecture for intra prediction of HEVC working in full pipeline. The proposed techniques include: 1) a novel buffer structure for reference samples; 2) a mode-dependent scanning order; and 3) an inverse method for reference sample extension. The size of the buffer is 3K b for luma component and 3K b for chroma components, providing sufficient accessing to the reference samples. Since the data dependency between two neighboring blocks is addressed by the mode-dependent scanning order, the proposed fully pipelined design can produce 4 pixels/clock cycle. As a result, the throughput of the proposed architecture is capable to support $3840 \times 2160$ videos at 30 frames/s. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
14. HRT-PLRU: A New Paging Schemefor Executing Hard Real-Time Programson NAND Flash Memory.
- Author
-
We, Kyoung-Soo, Lee, Chang-Gun, Yi, Kyongsu, Lin, Kwei-Jay, and Lee, Yun Sang
- Subjects
- *
REAL-time computing , *COMPUTER software execution , *NAND gates , *FLASH memory , *FEATURE extraction , *EMBEDDED computer systems , *RANDOM access memory - Abstract
For advanced features of next generation vehicles, the real-time programs in automotive embedded systems are dramatically increasing. For such large volume program codes, this paper proposes a novel framework to use high-density and low-cost nonvolatile memory, i.e., NAND flash memory, as a low-cost means of storing and executing hard real-time programs. Regarding this, one challenge is that NAND flash memory allows only 2 KB page-based read operations not per-byte random accesses, which requires RAM as working storage for code executions. This paper proposes two solutions, i.e., partitioned RAM solution and shared RAM solution, that minimize the RAM size required to deterministically guarantee the deadlines of all the hard real-time tasks. The proposed solutions are verified with the actual real-time programs for unmanned autonomous driving. To the best of our knowledge, this is the first work that allows us to use NAND flash memory for hard real-time program executions with the minimal usage of RAM. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
15. Universal Hardware for Systems With Acceptable Representations as Low Order Polynomials.
- Author
-
Burg, Ariel and Keren, Osnat
- Subjects
- *
COMPUTERS , *POLYNOMIALS , *ALGEBRA , *COEFFICIENTS (Statistics) , *MATHEMATICAL variables - Abstract
This paper presents a novel hardware architecture for adaptive systems whose exact specification is unknown. The architecture is suitable for linear and nonlinear systems whose inputs are real or complex signals (variables), and that have an acceptable representation as low order polynomials in these variables. The implementation is based on using an a priori selected subset of Walsh spectral coefficients. The proposed architecture can acquire its target functionality and adapt itself to changing environments even if the number of variables, their order and precision are unknown in advance. This is in contrast to conventional multiply-and-accumulate (MAC) based architectures where this information must be determined before the design and implementation of the system. In this context (of systems whose functionality is unknown), the delay and the implementation cost of the proposed architecture is significantly lower than MAC-based solutions. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
16. Underdesigned and Opportunistic Computing in Presence of Hardware Variability.
- Author
-
Gupta, Puneet, Agarwal, Yuvraj, Dolecek, Lara, Dutt, Nikil, Gupta, Rajesh K., Kumar, Rakesh, Mitra, Subhasish, Nicolau, Alexandru, Rosing, Tajana Simunic, Srivastava, Mani B., Swanson, Steven, and Sylvester, Dennis
- Subjects
- *
MICROELECTRONICS , *ENERGY consumption , *ELECTRONIC systems , *MINIATURE electronic equipment , *INFORMATION technology , *COMPUTERS - Abstract
Microelectronic circuits exhibit increasing variations in performance, power consumption, and reliability parameters across the manufactured parts and across use of these parts over time in the field. These variations have led to increasing use of overdesign and guardbands in design and test to ensure yield and reliability with respect to a rigid set of datasheet specifications. This paper explores the possibility of constructing computing machines that purposely expose hardware variations to various layers of the system stack including software. This leads to the vision of underdesigned hardware that utilizes a software stack that opportunistically adapts to a sensed or modeled hardware. The envisioned underdesigned and opportunistic computing (UnO) machines face a number of challenges related to the sensing infrastructure and software interfaces that can effectively utilize the sensory data. In this paper, we outline specific sensing mechanisms that we have developed and their potential use in building UnO machines. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
17. Supporting Undergraduate Computer Architecture Students Using a Visual MIPS64 CPU Simulator.
- Author
-
Patti, Davide, Spadaccini, Andrea, Palesi, Maurizio, Fazzino, Fabrizio, and Catania, Vincenzo
- Subjects
- *
ENGINEERING education , *UNDERGRADUATES , *CENTRAL processing units , *COMPUTER architecture , *DATA pipelining , *USER interfaces , *COMPUTER programming , *SIMULATION methods & models - Abstract
The topics of computer architecture are always taught using an Assembly dialect as an example. The most commonly used textbooks in this field use the MIPS64 Instruction Set Architecture (ISA) to help students in learning the fundamentals of computer architecture because of its orthogonality and its suitability for real-world applications. This paper shows how to use the EduMIPS64 visual CPU Simulator as a supporting tool for teaching the standard topics covered by an undergraduate course in computer architecture. The proposed approach is first compared to other similar works in the field, then after a short description of the simulator, the paper focuses on how it can be used for teaching specific topics in an undergraduate computer architecture course. This discussion is then followed by a quantitative assessment of the suitability of the simulator by means of a survey compiled by students themselves; the results show that EduMIPS64 is suitable for the purpose for which it was built—that is, supporting the learning process of computer architecture topics. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
18. VisoMT: A Collaborative Multithreading Multicore Processor for Multimedia Applications with a Fast Data Switching Mechanism.
- Author
-
Wei-Chun Ku, Shu-Hsuan Chou, Jui-Chin Chu, Chi-Lin Liu, Tien-Fu Chen, Jiun-In Guo, and Jinn-Shyan Wang
- Subjects
- *
SWITCHING circuits , *DIGITAL signal processing , *INTEGRATED circuits , *COMPUTER storage devices , *DIGITAL communications , *PARALLEL computers , *COMPUTERS - Abstract
Multithreading and multicore processing are powerful ways to take advantage of parallelism in applications in order to boost a system's performance. However, exploring sufficient parallelism and achieving data locality with low communication overhead are still important research issues in embedded multithreading/multicore design. This paper introduces the design of a fast data switching mechanism between multilevel storage structures in a new multicore architecture. This paper makes several contributions to the development of contemporary sophisticated multimedia applications with advanced standards such as H.264. The first contribution, collaborative-multithreading, tightly unifies reduced instruction set computer and collaborative multithreading digital signal processing (DSP) in order to exploit high parallelism to provide sufficient computing power to applications. Each collaborative thread of our DSP is constructed by a heterogeneous-simultaneously multithreading single instruction, multiple data structure, and four media processing cores, which is connected by a fast switch for providing a fast data exchange mechanism among correlative streams on a thread-level basis. Our second contribution is one-stop streaming processing, which aims to keep data in the system for as long as possible until it is no longer needed, thus making data more efficient to access. Our third contribution is a chunk threading programming model, including a thread management library and threading communication directives for reducing data communication and synchronization overhead. By a combination of coarse-grained and fine-grained threading, programmers can choose various threading levels based on the amount of data exchange in a program. With our proposed techniques and an appropriate programming model, we can reduce processing time by 54.9% in H.264 video encoding (common intermediate format video at 16.574 f/s) with the 1-virtual independent and streaming processing by open collaborative multithreading configuration, compared to the Texas Instruments C62 core that owns 8 function units. We realize our design as a prototype by chip implementation, and fabricate it as a chip based on the Taiwan Semiconductor Manufacturing Company Ltd. 0.13 μm process. The die size of the processor core is 16.12 mm2, including 414k logic transistors and 34.4 kB of on-chip static random access memory. The processor runs at 180 MHOz/1.2-V and consumes 245mW by postsimulation results. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
19. NoC Synthesis Flow for Customized Domain Specific Multiprocessor Systems-on-Chip.
- Author
-
Bertozzi, Davide, Jalabert, Antoine, Murali, Srinivasan, Tamhankar, Rutuparna, Stergiou, Stergios, Benini, Luca, and De Micheli, Giovanni
- Subjects
- *
COMPUTER architecture , *INTEGRATED circuits industry , *SWITCHING circuits , *ELECTRONIC systems , *TOPOLOGY , *COMPUTERS - Abstract
The growing complexity of customizable single-chip multiprocessors is requiring communication resources that can only be provided by a highly-scalable communication infrastructure. This trend is exemplified by the growing number of Network-on-Chip (NoC) architectures that have been proposed recently for System-on-Chip (SoC) integration. Developing NoC-based systems tailored to a particular application domain is crucial for achieving high-performance, energy-efficient customized solutions. The effectiveness of this approach largely depends on the availability of an ad hoc design methodology that, starting from a high-level application specification, derives an optimized NoC configuration with respect to different design objectives and instantiates the selected application specific on-chip micronetwork. Automatic execution of these design steps is highly desirable to increase SoC design productivity. This paper illustrates a complete synthesis flow, called NetChip, for customized NoC architectures, that partitions the development work into major steps (topology mapping, selection, and generation) and provides proper tools for their automatic execution (SUNMAP, xpipescompiler). The entire flow leverages the flexibility of a fully reusable and scalable network components library called xpipes, consisting of highly-parameterizable network building blocks (network interlace, switches, switch-to-switch links) that are design-time tunable and composable to achieve arbitrary topologies and customized domain-specific NoC architectures. Several experimental case studies are presented in the paper, showing the powerful design space exploration capabilities of the proposed methodology and tools. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
20. Custom Wide Counterflow Pipelines for High-Performance Embedded Applications.
- Author
-
Childers, Bruce R. and Davidson, Jack W.
- Subjects
- *
EMBEDDED computer systems , *COMPUTERS , *APPLICATION software , *APPLICATION-specific instruction-set processors , *COMPUTER architecture , *HIGH performance computing - Abstract
Application-specific instruction set processor (ASIP) design is a promising technique to meet the performance and cost goals of high-performance systems. ASIPs are especially valuable for embedded computing applications (e.g., digital cameras, color printers, cellular phones, etc.) where a small increase in performance and decrease in cost can have a large impact on a product's viability. Sutherland, Sproull, and Molnar originally proposed a processor organization called the counterflow pipeline (CFP) as a general-purpose architecture. We observed that the CFP is appropriate for ASIP design due to Its simple and regular structure, local control and communication, and high degree of modularity. This paper describes a new CFP architecture, called the wide counterflow pipeline (WCFP), that extends the original proposal to be better suited for custom embedded instruction-level parallel processors. This work presents a novel and practical application of the CFP to automatic and quick turnaround design of ASIPs. The paper introduces the WCFP architecture and describes several microarchitecture capabilities needed to get good performance from custom WCFPs. We demonstrate that custom WCFPs have performance that is up to four times better than that of ASIPs based on the CFP. Using an analytic cost model, we show that custom WCFPs do not unduly increase the cost of the original counterflow pipeline architecture, yet they retain the simplicity of the CFP. We also compare custom WCFPs to custom VLIW architectures and demonstrate that the WCFP is performance competitive with traditional VLIWs without requiring complicated global interconnection of functional devices. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
21. Multifunction Residue Architectures for Cryptography.
- Author
-
Schinianakis, Dimitrios and Stouraitis, Thanos
- Subjects
- *
CRYPTOGRAPHY , *POLYNOMIALS , *COMPUTER arithmetic , *COMPUTER architecture , *COMPUTER algorithms , *COMPUTER input-output equipment - Abstract
A design methodology for incorporating Residue Number System (RNS) and Polynomial Residue Number System (PRNS) in Montgomery modular multiplication in GF(p) or GF(2^n) respectively, as well as a VLSI architecture of a dual-field residue arithmetic Montgomery multiplier are presented in this paper. An analysis of input/output conversions to/from residue representation, along with the proposed residue Montgomery multiplication algorithm, reveals common multiply-accumulate data paths both between the converters and between the two residue representations. A versatile architecture is derived that supports all operations of Montgomery multiplication in GF(p) and GF(2^n), input/output conversions, Mixed Radix Conversion (MRC) for integers and polynomials, dual-field modular exponentiation and inversion in the same hardware. Detailed comparisons with state-of-the-art implementations prove the potential of residue arithmetic exploitation in dual-field modular multiplication. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
22. Parallel Architectures for Learning the RTRN and Elman Dynamic Neural Networks.
- Author
-
Bilski, Jaroslaw and Smolag, Jacek
- Subjects
- *
PARALLEL processing , *ARTIFICIAL neural networks , *MACRO processors , *COMPUTERS , *COMPUTER networks - Abstract
A major problem encountered by researchers of dynamic neural networks is the computational complexity increasing the learning time. In this paper the parallel realization of the RTRN and the Elman networks are discussed. Both networks are examples of dynamic neural networks. Inherent parallelism of dynamic neural networks has been employed to accelerate the learning process. The proposed solution is based on a highly parallel three dimensional architecture to speed up the learning performance. The presented structures are suitable for efficient parallel realization in digital hardware or vector processors. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
23. Implementation of the Database Machine DIRECT.
- Author
-
Boral, Haran, DeWitt, David J., Friedland, Dina, Jarrell, Nancy F., and Wilkinson, W. Kevin
- Subjects
- *
DATABASES , *ELECTRONIC systems , *MULTIPROCESSORS , *COMPUTERS , *COMPUTER software , *SOFTWARE engineering , *COMPUTER systems - Abstract
DIRECT is a multiprocessor database machine designed and implemented at the University of Wisconsin. This paper describes our experiences with the implementation of DIRECT. We start with a brief overview of the original machine proposal and how it differs from what was actually implemented. We then describe the structure of the DIRECT software. This includes software on host computers that interfaces with the database machine; software on the back-end controller of DIRECT; and software executed by the query processors. In addition to describing the structure of the software we will attempt to motivate and justify its design and implementation. We also discuss a number of implementation issues (e.g., debugging of the code across several machines). We conclude the paper with a list of the "lessons" we have learned from this experience. [ABSTRACT FROM AUTHOR]
- Published
- 1982
24. Enhanced Scaling-Free CORDIC.
- Author
-
Jaime, Francisco J., Sánchez, Miguel A., Hormigo, Javier, Villalba, Julio, and Zapata, Emilio L.
- Subjects
- *
COMPUTERS , *DIGITAL electronics , *ALGORITHMS , *COMPUTER architecture , *WIRELESS communications - Abstract
Coordinate Rotation DIgital Computer (CORDIC) rotator is a well known and widely used algorithm within computers due to its way of carrying out some calculations such as trigonometric functions, among others. A scale factor compensation inherent to the CORDIC algorithm becomes an important drawback when trying to improve its benefits, although some authors have come up with a new scaling-free version, which has been successfully implemented within wireless applications. However, this new CORDIC can still be significantly improved by modifying some of its parts, therefore, this paper shows an enhanced version of the scaling-free CORDIC. These new enhancements have been implemented and tested, obtaining some new architectures which are able to reach a 35% lower latency and a 36% reduction in area and power consumption compared to the original scaling-free architecture. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
25. Statistical Performance Comparisons of Computers.
- Author
-
Chen, Tianshi, Guo, Qi, Temam, Olivier, Wu, Yue, Bao, Yungang, Xu, Zhiwei, and Chen, Yunji
- Subjects
- *
COMPUTER performance , *COMPARATIVE studies , *COMPUTER architecture , *DISTRIBUTION (Probability theory) , *RELIABILITY in engineering - Abstract
As a fundamental task in computer architecture research, performance comparison has been continuously hampered by the variability of computer performance. In traditional performance comparisons, the impact of performance variability is usually ignored (i.e., the means of performance observations are compared regardless of the variability), or in the few cases directly addressed with $t$
to $56.3$ percent on SPEC CPU2006 or SPEC MPI2007, which demonstrates the necessity of using appropriate statistical techniques. This HPT framework has been implemented as an open-source software, and integrated in the PARSEC 3.0 benchmark suite. [ABSTRACT FROM PUBLISHER]- Published
- 2015
- Full Text
- View/download PDF
26. Inside the Virtual Robotics Challenge: Simulating Real-Time Robotic Disaster Response.
- Author
-
Aguero, Carlos E., Koenig, Nate, Chen, Ian, Boyer, Hugo, Peters, Steven, Hsu, John, Gerkey, Brian, Paepcke, Steffi, Rivero, Jose L., Manzo, Justin, Krotkov, Eric, and Pratt, Gill
- Subjects
- *
VIRTUAL reality , *ROBOT control systems , *COMPUTER software , *SIMULATION methods & models , *DEGREES of freedom - Abstract
This paper presents the software framework established to facilitate cloud-hosted robot simulation. The framework addresses the challenges associated with conducting a task-oriented and real-time robot competition, the Defense Advanced Research Projects Agency (DARPA) Virtual Robotics Challenge (VRC), designed to mimic reality. The core of the framework is the Gazebo simulator, a platform to simulate robots, objects, and environments, as well as the enhancements made for the VRC to maintain a high fidelity simulation using a high degree of freedom and multisensor robot. The other major component used is the CloudSim tool, designed to enhance the automation of robotics simulation using existing cloud technologies. The results from the VRC and a discussion are also detailed in this work. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
27. Approaches and Tools Used to Teach the Computer Input/Output Subsystem: A Survey.
- Author
-
Larraza-Mendiluze, Edurne and Garay-Vitoria, Nestor
- Subjects
- *
COMPUTER input-output equipment , *COMPUTER surveys , *UNDERGRADUATE programs , *CURRICULUM , *EDUCATION research , *COMPUTER architecture , *COMPUTER programming - Abstract
This paper surveys how the computer input/output (I/O) subsystem is taught in introductory undergraduate courses. It is important to study the educational process of the computer I/O subsystem because, in the curricula recommendations, it is considered a core topic in the area of knowledge of computer architecture and organization (CAO). It is also a basic knowledge to be acquired in order to work in areas such as human–computer interaction (HCI) or embedded systems. Examination questions, course syllabi, and textbooks were analyzed to identify which teaching approaches are being used. Individuals teaching the I/O subsystem could choose between the options explained here, according to their intended learning outcomes. In addition, a literature survey was conducted on the development and use of tools to improve student understanding of I/O and to make the topic less abstract and more attractive. A goal is to indicate to computing education researchers that the majority of the literature reports experiences in developing or using different resources or educational methodologies, but that these are not based on a theory of learning. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
28. A computation and energy reduction technique for HEVC intra mode decision.
- Author
-
Ozcan, Erdem, Kalali, Ercan, Adibelli, Yusuf, and Hamzaoglu, Ilker
- Subjects
- *
VIDEO coding , *VIDEO codecs , *CODING theory , *BIT rate , *COMPUTERS - Abstract
High Efficiency Video Coding (HEVC) intra mode decision algorithm has very high computational complexity. Therefore, in this paper, a computation and energy reduction technique is proposed for reducing the amount of computations performed by Sum of Absolute Transformed Difference (SATD) calculations in HEVC intra mode decision, and therefore reducing the energy consumption of HEVC SATD calculation hardware without any PSNR loss and bit rate increase. The proposed technique reduced the energy consumption of HEVC SATD calculation hardware up to 64.6%. Therefore, it can be used in portable consumer electronics products that require a real-time HEVC encoder. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
29. Facilitating Remote Laboratory Deployments Using a Relay Gateway Server Architecture.
- Author
-
Melkonyan, Arsen, Gampe, Andreas, Pontual, Murillo, Huang, Grant, and Akopian, David
- Subjects
- *
LABORATORIES , *INTERNET in education , *CLIENT/SERVER computing , *COMPUTER network architectures , *ENGINEERING laboratories , *EDUCATIONAL evaluation , *COMPUTER network resources - Abstract
Hands-on experiments prepare students to deal with real-world problems and help to efficiently digest theoretical concepts and relate those to practical tasks. However, shortage of equipment, high costs, and the lack of human resources for laboratory maintenance and assistance decrease the implementation capacity of the hands-on training laboratories. At the same time, the Internet has become a common networking medium and is increasingly used to enhance education and training. In addition, experimental equipment at many sites is typically underutilized. Thus, remote laboratories accessible through the Internet can resolve cost and access constraints as they can be used at flexible times and from various locations. While many solutions have been proposed so far, this paper addresses an important issue of facilitating remote lab deployments by providing remote connectivity services to lab providers using a Relay Gateway Server architecture. A proof-of-concept solution is described which also includes other previously reported useful features. The system has been tested in engineering labs and student assessment is provided. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
30. Compiler-Directed Energy Reduction Using Dynamic Voltage Scaling and Voltage Islands for Embedded Systems.
- Author
-
Ozturk, Ozcan, Kandemir, Mahmut, and Chen, Guangyu
- Subjects
- *
EMBEDDED computer systems , *COMPUTERS , *SMART devices , *MULTIPROCESSORS , *ENERGY consumption , *ENERGY management , *ENERGY shortages - Abstract
Addressing power and energy consumption related issues early in the system design flow ensures good design and minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically, we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide range of values of our major simulation parameters. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
31. A Project-Based Learning Approach to Programmable Logic Design and Computer Architecture.
- Author
-
Kellett, Christopher M.
- Subjects
- *
ENGINEERING education , *COMPUTER architecture , *COMPUTER programming , *PROGRAMMING languages , *PROJECT method in teaching , *FIELD programmable gate arrays , *MICROPROCESSOR design & construction - Abstract
This paper describes a course in programmable logic design and computer architecture as it is taught at the University of Newcastle, Australia. The course is designed around a major design project and has two supplemental assessment tasks that are also described. The context of the Computer Engineering degree program within which the course is taught is presented, and some student outcomes are discussed. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
32. A CMOS Low-Power Digital Polar Modulator System Integration for WCDMA Transmitter.
- Author
-
Jung, In-Seok and Kim, Yong-Bin
- Subjects
- *
COMPLEMENTARY metal oxide semiconductors , *SYSTEM integration , *COMPUTERS , *COMPUTER algorithms , *PERFORMANCE evaluation , *ENERGY consumption - Abstract
This paper presents a novel low-power design and highly cost effective chip implementation solution of digital polar modulator for WCDMA transmitters using 0.35 \mu\m mixed mode CMOS technology. The proposed coordinate rotation digital computer (CORDIC) in the polar modulator converts rectangular coordinate to polar coordinate with significantly less hardware and power comparing to the existing computational intensive algorithm by employing hard wired pipeline strategy to increase the performance and to reduce the hardware size. The proposed CORDIC performs a sequence of elementary rotations using shift and add operations without multiplications, providing a highly cost effective solution. The separate distribution of angle constants to each adder permits a hard-wire solution instead of using a lookup table, and all the shifters are hard-wired. Linear interpolators to extend the sampling rate for WCDMA specification are used to decrease the operating frequency. The proposed approach reduces both size and power by integrating booth CORDIC and power amplifier on the same die. The measured average power consumption is 27 mW with 67 MHz clock and 3 V power supply. [ABSTRACT FROM PUBLISHER]
- Published
- 2011
- Full Text
- View/download PDF
33. Use of a New Moodle Module for Improving the Teaching of a Basic Course on Computer Architecture.
- Author
-
Trenas, María A., Ramos, Julián, Gutierrez, Eladio D., Romero, Sergio, and Corbera, Francisco
- Subjects
- *
COMPUTER architecture , *VHDL (Computer hardware description language) , *COMPUTERS , *COMPUTER simulation , *AUTOMATION , *PROGRAMMING languages , *TEACHERS , *EDUCATION - Abstract
This paper describes how a new Moodle module, called CTPracticals, is applied to the teaching of the practical content of a basic computer organization course. In the core of the module, an automatic verification engine enables it to process the VHDL designs automatically as they are submitted. Moreover, a straightforward modification of this engine would make it possible to extend its application to other programming languages. The module provides students with real-time knowledge of the state of their work by their accessing the result of the automatic assessment or feedback messages. Teachers have a constant global view of the status of their class and have available multiple options such as sending feedback messages to students, obtaining statistics, launching additional verifications in batch, and so on. Likewise, the module substantially improves some organizational aspects, and its design may help teachers to encourage teamwork. Its use partially frees teachers from certain routine work, saving time that can be devoted to teaching objectives and tutoring activities. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
34. Enhancement of Student Learning Through the Use of a Hinting Computer e-Learning System and Comparison With Human Teachers.
- Author
-
Munoz-Merino, Pedro J., Kloos, Carlos Delgado, and Munoz-Organero, Mario
- Subjects
- *
MOBILE learning , *COMPUTER assisted instruction , *INTERNET in education , *COMPARATIVE studies , *COMPUTER architecture , *ENGINEERING students , *ENGINEERING teachers - Abstract
This paper reports the results of an experiment in a Computer Architecture Laboratory course classroom session, in which students were divided into two groups for interaction both with a hinting e-learning system and with human teachers generating hints. The results show that there were high learning gains for both groups, demonstrating the effectiveness of the human teachers as well as of the computer-based hinting e-learning system even without the use of adaptive and personalization capabilities. In addition, in the worst case, the difference in favor of human teachers (with a low student-to-teacher ratio of 13.5 students per teacher) would not be significant with respect to the e-learning system, so the computer-based system can replace teachers without a significant loss of effectiveness. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
35. A Configurable Heterogeneous Multicore Architecture with Cellular Neural Network for Real-Time Object Recognition.
- Author
-
Kwanho Kim, Seungjin Lee, Joo-Young Kim, Minsu Kim, and Hoi-Jun Yoo
- Subjects
- *
COMPUTER architecture , *SYSTEMS development , *REAL-time control , *NEURAL computers , *COMPUTERS , *ARTIFICIAL intelligence - Abstract
As object recognition requires huge computation power to deal with complex image processing tasks, it is very challenging to meet real-time processing demands under low-power constraints for embedded systems. In this paper, a configurable heterogeneous multicore architecture with a dual-mode linear processor array and a cellular neural network on the network- on-chip platform is presented for real-time object recognition. The bio-inspired attention-based object recognition algorithm is devised to reduce computational complexity of the object recognition. The cellular neural network is utilized to accelerate the visual attention algorithm for selecting salient image regions rapidly. The dual-mode parallel processor is configured into single instruction, multiple data (SIMD) or multiple-instruction- multiple-data modes to perform data-intensive image processing operations while exploiting pixel-level and feature-level parallelisms required for the attention-based object recognition. The algorithm's hybrid parallelization strategy on the proposed architecture is adopted to obtain maximum performance improvement. The performance analysis results, using a cycle-accurate architecture simulator, show that the proposed architecture achieves a speedup of 2.8 times for the target algorithm over conventional massively parallel SIMD architecture at low hardware cost overhead. A prototype chip of the proposed architecture, fabricated in 0.13 μm complementary metal-oxide- semiconductor technology, achieves 22 frames/s real-time object recognition with less than 600 mW power consumption. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
36. Embedded System Architecture for an WLAN-based Dual Mode Mobile Phone.
- Author
-
Sung-Bong Jang, Young-Gab Kim, Hong-Seok Na, and Doo-Kwon Baik
- Subjects
- *
EMBEDDED computer systems , *COMPUTERS , *COMPUTER architecture , *WIRELESS LANs , *CELL phones , *INTERNET telephony , *TELEPHONE systems , *WIRELESS communications - Abstract
This paper presents new embedded system architecture (ESA) for improving the voice quality in a WLAN/Cellular dual-mode mobile phone. The proposed architecture is based on a dual-core scheme and its main functional blocks are composed of VoIP Remote Procedure Call (VRPC), an audio bridging scheme, and a Server-Assisted Call Management (SACM) algorithm. In order to illustrate the aims of the proposed approach, prototype systems are implemented, and evaluated by measuring average Mean Opinion Score (MOS), and Mouth-To-Ear (M2E) delay. The experimental results show that the proposed approach results in greatly improved voice quality. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
37. Efficient VLSI Architecture for Video Transcoding.
- Author
-
Jian Huang and Jooheung Lee
- Subjects
- *
COMPUTER architecture , *COMPUTERS , *HIGH performance processors , *ARRAY processors , *VERY large scale circuit integration , *COMPUTER systems - Abstract
In this paper, we present a unified architecture that can perform Discrete Cosine Transform (DCT), Inverse Discrete Cosine Transform (IDCT), DCT domain motion estimation and compensation (DCT-ME/MC). Our proposed architecture is a Wavefront Array-based Processor with a highly modular structure consisting of 8×8 Processing Elements (PEs). By utilizing statistical properties and arithmetic operations, it can be used as a high performance hardware accelerator for video transcoding applications. We show how different core algorithms can be mapped onto the same hardware fabric and can be executed through the pre-defined PEs. In addition to the simplified design process of the proposed architecture and savings of the hardware resources, we also demonstrate that high throughput rate can be achieved for IDCT and DCT-MC by fully utilizing the sparseness property of DCT coefficient matrix. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
38. A Floating-Point Unit for 4D Vector Inner Product with Reduced Latency.
- Author
-
Donghyun Kim and Lee-Sup Kim
- Subjects
- *
COMPUTER graphics , *FLOATING-point arithmetic , *THREE-dimensional display systems , *COMPUTER architecture , *COMPUTERS , *SYSTEMS development - Abstract
This paper presents the algorithm and implementation of a new high-performance functional unit for floating-point four-dimensional vector inner product (4D dot product; DP4), which is most frequently performed in 3D graphics application. The proposed IEEE-compliant DP4 unit computes Z = AB + CD + EF + GH in one path and keeps the intermediate rounding by IEEE-754 rounding to nearest even. The intermediate rounding is merged with shift alignment, and intermediate carry-propagated addition and normalization are omitted to reduce latency in the proposed architecture. The proposed DP4 unit is implemented with 0.18-µm CMOS technology and has 12.8-ns critical path delay, which is reduced by 45.5 percent compared to a previous DP4 implementation using discrete multipliers and adders. The proposed DP4 unit also reduces the cycle time of 3D graphics applications by 12.4 percent on the average compared to the usual 3D graphics FPU based on four-way multiply-add-fused units. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
39. An Improved Scaled OCT Architecture.
- Author
-
Zhigang Wu, Jin Sha, Zhongfeng Wang, Li Li, and Minglun Gao
- Subjects
- *
DISCRETE cosine transforms , *COMPUTER architecture , *COMPUTER algorithms , *PROGRAMMING languages , *APPROXIMATION theory , *COMPUTERS - Abstract
This paper presents an efficient architecture for computing the eight-point ]-D scaled DCT (Discrete Cosine Transform) with a new algorithm based on a selected Loeffier DCT scheme whose multiplications are placed in the last stage. The proposed DCT architecture does not require any scaling compensation in the computation. Furthermore, a multiplication approximation method is developed, which is more efficient than traditional CORDIC (Coordinate Rotation Digital Computer) -based algorithms. Compared to the latest work [8], the proposed approach can save 14% addition operations for the same precision requirement and the path delay can be significantly reduced as well. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
40. Immunet: Dependable Routing for Interconnection Networks with Arbitrary Topology.
- Author
-
Puente, Valentin, Gregorio, José Angel, Vallejo, Fernando, and Beivide, Ramón
- Subjects
- *
COMPUTERS , *NETWORK routers , *COMPUTER networks , *TOPOLOGY , *MULTIMEDIA systems , *PARALLEL computers , *COMPUTER architecture , *COMPUTER engineering - Abstract
A complete mechanism for tolerating multiple failures in parallel computer systems, denoted as Immunet, is described in this paper. Immunet can be applied to arbitrary topologies, either regular or irregular, exhibiting in both cases graceful performance degradation. Provided that the network remains connected, Immunet is able to deal with any number of failures regardless of their spatial and temporal distributions. Our mechanism operates on the basis of a dynamic network reconfiguration in response to failures. The network reconfiguration only employs local information recorded at the router nodes, which leads to a highly scalable system. In addition, its low cost and overhead permit a practicable hardware implementation. Finally, as Immunet does not require in-flight traffic to be discarded, the parallel applications running in the system can transparently circumvent network failures. Only packets stored in or traveling through a broken component need to be recovered by higher system levels. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
41. Multicore Curve-Based Cryptoprocessor with Reconfigurable Modular Arithmetic Logic Units over GF(2n).
- Author
-
Sakiyama, Kazuo, Batina, Lejia, Preneel, Bart, and Verbauwhede, Ingrid
- Subjects
- *
MULTIPROCESSORS , *COMPUTERS , *ELECTRONIC data processing , *COMPUTER architecture , *COMPUTER input-output equipment , *COMPUTER systems , *SYSTEMS development , *ADAPTIVE computing systems , *PUBLIC key cryptography - Abstract
This paper presents a reconfigurable curve-based cryptoprocessor that accelerates scalar multiplication of Elliptic Curve Cryptography (ECC) and HyperElliptic Curve Cryptography (HECC) of genus 2 over CF(2n). By allocating a copies of processing cores that embed reconfigurable Modular Arithmetic Logic Units (MALU5) over GF(2n), the scalar multiplication of ECC/HECC can be accelerated by exploiting Instruction-Level Parallelism (ILP). The supported field size can be arbitrary up to a(n + 1) - 1. The superscaling feature is facilitated by defining a single instruction that can be used for all field operations and point/divisor operations. In addition, the cryptoprocessor is fully programmable and it can handle various curve parameters and arbitrary irreducible polynomials. The cost, performance, and security trade-offs are thoroughly discussed for different hardware configurations and software programs. The synthesis results with a 0.13-μm CMOS technology show that the proposed reconfigurable cryptoprocessor runs at 292 MHz, whereas the field sizes can be supported up to 587 bits. The compact and fastest configuration of our design is also synthesized with a fixed field size and irreducible polynomial. The results show that the scalar multiplication of ECC over GF(2163) and HECC over CF(283) can be performed in 29 and 63 μs, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
42. A Case for the VMEbus Architecture in Embedded Systems Education.
- Author
-
Ricks, Kenneth G. and Jackson, David Jeff
- Subjects
- *
COMPUTER architecture , *COMPUTER engineering , *COMPUTER science , *EMBEDDED computer systems , *COMPUTERS , *VME (Computer bus) , *COMPUTER buses - Abstract
The VMEbus is an IEEE standard architecture upon which many embedded and real-time systems are built. The VMEbus architecture has existed for nearly 25 years and has been used extensively for military, industrial, and aerospace applications. This paper describes the general characteristics of the VMEbus architecture, specifically relating these characteristics to aspects of embedded systems education included as components of the IEEE/ACM CE2004 computer engineering model curriculum. Portions of this model curriculum are currently being implemented at universities across the country as part of an increasing effort to address the need for embedded systems education. This evaluation will identify the strengths and weaknesses of this architecture as a general-purpose embedded systems educational tool. The VMEbus architecture is used in the laboratory component of an undergraduate embedded systems course at the University of Alabama (UA), Tuscaloosa. The assessment results evaluating its effectiveness are presented. Index Terms-Computer architecture, computer engineering education, educational technology, embedded systems. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
43. ATLAS DataFlow: The Read-Out Subsystem, Results From Trigger and Data-Acquisition System Testbed Studies and From Modeling.
- Author
-
Vermeulen, J., Abolins, M., Alexandrov, I., Amorim, A., Dos Anjos, A., Badescu, B., Barros, N., Beck, H. P., Blair, R., Burckhart-Chromek, D., Caprini, M., Ciobotaru, M., Corso-Radu, A., Cranfield, R., Crone, G., Dawson, J., Dobinson, R., Dobson, M., Drake, G., and Ermoline, Y.
- Subjects
- *
ATLAS (Computer program language) , *DATA transmission systems , *NETWORK operating system , *COMPUTER architecture , *COMPUTERS , *DATABASE management , *DATABASE administration , *MATHEMATICAL optimization , *REAL-time computing - Abstract
In the ATLAS experiment at the LHC, the output of read-out hardware specific to each subdetector will be transmitted to buffers, located on custom made PCI cards ("ROBINs"). The data consist of fragments of events accepted by the first-level trigger at a maximum rate of 100 kHz. Groups of four ROBINs will be hosted in about 150 Read-Out Subsystem (ROS) PCs. Event data are forwarded on request via Gigabit Ethernet links and switches to the second-level trigger or to the Event builder. In this paper a discussion of the functionality and real-time properties of the ROS is combined with a presentation of measurement and modelling results for a testbed with a size of about 20% of the final DAQ system. Experimental results on strategies for optimizing the system performance, such as utilization of different network architectures and network transfer protocols, are presented for the testbed, together with extrapolations to the full system. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
44. High Performance Dense Ring Generators.
- Author
-
Mrugalski, Grzegorz, Mukherjee, Nilanjan, Rajski, Janusz, and Tyszer, Jerzy
- Subjects
- *
HIGH performance computing , *PHASE shifters , *COMPUTER architecture , *INTEGRATED circuits , *COMPUTER science , *ELECTRONIC equipment , *GENERATORS (Computer programs) , *AUTOMATIC programming (Computer science) , *COMPUTERS - Abstract
This paper presents an enhanced architecture of on-chip pseudorandom test pattern generators, test data decompressors, and test response compactors based on ring generators. The new structure is aimed at improving layout and routing properties while, at the same time, reducing propagation delays introduced by associated phase shifters. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
45. Design-Level Performance Prediction of Component-Based Applications.
- Author
-
Yan Liu, Fekete, Alan, and Gorton, Ian
- Subjects
- *
CORBA (Computer architecture) , *COMPUTER architecture , *COMPUTER software development , *BENCHMARKING (Management) , *COMPUTER software , *SYSTEMS design , *DIGITAL control systems , *COMPUTERS , *COMPUTER systems - Abstract
Server-side component technologies such as Enterprise JavaBeans (EJBs), .NET, and CORBA are commonly used in enterprise applications that have requirements for high performance and scalability. When designing such applications, architects must select a suitable component technology platform and application architecture to provide the required performance. This is challenging as no methods or tools exist to predict application performance without building a significant prototype version for subsequent benchmarking. In this paper, we present an approach to predict the performance of component-based server-side applications during the design phase of software development. The approach constructs a quantitative performance model for a proposed application. The model requires inputs from an application-independent performance profile of the underlying component technology platform, and a design description of the application. The results from the model allow the architect to make early decisions between alternative application architectures in terms of their performance and scalability. We demonstrate the method using an EJB application and validate predictions from the model by implementing two different application architectures and measuring their performance on two different implementations of the EJB platform. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
46. Distributed Data Cache Designs for Clustered VLIW Processors.
- Author
-
Gibert, Enric, Sánchez, Jesús, and González, Antonio
- Subjects
- *
MICROPROCESSORS , *COMPUTERS , *COMPUTER science , *TECHNOLOGY , *COMPUTER architecture - Abstract
Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in what we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular, we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible L0 buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suite show that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
47. Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures.
- Author
-
Pratim Pande, Partha, Grecu, Cristian, Jones, Michael, Ivanov, André, and Saleh, Resve
- Subjects
- *
MULTIPROCESSORS , *ENERGY dissipation , *COMPUTER architecture , *INTEGRATED circuits , *COMPUTERS , *SILICON - Abstract
Multiprocessor system-on-chip (MP-SoC) platforms are emerging as an important trend for SoC design. Power and wire design constraints are forcing the adoption of new design methodologies for system-on-chip (SoC), namely, those that incorporate modularity and explicit parallelism. To enable these MP-SoC platforms, researchers have recently pursued scaleable communication- centric interconnect fabrics, such as networks-on-chip (N0C), which possess many features that are particularly attractive for these. These communication-centric interconnect fabrics are characterized by different trade-offs with regard to latency, throughput, energy dissipation, and silicon area requirements. In this paper, we develop a consistent and meaningful evaluation methodology to compare the performance and characteristics of a variety of NoC architectures. We also explore design trade-offs that characterize the NoC approach and obtain comparative results for a number of common NoC topologies. To the best of our knowledge, this is the first effort in characterizing different NoC architectures with respect to their performance and design trade-offs. To further illustrate our evaluation methodology, we map a typical multiprocessing platform to different NoC interconnect architectures and show how the system performance is affected by these design trade-offs. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
48. Design of High-Performance System-On-Chips Using Communication Architecture Tuners.
- Author
-
Lahiri, Kanishka, Raghunathan, Anand, Lakshminarayana, Ganesh, and Dey, Sujit
- Subjects
- *
COMPUTER architecture , *INTEGRATED circuits , *SYSTEMS design , *APPLICATION-specific integrated circuits , *ELECTRONIC systems , *COMPUTERS - Abstract
In this paper, we present a methodology for the design of high-performance system-on-chip communication architectures. The approach is based on the addition of a layer of circuitry called the communication architecture tuner (CAT) layer around an existing communication architecture topology. The added layer provides a system with the capability of adapting to runtime variability in the communication needs of its constituent components. For example, more critical data may be handled differently, leading to lower communication latencies. The CAT associated with each component monitors its internal state, analyzes the communication transactions it generates, and "predicts" the relative importance of the transactions in terms of their impact on system-level performance metrics. It then configures the protocol parameters of the underlying communication architecture (e.g., priorities, burst modes, etc.) to best suit the system's changing communication needs. We illustrate the issues and tradeoffs involved in the design of CAT-based communication architectures, and present algorithms that automate the key steps. Experiments with example systems indicate that performance metrics (e.g., number of missed deadlines, average processing time) for systems with CAT-based communication architectures are significantly (sometimes over an order of magnitude) better than those with conventional communication architectures. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
49. Array Regrouping and Its Use in Compiling Data-Intensive Embedded Applications.
- Author
-
de La Luz, Victor and Kandemir, Mahmut
- Subjects
- *
EMBEDDED computer systems , *COMPUTERS , *COMPILERS (Computer programs) , *CACHE memory , *COMPUTER storage devices , *COMPUTER architecture - Abstract
One of the key challenges facing computer architects and compiler writers is the increasing discrepancy between processor cycle times and main memory access times. To alleviate this problem in array-intensive embedded signal and video processing applications, compilers may employ either control-centric transformations that change data access patterns of nested loops or data-centric transformations that modify memory layouts of multidimensional arrays. Most of the memory layout optimizations proposed so far either modify the layout of each array independently or are based on explicit data reorganizations at runtime. This paper focuses on a compiler technique, called array regrouping, that automatically maps multiple arrays into a single data (array) space to improve data access pattern. We present a mathematical framework that enables us to systematically derive suitable mappings for a given array-intensive embedded application. The framework divides the arrays accessed in a given program into several groups and each group is independently layout-transformed to improve spatial locality and reduce the number of conflict misses. As compared to the previous approaches, the proposed technique makes two new contributions: 1) It presents a graph based formulation of the array regrouping problem and 2) it demonstrates potential benefits of this aggressive array-regrouping strategy in optimizing behavior of embedded systems. Extensive experimental results demonstrate significant improvements in cache miss rates and execution times. An important advantage of this approach over the previous techniques that target conflict misses is that it reduces conflict misses without increasing the data space requirements of the application being optimized. This is a very desirable property in many embedded/portable environments where data space requirements determine the minimum physical memory capacity. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
50. Analysis of a Conflict Between Aggregation and Interface Negotiation in Microsoft's Component Object Model.
- Author
-
Sullivan, Kevin J., Marchukov, Mark, and Socha, John
- Subjects
- *
SOFTWARE engineering , *COMPUTER software , *SOFTWARE configuration management , *COMPUTER architecture , *COMPUTERS - Abstract
Many software projects today are based on the integration of independently designed software components that are acquired on the market, rather than developed within the projects themselves. A component standard, or integration architecture, is a set of design rules meant to ensure that such components can be integrated in defined ways without undue effort. The rules of a component standard define, among other things, component interoperability and composition mechanisms. Understanding the properties of such mechanisms and interactions between them is important for the successful development and integration of software components, as well as for the evolution of component standards. This paper presents a rigorous analysis of two such mechanisms: component aggregation and dynamic interface negotiation, which were first introduced in Microsoft's Component Object Model (COM). We show that interface negotiation does not function properly within COM aggregation boundaries. In particular, interface negotiation generally cannot be used to determine the identity and set of interfaces of aggregated components. This complicates integration within aggregates. We provide a mediator-based example, and show that the problem is in the sharing of interfaces inherent in COM aggregation. [ABSTRACT FROM AUTHOR]
- Published
- 1999
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.