Journal: ieee transactions on computers / Publisher: ieee / Topic: energy consumption - Searchworks@Jio Institute Digital Library Search Results

1. Call For Papers: Energy Efficient Computing.

Subjects: *ENERGY consumption, *USER-centered system design, *PERFORMANCE evaluation, *CALORIC expenditure, *BIOTIC communities, *MANUSCRIPTS, *ENERGY management
Published: 2011
Full Text: View/download PDF

2. Call For Papers: Networks-on-Chip.

Subjects: *MULTICORE processors, *ENERGY consumption, *SEMICONDUCTORS, *INTEGRATED circuit interconnections, *MANUSCRIPTS, *SWITCHING circuits
Published: 2011
Full Text: View/download PDF

3. Efficient Mitchell’s Approximate Log Multipliers for Convolutional Neural Networks.

Author: Kim, Min Soo, Barrio, Alberto A. Del, Oliveira, Leonardo Tavares, Hermida, Roman, and Bagherzadeh, Nader
Subjects: CONVOLUTIONAL neural networks, ARTIFICIAL neural networks, ENERGY consumption, DESIGN techniques, COMPUTER vision
Abstract: This paper proposes energy-efficient approximate multipliers based on the Mitchell's log multiplication, optimized for performing inferences on convolutional neural networks (CNN). Various design techniques are applied to the log multiplier, including a fully-parallel LOD, efficient shift amount calculation, and exact zero computation. Additionally, the truncation of the operands is studied to create the customizable log multiplier that further reduces energy consumption. The paper also proposes using the one's complements to handle negative numbers, as an approximation of the two's complements that had been used in the prior works. The viability of the proposed designs is supported by the detailed formal analysis as well as the experimental results on CNNs. The experiments also provide insights into the effect of approximate multiplication in CNNs, identifying the importance of minimizing the range of error.The proposed customizable design at $w$w = 8 saves up to 88 percent energy compared to the exact fixed-point multiplier at 32 bits with just a performance degradation of 0.2 percent for the ImageNet ILSVRC2012 dataset. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

4. Towards the Integration of Reverse Converters into the RNS Channels.

Author: Sousa, Leonel, Paludo, Rogerio, Martins, Paulo, and Pettenghi, Hector
Subjects: NURSES, NUMBER systems, DIGITAL filters (Mathematics), WAVELETS (Mathematics), ENERGY consumption, ELECTRIC network topology
Abstract: The conversion from a Residue Number System (RNS) to a weighted representation is a costly inter-modulo operation that introduces delay and area overhead to RNS processors, while also increasing power consumption. This paper proposes a new approach to decompose the reverse conversion into operations that can be processed by the arithmetic units already present in the RNS independent channels. This leads to a more effective reuse of the processor circuitry while enhancing parallelism. Experimental results show that, when the proposed techniques are applied to architectures based on ripple-carry adders for the traditional 3-moduli set, the delay is improved in average by 16 percent, the circuit area by 36 percent and the power consumption by 47 percent. When carry-lookahead adder topologies are considered, these improvements are in average of 45 percent for the circuit area and 58 percent for the power consumption while the delay is only slightly reduced. The proposed techniques are applied to a use case in digital filtering, showing an increase in throughput/area of up to 1.25 times, and average reductions in energy consumption of 15.6 percent. This work is a step forward to the usage of RNS in practice, since reverse conversion underpins other hard inter-modulo operations, like comparison, scaling and division. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

5. PerfBound: Conserving Energy with Bounded Overheads in On/Off-Based HPC Interconnects.

Author: Saravanan, Karthikeyan P. and Carpenter, Paul M.
Subjects: HIGH performance computing, ETHERNET, SUPERCOMPUTERS, ENERGY conservation, INTERNET
Abstract: Energy and power are key challenges in high-performance computing. System energy efficiency must be significantly improved, and this requires greater efficiency in all subcomponents. An important target of optimization is the interconnect, since network links are always on, consuming power even during idle periods. A large number of HPC machines have a primary interconnect based on Ethernet (about 40 percent of TOP500 machines), which, since 2010, has included support for saving power via Energy Efficient Ethernet (EEE). Nevertheless, it is unlikely that HPC interconnects would use these energy saving modes unless the performance overhead is known and small. This paper presents PerfBound, a self-contained technique to manage on/off-based networks such as EEE, minimizing interconnect link energy consumption subject to a bound on the performance degradation. PerfBound does not require changes to the applications and it uses only local information already available at switches and NICs without introducing additional communication messages, and is also compatible with multi-hop networks. PerfBound is evaluated using traces from a production supercomputer. For twelve out of fourteen applications, PerfBound has high energy savings, up to 70 percent for only 1 percent performance degradation. This paper also presents DynamicFastwake, which extends PerfBound to exploit multiple low-power states. DynamicFastwake achieves an energy–delay product 10 percent lower than the original PerfBound technique. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

6. Compiler-Assisted Refresh Minimization for Volatile STT-RAM Cache.

Author: Li, Qingan, He, Yanxiang, Li, Jianhua, Shi, Liang, Chen, Yiran, and Xue, Chun Jason
Subjects: RANDOM access memory, SPIN transfer torque, INFORMATION storage & retrieval systems, MATHEMATICAL models, ENERGY consumption, CACHE memory
Abstract: Spin-transfer torque RAM (STT-RAM) has been proposed to build on-chip caches because of its attractive features such as high storage density and ultra low leakage power. However, long write latency and high write energy are the two challenges for STT-RAM. Recently, researchers propose to improve the write performance of STT-RAM by relaxing its non-volatility property. To avoid data losses resulting from volatility, refresh schemes have been proposed. However, refresh operations consume additional overhead. In this paper, we propose to significantly reduce the number of refresh operations through re-arranging program data layout at compilation time. An N-refresh scheme is also proposed to further reduce the number of refreshes. Experimental results show that, on average, the proposed methods can reduce the number of refresh operations by 84.2 percent, and reduce the dynamic energy consumption by 38.0 percent for volatile STT-RAM caches while incurring only 4.1 percent performance degradation. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

7. Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for Video Recognition With Hierarchical Tucker Tensor Decomposition.

Author: Gong, Yu, Yin, Miao, Huang, Lingyi, Deng, Chunhua, and Yuan, Bo
Subjects: PARTICIPATORY design, ENERGY consumption, ALGORITHMS, HARDWARE
Abstract: Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition tasks that require high-dimensional input data. Aiming to overcome this limitation and fully unlock the potentials of LSTM models, in this paper we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks. At algorithm level, we propose to develop fully decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM, which enjoys ultra-low model complexity while still achieving high accuracy. In order to fully reap such attractive algorithmic benefit, we further develop the corresponding customized hardware architecture to support the efficient execution of the proposed FDHT-LSTM model. With the delicate design of memory access scheme, the complicated matrix transformation can be efficiently supported by the underlying hardware without any access conflict in an on-the-fly way. Our evaluation results show that both the proposed ultra-compact FDHT-LSTM models and the corresponding hardware accelerator achieve very high performance. Compared with the state-of-the-art compressed LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction (more than $1000 \times$ 1000 × ) in model size and significant accuracy improvement (0.6% to 12.7%) across different video recognition datasets. Meanwhile, compared with the state-of-the-art tensor decomposed model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieve $2.5\times$ 2. 5 × , $1.46\times$ 1. 46 × and $2.41\times$ 2. 41 × increase in throughput, area efficiency and energy efficiency, respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed design also outperforms TIE with $1.9\times$ 1. 9 × higher throughput, $1.83\times$ 1. 83 × higher energy efficiency and comparable area efficiency. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

8. EGCN: An Efficient GCN Accelerator for Minimizing Off-Chip Memory Access.

Author: Han, Yunki, Park, Kangkyu, Jung, Youngbeom, and Kim, Lee-Sup
Subjects: DYNAMIC random access memory, REPRESENTATIONS of graphs, MATRIX multiplications, MEMORY, RANDOM access memory, ENERGY consumption
Abstract: As Graph Convolutional Networks (GCNs) have emerged as a promising solution for graph representation learning, designing specialized GCN accelerators has become an important challenge. An analysis of GCN workloads shows that the main bottleneck of GCN processing is not computation but the memory latency of intensive off-chip data transfer. Therefore, minimizing off-chip data transfer is the primary challenge for designing an efficient GCN accelerator. To address this challenge, optimization is initialized by considering GCNs as tiled matrix multiplication. In this paper, we optimize off-chip memory access from both the in- and out-of-tile perspectives. From the out-of-tile perspective, we find optimal tile configurations of given datasets and on-chip buffer capacity, then observe the dataflow across phases and layers. Inter-layer phase fusion dataflow with optimal tile configuration reduces data transfer of intermediate outputs. From the in-tile perspective, due to the sparsity of tiles, tiles have redundant data which does not participate in computation. Redundant data load is eliminated with hardware support. Finally, we introduce an efficient GCN inference accelerator, EGCN, specialized for minimizing off-chip memory access. EGCN achieves 41.9% off-chip DRAM access reduction, 1.49× speedup, and 1.95× energy efficiency improvement on average over the state-of-the-art accelerators. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

9. Polysynchronous Clocking: Exploiting the Skew Tolerance of Stochastic Circuits.

Author: Najafi, M. Hassan, Lilja, David J., Riedel, Marc D., and Bazargan, Kia
Subjects: CLOCK circuits (Electronics), ARITHMETIC functions, CLOCK distribution networks, ENERGY consumption, INTEGRATED circuit design
Abstract: In the paradigm of stochastic computing, arithmetic functions are computed on randomized bit streams. The method naturally and effectively tolerates very high clock skew. Exploiting this advantage, this paper introduces polysynchronous clocking, a design strategy in which clock domains are split at a very fine level. Each domain is synchronized by an inexpensive local clock. Alternatively, the skew requirements for a global clock distribution network can be relaxed. This allows for a higher working frequency and so lower latency. The benefits of both approaches are quantified. Polysynchronous clocking results in significant latency, area, and energy savings for wide variety of applications. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

10. A Ferroelectric-Based Volatile/Non-Volatile Dual-Mode Buffer Memory for Deep Neural Network Accelerators.

Author: Luo, Yandong, Luo, Yuan-Chun, and Yu, Shimeng
Subjects: ARTIFICIAL neural networks, FERROELECTRIC materials, MEMORY, NUTRIENT density, STRAY currents, ENERGY consumption, COMPLEMENTARY metal oxide semiconductors
Abstract: Deep neural network (DNN) inference and training produce a large amount of intermediate data. To achieve high energy efficiency, sufficient on-chip buffer is preferred to reduce the energy and time consuming off-chip DRAM access. However, SRAM buffer suffers from large area cost and high standby power due to its large cell size and high leakage current. Although embedded DRAM (eDRAM) offers higher memory density, its energy consumption is high due to frequent refresh operation, which is induced by the short refresh interval (40∼100μs). In this paper, a dual-mode buffer memory based on the CMOS compatible HfZrO2 ferroelectric material is proposed for DNN accelerators. It can operate in both volatile eDRAM mode and non-volatile ferroelectric RAM (FeRAM) mode. The functionality of the proposed dual-mode memory bit-cell design is verified using SPICE simulation with the multi-domain Preisach physical model. A data-lifetime-aware memory mode configuration protocol is proposed to optimize the buffer access energy for both DNN inference and training. Detailed circuitry and architectural support for the dual-mode memory are presented. For DNN training with ferroelectric-field-effect-transistor (FeFET) and SRAM-based compute-in-memory (CIM) accelerator, the proposed dual-mode buffer design improves the overall energy efficiency by 92.2%∼98.7%, 44.1%∼47.6%, 12.6%∼13.0% compared to baseline designs using SRAM buffer with the same buffer area, eDRAM and FeRAM with the same buffer capacity, respectively. For DNN inference with tensor-processing-unit (TPU)-like systolic array, the energy efficiency during computing is improved by 40.7%∼45.6%, 18.4%∼29.6% compared to the designs with eDRAM and FeRAM buffer, respectively. By storing the persistent data using the non-volatile mode, the energy efficiency of systolic array is improved by 2.3×∼5.5× over SRAM-based design when standby is frequent. The chip area overhead of the dual-mode buffer design is 5.2%, 4.1% and 7.2% for FeFET-based-CIM, SRAM-based-CIM and systolic-array-based accelerators using eDRAM buffer, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

11. Signal Strength-Aware Adaptive Offloading with Local Image Preprocessing for Energy Efficient Mobile Devices.

Author: Kim, Young Geun, Lee, Young Seo, and Chung, Sung Woo
Subjects: ENERGY consumption, IMAGE transmission, IMAGE processing, DATA transmission systems, ESTIMATION theory
Abstract: To prolong battery life of mobile devices, image processing applications often exploit offloading techniques which run some or all of the computations on remote servers. Unfortunately, the existing offloading techniques do not consider the fact that data transmission time and energy consumption of wireless network interfaces exponentially increase when signal strength decreases. In this paper, we propose an adaptive offloading for image processing applications, which considers wireless signal strength. To improve performance and energy efficiency of offloading, we also propose to adaptively exploit local preprocessing (executing image preprocessing on local mobile devices), considering wireless signal strength; the local preprocessing usually reduces the size of transmission image in offloading. Our proposed technique estimates performance and energy consumption of the following three methods, depending on the wireless signal strength: 1) local execution (executing all the computations on the local mobile devices), 2) offloading without local preprocessing, and 3) offloading with local preprocessing. Based on the estimated performance and energy consumption, our technique employs one among the three methods, which is expected to result in the best performance or energy efficiency. In our evaluation on an off-the-shelf smartphone, when a user prefers performance to energy, our proposed technique improves performance by 27.1 percent, compared to the conventional offloading technique that does not consider the signal strength. On the other hand, when a user prefers energy to performance, our proposed technique saves system-wide (not just CPU nor wireless network interface) energy consumption by 26.3 percent, on average, compared to the conventional offloading technique. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

12. Fast and Energy-Efficient OLAP Data Management on Hybrid Main Memory Systems.

Author: Hassan, Ahmad, Nikolopoulos, Dimitrios S., and Vandierendonck, Hans
Subjects: DYNAMIC random access memory, OLAP technology, DATABASES
Abstract: This paper studies the problem of efficiently utilizing hybrid memory systems, consisting of both Dynamic Random Access Memory (DRAM) and novel Non-Volatile Memory (NVM) in database management systems (DBMS) for online analytical processing (OLAP) workloads. We present a methodology to determine the database operators that are responsible for most main memory accesses. Our analysis uses both cost models and empirical measurements. We develop heuristic decision procedures to allocate data in hybrid memory at the time that the data buffers are allocated, depending on the expected memory access frequency. We implement these heuristics in the MonetDB column-oriented database and demonstrate performance improvement and energy-efficiency as compared to state-of-the-art application-agnostic hybrid memory management techniques. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

13. The Parallel Multi-Mode Digraph Task Model for Energy-Aware Real-Time Heterogeneous Multi-Core Systems.

Author: Zahaf, Houssam-Eddine, Lipari, Giuseppe, Bertogna, Marko, and Boulet, Pierre
Subjects: DIRECTED graphs, ASSIGNMENT problems (Programming), COMPUTATIONAL complexity, ENERGY consumption, TASKS, PARALLEL programming
Abstract: Many task models have been proposed to express and analyze the behavior of real-time applications at different levels of precision. Most of them target sequential applications with no support for parallelism. The digraph task model is one of the most general ones, as it allows modeling arbitrary directed graphs (digraphs) for sequential job releases. In this paper, we extend the digraph task model to support intra-task parallelism. For the proposed parallel multi-mode digraph model, we derive sufficient schedulability tests and a dichotomic search to improve the test pessimism for a set of $n$n tasks onto a heterogeneous single-ISA multi-core platform. To reduce the computational complexity of the schedulability test, we also propose heuristics for (i) partitioning parallel digraph tasks onto the heterogeneous cores, and (ii) assigning core operating frequencies to reduce the overall energy consumption, while meeting real-time constraints. The effectiveness of the proposed approach is validated with an exhaustive set of simulations. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

14. Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance.

Author: Candel, Francisco, Valero, Alejandro, Petit, Salvador, and Sahuquillo, Julio
Subjects: CACHE memory, GRAPHICS processing units, MEMORY, ENERGY consumption
Abstract: To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

15. BAFL: A Blockchain-Based Asynchronous Federated Learning Framework.

Author: Feng, Lei, Zhao, Yiqi, Guo, Shaoyong, Qiu, Xuesong, Li, Wenjing, and Yu, Peng
Subjects: ARTIFICIAL intelligence, MACHINE learning, CONSUMPTION (Economics), ENERGY consumption, COLLABORATIVE learning, ASYNCHRONOUS learning, BLOCKCHAINS
Abstract: As an emerging distributed machine learning (ML) method, federated learning (FL) can protect data privacy through collaborative learning of artificial intelligence (AI) models across a large number of devices. However, inefficiency and vulnerability to poisoning attacks have slowed FL performance. Therefore, a blockchain-based asynchronous federated learning (BAFL) framework is proposed to ensure the security and efficiency required by FL. The blockchain ensures that the model data cannot be tampered with while asynchronous learning speeds up global aggregation. A novel entropy weight method is used to evaluate the participating rank and proportion of the local model trained in BAFL of the devices. The energy consumption and local model update efficiency are balanced by adjusting the local training and communication delay and optimizing the block generation rate. The extensive evaluation results show that the proposed BAFL framework has higher efficiency and higher performance for preventing poisoning attacks than other distributed ML methods. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. An Energy-Efficient Last-Level Cache Architecture for Process Variation-Tolerant 3D Microprocessors.

Author: Kong, Joonho, Koushanfar, Farinaz, and Chung, Sung Woo
Subjects: ENERGY consumption, CACHE memory, COMPUTER architecture, MICROPROCESSORS, PERFORMANCE evaluation, FAULT-tolerant computing
Abstract: As process technologies evolves, tackling process variation problems is becoming more challenging in 3D (i.e., die-stacked) microprocessors. Process variation adversely affects performance, power, and reliability of the 3D microprocessors, which in turn results in yield losses. In particular, last-level caches (LLCs: L2 or L3 caches) are known as the most vulnerable component to process variation in 3D microprocessors. In this paper, we propose a novel cache architecture that exploits narrow-width values for yield improvement of LLCs (in this paper, L2 caches) in 3D microprocessors. Our proposed architecture disables faulty cache subparts and turns on only the portions that store meaningful data in the cache arrays, which results in high energy-efficiency as well as high cache yield. In an energy-/performance-efficient manner, our proposed architecture significantly recovers not only SRAM cell failure-induced yield losses but also leakage-induced yield losses. [ABSTRACT FROM PUBLISHER]
Published: 2015
Full Text: View/download PDF

17. Design of Hybrid Second-Level Caches.

Author: Valero, Alejandro, Sahuquillo, Julio, Petit, Salvador, Lopez, Pedro, and Duato, Jose
Subjects: CACHE memory, HYBRID systems, SYSTEMS design, EMBEDDED computer systems, ENERGY consumption
Abstract: In recent years, embedded dynamic random-access memory (eDRAM) technology has been implemented in last-level caches due to its low leakage energy consumption and high density. However, the fact that eDRAM presents slower access time than static RAM (SRAM) technology has prevented its inclusion in higher levels of the cache hierarchy. This paper proposes to mingle SRAM and eDRAM banks within the data array of second-level (L2) caches. The main goal is to achieve the best trade-off among performance, energy, and area. To this end, two main directions have been followed. First, this paper explores the optimal percentage of banks for each technology. Second, the cache controller is redesigned to deal with performance and energy. Performance is addressed by keeping the most likely accessed blocks in fast SRAM banks. In addition, energy savings are further enhanced by avoiding unnecessary destructive reads of eDRAM blocks. Experimental results show that, compared to a conventional SRAM L2 cache, a hybrid approach requiring similar or even lower area speedups the performance on average by 5.9 percent, while the total energy savings are by 32 percent. For a 45 nm technology node, the energy-delay-area product confirms that a hybrid cache is a better design than the conventional SRAM cache regardless of the number of eDRAM banks, and also better than a conventional eDRAM cache when the number of SRAM banks is an eighth of the total number of cache banks. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

18. ReDO: Cross-Layer Multi-Objective Design-Exploration Framework for Efficient Soft Error Resilient Systems.

Author: Savino, Alessandro, Vallero, Alessandro, and Di Carlo, Stefano
Subjects: SOFT errors, MULTIPLE criteria decision making, FAULT tolerance (Engineering), ENERGY consumption, BAYESIAN analysis
Abstract: Designing soft errors resilient systems is a complex engineering task, which nowadays follows a cross-layer approach. It requires a careful planning for different fault-tolerance mechanisms at different system's layers: starting from the technology up to the software domain. While these design decisions have a positive effect on the reliability of the system, they usually have a detrimental effect on its size, power consumption, performance and cost. Design space exploration for cross-layer reliability is therefore a multi-objective search problem in which reliability must be traded-off with other design dimensions. This paper proposes a cross-layer multi-objective design space exploration algorithm developed to help designers when building soft error resilient electronic systems. The algorithm exploits a system-level Bayesian reliability estimation model to analyze the effect of different cross-layer combinations of protection mechanisms on the reliability of the full system. A new heuristic based on the extremal optimization theory is used to efficiently explore the design space. Two exploration strategies are proposed. The first strategy aims at optimizing the reliability of the system alone. It is suited in those cases in which reaching a given reliability target is the sole goal. It focuses on finding a reduced set of system's components that, when protected, allow the designer to reach the desired reliability level. As a positive effect, by reducing the number of protected components, the overhead introduced by the fault tolerance techniques is reduced as well. The second strategy jointly considers the effect that the introduced fault-tolerance mechanisms have on the execution time, power, hardware area and software size. This strategy supports the exploration of the design space setting multiple objectives on different design dimensions. An extended set of simulations shows the capability of this framework when applied both to benchmark applications and realistic systems, providing optimized systems that outperform those obtained by applying state-of-the-art cross-layer reliability techniques. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

19. A Stochastic Computational Multi-Layer Perceptron with Backward Propagation.

Author: Liu, Yidong, Liu, Siting, Wang, Yanzhi, Lombardi, Fabrizio, and Han, Jie
Subjects: STOCHASTIC analysis, ARTIFICIAL neural networks, ELECTRIC power consumption, MULTILAYER perceptrons, ALGORITHMS
Abstract: Stochastic computation has recently been proposed for implementing artificial neural networks with reduced hardware and power consumption, but at a decreased accuracy and processing speed. Most existing implementations are based on pre-training such that the weights are predetermined for neurons at different layers, thus these implementations lack the ability to update the values of the network parameters. In this paper, a stochastic computational multi-layer perceptron (SC-MLP) is proposed by implementing the backward propagation algorithm for updating the layer weights. Using extended stochastic logic (ESL), a reconfigurable stochastic computational activation unit (SCAU) is designed to implement different types of activation functions such as the $tanh$ and the rectifier function. A triple modular redundancy (TMR) technique is employed for reducing the random fluctuations in stochastic computation. A probability estimator (PE) and a divider based on the TMR and a binary search algorithm are further proposed with progressive precision for reducing the required stochastic sequence length. Therefore, the latency and energy consumption of the SC-MLP are significantly reduced. The simulation results show that the proposed design is capable of implementing both the training and inference processes. For the classification of nonlinearly separable patterns, at a slight loss of accuracy by 1.32-1.34 percent, the proposed design requires only 28.5-30.1 percent of the area and 18.9-23.9 percent of the energy consumption incurred by a design using floating point arithmetic. Compared to a fixed-point implementation, the SC-MLP consumes a smaller area (40.7-45.5 percent) and a lower energy consumption (38.0-51.0 percent) with a similar processing speed and a slight drop of accuracy by 0.15-0.33 percent. The area and the energy consumption of the proposed design is from 80.7-87.1 percent and from 71.9-93.1 percent, respectively, of a binarized neural network (BNN), with a similar accuracy. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

20. Localized Routing Approach to Bypass Holes in Wireless Sensor Networks.

Author: Mostefaoui, Ahmed, Melkemi, Mahmoud, and Boukerche, Azzedine
Subjects: TELECOMMUNICATION systems routing, WIRELESS sensor networks, DISTRIBUTED algorithms, DATA packeting, DATA transmission systems, ENERGY consumption
Abstract: Geographic greedy forwarding (GF) technique has been widely used by many algorithms for routing in sensor networks because of its high efficiency resulting from its local and memoryless nature. Hence, it ensures scalability which is a fundamental requirement for protocol applicability to large-scale sensor networks with limited resources. Nevertheless, GF suffers from a serious drawback when packets, based on geographic distance, cannot be delivered; i.e., the so-called “local minimum phenomenon”. This problem has been tackled in previous research works to guarantee packet delivery by routing around the boundaries of the hole, but at an excessive consumption of control overheads. In this paper, we propose a novel approach that exploits GF technique and guarantees at the same time packet delivery (handles the local minimum situations). Our approach is of a local nature that does not retain memories and performs better than the state-of-the-art approaches in terms of its ability to guarantee packet delivery and to derive efficient routing paths. We provide, in this paper, proof of its correctness (packet delivery guaranty) while showing, through simulations, its performance effectiveness in terms of reducing path lengths, average end-to-end delays, and overall energy consumption. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

21. Non-Parametric RSS Prediction Based Energy Saving Scheme for Moving Smartphones.

Author: Leu, Jenq-Shiou, Tung, Nguyen Hai, and Liu, Chun-Yao
Subjects: NONPARAMETRIC estimation, RSS feeds, LOGICAL prediction, ENERGY consumption, SCHEME programming language, SMARTPHONES, WIRELESS Internet
Abstract: With the emergence of WiFi technology and network-based applications, the computing, communication and sensing capabilities of smartphones are increasing rapidly, and the smartphone has emerged as a particularly appealing platform for pervasive network applications. However, WiFi entails considerable energy consumption on these battery-powered devices. Finding ways to reduce power consumption on smartphones becomes a critical issue. In this paper, we propose an adaptive limit-rate selection algorithm based on anon-parametric signal strength prediction scheme and analyze its potential for energy savings. By periodically monitoring the received signal strength (RSS) in diverse network environments, the proposed scheme applies weighted scatter plot smoothing and kernel moving average algorithms to adaptively adjust file downloading and video streaming rates. Experimental results demonstrate that the proposed scheme can save 5.7% energy at least and 13.9% energy at most compared to non-adaptive and non-prediction schemes when the smartphone holders use the applications on the move. [ABSTRACT FROM PUBLISHER]
Published: 2014
Full Text: View/download PDF

22. Leveraging Process Variation for Performance and Energy: In the Perspective of Overclocking.

Author: Jang, Hyung Beom, Lee, Junhee, Kong, Joonho, Suh, Taeweon, and Chung, Sung Woo
Subjects: PERFORMANCE evaluation, ENERGY conservation, RELIABILITY in engineering, MICROPROCESSOR design & construction, ELECTRIC potential, ADAPTIVE computing systems
Abstract: Process variation is one of the most important factors to be considered in recent microprocessor design, since it negatively affects performance, power, and yield of microprocessors. However, by leveraging process variation, overclocking techniques can improve performance. As microprocessors have substantial clock cycle time margin for yield, there is enough room for performance improvement by overclocking techniques. In this paper, we adopt the F-overclocking technique, which increases clock frequency without changing supply voltage. Our experimental results show that the F-overclocking technique significantly improves performance as well as energy consumption. In addition, the F-overclocking technique is superior to the conventional overclocking technique which increases clock frequency and supply voltage together in the perspective of energy efficiency and reliability, showing similar performance improvement. Furthermore, we propose an adaptive overclocking controller which dynamically applies the F-overclocking technique based on the application characteristics. By adopting our adaptive overclocking controller, we further minimize the reliability loss caused by the F-overclocking technique. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

23. Bi-Objective Optimization of Data-Parallel Applications on Homogeneous Multicore Clusters for Performance and Energy.

Author: Manumachu, Ravindranath Reddy and Lastovetsky, Alexey
Subjects: CENTRAL processing units, MULTICORE processors, ALGORITHMS, ELECTRIC power consumption, MULTIDISCIPLINARY design optimization, PARALLEL computers
Abstract: Performance and energy are now the most dominant objectives for optimization on modern parallel platforms composed of multicore CPU nodes. The existing intra-node and inter-node optimization methods employ a large set of decision variables but do not consider problem size as a decision variable and assume a linear relationship between performance and problem size and between energy consumption and problem size. We demonstrate using experiments of real-life data-parallel applications on modern multicore CPUs that these relationships have complex (non-linear and even non-convex) properties and, therefore, that the problem size has become an important decision variable that can no longer be ignored. This key finding motivates our work in this paper. In this paper, we first formulate the bi-objective optimization problem for performance and energy (BOPPE) for data-parallel applications on homogeneous clusters of modern multicore CPUs. It contains only one but heretofore unconsidered decision variable, the problem size. We then present an efficient and exact global optimization algorithm called ALEPH that solves the BOPPE. It takes as inputs, discrete functions of performance and dynamic energy consumption against problem size and outputs the globally Pareto-optimal set of solutions. The solutions are the workload distributions, which achieve inter-node optimization of data-parallel applications for performance and energy. While existing solvers for BOPPE give only one solution when the problem size and number of processors are fixed, our algorithm gives a diverse set of globally Pareto-optimal solutions. The algorithm has time complexity of $O(m^2 \times p^2)$ where $m$ is the number of points in the discrete speed/energy function and $p$ is the number of available processors. We experimentally study the efficiency and scalability of our algorithm for two data parallel applications, matrix multiplication and fast Fourier transform, on a modern multicore CPU and homogeneous clusters of such CPUs. Based on our experiments, we show that the average and maximum sizes of the globally Pareto-optimal sets determined by our algorithm are 15 and 34 and 7 and 20 for the two applications respectively. Comparing with load-balanced workload distribution solution, the average and maximum percentage improvements in performance and energy respectively demonstrated for the first application are (13%,97%) and (18%,71%). For the second application, these improvements are (40%,95%) and (22%,127%). Assuming 5 percent performance degradation from the optimal is acceptable, the average and maximum improvements in energy consumption demonstrated for the two applications respectively are 9 and 44 and 8 and 20 percent. Using the algorithm and its building blocks, we also present a study of interplay between performance and energy. We demonstrate how ALEPH can be combined with DVFS-based Multi-Objective Optimization (MOP) methods to give a better set of (globally Pareto-optimal) solutions. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

24. TLB Index-Based Tagging for Reducing Data Cache and TLB Energy Consumption.

Author: Kim, Jesung, Lee, Jongmin, and Kim, Soontae
Subjects: CACHE memory, DATA reduction, DATA mining, ENERGY consumption, HIGH performance processors
Abstract: Conventional cache tag matching identifies the requested data based on a memory address. However, this address-based tag matching is inefficient because it requires unnecessarily many tag bits. Previous studies show that translation look-aside buffer (TLB) index-based tagging (TLBIT) can be adopted in instruction caches because there are not many different tags at a given moment due to spatial locality, and those tags can be captured by TLBs. For the TLBIT scheme, extra TLB indices are added to each TLB entry and conventional cache tags are replaced with TLB indices to identify the requested data in the cache. TLBIT reduces the number of required tag bits in tag arrays; therefore, the cache energy consumption and area are decreased. In this paper, we show that naively adopting TLBIT for data caches is inefficient, in terms of performance and energy consumption, because of cache line searches and invalidations on TLB misses. To achieve the true potential of TLBIT, we propose four novel techniques: search zone, c-LRU, TLB buffer and demand address fetching. The search zone reduces unnecessary cache line searching and c-LRU reduces the cache line invalidations. The TLB buffer prevents immediate cache line invalidations on TLB misses. Furthermore, we present demand address fetching to reduce energy consumption in the TLB. From our experiments, we observed that the proposed techniques reduce the overall dynamic energy consumption of the data cache by 14.3 percent on average. The overall tag array area and leackage power of the data cache are also reduced by 54 and 45 percent, respectively. The TLB energy consumption is reduced by 22.7 percent. The performance impact is small, less than 0.4 percent on average. We also demonstrate that TLBIT can be applied to large caches, and set-associative TLBs. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

25. Quality Configurable Approximate DRAM.

Author: Raha, Arnab, Sutar, Soubhagya, Jayakumar, Hrishikesh, and Raghunathan, Vijay
Subjects: DYNAMIC random access memory, DATA mining, RESOURCE management, INTEGRATED circuits, ENERGY consumption
Abstract: Approximate computing is an emerging design paradigm that leverages the inherent error tolerance present in many applications to improve their power consumption and performance. Due to the forgiving nature of these error-resilient applications, precise input data is not always necessary for them to produce outputs of acceptable quality. This makes the memory subsystem (i.e., the place where data is stored), a suitable component for introducing approximations in return for substantial energy savings. Towards this end, this paper proposes a systematic methodology for constructing a quality configurable approximate DRAM system. Our design is based upon an extensive experimental characterization of memory errors as a function of the DRAM refresh-rate. Leveraging the insights gathered from this characterization, we propose four novel strategies for partitioning the DRAM in a system into a number of quality bins based on the frequency, location, and nature of bit errors in each of the physical pages, while also taking into account the property of variable retention time exhibited by DRAM cells. During data allocation, critical data is placed in the highest quality bin (that contains only accurate pages) and approximate data is allocated to bins sorted in descending order of quality, with the refresh rate serving as the quality control knob. We validate our proposed scheme on several error-resilient applications implemented using an Altera Stratix IV GX FPGA based Terasic TR4-230 development board containing a 1GB DDR3 DRAM module. Experimental results demonstrate a significant improvement in the energy-quality trade-off compared to previous work and show a reduction in DRAM refresh power of up to 73 percent on average with minimal loss in output quality. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

26. Application-Guided Power-Efficient Fault Tolerance for H.264 Context Adaptive Variable Length Coding.

Author: Shafique, Muhammad, Rehman, Semeen, Kriebel, Florian, Khan, Muhammad Usman Karim, Zatt, Bruno, Subramaniyan, Arun, Vizzotto, Bruno Boessio, and Henkel, Jorg
Subjects: APPLICATION software, ENERGY consumption, MATHEMATICAL variables, CODING theory, FAULT tolerance (Engineering)
Abstract: This paper presents a fault-tolerance technique for H.264's Context-Adaptive Variable Length Coding (CAVLC) on unreliable computing hardware. The application-specific knowledge is leveraged at both algorithm and architecture levels to protect the CAVLC process (especially context adaptation and coding tables) in a reliable yet power-efficient manner. Specifically, the statistical analysis of coding syntax and video content properties are exploited for: (1) selective redundancy of coefficient/header data of video bitstreams; (2) partitioning the coding tables into various sub-tables to reduce the power overhead of fault tolerance; and (3) run-time power management of memory parts storing the sub-tables and their parity computations. Experimental results demonstrate that leveraging application-specific knowledge reduces area and performance overhead by 2x compared to a double-parity table protection technique. For functional verification and area comparison, the complete H.264 CAVLC architecture is prototyped on a Xilinx Virtex-5 FPGA (though not limited to it). [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

27. Towards an Energy-Efficient Anomaly-Based Intrusion Detection Engine for Embedded Systems.

Author: Viegas, Eduardo, Santin, Altair Olivo, Franca, Andre, Jasinski, Ricardo, Pedroni, Volnei A., and Oliveira, Luiz S.
Subjects: ENERGY consumption, INTRUSION detection systems (Computer security), EMBEDDED computer systems, COMPUTER network security, ENERGY measurement
Abstract: Nowadays, a significant part of all network accesses comes from embedded and battery-powered devices, which must be energy efficient. This paper demonstrates that a hardware (HW) implementation of network security algorithms can significantly reduce their energy consumption compared to an equivalent software (SW) version. The paper has four main contributions: (i) a new feature extraction algorithm, with low processing demands and suitable for hardware implementation; (ii) a feature selection method with two objectives—accuracy and energy consumption; (iii) detailed energy measurements of the feature extraction engine and three machine learning (ML) classifiers implemented in SW and HW—Decision Tree (DT), Naive-Bayes (NB), and k-Nearest Neighbors (kNN); and (iv) a detailed analysis of the tradeoffs in implementing the feature extractor and ML classifiers in SW and HW. The new feature extractor demands significantly less computational power, memory, and energy. Its SW implementation consumes only 22 percent of the energy used by a commercial product and its HW implementation only 12 percent. The dual-objective feature selection enabled an energy saving of up to 93 percent. Comparing the most energy-efficient SW implementation (new extractor and DT classifier) with an equivalent HW implementation, the HW version consumes only 5.7 percent of the energy used by the SW version. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

28. Multicore-Aware Virtual Machine Placement in Cloud Data Centers.

Author: Mann, Zoltan Adam
Subjects: VIRTUAL machine systems, SERVER farms (Computer network management), CLOUD computing, CONSTRAINT programming, ENERGY consumption, SERVICE level agreements
Abstract: Finding the best way to map virtual machines (VMs) to physical machines (PMs) in a cloud data center is an important optimization problem, with significant impact on costs, performance, and energy consumption. In most situations, the computational capacity of PMs and the computational load of VMs are a vital aspect to consider in the VM-to-PM mapping. Previous work modeled computational capacity and load as one-dimensional quantities. However, today's PMs have multiple processor cores, all of which can be shared by cores of multiple multicore VMs, leading to complex scheduling issues within a single PM, which the one-dimensional problem formulation cannot capture. In this paper, we argue that at least a simplified model of these scheduling issues should be taken into account during VM placement. We show how constraint programming techniques can be used to solve this problem, leading to significant improvement over non-multicore-aware VM placement. Several ways are presented to hybridize an exact constraint solver with common packing heuristics to derive an effective and scalable algorithm. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

29. Resource-Efficient Byzantine Fault Tolerance.

Author: Distler, Tobias, Cachin, Christian, and Kapitza, Rudiger
Subjects: FAULT tolerance (Engineering), ENERGY consumption, FINITE state machines, EMAIL systems, COMPUTER network protocols
Abstract: One of the main reasons why Byzantine fault-tolerant (BFT) systems are currently not widely used lies in their high resource consumption: 3f+1 replicas are required to tolerate only f faults. Recent works have been able to reduce the minimum number of replicas to 2f+1 by relying on trusted subsystems that prevent a faulty replica from making conflicting statements to other replicas without being detected. Nevertheless, having been designed with the focus on fault handling, during normal-case operation these systems still use more resources than actually necessary to make progress in the absence of faults. This paper presents Resource-efficient Byzantine Fault Tolerance (ReBFT), an approach that minimizes the resource usage of a BFT system during normal-case operation by keeping f replicas in a passive mode. In contrast to active replicas, passive replicas neither participate in the agreement protocol nor execute client requests; instead, they are brought up to speed by verified state updates provided by active replicas. In case of suspected or detected faults, passive replicas are activated in a consistent manner. To underline the flexibility of our approach, we apply ReBFT to two existing BFT systems: PBFT and MinBFT. [ABSTRACT FROM PUBLISHER]
Published: 2016
Full Text: View/download PDF

30. Node Scaling Analysis for Power-Aware Real-Time Tasks Scheduling.

Author: Yu, Lei, Teng, Fei, and Magoules, Frederic
Subjects: MULTICORE processors, CONJOINT analysis, PERFORMANCE evaluation, SCHEDULING, ENERGY consumption, POWER aware computing, REAL-time computing
Abstract: Multi-core processors achieve a trade-off between the performance and the power consumption by using Dynamic Voltage Scaling (DVS) techniques. In this paper, we study the power efficient scheduling problem of real-time tasks in an identical multi-core system, and present Node Scaling model to achieve power-aware scheduling. We prove that there is a bound speed which results in the minimal power consumption for a given task set, and the maximal value of task utilization, umax, in a task set is a key element to decide its minimal power consumption. Based on the value umax, we classify task sets into two categories: the bounded task sets and the non-bounded task sets, and we prove the lower bound of power consumption for each type of task set. Simulations based on Intel Xeon X5550 and PXA270 processors show Node Scaling model can achieve power efficient scheduling by applying to existing algorithms such as EDF-FF and SPA2. The ratio of power reduction depends on the multi-core processor's property which is defined as the ratio of the bound speed to the maximal speed of the cores. When the ratio of speeds decreases, the ratio of power reduction increases for all the power efficient algorithms. [ABSTRACT FROM PUBLISHER]
Published: 2016
Full Text: View/download PDF

31. Collaborative Adaptation for Energy-Efficient Heterogeneous Mobile SoCs.

Author: Singh, Amit Kumar, Basireddy, Karunakar Reddy, Prakash, Alok, Merrett, Geoff V., and Al-Hashimi, Bashir M.
Subjects: SYSTEMS on a chip, HETEROGENEOUS computing, PHYSIOLOGICAL adaptation, ENERGY consumption, CENTRAL processing units, GRAPHICS processing units
Abstract: Heterogeneous Mobile System-on-Chips (SoCs) containing CPU and GPU cores are becoming prevalent in embedded computing, and they need to execute applications concurrently. However, existing run-time management approaches do not perform adaptive mapping and thread-partitioning of applications while exploiting both CPU and GPU cores at the same time. In this paper, we propose an adaptive mapping and thread-partitioning approach for energy-efficient execution of concurrent OpenCL applications on both CPU and GPU cores while satisfying performance requirements. To start execution of concurrent applications, the approach makes mapping (number of cores and operating frequencies) and partitioning (distribution of threads between CPU and GPU) decisions to satisfy performance requirements for each application. The mapping and partitioning decisions are made by having a collaboration between the CPU and GPU cores’ processing capabilities such that balanced execution can be performed. During execution, adaptation is triggered when new application(s) arrive, or an executing one finishes, that frees cores. The adaptation process identifies a new mapping and thread-partitioning in a similar collaborative manner for remaining applications provided it leads to an improvement in energy efficiency. The proposed approach is experimentally validated on the Odroid-XU3 hardware platform with varying set of applications. Results show an average energy saving of 37%, compared to existing approaches while satisfying the performance requirements. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

32. Per-Operation Reusability Based Allocation and Migration Policy for Hybrid Cache.

Author: Oh, Minsik, Kim, Kwangsu, Choi, Duheon, Lee, Hyuk-Jun, and Chung, Eui-Young
Subjects: CACHE memory, COMPUTER software reusability, COST functions, RANDOM access memory
Abstract: Recently, a hybrid cache consisting of SRAM and STT-RAM has attracted much attention as a future memory by complementing each other with different memory characteristics. Prior works focused on developing data allocation and migration techniques considering write-intensity to reduce write energy at STT-RAM. However, these works often neglect the impact of operation-specific reusability of a cache line. In this paper, we propose an energy-efficient per-operation reusability-based allocation and migration policy (ORAM) with a unified LRU replacement policy. First, to select an adequate memory type for allocation, we propose a cost function based on per-operation reusability – gain from an allocated cache line and loss from an evicted cache line for different memory types – which exploits the temporal locality. Besides, we present a migration policy, victim and target cache line selection scheme, to resolve memory type inconsistency between replacement policy and the allocation policy, with further energy reduction. Experiment results show an average energy reduction in the LLC and the main memory by 12.3 and 21.2 percent, and the improvement of latency and execution time by 21.2 and 8.8 percent, respectively, compared with a baseline hybrid cache management. In addition, the Energy-Delay Product (EDP) is improved by 36.9 percent over the baseline. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

33. Concertina: Squeezing in Cache Content to Operate at Near-Threshold Voltage.

Author: Ferreron, Alexandra, Suarez-Gracia, Dario, Alastruey-Benede, Jesus, Monreal-Arnal, Teresa, and Ibanez, Pablo
Subjects: CACHE memory, ELECTRIC potential, ENERGY consumption, RANDOM access memory, RELIABILITY in engineering
Abstract: Scaling supply voltage to values near the threshold voltage allows a dramatic decrease in the power consumption of processors; however, the lower the voltage, the higher the sensitivity to process variation, and, hence, the lower the reliability. Large SRAM structures, like the last-level cache (LLC), are extremely vulnerable to process variation because they are aggressively sized to satisfy high density requirements. In this paper, we propose Concertina, an LLC designed to enable reliable operation at low voltages with conventional SRAM cells. Based on the observation that for many applications the LLC contains large amounts of null data, Concertina compresses cache blocks in order that they can be allocated to cache entries with faulty cells, enabling use of 100 percent of the LLC capacity. To distribute blocks among cache entries, Concertina implements a compression- and fault-aware insertion/replacement policy that reduces the LLC miss rate. Concertina reaches the performance of an ideal system implementing an LLC that does not suffer from parameter variation with a modest storage overhead. Specifically, performance degrades by less than 2 percent, even when using small SRAM cells, which implies over 90 percent of cache entries having defective cells, and this represents a notable improvement on previously proposed techniques. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

34. TAP: Reducing the Energy of Asymmetric Hybrid Last-Level Cache via Thrashing Aware Placement and Migration.

Author: Luo, Jing-Yuan, Cheng, Hsiang-Yun, Lin, Ing-Chao, and Chang, Da-Wei
Subjects: STATIC random access memory, CACHE memory, RANDOM access memory, ENERGY consumption, NONVOLATILE memory
Abstract: Emerging non-volatile memories (NVMs) have favorable properties, such as low leakage and high density, and have attracted a lot of attention in recent years. Among them, spin-transfer torque magnetoresistive random access memory (STT-MRAM) with SRAM-comparable read speed is a good candidate to build large last-level caches (LLCs). However, STT-MRAM suffers from long write latency and high write energy. To mitigate the impact of asymmetric read/write energy and latency, hybrid cache designs have been proposed to combine the merits of STT-MRAM and SRAM. In such a hybrid SRAM/STT-MRAM LLC, intelligent block placement and migration policies are needed to improve the energy efficiency. Prior studies map write-intensive blocks to SRAM and keep read-intensive blocks in STT-MRAM for reducing the energy consumption of hybrid LLCs. The write-intensive/read-intensive blocks are usually captured by sampling the address (PC) of memory access instructions or adding simple access counters in each cache line. Nevertheless, these prior approaches cannot fully capture the energy-harmful access behavior in STT-MRAM, especially the writes caused by repetitive data transfer between the LLC and upper-level caches. In this paper, we find that conflict misses in L2 often generate thrashing blocks which move back and forth between L2 and LLC. If dirty thrashing blocks that incur extensive writes are placed in STT-MRAM, energy consumption would excessively increase, especially when running memory-bound workloads. Thus, we propose a thrashing aware placement and migration policy (TAP) to tackle the challenge. TAP places dirty thrashing blocks into SRAM and migrates clean thrashing blocks from SRAM to STT-MRAM. Evaluation results show that TAP can provide significant energy savings with minimal performance loss. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

35. SqueezeFlow: A Sparse CNN Accelerator Exploiting Concise Convolution Rules.

Author: Li, Jiajun, Jiang, Shuhao, Gong, Shijun, Wu, Jingya, Yan, Junchao, Yan, Guihai, and Li, Xiaowei
Subjects: ARTIFICIAL neural networks, ENERGY consumption, MATHEMATICAL convolutions
Abstract: Convolutional Neural Networks (CNNs) have been widely used in machine learning tasks. While delivering state-of-the-art accuracy, CNNs are known as both compute- and memory-intensive. This paper presents the SqueezeFlow accelerator architecture that exploits sparsity of CNN models for increased efficiency. Unlike prior accelerators that trade complexity for flexibility, SqueezeFlow exploits concise convolution rules to benefit from the reduction of computation and memory accesses as well as the acceleration of existing dense architectures without intrusive PE modifications. Specifically, SqueezeFlow employs a PT-OS-sparse dataflow that removes the ineffective computations while maintaining the regularity of CNN computations. We present a full design down to the layout at 65 $nm$nm, with an area of 4.80 $\mathrm{mm^2}$ mm 2 and power of 536.09 $\mathrm{mW}$ mW. The experiments show that SqueezeFlow achieves a speedup of $2.9\times$2.9× on VGG16 compared to the dense architectures, with an area and power overhead of only 8.8 and 15.3 percent, respectively. On three representative sparse CNNs, SqueezeFlow improves the performance and energy efficiency by $1.8\times$1.8× and $1.5\times$1.5× over the state-of-the-art sparse accelerators. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

36. HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore Systems.

Author: Kuan, Kyle and Adegbija, Tosiron
Subjects: STATIC random access memory, MULTICORE processors, RF values (Chromatography), COMMERCIAL buildings, RANDOM access memory, RECORDS management, ENERGY consumption
Abstract: Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility, low leakage power, high density, and fast read speed. The STT-RAM's small feature size is particularly desirable for the last-level cache (LLC), which typically consumes a large area of silicon die. However, long write latency and high write energy still remain challenges of implementing STT-RAMs in the CPU cache. An increasingly popular method for addressing this challenge involves trading off the non-volatility for reduced write speed and write energy by relaxing the STT-RAM's data retention time. However, in order to maximize energy saving potential, the cache configurations, including STT-RAM's retention time, must be dynamically adapted to executing applications' variable memory needs. In this paper, we propose a highly adaptable last level STT-RAM cache (HALLS) that allows the LLC configurations and retention time to be adapted to applications' runtime execution requirements. We also propose low-overhead runtime tuning algorithms to dynamically determine the best (lowest energy) cache configurations and retention times for executing applications. Compared to prior work, HALLS reduced the average energy consumption by 60.57 percent in a quad-core system, while introducing marginal latency overhead. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

37. Energy-Efficient Permanent Fault Tolerance in Hard Real-Time Systems.

Author: Mireshghallah, FatemehSadat, Bakhshalipour, Mohammad, Sadrosadati, Mohammad, and Sarbazi-Azad, Hamid
Subjects: ENERGY consumption, FAULT-tolerant computing, REDUNDANCY in engineering, RELIABILITY in engineering, TASK analysis
Abstract: Triple Modular Redundancy (TMR) is a historical and long-time–used approach for masking various kinds of faults. By employing redundancy and analyzing the results of three separate executions of the same program, TMR is able to attain excellent levels of reliability. While TMR provides a desirable level of reliability, it suffers from the high power consumption of the redundant hardware, a severe detriment to its broad adoption. The energy consumption of TMR can be mitigated if its operations are divided into two stages, and one stage is dropped in the absence of fault. Such an approach, which is evaluated in recent research, however, quickly fails in the presence of permanent faults, as we show in this paper. In this work, we introduce Reactive TMR, a novel energy-efficient approach for tolerating both transient and permanent faults. The key idea is to detect and deactivate faulty components and re-assign their tasks to functioning ones. Using a combination of static scheduling and dynamic task-management, our method decouples tasks from cores that are susceptible to result in a faulty execution; hence, it instinctively tolerates permanent faults and improves both reliability and energy-efficiency. Through a detailed evaluation, we show that our proposal reduces the energy consumption of baseline TMR by 30 percent while preserving its reliability. As compared to the state-of-the-art proposal for TMR, our method, while maintaining the energy consumption, augments hard-fault–tolerance to the system. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

38. Segmented Tag Cache: A Novel Cache Organization for Reducing Dynamic Read Energy.

Author: Kim, Moonsoo, Chang, Ik-Joon, and Lee, Hyuk-Jae
Subjects: CACHE memory, ENERGY consumption, RANDOM access memory, ORGANIZATION
Abstract: A set-associative cache organization is widely used to achieve a high hit-rate in modern caches, leading to a considerable enhancement in the system performance. In a conventional set-associative cache, tag and data arrays are simultaneously accessed to achieve fast access, but doing so causes a large amount of energy consumption. Previous attempts, such as way prediction and sequential tag access for energy reduction, are still associated with substantial energy consumption due to the tag access. This paper presents a new cache organization termed Segmented Tag Cache (STC), which reduces the amount of energy consumed during the tag access. The proposed organization initially implements a new tag organization that supports partial tag access. More specifically, the tag array is segmented into two parts, and partial tag access only reads low-order part of the tag. A cache access scheme suitable for the proposed tag organization then are developed. Under the scheme, the delay of the tag organization is hidden by the data access delay, avoiding an increase in the overall cache access time to be maintained. Simulation results show that the STC reduces the energy-delay product by approximately 58 percent compared to the conventional cache organizations with a negligible performance penalty. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

39. Optimal Application Mapping and Scheduling for Network-on-Chips with Computation in STT-RAM Based Router.

Author: Yang, Lei, Liu, Weichen, Guan, Nan, and Dutt, Nikil
Subjects: NETWORK routers, NONVOLATILE memory, RANDOM access memory, HEURISTIC algorithms, ENERGY consumption
Abstract: Spin-Torque Transfer Magnetic RAM (STT-RAM), one of the emerging nonvolatile memory (NVM) technologies explored as the replacement for SRAM memory architectures, is particularly promising due to the fast access speed, high integration density, and zero standby power consumption. Recently, hybrid deigns with SRAM and STT-RAM buffers for routers in Network-on-Chip (NoC) systems have been widely implemented to maximize the mutually complementary characteristics of different memory technologies, and leverage the efficiency of intra-router latency and system power consumption. With the realization of Processing-in-Memory enabled by STT-RAM, in this paper, we novelly offload the execution from processors to the STT-RAM based on-chip routers to improve the application performance. On top of the hybrid buffer design in routers, we further present system-level approaches, including an ILP model and polynomial-time heuristic algorithms, to fine-tune the application mapping and scheduling on NoCs, with the objectives of improving system performance-energy efficiency. Network overhead caused by flit conflict in conventional communication circumstances can be ideally avoided by computing the contended flits in intermediate routers; meanwhile, the pressure of heavy workload on processors can be relieved by transferring partial operations to routers, such that network latency and system power consumption can be significantly reduced. Experimental results demonstrate that application schedule length and system energy consumption can be reduced by 35.62, 32.87 percent on average, respectively, in extensive evaluation experiments on PARSEC benchmark applications. In particular, the achievements of application performance and energy efficiency, averagely 36.44 and 33.19 percent, for the CNN application AlexNet have verified the practicability and effectiveness of our presented approaches. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

40. HyVE: Hybrid Vertex-Edge Memory Hierarchy for Energy-Efficient Graph Processing.

Author: Dai, Guohao, Huang, Tianhao, Wang, Yu, Yang, Huazhong, and Wawrzynek, John
Subjects: DYNAMIC random access memory, MAGNITUDE (Mathematics), ENERGY consumption, MEMORY, NONVOLATILE memory
Abstract: High energy consumption of conventional memory modules (e.g., DRAMs) hinders the further improvement of large-scale graph processing's energy efficiency. The emerging resistive random-access memory (ReRAM) has shown great potential in providing an energy-efficient memory module. However, the performance of ReRAMs suffers from data access patterns with poor locality and large amounts of written data, which are common in graph processing. In this paper, we propose HyVE, a Hybrid Vertex-Edge memory hierarchy for energy-efficient graph processing. In HyVE, we avoid random access and data written to ReRAM modules. HyVE can reduce memory energy consumption by 86.17 percent compared with conventional memory systems. We have also proposed data sharing and bank-level power-gating schemes, which improve the energy efficiency by 1.60x and 1.53x. By analyzing the graph processing model on ReRAMs, we show that ReRAMs are good for read-intensive operations in graph processing (e.g., reading edges), while ReRAM crossbars are not suitable for processing edges because of heavy writing overheads. Our evaluations show that the optimized design achieves two orders of magnitude and 5.90x energy efficiency improvement compared with the CPU-based and conventional memory hierarchy based designs, respectively. Moreover, HyVE achieves 2.83x energy reduction compared with the previous ReRAM-based graph processing architecture. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

41. Intra-Cluster Coalescing and Distributed-Block Scheduling to Reduce GPU NoC Pressure.

Author: Wang, Lu, Zhao, Xia, Kaeli, David, Wang, Zhiying, and Eeckhout, Lieven
Subjects: COMPUTER scheduling, GRAPHICS processing units, ENERGY consumption
Abstract: GPUs continue to boost the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this paper, we target redundant network traffic to mitigate GPU NoC congestion. In particular, we observe that in many GPU-compute applications, different SMs in a cluster access shared data. Sending redundant requests to access the same memory location wastes valuable NoC bandwidth—we find on average 19 percent (and up to 48 percent) of the requests to be redundant. To remove redundant NoC traffic, we propose distributed-block scheduling, intra-cluster coalescing (ICC) and the coalesced cache (CC) to coalesce L1 cache misses within and across SMs in a cluster, respectively. Our evaluation results show that distributed-block scheduling, ICC and CC are complementary and improve both performance and energy consumption. We report an average performance improvement of 15 percent (and up to 67 percent) while at the same time reducing system energy by 6 percent (and up to 19 percent) and improving the energy-delay product (EDP) by 19 percent on average (and up to 53 percent), compared to state-of-the-art distributed CTA scheduling. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

42. Promoting the Harmony between Sparsity and Regularity: A Relaxed Synchronous Architecture for Convolutional Neural Networks.

Author: Lu, Wenyan, Yan, Guihai, Li, Jiajun, Gong, Shijun, Jiang, Shuhao, Wu, Jingya, and Li, Xiaowei
Subjects: ARTIFICIAL neural networks, ENERGY consumption, NEURONS, COMPUTER architecture
Abstract: There are two approaches to improve the performance of Convolutional Neural Networks (CNNs): 1) accelerating computation and 2) reducing the amount of computation. The acceleration approaches take the advantage of CNN computing regularity which enables abundant fine-grained parallelisms in feature maps, neurons, and synapses. Alternatively, reducing computations leverages the intrinsic sparsity of CNN neurons and synapses. The sparsity represents as the computing "bubbles", i.e., zero or tiny-valued neurons and synapses. These bubbles can be removed to reduce the volume of computations. Although distinctly different from each other in principle, we find that the two types of approaches are not orthogonal to each other. Even worse, they may conflict to each other when working together. The conditional branches introduced by some bubble-removing mechanisms in the original computations destroy the regularity of deeply nested loops, thereby impairing the intrinsic parallelisms. Therefore, enabling the synergy between the two types of approaches is critical to arrive at superior performance. This paper proposed a relaxed synchronous computing architecture, FlexFlow-Pro, to fulfill this purpose. Compared with the state-of-the-art accelerators, the FlexFlow-Pro gains more than 2.5× performance on average and 2× energy efficiency. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

43. A Profit Maximization Scheme with Guaranteed Quality of Service in Cloud Computing.

Author: Mei, Jing, Li, Kenli, Ouyang, Aijia, and Li, Keqin
Subjects: PROFIT maximization, QUALITY of service, CLOUD computing, PERFORMANCE evaluation, COMPARATIVE studies, ENERGY consumption
Abstract: As an effective and efficient way to provide computing resources and services to customers on demand, cloud computing has become more and more popular. From cloud service providers’ perspective, profit is one of the most important considerations, and it is mainly determined by the configuration of a cloud service platform under given market demand. However, a single long-term renting scheme is usually adopted to configure a cloud platform, which cannot guarantee the service quality but leads to serious resource waste. In this paper, a double resource renting scheme is designed firstly in which short-term renting and long-term renting are combined aiming at the existing issues. This double renting scheme can effectively guarantee the quality of service of all requests and reduce the resource waste greatly. Secondly, a service system is considered as an M/M/m+D queuing model and the performance indicators that affect the profit of our double renting scheme are analyzed, e.g., the average charge, the ratio of requests that need temporary servers, and so forth. Thirdly, a profit maximization problem is formulated for the double renting scheme and the optimized configuration of a cloud platform is obtained by solving the profit maximization problem. Finally, a series of calculations are conducted to compare the profit of our proposed scheme with that of the single renting scheme. The results show that our scheme can not only guarantee the service quality of all requests, but also obtain more profit than the latter. [ABSTRACT FROM PUBLISHER]
Published: 2015
Full Text: View/download PDF

44. Online Energy Estimation of Relational Operations in Database Systems.

Author: Xu, Zichen, Tu, Yi-Cheng, and Wang, Xiaorui
Subjects: DATABASE management, DATA libraries, ENERGY consumption, SEARCH algorithms, CENTRAL processing units
Abstract: Data centers are well known to consume a large amount of energy. As databases are one of the major applications in a data center, building energy-aware database systems has become an active research topic recently. The quantification of the energy cost of database systems is an important task in design. In this paper, we report our recent efforts on this issue, with a focus on the energy cost estimation of query plans during query optimization. We start from building a series of physical models for energy estimation of individual relational operators based on their resource consumption patterns. As the execution of a query plan is a combination of multiple relational operators, we use the physical models as a basis for a comprehensive energy model for the entire query. To address the challenge of maintaining accuracy under system and workload dynamics, we develop an online scheme that dynamically adjusts model parameters based on statistical signal modeling. Our models are implemented in a real database management system and evaluated on a physical test bed. The results show that our solution achieves a high accuracy (worst-case error 13.7 percent) despite noises. Our models also help identify query plans with significantly higher energy efficiency. [ABSTRACT FROM PUBLISHER]
Published: 2015
Full Text: View/download PDF

45. DCCS: Double Circular Caching Scheme for DRAM/PRAM Hybrid Cache.

Author: Hsieh, Jen-Wei and Kuan, Yuan-Hung
Subjects: CACHE memory, INFORMATION storage & retrieval systems, COMPARATIVE studies, ALGORITHMS, ENERGY consumption
Abstract: DRAM is widely adopted as a cache for secondary storage due to its small access latency. Compared with DRAM, PRAM draws a lot of attention recently, since it provides higher density and has no need to refresh the capacitor charge periodically. The non-volatile nature of PRAM can even reduce compulsory miss, which cannot be avoided by DRAM cache. However, PRAM cache cannot replace DRAM cache due to its endurance issue. Thus DRAM/PRAM hybrid cache becomes a good alternative for traditional DRAM cache. Least recently used (LRU) replacement algorithm and CLOCK-Pro algorithm work well for traditional DRAM cache. But these algorithms shall not be directly applied to DRAM/PRAM hybrid cache since the characteristics of PRAM are not considered. This paper proposed a double circular caching scheme (DCCS) to manage DRAM/PRAM hybrid cache. In our scheme, cached data migrate between DRAM cache and PRAM cache adaptively to achieve good hit ratio while frequent writes to PRAM cache are avoided for endurance concern. The experimental results showed that our scheme can reduce up to 87.10 percent PRAM write accesses for read-intensive access pattern and up to 44.90 percent energy consumption for write-intensive access pattern, compared with other caching schemes. [ABSTRACT FROM PUBLISHER]
Published: 2015
Full Text: View/download PDF

46. Dynamic Antenna Management for Uplink Energy Efficiency on 802.11n Mobile Devices.

Author: Cheng, Sheng-Wei, Ku, Ling-Chia, and Hsiu, Pi-Cheng
Subjects: ENERGY consumption, CELL phones, BANDWIDTHS, COMPUTER interfaces, MOBILE apps, MATHEMATICAL optimization
Abstract: An increasing number of mobile devices are being equipped with 802.11n interfaces to support bandwidth-intensive applications; however, the improved bandwidth increases power consumption. To address the issue, researchers are focusing on antenna management. In this paper, we present a dynamic antenna management (DAM) scheme to improve the uplink energy efficiency on mobile devices whose packet workloads may vary significantly and frequently. First, we model antenna management as an optimization problem, with the objective of minimizing the energy required to transmit a sequence of variable-length packets with random arrival times. Then, we propose an optimal offline algorithm to solve the problem, as well as a competitive online algorithm that has a provable performance guarantee and allows compatible implementations on 802.11n mobile devices. To evaluate our scheme, we conducted extensive simulations based on real mobile user traces and application transmission patterns. Nearly all commercial 802.11n mobile devices support the power save mode (PSM). Our results demonstrate that DAM can improve the energy efficiency of PSM significantly at a cost of slight throughput degradation. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

47. Data Collection Maximization in Renewable Sensor Networks via Time-Slot Scheduling.

Author: Ren, Xiaojiang, Liang, Weifa, and Xu, Wenzheng
Subjects: ACQUISITION of data, RENEWABLE energy sources, WIRELESS sensor networks, COMPUTER scheduling, INTERNET traffic
Abstract: In this paper we study data collection in an energy renewable sensor network for scenarios such as traffic monitoring on busy highways, where sensors are deployed along a predefined path (the highway) and a mobile sink travels along the path to collect data from one-hop sensors periodically. As sensors are powered by renewable energy sources, time-varying characteristics of ambient energy sources poses great challenges in the design of efficient routing protocols for data collection in such networks. In this paper we first formulate a novel data collection maximization problem by adopting multi-rate data transmissions and performing transmission time slot scheduling, and show that the problem is NP-hard. We then devise an offline algorithm with a provable approximation ratio for the problem by exploiting the combinatorial property of the problem, assuming that the harvested energy at each node is given and link communications in the network are reliable. We also extend the proposed algorithm by minor modifications to a general case of the problem where the harvested energy at each sensor is not known in advance and link communications are not reliable. We thirdly develop a fast, scalable online distributed algorithm for the problem in realistic sensor networks in which neither the global knowledge of the network topology nor sensor profiles such as sensor locations and their harvested energy profiles is given. Furthermore, we also consider a special case of the problem where each node has only a fixed transmission power, for which we propose an exact solution to the problem. We finally conduct extensive experiments by simulations to evaluate the performance of the proposed algorithms. Experimental results demonstrate that the proposed algorithms are efficient and the solutions obtained are fractional of the optimum. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

48. Collaborative Mobile Charging.

Author: Zhang, Sheng, Wu, Jie, and Lu, Sanglu
Subjects: WIRELESS sensor nodes, STORAGE battery charging, WIRELESS sensor networks, ENERGY transfer, ENERGY consumption, MOBILE communication systems
Abstract: The limited battery capacity of sensor nodes has become one of the most critical impediments that stunt the deployment of wireless sensor networks (WSNs). Recent breakthroughs in wireless energy transfer and rechargeable lithium batteries provide a promising alternative to power WSNs: mobile vehicles/robots carrying high volume batteries serve as mobile chargers to periodically deliver energy to sensor nodes. In this paper, we consider how to schedule multiple mobile chargers to optimize energy usage effectiveness, such that every sensor will not run out of energy. We introduce a novel charging paradigm, collaborative mobile charging, where mobile chargers are allowed to intentionally transfer energy between themselves. To provide some intuitive insights into the problem structure, we first consider a scenario that satisfies three conditions, and propose a scheduling algorithm, PushWait, which is proven to be optimal and can cover a one-dimensional WSN of infinite length. Then, we remove the conditions one by one, investigating chargers’ scheduling in a series of scenarios ranging from the most restricted one to a general 2D WSN. Through theoretical analysis and simulations, we demonstrate the advantages of the proposed algorithms in energy usage effectiveness and charging coverage. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

49. Virtual Shuffling for Efficient Data Movement in MapReduce.

Author: Yu, Weikuan, Wang, Yandong, Que, Xinyu, and Xu, Cong
Subjects: PARALLEL processing, ENERGY conservation, DATA analysis, INFORMATION sharing, ENERGY consumption
Abstract: MapReduce is a popular parallel processing framework for large-scale data analytics. To keep up with the increasing volume of datasets, it requires efficient I/O capability from the underlying computer systems to process and analyze data in two phases (mapping and reducing). Between these phases, MapReduce requires a shuffling phase to globally exchange the intermediate data generated by the mapping phase. We reveal that data shuffling, by physically moving segments of intermediate data across disks, causes significant I/O contention and compounds the I/O problem. In this paper, we propose a novel virtual shuffling strategy to enable efficient data movement and reduce I/O for MapReduce shuffling, thereby reducing power consumption and conserving energy. Virtual shuffling is realized through a combination of three techniques including a three-level segment table, near-demand merging, and dynamic and balanced merging subtrees. Our experimental results show that virtual shuffling significantly speeds up data movement in MapReduce and achieves faster job execution. Particularly, its reduction in disk I/O accesses results in as much as 12% savings in power consumption for MapReduce programs. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

50. Hardware–Software Coherence Protocol for the Coexistence of Caches and Local Memories.

Author: Alvarez, Lluc, Vilanova, Lluis, Gonzalez, Marc, Martorell, Xavier, Navarro, Nacho, and Ayguade, Eduard
Subjects: COMPUTER network protocols, CACHE memory, COMPUTER network architectures, INTEGRATED circuits, ENERGY consumption
Abstract: Cache coherence protocols limit the scalability of multicore and manycore architectures and are responsible for an important amount of the power consumed in the chip. A good way to alleviate these problems is to introduce a local memory alongside the cache hierarchy, forming a hybrid memory system. Local memories are more power-efficient than caches and do not generate coherence traffic, but they suffer from poor programmability. When non-predictable memory access patterns are found, compilers do not succeed in generating code because of the incoherence between the two storages. This paper proposes a coherence protocol for hybrid memory systems that allows the compiler to generate code even in the presence of memory aliasing problems. Coherence is ensured by a software/hardware co-design where the compiler identifies potentially incoherent memory accesses and the hardware diverts them to the correct copy of the data. The coherence protocol introduces overheads of 0.26% in execution time and of 2.03% in energy consumption to enable the usage of the hybrid memory system, which outperforms cache-based systems by an speedup of 38% and an energy reduction of 27%. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

447 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources