26 results on '"Manuel E. Acacio"'
Search Results
2. SAWS: Simple and Adaptive Warp Scheduling for Improved Performance in Throughput Processors
- Author
-
Manuel E. Acacio and Francisco Munoz Martinez
- Subjects
010302 applied physics ,Improved performance ,Computer engineering ,Computer science ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Scheduling (computing) - Abstract
In this work, we address the challenge of designing an efficient warp scheduler for throughput processors by proposing SAWS (Simple and Adaptive Warp Scheduler). Differently from previous approaches which target a particular type of applications, SAWS considers several simple scheduling algorithms and tries to use the one that best fits each application or phase within an application. Through detailed simulations we demonstrate that a practical implementation of SAWS can obtain IPC values that closely match the best scheduling algorithm in each case.
- Published
- 2018
- Full Text
- View/download PDF
3. Using Heterogeneous Networks to Improve Energy Efficiency in Direct Coherence Protocols for Many-Core CMPs
- Author
-
Alberto Ros, Manuel E. Acacio, and Ricardo Fernández-Pascual
- Subjects
Smart Cache ,Hardware_MEMORYSTRUCTURES ,Computer science ,Cache invalidation ,Bus sniffing ,MESI protocol ,Distributed computing ,Page cache ,Cache ,Cache pollution ,Cache algorithms - Abstract
Direct coherence protocols have been recently proposed as an alternative to directory-based protocols to keep cache coherence in many-core CMPs. Differently from directory-based protocols, in direct coherence the responsible for providing the requested data in case of a cache miss (i.e., the owner cache) is also tasked with keeping the updated directory information and serializing the different accesses to the block by all cores. This way, these protocols send requests directly to the owner cache, thus avoiding the indirection caused by accessing a separate directory (usually in the home node). A hints mechanism ensures a high hit rate when predicting the current owner of a block for sending requests, but at the price of significantly increasing network traffic, and consequently, energy consumption. In this work, we show how using a heterogeneous interconnection network composed of two kinds of links is enough to drastically reduce the energy consumed by hint messages, obtaining significant improvements in energy efficiency.
- Published
- 2012
- Full Text
- View/download PDF
4. An Experience of Early Initiation to Parallelism in the Computing Engineering Degree at the University of Murcia, Spain
- Author
-
Javier Cuenca, M. Carmen Garrido, María-Eugenia Requena, José Guillén, Lorenzo Fern'ndez, Juan A. S'nchez Laguna, Manuel E. Acacio, Juan Alejandro Palomino Benito, Domingo Giménez, Ricardo Fern'ndez-Pascual, and Joaquín Cervera
- Subjects
Data parallelism ,Computer science ,Programming language ,Parallel algorithm ,Mathematics education ,Parallelism (grammar) ,Task parallelism ,Algorithm design ,Instruction-level parallelism ,computer.software_genre ,Degree (music) ,Early initiation ,computer - Abstract
This paper presents an on-going experience in early introduction to parallelism in the Computing Engineering Degree. Four courses of the second year and a computing centre participate in the experience. The courses are given by three departments. Students are introduced to parallelism for the first time in the second year, and with our experience we aim to approach different topics of parallelism in a coordinated and practical way.
- Published
- 2012
- Full Text
- View/download PDF
5. Heterogeneous NoC Design for Efficient Broadcast-based Coherence Protocol Support
- Author
-
Mario Lodde, Manuel E. Acacio, and Jose Flich
- Subjects
Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,Acknowledgement ,Multiprocessing ,Chip ,MESIF protocol ,Network on a chip ,Embedded system ,Cache ,Latency (engineering) ,business ,Cache coherence ,Computer network - Abstract
Chip Multiprocessor Systems (CMPs) rely on a cache coherency protocol to maintain memory access coherence between cached data and main memory. The Hammer coherency protocol is appealing as it eliminates most of the space overhead when compared to a directory protocol. However, it generates much more traffic, thus stressing the NoC and having worse performance in terms of power consumption. When using a NoC with built-in broadcast support network utilization is lowered but does not solve completely the problem as acknowledgment messages are still sent from each core to the memory access requestor. In this paper we propose a simple control network that collects the acknowledgement messages and delivers them with a bounded and fixed latency, thus relieving the NoC from a large amount of messages. Experimental results demonstrate on a 16-tile system with the control network that execution time improves up to 17%, with an average improvement of about 7.5%. The control network has negligible impact on area when compared to the switches.
- Published
- 2012
- Full Text
- View/download PDF
6. π-TM: Pessimistic invalidation for scalable lazy hardware transactional memory
- Author
-
Anurag Negi, José M. García, Ruben Titos-Gil, Manuel E. Acacio, and Per Stenström
- Subjects
Concurrency control ,Transaction processing ,Computer science ,Workaround ,Concurrency ,Distributed computing ,Conflict resolution ,Scalability ,Transactional memory ,Commit - Abstract
Lazy hardware transactional memory has been shown to be more efficient at extracting available concurrency than its eager counterpart. However, it poses scalability challenges at commit time as existence of conflicts among concurrent transactions is not known prior to commit. Non-conflicting transactions may have to wait before committing, severely affecting performance in certain workloads. Early conflict detection can be employed to allow such transactions to commit simultaneously. In this paper we show that the potential of this technique has not yet been fully utilized, with design choices in prior work severely burdening common-case transactional execution to avoid some relatively uncommon correctness concerns. The paper quantifies the severity of the problem and develops μ-TM, an early conflict detection — lazy conflict resolution design. This design highlights how, with modest extensions to existing directory-based coherence protocols, information regarding possible conflicts can be effectively used to achieve true parallelism at commit without burdening the common-case. We leverage the observation that contention is typically seen on only a small fraction of shared data accessed by coarse-grained transactions. Pessimistic invalidation of such lines when committing or aborting, therefore, enables fast common-case execution. Our results show that μ-TM performs consistently well and, in particular, far better than previous work on early conflict detection in lazy HTM. We also identify a pathological scenario that lazy designs with early conflict detection suffer from and propose a simple hardware workaround to sidestep it.
- Published
- 2012
- Full Text
- View/download PDF
7. Dynamic Serialization: Improving Energy Consumption in Eager-Eager Hardware Transactional Memory Systems
- Author
-
Juan C. Fern´ndez, Ruben Titos-Gil, Manuel E. Acacio, and Epifanio Gaona
- Subjects
Memory management ,Record locking ,Transaction processing ,Computer science ,Serialization ,Synchronization (computer science) ,Operating system ,Transactional memory ,Energy consumption ,computer.software_genre ,Database transaction ,computer - Abstract
In the search for new paradigms to simplify multithreaded programming, Transactional Memory (TM) is currently being advocated as a promising alternative to deadlock-prone lock-based synchronization. In this way, future many-core CMP architectures may need to provide hardware support for TM. On the other hand, power dissipation constitutes a first class consideration in multicore processor designs. In this work, we propose Dynamic Serialization (DS) as a new technique to improve energy consumption without degrading performance in applications with conflicting transactions. Our proposal, which is implemented on top of a hardware transactional memory system with an eager conflict management policy, detects and serializes conflicting transactions dynamically. Particularly, in case of conflict one transaction is allowed to continue whilst the rest are completely stalled. Once the executing transaction has finished it wakes up several of the stalling transactions. This brings important benefits in terms of energy consumption due to the reduction in the amount of wasted work that DS implies. Results for a 16-core CMP show that Dynamic Serialization obtains reductions of 10% on average in energy consumption (more than 20% in high contention scenarios) without affecting, on average, execution time.
- Published
- 2012
- Full Text
- View/download PDF
8. Pi-TM: Pessimistic Invalidation for Scalable Lazy Hardware Transactional Memory
- Author
-
Anurag Negi, Per Stenström, José M. García, Manuel E. Acacio, and Ruben Titos-Gil
- Subjects
Memory management ,Computer science ,Transaction processing ,Concurrency ,Distributed computing ,Scalability ,Operating system ,Transactional memory ,Commit ,Cache ,computer.software_genre ,computer ,Cache coherence - Abstract
Lazy hardware transactional memory (HTM) al-lows better utilization of available concurrency in transactional workloads than eager HTM, but poses challenges at commit time due to the requirement of en-masse publication of speculative updates to global system state. Early conflictdetection can be employed in lazy HTM designs to allow non-conflicting transactions to commit in parallel. Though this has the potential to improve performance, it has not been utilized effectively so far. Prior work in the area burdens common-case transactional execution severely to avoid some relatively uncommon correctness concerns. In this work we investigate this problem and introduce a novel design, p-TM, which eliminates this problem. p-TM uses modest extensions to existing directory-based cache coherence protocols to keep a record of conflicting cache lines as a transaction executes. This information allows a consistent cache state to be maintained when transactions commit or abort. We observe that contention is typically seen only on a small fraction of shared data accessed by coarse-grained transactions. In p-TM earlyconflict detection mechanisms imply additional work only when such contention actually exists. Thus, the design is able to avoid expensive core-to-core and core-to-directory communication for a large part of transactionally accessed data. Our evalutation shows major performance gains when compared to other HTM designs in this class and competitive performance when compared to more complex lazy commit schemes.
- Published
- 2011
- Full Text
- View/download PDF
9. The Impact of Non-coherent Buffers on Lazy Hardware Transactional Memory Systems
- Author
-
José M. García, Manuel E. Acacio, Per Stenström, Ruben Titos-Gil, and Anurag Negi
- Subjects
Computer science ,Shared memory multiprocessor ,Concurrency ,Distributed computing ,Scalability ,Conflict resolution ,Transactional memory ,Non coherent ,Software versioning - Abstract
When supported in silicon, transactional memory (TM) promises to become a fast, simple and scalable parallel programming paradigm for future shared memory multiprocessor systems. Among the multitude of hardware TM design points and policies that have been studied so far, lazy conflict resolution designs often extract the most concurrency, but their inherent need for lazy versioning requires careful management of speculative updates. In this paper we study how coherent buffering, in private caches for example, as has been proposed in several hardware TM proposals, can lead to inefficiencies. We then show how such inefficiencies can be substantially mitigated by using complete or partial non-coherent buffering of speculative writes in dedicated structures or suitably adapted standard per-core write-buffers. These benefits are particularly noticeable in scenarios involving large coarse grained transactions that may write a lot of non-contended data in addition to actively shared data. We believe our analysis provides important insights into some overlooked aspects of TM behaviour and would prove useful to designers wishing to implement lazy TM schemes in hardware.
- Published
- 2011
- Full Text
- View/download PDF
10. GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs
- Author
-
Manuel E. Acacio, Jose L. Abell´n, and Juan C. Fern´ndez
- Subjects
010302 applied physics ,Memory hierarchy ,Busy waiting ,Computer science ,Serialization ,Distributed computing ,Message passing ,020206 networking & telecommunications ,Context (language use) ,02 engineering and technology ,Parallel computing ,Data structure ,01 natural sciences ,Execution time ,Lock (computer science) ,Data dependency ,Multithreading ,0103 physical sciences ,Synchronization (computer science) ,Scalability ,0202 electrical engineering, electronic engineering, information engineering - Abstract
Synchronization is of paramount importance to exploit thread-level parallelism on many-core CMPs. In these architectures, synchronization mechanisms usually rely on shared variables to coordinate multithreaded access to shared data structures thus avoiding data dependency conflicts. Lock synchronization is known to be a key limitation to performance and scalability. On the one hand, lock acquisition through busy waiting on shared variables generates additional coherence activity which interferes with applications. On the other hand, lock contention causes serialization which results in performance degradation. This paper proposes and evaluates \textit{GLocks}, a hardware-supported implementation for highly-contended locks in the context of many-core CMPs. \textit{GLocks} use a token-based message-passing protocol over a dedicated network built on state-of-the-art technology. This approach skips the memory hierarchy to provide a non-intrusive, extremely efficient and fair lock implementation with negligible impact on energy consumption or die area. A comprehensive comparison against the most efficient shared-memory-based lock implementation for a set of micro benchmarks and real applications quantifies the goodness of \textit{GLocks}. Performance results show an average reduction of 42% and 14% in execution time, an average reduction of 76% and 23% in network traffic, and also an average reduction of 78% and 28% in energy-delay$^2$ product (ED$^2$P) metric for the full CMP for the micro benchmarks and the real applications, respectively. In light of our performance results, we can conclude that \textit{GLocks} satisfy our initial working hypothesis. \textit{GLocks} minimize cache-coherence network traffic due to lock synchronization which translates into reduced power consumption and execution time.
- Published
- 2011
- Full Text
- View/download PDF
11. Characterizing Energy Consumption in Hardware Transactional Memory Systems
- Author
-
Juan A. Fernández, Epifanio Gaona-Ramirez, Manuel E. Acacio, and Ruben Titos-Gil
- Subjects
Multi-core processor ,Memory management ,Record locking ,Computer architecture ,business.industry ,Computer science ,Embedded system ,Scalability ,Transactional memory ,System on a chip ,Energy consumption ,business ,Synchronization - Abstract
Transactional Memory is currently being advocated as a promising alternative to lock-based synchronization because it simplifies multithreaded programming. In this way, future many-core CMP architectures may need to provide hardware support for transactional memory. On the other hand, power dissipation constitutes a first class consideration in multicore processor design. In this work, we characterize the performance and energy consumption of two well-known Hardware Transactional Memory systems that employ opposite policies for data versioning and conflict management. More specifically, we compare the Log TM-SE Eager-Eager system and a version of the Scalable TCC Lazy-Lazy system that enables parallel commits. To the best of our knowledge, this is the first characterization in terms of energy consumption of hardware transactional memory systems. To do that, we extended the GEMS simulator to estimate the energy consumed in the on-chip caches according to CACTI, and used the interconnection network energy model given by Orion 2. Results show that the energy consumption of the Eager-Eager system is 60% higher on average than in the Lazy-Lazy case, whereas performance differences between the two systems are 42% on average. Finally, we found that although on average Lazy-Lazy beats Eager-Eager there are considerable deviations in performance depending on the particular characteristics of each application.
- Published
- 2010
- Full Text
- View/download PDF
12. A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs
- Author
-
Juan A. Fernández, José L. Abellán, and Manuel E. Acacio
- Subjects
Flow control (data) ,Interconnection ,Network on a chip ,Shared memory ,Computer science ,Distributed computing ,Scalability ,Context (language use) ,Parallel computing ,Thread (computing) ,Synchronization - Abstract
Barrier synchronization in shared memory parallel machines has been widely implemented through busy-waiting on shared variables. However, typical implementations of barrier synchronization tend to produce hot-spots in terms of memory and network contention, thus creating performance bottlenecks that become markedly more pronounced as the number of cores or processors increases. To overcome such limitations, we present a novel hardware-based barrier mechanism in the context of many-core CMPs. Our proposal is based on global interconnection lines (G-lines) and the S-CSMA technique, which have been recently used to enhance a flow control mechanism (EVC) in the context of networks-on-chip. Based on this technology, we have designed a simple and scalable G-line-based network that operates independently of the main data network, and that is aimed at carrying out barrier synchronizations efficiently. In the ideal case, our design takes only 4 cycles to perform a barrier synchronization once all cores or threads have arrived at the barrier. As a proof of concept, we examine the benefits of our proposal by comparing it with one of the best software approaches (a binary combining-tree barrier). To do so, we run several kernels and scientific applications on top of the Sim-PowerCMP performance simulator that models a 32-core CMP with a 2D-mesh network configuration. Our proposal entails average reductions in terms of execution time of 68% and 21% for kernels and scientific applications, respectively. Additionally, network traffic is also lowered by 74% and 18%, respectively.
- Published
- 2010
- Full Text
- View/download PDF
13. Energy-Efficient Hardware Prefetching for CMPs Using Heterogeneous Interconnects
- Author
-
Antonio Flores, Juan L. Aragón, and Manuel E. Acacio
- Subjects
Interconnection ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,Parallel computing ,Execution time ,Energy conservation ,Network on a chip ,Power consumption ,Embedded system ,System on a chip ,Latency (engineering) ,business ,Computer hardware ,Efficient energy use - Abstract
In the last years high performance processor designs have evolved toward Chip-Multiprocessor (CMP) architectures that implement multiple processing cores on a single die. As the number of cores inside a CMP increases, the on-chip interconnection network will have significant impact on both overall performance and power consumption as previous studies have shown. On the other hand, CMP designs are likely to be equipped with latency hiding techniques like hardware prefetching in order to reduce the negative impact on performance that, otherwise, high cache miss rates would lead to. Unfortunately, the extra number of network messages that prefetching entails can drastically increase the amount of power consumed in the interconnect. In this work, we show how to reduce the impact of prefetching techniques in terms of power (and energy) consumption in the context of tiled CMPs. Our proposal is based on the fact that the wires used in the on-chip interconnection network can be designed with varying latency, bandwidth and power characteristics. By using a heterogeneous interconnect, where low-power wires are used for dealing with prefetched lines, significant energy savings can be obtained. Detailed simulations of a 16-core CMP show that our proposal obtains improvements of up to 30% in the power consumed by the interconnect (15-23% on average) with almost negligible cost in terms of execution time (average degradation of 2%).
- Published
- 2010
- Full Text
- View/download PDF
14. Distance-aware round-robin mapping for large NUCA caches
- Author
-
Marcelo Cintra, José M. García, Alberto Ros, and Manuel E. Acacio
- Subjects
Hardware_MEMORYSTRUCTURES ,business.industry ,Cache coloring ,Computer science ,Parallel computing ,Cache-oblivious algorithm ,Cache pollution ,Cache invalidation ,Memory architecture ,Page cache ,Cache ,business ,Cache algorithms ,Computer network - Abstract
In many-core architectures, memory blocks are commonly assigned to the banks of a NUCA cache by following a physical mapping. This mapping assigns blocks to cache banks in a round-robin fashion, thus neglecting the distance between the cores that most frequently access every block and the corresponding NUCA bank for the block. This issue impacts both cache access latency and the amount of on-chip network traffic generated. On the other hand, first-touch mapping policies, which take into account distance, can lead to an unbalanced utilization of cache banks, and consequently, to an increased number of expensive off-chip accesses. In this work, we propose the distance-aware round-robin mapping policy, an OS-managed policy which addresses the trade-off between cache access latency and number of off-chip accesses. Our policy tries to map the pages accessed by a core to its closest (local) bank, like in a first-touch policy. However, our policy also introduces an upper bound on the deviation of the distribution of memory pages among cache banks, which lessens the number of off-chip accesses. This tradeoff is addressed without requiring any extra hardware structure. We also show that the private cache indexing commonly used in many-core architectures is not the most appropriate for OS-managed distance-aware mapping policies, and propose to employ different bits for such indexing. Using GEMS simulator we show that our proposal obtains average improvements of 11% for parallel applications and 14% for multi-programmed workloads in terms of execution time, and significant reductions in network traffic, over a traditional physical mapping. Moreover, when compared to a first-touch mapping policy, our proposal improves average execution time by 5% for parallel applications and 6% for multi-programmed workloads, slightly increasing on-chip network traffic.
- Published
- 2009
- Full Text
- View/download PDF
15. Speculation-based conflict resolution in hardware transactional memory
- Author
-
Ruben Titos, Manuel E. Acacio, and José M. García
- Subjects
Scheme (programming language) ,Computer science ,Event (computing) ,Concurrency ,Multithreading ,Distributed computing ,Synchronization (computer science) ,Conflict resolution ,Transactional memory ,Speculation ,Execution time ,computer ,computer.programming_language - Abstract
Conflict management is a key design dimension of hardware transactional memory (HTM) systems, and the implementation of efficient mechanisms for detection and resolution becomes critical when conflicts are not a rare event. Current designs address this problem from two opposite perspectives, namely, lazy and eager schemes. While the former approach is based on an purely optimistic view that is not well-suited when conflicts become frequent, the latter results too pessimistic because resolves conflicts too conservatively, often limiting concurrency unnecessarily. In this paper, we present a hybrid, pseudo-optimistic scheme of conflict resolution for HTM systems that recaptures the concept of speculation to allow transactions to continue their execution past conflicting accesses. Simulation results show that our proposal is capable of combining the advantages of both classical approaches. For the STAMP transactional benchmarks, our hybrid scheme outperforms both eager and lazy systems with average reductions in execution time of 8 and 17%, respectively, and it decreases network traffic by another 17% compared to the eager policy.
- Published
- 2009
- Full Text
- View/download PDF
16. A Parallel Implementation of the 2D Wavelet Transform Using CUDA
- Author
-
Gregorio Bernabé, Joaquín Franco, Juan A. Fernández, and Manuel E. Acacio
- Subjects
Computer graphics ,CUDA ,Multi-core processor ,Speedup ,Computer science ,CUDA Pinned memory ,Programming paradigm ,Wavelet transform ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Software_PROGRAMMINGTECHNIQUES ,Fast wavelet transform - Abstract
There is a multicore platform that is currently concentrating an enormous attention due to its tremendous potential in terms of sustained performance: the NVIDIA Tesla boards. These cards intended for general-purpose computing on graphic processing units (GPGPUs) are used as data-parallel computing devices. They are based on the Computed Unified Device Architecture (CUDA) which is common to the latest NVIDIA GPUs. The bottom line is a multicore platform which provides an enormous potential performance benefit driven by a non-traditional programming model. In this paper we try to provide some insight into the peculiarities of CUDA in order to target scientific computing by means of a specific example. In particular, we show that the parallelization of the two-dimensional fast wavelet transform for the NVIDIA Tesla C870 achieves a speedup of 20.8 for an image size of 8192x8192, when compared with the fastest host-only version implementation using OpenMP and including the data transfers between main memory and device memory.
- Published
- 2009
- Full Text
- View/download PDF
17. Address Compression and Heterogeneous Interconnects for Energy-Efficient High-Performance in Tiled CMPs
- Author
-
Juan L. Aragón, Antonio Flores, and Manuel E. Acacio
- Subjects
Interconnection ,business.industry ,Computer science ,Embedded system ,Message passing ,Hardware_INTEGRATEDCIRCUITS ,System on a chip ,Energy consumption ,business ,Efficient energy use - Abstract
Previous studies have shown that the interconnection network of a chip-multiprocessor (CMP) has significant impact on both overall performance and energy consumption. Moreover, wires used in such interconnect can be designed with varying latency, bandwidth and power characteristics. In this work, we present a proposal for performance- and energy-efficient message management in tiled CMPs that combines both address compression with a heterogeneous interconnect. Our proposal consists of applying an address compression scheme that dynamically compresses the addresses within coherence messages allowing for a significant area slack. The arising area can be exploited for wire latency improvement by using a heterogeneous interconnection network comprised of a small set of very-low-latency wires for critical short-messages in addition to baseline wires. Detailed simulations of a 16-core CMP show that our proposal obtains average improvements of 10% in execution time and 38% in the Energy-Delay2 Product of the interconnect.
- Published
- 2008
- Full Text
- View/download PDF
18. DiCo-CMP: Efficient cache coherency in tiled CMP architectures
- Author
-
José M. García, Manuel E. Acacio, and Alberto Ros
- Subjects
Multi-core processor ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,MESI protocol ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,MESIF protocol ,Network on a chip ,Write-once ,Embedded system ,Bus sniffing ,Cache ,business ,Cache algorithms ,Cache coherence ,Computer network - Abstract
Future CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs organized around a direct interconnection network will probably be the architecture of choice. Power constraints make impractical to rely on broadcasts (as Token-CMP does) or any other brute-force method for keeping cache coherence, and directory-based cache coherence protocols are currently being employed. Unfortunately, directory protocols introduce indirection to access directory information, which negatively impacts performance. In this work, we present DiCo-CMP, a novel cache coherence protocol especially suited to future tiled CMP architectures. In DiCo- CMP the role of storing up-to-date sharing information and ensuring totally ordered accesses for every memory block is assigned to the cache that must provide the block on a miss. Therefore, DiCo-CMP reduces the miss latency compared to a directory protocol by sending coherence messages directly from the requesting caches to those that must observe them (as it would be done in brute-force protocols), and reduces the network traffic compared to Token-CMP (and consequently, power consumption in the interconnection network) by sending just one request message for each miss. Using an extended version of GEMS simulator we show that DiCo-CMP achieves improvements in execution time of up to 8% on average over a directory protocol, and reductions in terms of network traffic of up to 42% on average compared to Token-CMP.
- Published
- 2008
- Full Text
- View/download PDF
19. CellStats: A Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE
- Author
-
Manuel E. Acacio, José L. Abellán, and Juan C. Fernandez
- Subjects
Multi-core processor ,Computer science ,Distributed computing ,Broadband ,Atomic operations ,Transfer mechanism ,Thread (computing) ,Architecture ,IBM ,First generation - Abstract
The Cell Broadband Engine (Cell BE) is a recent heterogeneous chip-multiprocessor (CMP) architecture jointly developed by IBM, Sony and Toshiba to offer very high performance, especially on game and multimedia applications. The significant number of processor cores that it contains (nine in its first generation), along with their heterogeneity (they are of two different types) and the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications for the Cell BE very challenging. In this work, we present CellStats, a tool aimed at characterizing the performance of the main synchronization and communication primitives provided by the Cell BE under varying workloads. In particular, the current implementation of CellStats allows to evaluate the DMA transfer mechanism, the read-modify-write atomic operations, the mailboxes, the signals and the time taken by thread creation. As an example of application of CellStats, we present a characterization of the Cell BE incorporated into the PlayStation 3. From this characterization, we extract some recommendations that can help programmers to identify the most appropriate primitive under different assumptions.
- Published
- 2008
- Full Text
- View/download PDF
20. Characterization of Conflicts in Log-Based Transactional Memory (LogTM)
- Author
-
José M. García, Manuel E. Acacio, and J.R. Titos
- Subjects
Record locking ,Transactional leadership ,Transaction processing ,Semantics (computer science) ,Computer science ,Distributed computing ,Memory architecture ,Operating system ,Programming paradigm ,Software transactional memory ,Transactional memory ,computer.software_genre ,computer - Abstract
The difficulty of multithreaded programming remains a major obstacle for programmers to fully exploit multicore chips. Transactional memory has been proposed as an abstraction capable of ameliorating the challenges of traditional lock-based parallel programming. Hardware transactional memory (HTM) systems implement the necessary mechanisms to provide transactional semantics efficiently. In order to keep hardware simple, current HTM designs apply fixed policies that aim at optimizing the most expected application behaviour, and many of these proposals explicitly assume that commits will be clearly more frequent than aborts in future transactional workloads. This paper shows that some applications developed under the TM programming model are by nature prone to experience many conflicts. As a result, aborted transactions can get to be common and may seriously hurt performance. Our characterization, performed with truly transactional benchmarks on the LogTM system, shows that certain programs composed by large transactions suffer indeed very high abort rates. Thus, if TM is to unburden developers from the programmability-performance trade-off, HTM systems must obtain good performance levels in the presence of frequent aborts, requiring more flexible policies of data versioning as well as more sophisticated recovery schemes.
- Published
- 2008
- Full Text
- View/download PDF
21. A fault-tolerant directory-based cache coherence protocol for CMP architectures
- Author
-
José M. García, Ricardo Fernández-Pascual, Manuel E. Acacio, and José Duato
- Subjects
Interconnection ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,Reliability (computer networking) ,Overhead (computing) ,Transient (computer programming) ,Fault tolerance ,business ,MESIF protocol ,Protocol (object-oriented programming) ,Cache coherence ,Computer network - Abstract
Current technology trends of increased scale of integration are pushing CMOS technology into the deep-submicron domain, enabling the creation of chips with a significantly greater number of transistors but also more prone to transient failures. Hence, computer architects will have to consider reliability as a prime concern for future chip-multiprocessor designs (CMPs). Since the interconnection network of future CMPs will use a significant portion of the chip real state, it will be especially affected by transient failures. We propose to deal with this kind of failures at the level of the cache coherence protocol instead of ensuring the reliability of the network itself. Particularly, we have extended a directory-based cache coherence protocol to ensure correct program semantics even in presence of transient failures in the interconnection network. Additionally, we show that our proposal has virtually no impact on execution time with respect to a non fault-tolerant protocol, and just entails modest hardware and network traffic overhead.
- Published
- 2008
- Full Text
- View/download PDF
22. A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures
- Author
-
José Duato, Ricardo Fernández-Pascual, Manuel E. Acacio, and José M. García
- Subjects
Multi-core processor ,business.industry ,Computer science ,Embedded system ,Overhead (computing) ,Transient (computer programming) ,Fault tolerance ,Deadlock ,business ,Protocol (object-oriented programming) ,MESIF protocol ,Cache coherence - Abstract
It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against a similar protocol without fault tolerance (TOKENCMP). We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TOKENCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15%
- Published
- 2007
- Full Text
- View/download PDF
23. Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures
- Author
-
Juan L. Aragón, Antonio Flores, and Manuel E. Acacio
- Subjects
Battery (electricity) ,Interconnection ,Multi-core processor ,business.industry ,Computer science ,Transistor ,Energy consumption ,Chip ,law.invention ,Microprocessor ,law ,Embedded system ,Scalability ,business ,Energy (signal processing) - Abstract
Continuous improvements in integration scale have made major microprocessor vendors to move to designs that integrate several processor cores on the same chip. Chip-multiprocessors (CMPs) constitute the architecture of choice in the high performance embedded domain for several reasons such as better levels of scalability and performance/energy ratio. On the other hand, higher clock frequencies and increasing transistor density have revealed power dissipation as a critical design issue, especially in embedded systems where reduced energy consumption directly translates into extended battery life. In this work we present Sim-PowerCMP, a detailed architecture-level power-performance simulation tool for CMP architectures that integrates several well-known contemporary simulators (RSIM, HotLeakage and Orion) into a single framework. As a case of use of Sim-PowerCMP, we present a characterization of the energy-efficiency of a CMP for parallel scientific applications, paying special attention to the energy consumed on the interconnect. Results for an 8- and 16-core CMP show that the contribution of the interconnection network to the total power is close to 20%, on average, and that the most consuming messages are replies that carry data (almost 70% of total energy consumed in the interconnect).
- Published
- 2007
- Full Text
- View/download PDF
24. On the Evaluation of Dense Chip-Multiprocessor Architectures
- Author
-
Francisco J. Villa, José M. García, and Manuel E. Acacio
- Subjects
Interconnection ,Multi-core processor ,Hardware_MEMORYSTRUCTURES ,CPU cache ,business.industry ,Computer science ,Multiprocessing ,Directory ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Directory structure ,Computer architecture ,Embedded system ,Memory architecture ,Concurrent computing ,business - Abstract
Chip-multiprocessors (CMPs) have been revealed as the most promising way of making efficient use of current improvements in integration scale. Nowadays, commercial CMP releases integrate at most 8 processor cores onto the chip. However, 16 or more processor cores are expected to be offered in near future Dense-CMP (D-CMP) systems. In this way, these architectures impose new design restrictions, and some topics, such as the cache-coherence problem, must be reviewed. In this paper we present an exhaustive performance evaluation of two recently proposed D-CMP architectures, making special emphasis on the solution to the cache-coherence problem that each one of them introduces. The Shared Bus Fabric architecture (SBF) features a snoop cache-coherence protocol and is based on a high-performance bus fabric interconnection network. The second architecture follows a directory-based approach and integrates a bi-dimensional mesh as the interconnection network. Our results show that the performance achieved by the SBF architecture is hard-limited by the bandwidth restrictions of the bus fabric. On the other hand, the directory-based architecture outperforms the SBF one, but presents some performance in-efficiencies due to the additional indirection that the directory structure stored in the L2 cache level introduces.
- Published
- 2006
- Full Text
- View/download PDF
25. Optimizing a 3D-FWT Video Encoder for SMPs and HyperThreading Architectures
- Author
-
Gregorio Bernabé, José M. García, Ricardo Fernandez, and Manuel E. Acacio
- Subjects
POSIX Threads ,Computer science ,business.industry ,Maintainability ,Wavelet transform ,Parallel computing ,Software_PROGRAMMINGTECHNIQUES ,Functional decomposition ,Multithreading ,Embedded system ,Code (cryptography) ,business ,Encoder ,Implementation - Abstract
In this work we evaluate the implementation of a video encoder based on the 3D wavelet transform optimized for HyperThreading technology and SMPs. We design several implementations of the parallel encoder with Pthreads and OpenMP using functional decomposition. Then, we compare them in terms of execution speed, ease of implementation and maintainability of the resulting code. Our experiments show that while Pthreads provides the best results in terms of execution time, OpenMP can provide a nearly optimal execution time without sacrificing the maintainability of code.
- Published
- 2005
- Full Text
- View/download PDF
26. Characterization of Conflicts in Log-Based Transactional Memory (LogTM)
- Author
-
Gil, J. Ruben Titos, primary, Sanchez, Manuel E. Acacio, additional, and Carrasco, Jose M. Garcia, additional
- Published
- 2008
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.