Descriptor: "Multiprocessadors" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Multiprocessadors"' showing total 414 results

Start Over Descriptor "Multiprocessadors"

414 results on '"Multiprocessadors"'

201. Enhancing the efficiency and practicality of software transactional memory on massively multithreaded systems

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Unsal, Osman, Cristal Kestelman, Adrián, Kestor, Gökçen, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Unsal, Osman, Cristal Kestelman, Adrián, and Kestor, Gökçen
Abstract: Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming. The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends. The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always m, Postprint (published version)
Published: 2013

202. Performance and power optimizations in chip multiprocessors for throughput-aware computation

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ramírez Bellido, Alejandro, Valero Cortés, Mateo, Vega, Augusto J., Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ramírez Bellido, Alejandro, Valero Cortés, Mateo, and Vega, Augusto J.
Abstract: The so-called "power (or power density) wall" has caused core frequency (and single-thread performance) to slow down, giving rise to the era of multi-core/multi-thread processors. For example, the IBM POWER4 processor, released in 2001, incorporated two single-thread cores into the same chip. In 2010, IBM released the POWER7 processor with eight 4-thread cores in the same chip, for a total capacity of 32 execution contexts. The ever increasing number of cores and threads gives rise to new opportunities and challenges for software and hardware architects. At software level, applications can benefit from the abundant number of execution contexts to boost throughput. But this challenges programmers to create highly-parallel applications and operating systems capable of scheduling them correctly. At hardware level, the increasing core and thread count puts pressure on the memory interface, because memory bandwidth grows at a slower pace ---phenomenon known as the "bandwidth (or memory) wall". In addition to memory bandwidth issues, chip power consumption rises due to manufacturers' difficulty to lower operating voltages sufficiently every processor generation. This thesis presents innovations to improve bandwidth and power consumption in chip multiprocessors (CMPs) for throughput-aware computation: a bandwidth-optimized last-level cache (LLC), a bandwidth-optimized vector register file, and a power/performance-aware thread placement heuristic. In contrast to state-of-the-art LLC designs, our organization avoids data replication and, hence, does not require keeping data coherent. Instead, the address space is statically distributed all over the LLC (in a fine-grained interleaving fashion). The absence of data replication increases the cache effective capacity, which results in better hit rates and higher bandwidth compared to a coherent LLC. We use double buffering to hide the extra access latency due to the lack of data replication. The proposed vector register file is, El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño. En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos a, Postprint (published version)
Published: 2013

203. Thread assignment of multithreaded network applications in multicore/multithreaded processors

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, and Valero Cortés, Mateo
Abstract: © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works., The introduction of multithreaded processors comprised of a large number of cores with many shared resources makes thread scheduling, and in particular optimal assignment of running threads to processor hardware contexts to become one of the most promising ways to improve the system performance. However, finding optimal thread assignments for workloads running in state-of-the-art multicore/multithreaded processors is an NP-complete problem. In this paper, we propose BlackBox scheduler, a systematic method for thread assignment of multithreaded network applications running on multicore/multithreaded processors. The method requires minimum information about the target processor architecture and no data about the hardware requirements of the applications under study. The proposed method is evaluated with an industrial case study for a set of multithreaded network applications running on the UltraSPARC T2 processor. In most of the experiments, the proposed thread assignment method detected the best actual thread assignment in the evaluation sample. The method improved the system performance from 5 to 48 percent with respect to load balancing algorithms used in state-of-the-art OSs, and up to 60 percent with respect to a naive thread assignment., Peer Reviewed, Postprint (author's final draft)
Published: 2013

204. Comparison based sorting for systems with multiple GPUs

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Tanasic, Ivan, Vilanova, Lluís, Jorda, Marc, Cabezas, Javier, Gelado Fernandez, Isaac, Navarro, Nacho, Hwu, Wen-mei W., Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Tanasic, Ivan, Vilanova, Lluís, Jorda, Marc, Cabezas, Javier, Gelado Fernandez, Isaac, Navarro, Nacho, and Hwu, Wen-mei W.
Abstract: As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system. In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU., Peer Reviewed, Postprint (published version)
Published: 2013

205. Implementació del run-time Nanos++ sobre GMAC

Author: Labarta Mancho, Jesús José, Solà Vélez, Marçal, Labarta Mancho, Jesús José, and Solà Vélez, Marçal
Abstract: L'objectiu d'aquest projecte és la implementació d'una nova versió del runtime Nanos++, desenvolupat al BSC i que dona suport al model de programació OmpSs. La nova versió ha de donar suport a la programació de GPUs, fent ús de la biblioteca GMAC i substituint l'actual implementació basada en CUDA.
Published: 2013

206. Communication bottelneck analysis on big data applications

Author: Nemirovsky, Mario, Solé Pareta, Josep, Roca Marí, Damián, Nemirovsky, Mario, Solé Pareta, Josep, and Roca Marí, Damián
Abstract: [ANGLÈS] Computers, and multicore processors in specific, need cache memory to improve memory bandwidth and overall performance. There are different types of cache (private and shared) divided into different levels of hierarchy. Keeping coherence and consistency of shared values in these caches is a major performance bottleneck on multicore systems. To address this issue, there are several protocols that invalidate or update these values when a core needs to modify them. But these protocols require broadcast communication (or similar) that in today NoCs represents a big cost in terms of cycles. In order to improve this bottleneck, the first step in this research is to know and have an approximation of the target that represents these invalidations in the terms of performance of the system. To obtain that estimation is mandatory to use programs or simulators of a real process inside a multicore/multithreaded processor to visualize the communications between these cores and the effects of sharing a part of the space address. The reason is that these invalidations are produced by keeping the coherence between different copies of the same variable (shared space). Once that we have a simulator that allows us to see the communications we can make different configurations to emulate a real processor in different scenarios. With these cases, we can obtain how the number of invalidations is modified depending on the parameters of the system (number of cores, size of cache memories, etc) and the applications which are running. Due to this, the results can vary for different applications since each of them uses the shared memory space in a different way. With this information we can elaborate some statistics to extract the first conclusions and fix the bases for future work. These results also enables us to study the scalability of the actual models to see what would happen if we have more than 1000-core processor because the actual simulators do not support such high number o, [CASTELLÀ] Los chips multicore conforman la realidad en el campo de los computadores. Pero dichos sistemas presentan muchos problemas que restringen su potencial. En este proyecto se realiza un estudio del principal, el sistema de memoria y mas concretamente, la memoria cache. Se estudia la escalibilidad que presentan las soluciones actuales con respecto al número de cores y se extraen las conclusiones pertinentes., [CATALÀ] La tendència actual quant a processadors consisteix a integrar múltiples cores dins d'un mateix xip. Són coneguts com a xips multicore (CMP), però el seu diseny està ple de problemes. En aquest projecte s'estudien, centrant-se en el sistema de memoria i més concretament en la memòria cau. En cocnret, s'analitza el funcionament dels protocols de coherència i la seva escalabilitat respecte el nombre de cores. Finalment, s'extreuen les conclusions que les solucions actuals no serveixen per a un nombre de cores elevat.
Published: 2013

207. Deconstructing bus access control policies for real-time multicores

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Jalle Ibarra, Javier, Abella Ferrer, Jaume, Quiñones, Eduardo, Fossati, Luca, Zulianello, Marco, Cazorla Almeida, Francisco Javier, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Jalle Ibarra, Javier, Abella Ferrer, Jaume, Quiñones, Eduardo, Fossati, Luca, Zulianello, Marco, and Cazorla Almeida, Francisco Javier
Abstract: Multicores may satisfy the growing performance requirements of critical Real-Time systems which has made industry to consider them for future real-time systems. In a multicore, the bus contention-control policy plays a key role in system's performance and the tightness of the Worst-Case Execution Time (WCET) estimates. In this paper we develop analytical models of the contention that requests from different tasks running in different cores suffer for the two most-used contention control policies: Time-Division Multiple Access (TDMA) and Interference-Aware Bus Arbiter (IABA), which allows us to compare them. We further show the benefits of having such models for real-time system designers and chip providers. Our results show that WCET estimates obtained with TDMA are slightly (2%) tighter than those obtained with IABA, at the cost of knowing the exact cycle at which every access of every task accesses the bus. However, average performance is 10% worse with TDMA than with IABA. Overall, IABA is the most appealing contention-control policy since it allows achieving tight WCET estimates and high average performance with little burden for the user., Postprint (published version)
Published: 2013

208. Physical-aware system-level design for tiled hierarchical chip multiprocessors

Author: Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya. ALBCOM - Algorismia, Bioinformàtica, Complexitat i Mètodes Formals, Cortadella, Jordi, San Pedro Martín, Javier de, Nikitin, Nikita, Petit Silvestre, Jordi, Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya. ALBCOM - Algorismia, Bioinformàtica, Complexitat i Mètodes Formals, Cortadella, Jordi, San Pedro Martín, Javier de, Nikitin, Nikita, and Petit Silvestre, Jordi
Abstract: Tiled hierarchical architectures for Chip Multiprocessors (CMPs) represent a rapid way of building scalable and power-e fficient many-core computing systems. At the early stages of the design of a CMP, physical parameters are often ignored and postponed for later design stages. In this work, the importance of physical-aware system-level exploration is investigated, and a strategy for deriving chip floorplans is described. Additionally, wire planning of the on-chip interconnect is performed, as its topology and organization aff ect the physical layout of the system. Traditional algorithms for floorplanning and wire planning are customized to include physical constraints speci c for tiled hierarchical architectures. Over-the-cell routing is used as one of the major area savings strategy. The combination of architectural exploration and physical planning is studied with an example and the impact of the physical aspects on the selection of architectural parameters is evaluated., Postprint (author’s final draft)
Published: 2013

209. Measuring Operating System Overhead on CMT Processors

Author: Javier Verdú, Mateo Valero, Francisco J. Cazorla, Roberto Gioiosa, Petar Radojković, Alex Pajuelo, Mario Nemirovsky, Vladimir Cakarevic, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: CMT processor, Interrupt latency, Computer science, Interrupt handler, Context (computing), Compiladors (Programes d'ordinador), Sistemes operatius (Ordinadors), Multiprocessadors, computer.software_genre, Netra DPS, Simultaneous multithreading processors, Multithreading, Operating system, Interrupt priority level, High performance computing, Timer, Operating system overhead, Interrupt, Informàtica::Sistemes operatius [Àrees temàtiques de la UPC], computer, Interrupt request, Compilers (Computer programs)
Abstract: Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing (HPC), especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this paper, we analyze the major sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1, running Linux and Solaris. Since a real system is too complex to analyze, we compare those results with a low-overhead runtime environment: the Netra Data Plane Software Suite (Netra DPS). Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context of the timer interrupt handler and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor.
Published: 2008
Full Text: View/download PDF

210. Proposta de gestió del sistema operatiu per a arquitectures heterogènies multiprocessador

Author: Joglar Huber, Javier, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Gil, Marisa
Subjects: Potència del hardware, Operating systems (Computers), Multiprocessors, Sistemes operatius (Ordinadors), Multiprocessadors, Adaptació de sistemes operatius, Informàtica::Hardware [Àrees temàtiques de la UPC]
Published: 2008

211. Scheduling in Multiprocessor System Using Genetic Algorithms

Author: Alamgir Hossain, A. Daradoumis, Fatos Xhafa, Keshav Dahal, B. Varghese, Ajith Abraham, Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, and Universitat Politècnica de Catalunya. ALBCOM - Algorismia, Bioinformàtica, Complexitat i Mètodes Formals
Subjects: Earliest deadline first scheduling, Schedule, Job shop scheduling, Computer science, Distributed computing, Processor scheduling, Real-time data processing, Multiprocessadors, Dynamic priority scheduling, Parallel computing, Genetic algorithms, Scheduling (computing), Algorismes genètics, Genetic algorithm, Performance evaluation, Multiprocessors, Uniprocessor system, Heuristics, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Temps real (Informàtica)
Abstract: Multiprocessors have emerged as a powerful computing means for running real-time applications, especially where a uniprocessor system would not be sufficient enough to execute all the tasks. The high performance and reliability of multiprocessors have made them a powerful computing resource. Such computing environment requires an efficient algorithm to determine when and on which processor a given task should execute. This paper investigates dynamic scheduling of real-time tasks in a multiprocessor system to obtain a feasible solution using genetic algorithms combined with well-known heuristics, such as 'Earliest Deadline First' and 'Shortest Computation Time First'. A comparative study of the results obtained from simulations shows that genetic algorithm can be used to schedule tasks to meet deadlines, in turn to obtain high processor utilization.
Published: 2008
Full Text: View/download PDF

212. Nebelung: execution environment for transactional OpenMP

Author: Miloš Milovanović, Roger Ferrer, Vladimir Gajinov, Osman S. Unsal, Adrian Cristal, Eduard Ayguadé, Mateo Valero, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Software transactional memory, Computer science, Runtime library, Compiler, 02 engineering and technology, Parallel computing, Software_PROGRAMMINGTECHNIQUES, computer.software_genre, Theoretical Computer Science, Instruction set, Runtime system, Informàtica [Àrees temàtiques de la UPC], Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, Multiprocessors, Code generation, SIMD, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], 020207 software engineering, OpenMP, Multiprocessadors, 020202 computer hardware & architecture, Operating system, Programming paradigm, computer, Software, Information Systems
Abstract: Future generations of Chip Multiprocessors (CMP) will provide dozens or even hundreds of cores inside the chip. Writing applications that benefit from the massive computational power offered by these chips is not going to be an easy task for mainstream programmers who are used to sequential algorithms rather than parallel ones. This paper explores the possibility of using Transactional Memory (TM) in OpenMP, the industrial standard for writing parallel programs on shared-memory architectures, for C, C++ and Fortran. One of the major complexities in writing OpenMP applications is the use of critical regions (locks), atomic regions and barriers to synchronize the execution of parallel activities in threads. TM has been proposed as a mechanism that abstracts some of the complexities associated with concurrent access to shared data while enabling scalable performance. The paper presents a first proof-of-concept implementation of OpenMP with TM. Some language extensions to OpenMP are proposed to express transactions. These extensions are implemented in our source-to-source OpenMP Mercurium compiler and our Software Transactional Memory (STM) runtime system Nebelung that supports the code generated by Mercurium. Hardware Transactional Memory (HTM) or Hardware-assisted STM (HaSTM) are seen as possible paths to make the tandem TM-OpenMP more scalable. In the evaluation section we show the preliminary results. The paper finishes with a set of open issues that still need to be addressed, either in OpenMP or in the hardware/software implementations of TM.
Published: 2008

213. Planificación global en sistemas multiprocesador de tiempo real

Author: Banús Alsina, Josep María, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Labarta Mancho, Jesús José, Arenas, Àlex, and Labarta Mancho, Jesús
Subjects: Informàtica [Àrees temàtiques de la UPC], sistemes operatius, planificació de tasques, temps real, Multiprocessadors, Temps real (Informàtica)
Abstract: Esta tesis afronta el problema de la planificación de sistemas de tiempo real utilizando sistemas multiprocesador con memoria compartida. Según laliteratura este problema es NP-Hard. En las aplicaciones de sistemas de tiempo real se imponen unos plazos temporales para la realización de las tareas. Así, lo importante es obtener los resultados a tiempo y no lo es tanto el obtener un rendimiento alto en promedio. La solución al problematradicionalmente ha consistido en repartir las tareas en tiempo de diseño y tratar a losprocesadores como monoprocesadores aislados. La solución alternativa, la planificación global del multiprocesador, tiene una teoría poco desarrollada. Los límites de utilización del sistema con garantías para losplazos son muy bajos, del orden del 50%, y la capacidad sobrante difícilmente se puede usar para dar servicio a las tareas aperiódicas. Así, el objetivoprincipal de la tesis es la planificación global con garantías de los plazos y con buen servicio a las tareas aperiódicas, llegando a usar el 100% de la capacidad de proceso. Primero se estudiaron cuatro posibilidades de distribución: estática o dinámica según las tareas, periódicas o aperiódicas. Para ello se trató el servicioa las tareas aperiódicas con dos métodos distintos: con servidores y sin servidores. En las distribuciones dinámicas, con el método de los servidoresse encontraron dificultades en su dimensionado y en las garantías de los plazos. Los métodos sin servidores probados fueron los planificadores Slack Stealing y Total Bandwidth. Ambos solo se pudieron adaptar para la planificación estática de las tareas periódicas. Las simulaciones mostraron que laplanificación local con Slack Stealing y un distribuidor de las tareas aperiódicas tipo Next-Fit proporcionan los mejores tiempos de respuesta medios para las tareas aperiódicas. Sin embargo, cuando las cargas son muy altas su tiempo de respuesta se dispara. Todos los métodos ensayados hasta elmomento quedaron desestimados para la planificación global. En segundo lugar se adaptó a la planificación global el algoritmo Dual Priority. Primero se analizaron sus características en monoprocesadores y se realizaron diversas mejoras. El algoritmo depende del cálculo off-line del peor tiempo de respuesta de las tareas periódicas y la fórmula paracalcularlos en monoprocesadores no es válida para multiprocesadores. Así, se analizaron tres métodos para su cálculo: un método analítico, unmétodo con simulación y un método con un algoritmo. El primero obtiene valores demasiado pesimistas; el segundo obtiene valores más ajustados pero en ocasiones son demasiado optimistas; el tercero es un método aproximado y obtiene valores tanto optimistas como pesimistas. Así, estemétodo no garantiza los plazos y no se puede usar en sistemas de tiempo real estrictos. En sistemas laxos, con una monitorización on-liney un ajuste dinámico de las promociones, el número de plazos incumplidos es muy bajo y el tiempo de repuesta de las tareas aperiódicas es excelente. Finalmente, se presenta una solución híbrida entre el repartimiento estático de las tareas periódicas y la planificación global. En tiempo de diseño, sereparten las tareas periódicas entre los procesadores y se calculan las promociones para la planificación local. En tiempo de ejecución las tareasperiódicas se pueden ejecutar en cualquier procesador hasta el instante de su promoción, instante en el que deben migrar a su procesador. Así segarantizan los plazos y se permite un cierto grado de balanceo dinámico de la carga. La flexibilidad conferida por las promociones de las tareas y el balanceo de la carga se utiliza para (i) admitir tareas periódicas que de otra forma no serian planificables, (ii) servir tareas aperiódicas y (iii) servirtareas aperiódicas con plazo o esporádicas. Para los tres casos se diseñaron y analizaron distintos métodos de distribución de las tareas periódicas en tiempo de diseño. También se diseño un método para reducir el número de migraciones. Las simulaciones mostraron que con este método se puedenconseguir cargas con solo tareas periódicas muy cercanas al 100%, lejos del 50% de la teoría de la planificación global. Las simulaciones con tareasaperiódicas mostraron que su tiempo de repuesta medio es muy bueno. Se diseño un test de aceptación de las tareas esporádicas, de forma que si una tarea es aceptada entonces su plazo queda garantizado. El porcentaje de aceptación obtenido en los experimentos fue superior al 80%.Finalmente, se diseñó un método de distribución de las tareas periódicas pre-rutime capaz de facilitar en tiempo de ejecución la aceptación de un alto porcentaje de tareas esporádicas y mantener un buen nivel de servicio medio para las tareas aperiódicas., This thesis takes into consideration the problem of real-time systems scheduling using shared memory multiprocessor systems. According to the literature, this problem is NP-Hard. In real-time systems applications some time limits are imposed to tasks termination. Therefore, the really important thing is to get results on time and it is not so important to achieve high average performances. The solution to the problem traditionally has been to partition the tasks at design time and treat processors as isolated uniprocessors. The alternative solution, the global scheduling, has an undeveloped theory. The limit on the system utilization with deadlines guarantees is very low, around 50%, and spare capacity can hardly be used to service aperiodic tasks. Thus, the main goal of this thesis is to develop global scheduling techniques providing deadlines guarantees and achieving good service for aperiodic tasks, being able to use 100% of the processing capacity.First of all, we explored four possibilities of distribution: static or dynamic depending on the tasks, periodic or aperiodic. We tried to schedule aperiodic tasks with two different methods: with servers and without servers. In dynamic distributions, with the method of servers were found difficulties in its size and guarantees for deadlines. The methods without servers were The Slack Stealing and The Total Bandwidth. Both were adapted only for scheduling the static case. The simulations showed that the local scheduling with Slack Stealing and an allocation of aperiodic tasks kind Next-Fit provides the best mean average response time for the aperiodic tasks. However, when the load is very high response time increases. All methods tested so far were dismissed for the global scheduling.Secondly the Dual Priority algorithm was adapted to global scheduling. First we discussed its characteristics in uniprocessors and various improvements were made. The algorithm depends on the off-line calculation of the worst case response time for the task and the formula to compute them in uniprocessors is not valid for multiprocessors. We have analyzed three methods for its calculation: an analytical method, a simulation method and an algorithmic method. The former gets too pessimistic values, the second gets adjusted values but are sometimes too optimistic, and the third is a method that obtains approximate values. Thus, this method does not guarantee deadlines and may not be used in hard real-time systems. However, it is very suitable for soft real-time systems. In these systems, using an on-line monitoring and dynamic adjustment of promotions, the number of missed deadlines is very low and the response time of aperiodic tasks is excellent.Finally, we present a hybrid solution between static task allocation and global scheduling. At design time, is performed the distribution of periodic tasks among processors and their promotions are calculated for local scheduling. At runtime, the task can be run on any processor until the moment of its promotion, when it has to migrate to its processor. This will ensure deadlines and allowing a certain degree of dynamic load balancing. The flexibility provided by task promotions and load balancing is used (i) to admit task that would otherwise not be scheduled, (ii) to serve aperiodic tasks and (iii) to serve aperiodic tasks with deadlines or sporadic tasks. For the three cases were designed and analyzed various methods of task distribution at design time. We also designed a method to reduce the number of migrations. The simulations showed that this method can achieve with only periodic task loads very close to 100%, far from the 50% of the global scheduling theory. The simulations showed that aperiodic tasks average response time is very good. We designed an acceptance test for sporadic tasks, hence, if a task is accepted then its deadline is guaranteed. The acceptance rate obtained in the experiments was over 80%. Finally, we devised a pre-rutime distribution method of periodic tasks that is able to provide at run time a high acceptance ratio for sporadic tasks and maintain a good level of service for aperiodic tasks
Published: 2008

214. Task management analysis on the CellBE

Author: Rico Carro, Alejandro, Ramírez Bellido, Alejandro, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Multiprocessors, Multiprocessadors, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Abstract: There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programing models which allow the programmer to identify parallel tasks, and the runtime management of the intertask dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, and larger caches on the task management overhead.
Published: 2008

215. Efficient resources assignment schemes for clustered multithreaded processors

Author: Fernando Latorre, José González, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Delay, Speedup, Parallel processing (Electronic computers), Data parallelism, Computer science, Processament en paral·lel (Ordinadors), Task parallelism, Surface-mount technology, Parallel computing, Multiprocessadors, Simultaneous multithreading, Energy consumption, Hardware, Multithreading, Simultaneous multithreading processors, Wire, Yarn, Parallelism (grammar), Parallel processing, Resource allocation, Proposals, Process design, Instruction-level parallelism, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Abstract: New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.
Published: 2008

216. Optimizing programming models for massively parallel computers

Author: Farreras Esclusa, Montse, Cortés, Toni, and Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Subjects: Productivitat, Informàtica [Àrees temàtiques de la UPC], MPI, MPPS, Programari -- Optimització de recursos, Multiprocessadors, Model de programació, Optimització, Arquitectura d'ordinadors, Programabilitat, Alt rendiment
Abstract: Since the invention of the transistor, clock frequency increase was the primary method of improving computing performance. As the reach of Moore's law came to an end, however, technology driven performance gains became increasingly harder to achieve, and the research community was forced to come up with innovative system architectures. Today increasing parallelism is the primary method of improving performance: single processors are being replaced by multiprocessor systems and multicore architectures. The challenge faced by computer architects is to increase performance while limited by cost and power consumption. The appearance of cheap and fast interconnection networks has promoted designs based on distributed memory computing. Most modern massively parallel computers, as reflected by the Top 500 list, are clusters of workstations using commodity processors connected by high speed interconnects. Today's massively parallel systems consist of hundreds of thousands of processors. Software technology to program these large systems is still in its infancy. Optimizing communication has become a key to overall system performance. To cope with the increasing burden of communication, the following methods have been explored: (i) Scalability in the messaging system: The messaging system itself needs to scale up to the 100K processor range. (ii) Scalable algorithms reducing communication: As the machine grows in size the amount of communication also increases, and the resulting overhead negatively impacts performance. New programming models and algorithms allow programmers to better exploit locality and reduce communication. (iii) Speed up communication: reducing and hiding communication latency, and improving bandwidth. Following the three items described above, this thesis contributes to the improvement of the communication system (i) by proposing a scalable memory management of the communication system, that guarantees the correct reception of data and control-data, (ii) by proposing a language extension that allows programmers to better exploit data locality to reduce inter-node communication, and (iii) by presenting and evaluating a cache of remote addresses that aims to reduce control-data and exploit the RDMA native network capabilities, resulting in latency reduction and better overlap of communication and computation. Our contributions are analyzed in two different parallel programming models: Message Passing Interface (MPI) and Unified Parallel C (UPC). Many different programing models exist today, and the programmer usually needs to choose one or another depending on the problem and the machine architecture. MPI has been chosen because it is the de facto standard for parallel programming in distributed memory machines. UPC was considered because it constitutes a promising easy-to-use approach to parallelism. Since parallelism is everywhere, programmability is becoming important and languages such as UPC are gaining attention as a potential future of high performance computing. Concerning the communication system, the languages chosen are relevant because, while MPI offers two-sided communication, UPC relays on a one-sided communication model. This difference potentially influences the communication system requirements of the language. These requirements as well as our contributions are analyzed and discussed for both programming models and we state whether they apply to both programming models.
Published: 2008

217. The implementation of XHiNoC based MPSoC

Author: Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Technische Universität Darmstadt, Ying, Haoyuan, Mir Boada, Jordi, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Technische Universität Darmstadt, Ying, Haoyuan, and Mir Boada, Jordi
Abstract: Projecte realitzat en el marc d'un programa de mobilitat amb la TU Darmstadt., [ANGLÈS] Nowadays, the electronic is advancing an extremely high rhythm. The new applications demand great computational load and high speeds. Therefore the electronic future is linked with multi-tasking concept and the multi-processor assembled on a single chip. It is known as MPSoC (Multi Processors System-on-chip). One of the main factors that affects directly to the chip performance is the interconnect system among the different cores. So far, there have developed several interconnect alternatives for systems that have less than tenth cores. However the problem gets out of control when the number of processing elements increases more than 10. Then start to appear problems like bottleneck performance, the effect of the high electromagnetism interference can disturb the interconnect functionality, the wiring complexity problems, in general no one is flexible. Root of all this, one solution that has won more strongly is known as NoC (Network on chip). The NoC is a communication infrastructure that provides better scalability and flexibility as well as a great bandwidth. In addition it has a large level of reconfiguration. This thesis is based on, firstly, a documentation chapter where examined two networks-on-chip completely different, HeMPS NoC and XHiNoC. Then there is a design and implementation chapter that involves the adaptation and commissioning of XHiNoC within design platform known as HeMPS. This platform allows us evaluate the NoC performance. The third point we find measures and comparatives between the two NoC and finally, at the end, there are some conclusion drawn during the project., [CASTELLÀ] La electrónica de hoy en día está avanzando a un ritmo extremadamente elevado. Las nuevas aplicaciones exigen unas cargas computacionales y unas velocidades elevadísimas. Es por eso que el futuro de la electrónica está ligado con el concepto de multi-tarea y a la integración de multiprocesadores en un solo chip, el cual se conoce como MPSoC (Multi Processors System-on-chip). Uno de los principales factores que afectan directamente al rendimiento del chip es el sistema de interconexión entre los diferentes núcleos. Hasta el día de hoy se han desarrollado diferentes sistemas para chips que no llegan a la decena de núcleos. Ahora bien, el problema se desborda cuando el número de elementos integrados supera los 10. Entonces empiezan a aparecer problemas de cuello de botella, niveles altos de electromagnetismo y los sistemas de interconexión se complican. A raíz de todo esto, una de las soluciones que se ha impuesto con más fuerza es la conocida como NoC (Network-on-chip). La NoC es una de las infraestructuras de comunicación que proporciona una mayor escalabilidad y flexibilidad, a parte de un gran ancho de banda de comunicación. También posee un gran grado de reconfiguración. Este proyecto esta basado en, primeramente, una parte de documentación donde se estudian dos networks-on-chip totalmente diferentes, HeMPS NoC y XHiNoC. A continuación hay una parte de diseño e implementación que consiste en la adaptación y puesta en marcha de XHiNoC dentro de la plataforma de diseño conocida como HeMPS, la cual nos permite evaluar el rendimiento. En tercer lugar encontramos unas medidas y unas comparativas entre las dos NoC y finalmente se extraen unas conclusiones de la realización del proyecto., [CATALÀ] L'electrònica d'avui en dia està avançant a un ritme extremadament alt. Les noves aplicacions exigeixen unes càrregues computacionals i unes velocitats elevadíssimes. És per això que el futur de l'electrònica està lligat al concepte de multitasca i a la integració de multiprocessadors en un sol xip, que és conegut com a MPSoCs (Multi Processors System-on-chip). Un dels principals factors que afecten directament el rendiment del xip és el sistema d'interconnexió entre els diferents nuclis. Fins el dia d'avui s'han desenvolupat diferents sistemes per a xips que no arriben a la desena de nuclis. Ara bé, el problema es descontrola quan el número d'elements integrats supera els 10. Llavors comencen a aparèixer problemes de coll d'ampolla, nivells alts d'electromagnetisme i els sistemes d'interconnexió es compliquen. Arran de tot això, una de les solucions que s'ha imposat amb més força és la coneguda com a NoC (Network-on-chip). La NoC és una infraestructura de comunicació que proveeix una millor escalabilitat i flexibilitat, a part d'un gran ample de banda de comunicació. A més posseeix un gran grau de reconfiguració. Aquest projecte està basat en, primerament, una part documentària on s'estudien dues networks-on-chip totalment diferents, HeMPS NoC i XHiNoC. Seguidament una part de disseny i implementació que consisteix en l'adaptació i posta en marxa de XHiNoC dins una plataforma de disseny coneguda com a HeMPS, la qual ens permetrà avaluar el seu rendiment. En tercer punt hi trobarem una part de mesures i comparatives entre les dues NoC i finalment unes conclusions extretes durant la realització del projecte.
Published: 2012

218. Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Bertrán, Ramon, Buyuktosunoglu, Alper, Gupta, Meeta S., González Tallada, Marc, Bose, Pradip, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Bertrán, Ramon, Buyuktosunoglu, Alper, Gupta, Meeta S., González Tallada, Marc, and Bose, Pradip
Abstract: Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement based analysis shows superior power projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications)., Peer Reviewed, Postprint (author’s final draft)
Published: 2012

219. Static task mapping for tiled chip multiprocessors with multiple voltage islands

Author: Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya. ALBCOM - Algorismia, Bioinformàtica, Complexitat i Mètodes Formals, Nikitin, Nikita, Cortadella, Jordi, Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya. ALBCOM - Algorismia, Bioinformàtica, Complexitat i Mètodes Formals, Nikitin, Nikita, and Cortadella, Jordi
Abstract: The complexity of large Chip Multiprocessors (CMP) makes design reuse a practical approach to reduce the manufacturing and design cost of high-performance systems. This paper proposes techniques for static task mapping onto general-purpose CMPs with multiple pre-defined voltage islands for power management. The CMPs are assumed to contain different classes of processing elements with multiple voltage/frequency execution modes to better cover a large range of applications. Task mapping is performed with awareness of both on-chip and off-chip memory traffic, and communication constraints such as the link and memory bandwidth. A novel mapping approach based on Extremal Optimization is proposed for large-scale CMPs. This new combinatorial optimization method has delivered very good results in quality and computational cost when compared to the classical simulated annealing., Peer Reviewed, Postprint (published version)
Published: 2012

220. Mitosis based speculative multithreaded architectures

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, González Colás, Antonio María, Codina Viñas, Josep Mª, Marcuello Pascual, Pedro, Madriles Gimeno, Carles, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, González Colás, Antonio María, Codina Viñas, Josep Mª, Marcuello Pascual, Pedro, and Madriles Gimeno, Carles
Abstract: In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-co, Postprint (published version)
Published: 2012

221. A Simulation framework for hierarchical Network-on-Chip systems

Author: Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics, Carmona Vargas, Josep, Cortadella, Jordi, San Pedro Martín, Javier de, Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics, Carmona Vargas, Josep, Cortadella, Jordi, and San Pedro Martín, Javier de
Abstract: Today, even the simplest laptop processor has at least four cores and a graphics card containing tens of cores. It is not hard to find more performance- oriented processors with hundreds of cores, and it is expected to see processors with thousands of cores in the not very far future. In these and future processors, the design of the interconnection network between the cores and the memory subsystem is a key design aspect. Simple topologies like buses or rings provide great e fficiency, but do not scale as good as meshes once the number of cores increases. We explore the use of hierarchical network designs as an alternative, where diff erent topologies are stacked in a single network. The lowest layers use rings or buses, taking advantage of locality, while other layers use meshes or more complex topologies. To fully explore these and other chip multiprocessor design aspects, we build an interconnection network simulator that is capable of simulating arbitrary hierarchies of multiple network topologies. We propose using parameterizable automata as tra ffic sources, as a trade-off between full processor simulation and simulation using purely random traffic. By altering the automaton high-level parameters, changes in the processor workload can be simulated, such as the expected average memory tra ffic, the locality of the memory accesses, the additional traffi c caused by diff erent cache coherency protocols, etc.
Published: 2012

222. MLP-Aware Dynamic Cache Partitioning

Author: Moreto, Valero, Ramirez, Cazorla, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Cache storage, Multiprocessing systems, Computer science, Cache memory, Memòria cau, Task parallelism, Multiprocessadors, Parallel computing, ComputerSystemsOrganization_PROCESSORARCHITECTURES, Simultaneous multithreading, Shared resource, Multi-threading, Multithreading, Memòria ràpida de treball (Informàtica), Multiprocessors, Resource allocation, Cache, Cache hierarchy, Instruction-level parallelism, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Abstract: The limitation imposed by instruction-level parallelism (ILP) has motivated the use of thread-level parallelism (TLP) as a common strategy for improving processor performance. TLP paradigms such as simultaneous multithreading (SMT), chip multiprocessing (CMP) and combinations of both offer the opportunity to obtain higher throughputs. However, they also have to face the challenge of sharing resources of the architecture. Simply avoiding any resource control can lead to undesired situations where one thread is monopolizing all the resources and harming the other threads. Some studies deal with the resource sharing problem in SMTs at core level resources like issue queues, registers, etc. In CMPs, resource sharing is lower than in SMT, focusing in the cache hierarchy.
Published: 2007
Full Text: View/download PDF

223. Efficient resource management in heterogeneous multiprocessor systems

Author: Merino Vidal, Julio M., Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Navarro, Nacho
Subjects: Multiprocessors, Multiprocessadors, Informàtica::Hardware [Àrees temàtiques de la UPC]
Published: 2007

224. Introducing runahead threads

Author: Ramírez García, Tanausu, Pajuelo González, Manuel Alejandro|||0000-0002-5510-6860, Santana Jaria, Oliverio J., Valero Cortés, Mateo|||0000-0003-2917-2482, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Parallel processing (Electronic computers), Simultaneous multithreading processors, Processament en paral·lel (Ordinadors), Multiprocessadors, ComputerSystemsOrganization_PROCESSORARCHITECTURES, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Abstract: Simultaneous Multithreading processors share their resources among multiple threads in order to improve performance. However, a resource control policy is needed to avoid resource conflicts and prevent some threads from monopolizing them. On the contrary, resource conflicts would cause other threads to suffer from resource starvation degrading the overall performance. This situation is especially sensitive for memory bounded threads, because they hold an important amount of resources while long latency accesses are being served. Several fetch policies and resource control techniques have been proposed to overcome these problems by limiting the per-thread resource utilization. Nevertheless, this limitation is harmful for memory bounded threads because it restricts the memory level parallelism available that hides the long latency memory accesses. In this paper, we propose Runahead threads on SMT scenarios as a valuable solution for both exploiting the memory-level parallelism and reducing the resource contention. This approach switches a memory-bounded eager resource thread to a speculative light thread, avoiding critical resource blocking among multiple threads. Furthermore, it improves the thread-level parallelism by removing long-latency memory operations from the instruction window, releasing busy resources. We compare an SMT architecture using Runahead threads (SMTRA) to both state-of-the-art static fetch and dynamic resource control policies. Our results show that the SMTRA combination performs better, in terms of throughput and fairness, than any of the other policies.
Published: 2007

225. Online Prediction of Applications Cache Utility

Author: Alex Ramirez, Francisco J. Cazorla, Mateo Valero, Miquel Moreto, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: business.industry, Computer science, Cache coloring, CPU cache, Program compilers, Compiladors (Programes d'ordinador), Multiprocessadors, Cache-oblivious algorithm, Cache pollution, Multi-threading, Smart Cache, Reconfigurable architectures, Cache invalidation, Simultaneous multithreading processors, Embedded system, Cache, business, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Cache algorithms, Compilers (Computer programs)
Abstract: General purpose architectures are designed to offer average high performance regardless of the particular application that is being run. Performance and power inefficiencies appear as a consequence for some programs. Reconfigurable hardware (cache hierarchy, branch predictor, execution units, bandwidth, etc.) has been proposed to overcome these inefficiencies by dynamically adapting the architecture to the application needs. However, nearly all the proposals use indirect measures or heuristics of performance to decide new configurations, what may lead to inefficiencies. In this paper we propose a runtime mechanism that allows to predict the throughput of an application on an architecture using a reconfigurable L2 cache. L2 cache size varies at a way granularity and we predict the performance of the same application on all other L2 cache sizes at the same time. We obtain for different L2 cache sizes an average error of 3.11%, a maximum error of 16.4% and standard deviation of 3.7%. No profiling or operating system participation is needed in this mechanism. We also give a hardware implementation that allows to reduce the hardware cost under 0.4% of the total L2 size and maintains high accuracy. This mechanism can be used to reduce power consumption in single threaded architectures and improve performance in multithreaded architectures that dynamically partition shared L2 caches.
Published: 2007
Full Text: View/download PDF

226. Implicit transactional memory in chip multiprocessors

Author: Galluzzi, Marco, Vallejo, Enrique, Cristal Kestelman, Adrián, Vallejo, Fernando, Beivide Palacio, Ramon, Stenström, Per, Smith, James E., Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Multiprocessors, Implicit transaction, Multiprocessor, Memory consistency, Multiprocessadors, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Kilo-instruction
Abstract: Chip Multiprocessors (CMPs) are an efficient way of designing and use the huge amount of transistors on a chip. Different cores on a chip can compose a shared memory system with a very low-latency interconnect at a very low cost. Unfortunately, consistency models and synchronization styles of popular programming models for multiprocessors impose severe performance losses. Known architectural approaches to combat these losses are too complex, too specialized, or not transparent to the software. In this article, we introduce “implicit transactional memory” as a generalized architectural concept to remove such performance losses. We show how the concept of implicit transactions can be implemented at a low complexity by leveraging the multi-checkpoint mechanism of the Kilo-Instruction Processor. By relying on a general speculation substrate, it supports even the strictest consistency model – sequential consistency – potentially as effectively as weaker models and it allows multiple threads to speculatively execute critical sections, beyond barriers and event synchronizations.
Published: 2007

227. Virtual Cluster Scheduling Through the Scheduling Graph

Author: Jesús Sánchez, Josep M. Codina, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Rate-monotonic scheduling, Parallel processing (Electronic computers), Multiprocessing systems, Computer science, Processament en paral·lel (Ordinadors), Processor scheduling, Instruction scheduling, Workstation clusters, Multiprocessadors, Dynamic priority scheduling, Parallel computing, Round-robin scheduling, Fair-share scheduling, Deadline-monotonic scheduling, Scheduling (computing), Instruction sets, Two-level scheduling, Multiprocessors, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Abstract: This paper presents an instruction scheduling and cluster assignment approach for clustered processors. The proposed technique makes use of a novel representation named the scheduling graph which describes all possible schedules. A powerful deduction process is applied to this graph, reducing at each step the set of possible schedules. In contrast to traditional list scheduling techniques, the proposed scheme tries to establish relations among instructions rather than assigning each instruction to a particular cycle. The main advantage is that wrong or poor schedules can be anticipated and discarded earlier. In addition, cluster assignment of instructions is performed using another novel concept called virtual clusters, which define sets of instructions that must execute in the same cluster. These clusters are managed during the deduction process to identify incompatibilities among instructions. The mapping of virtual to physical clusters is postponed until the scheduling of the instructions has finalized. The advantages this novel approach features include: (1) accurate scheduling information when assigning, and, (2) accurate information of the cluster assignment constraints imposed by scheduling decisions. We have implemented and evaluated the proposed scheme with superblocks extracted from Speclnt95 and MediaBench. The results show that this approach produces better schedules than the previous state-of-the-art. Speed-ups are up to 15%, with average speed-ups ranging from 2.5% (2-Clusters) to 9.5% (4-Clusters).
Published: 2007
Full Text: View/download PDF

228. 'Virtual malleability' applied to MPI jobs to improve their execution in a multiprogrammed environment'

Author: Utrera Iglesias, Gladys Miriam, Labarta Mancho, Jesús José, Corabalán González, Julita, Labarta Mancho, Jesús, and Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Subjects: malleability, processor scheduling, parallel computing, Informàtica [Àrees temàtiques de la UPC], process scheduling, load balancing, message passsing, mpi, Multiprocessadors, job scheduling, Programació en paral·lel (Informàtica), parallel applications
Abstract: This work focuses on scheduling of MPI jobs when executing in shared-memory multiprocessors (SMPs). The objective was to obtain the best performance in response time in multiprogrammed multiprocessors systems using batch systems, assuming all the jobs have the same priority. To achieve that purpose, the benefits of supporting malleability on MPI jobs to reduce fragmentation and consequently improve the performance of the system were studied. The contributions made in this work can be summarized as follows:· Virtual malleability: A mechanism where a job is assigned a dynamic processor partition, where the number of processes is greater than the number of processors. The partition size is modified at runtime, according to external requirements such as the load of the system, by varying the multiprogramming level, making the job contend for resources with itself. In addition to this, a mechanism which decides at runtime if applying local or global process queues to an application depending on the load balancing between processes of it. · A job scheduling policy, that takes decisions such as how many processes to start with and the maximum multiprogramming degree based on the type and number of applications running and queued. Moreover, as soon as a job finishes execution and where there are queued jobs, this algorithm analyzes whether it is better to start execution of another job immediately or just wait until there are more resources available. · A new alternative to backfilling strategies for the problema of window execution time expiring. Virtual malleability is applied to the backfilled job, reducing its partition size but without aborting or suspending it as in traditional backfilling. The evaluation of this thesis has been done using a practical approach. All the proposals were implemented, modifying the three scheduling levels: queuing system, processor scheduler and runtime library. The impact of the contributions were studied under several types of workloads, varying machine utilization, communication and, balance degree of the applications, multiprogramming level, and job size. Results showed that it is possible to offer malleability over MPI jobs. An application obtained better performance when contending for the resources with itself than with other applications, especially in workloads with high machine utilization. Load imbalance was taken into account obtaining better performance if applying the right queue type to each application independently.The job scheduling policy proposed exploited virtual malleability by choosing at the beginning of execution some parameters like the number of processes and maximum multiprogramming level. It performed well under bursty workloads with low to medium machine utilizations. However as the load increases, virtual malleability was not enough. That is because, when the machine is heavily loaded, the jobs, once shrunk are not able to expand, so they must be executed all the time with a partition smaller than the job size, thus degrading performance. Thus, at this point the job scheduling policy concentrated just in moldability.Fragmentation was alleviated also by applying backfilling techniques to the job scheduling algorithm. Virtual malleability showed to be an interesting improvement in the window expiring problem. Backfilled jobs even on a smaller partition, can continue execution reducing memory swapping generated by aborts/suspensions In this way the queueing system is prevented from reinserting the backfilled job in the queue and re-executing it in the future.
Published: 2007

229. ROB-free architecture proposal

Author: González, Isidro, Galluzzi, Marco, Cristal Kestelman, Adrián, Pajuelo González, Manuel Alejandro, Santana Jaria, Oliverio J., Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Renaming, Parallel processing (Electronic computers), Recovery, Checkpoint, Processament en paral·lel (Ordinadors), Multiprocessors, Multiprocessadors, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Abstract: This Technical Report was sent to Advisory Committee of MICRO-40 (June 8th, 2007) for review and published in the Spanish Workshop on Parallelism on September 2006 and September 2007. Outstanding Technical Report: UPC-DAC-2002-43 (September 6th, 2002) 'Large virtual ROBs by processor checkpointing' Modern processors improve performance by taking advantage of the instruction level parallelism (ILP) by means of allowing hundreds of instructions in flight. However, they still have to face an important source of degradation coming from the increasing difference between the processor and the main memory speeds (memory wall). In order to overcome this problem, recent proposals allow even more instructions in flight by replacing a re-order buffer (ROB) with a checkpointing mechanism and an out-of-order retirement of the processors resources, relaxing other desirable features like the precise recovery of the state on mispredicted branches or exceptions, possibly re-executing correct-path instructions on a recovery.
Published: 2007

230. Análisis de rendimiento de aplicaciones en sistemas multicore

Author: Universitat Oberta de Catalunya, Hernández Barragán, Esteban de Jesús, Universitat Oberta de Catalunya, and Hernández Barragán, Esteban de Jesús
Abstract: El rápido crecimiento del los sistemas multicore y los diversos enfoques que estos han tomado, permiten que procesos complejos que antes solo eran posibles de ejecutar en supercomputadores, hoy puedan ser ejecutados en soluciones de bajo coste también denominadas "hardware de comodidad". Dichas soluciones pueden ser implementadas usando los procesadores de mayor demanda en el mercado de consumo masivo (Intel y AMD). Al escalar dichas soluciones a requerimientos de cálculo científico se hace indispensable contar con métodos para medir el rendimiento que los mismos ofrecen y la manera como los mismos se comportan ante diferentes cargas de trabajo. Debido a la gran cantidad de tipos de cargas existentes en el mercado, e incluso dentro de la computación científica, se hace necesario establecer medidas "típicas" que puedan servir como soporte en los procesos de evaluación y adquisición de soluciones, teniendo un alto grado de certeza de funcionamiento. En la presente investigación se propone un enfoque práctico para dicha evaluación y se presentan los resultados de las pruebas ejecutadas sobre equipos de arquitecturas multicore AMD e Intel., El ràpid creixement dels sistemes multicore i els diversos enfocaments que aquests han pres, permeten que processos complexos que abans només eren possibles d'executar en supercomputadors, avui puguin ser executats en solucions de baix cost també denominades "maquinari de comoditat". Aquestes solucions poden implementar-se mitjançant els processadors de major demanda al mercat de consum massiu (Intel i AMD). En escalar aquestes solucions a requeriments de càlcul científic es fa indispensable comptar amb mètodes per a mesurar el rendiment que aquests ofereixen i la manera com es comporten davant diferents càrregues de treball. A causa de la gran quantitat de tipus de càrregues existents al mercat, i fins i tot dins de la computació científica, es fa necessari establir mesures "típiques" que puguin servir com a suport en els processos d'avaluació i adquisició de solucions, amb un alt grau de certesa de funcionament. En la present recerca es proposa un enfocament pràctic per a aquesta avaluació i es presenten els resultats de les proves executades sobre equips d'arquitectures multicore AMD i Intel., The rapid growth of the multicore systems and the diverse approaches that these have taken, allow complex processes that before only were possible to execute in supercomputers, today can be executed in low-cost solutions also called "commodity hardware". These solutions can be implemented by means of the bestselling processors (Intel and AMD). When these solutions are scaled to scientific computing requirements is essential to have methods to measure their performance and how these perform under different workloads. Due to the large number of load types on the market, and even within scientific computing, it is necessary to introduce "typical"measures that can serve as a basis in assessment processes and solutions acquisition, having a high degree of certainty of operation. In this research it is proposed a practical approach to this assessment and the results of tests performed on AMD and Intel multicore equipment architectures are presented.
Published: 2011

231. Estudio y evaluación de formatos de almacenamiento para matrices dispersas en arquitecturas multi-core

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Pasarin, Marc, Otero Calviño, Beatriz, Herrero Zaragoza, José Ramón, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Pasarin, Marc, Otero Calviño, Beatriz, and Herrero Zaragoza, José Ramón
Abstract: Postprint (published version)
Published: 2011

232. Optimización y Paralelización con SMPSs de un código de detección de patrones

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Labarta Mancho, Jesús José, Marin Armengod, Jeremies, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Labarta Mancho, Jesús José, and Marin Armengod, Jeremies
Abstract: Castellano: En las herramientas de análisis de rendimiento CEPBA-Tools, centradas sobre la aplicación de Paraver, un analizador de trazas, se ha desarrollado un módulo para detectar la periodicidad de las trazas. Dicha aplicación, a partir de una traza en formato Paraver, realiza un análisis con wavelets de la señal, permitiéndonos así, detectar su zona periódica para poder realizar un corte en la región repetitiva con "calidad". El proyecto, tiene como objetivo optimizar dicha aplicación para que se más rápida y paralelizarla en SMPSs, de forma que se pueda ejecutarse eficientemente en máquinas de memoria compartida o multicores.
Published: 2011

233. Operación stencil en plataformas multi-core y many-core

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Herrero Zaragoza, José Ramón, Garcés Chapero, Bernardo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Herrero Zaragoza, José Ramón, and Garcés Chapero, Bernardo
Abstract: Catellano: Los problemas derivados de la disipación de energía en la computación secuencial, están haciendo que cada vez se popularice más el uso de máquinas y sistemas con mayor cantidad de núcleos de proceso. Pasando desde pequeños procesadores con un número reducido de núcleos, por clusters con varias máquinas secuenciales distribuidas, e incluso por dispositivos de procesamiento gráfico (GPUs) con varios cientos de núcleos que permiten asignar tareas generales a estos, muchos algoritmos están siendo adaptados a estos modelos de paralelización. Resumen En este trabajo se ha llevado a cabo el análisis, implementación, optimización y paralelización de las operaciones stencil de 5 puntos y 27 puntos, que son operaciones cuyo origen es la resolución de ecuaciones en derivadas parciales mediante un método finito y que tienen una importancia significativa en el campo de la ciencia. La paralelización se ha llevado a cabo tanto en un sistema multi-core con dos procesadores Intel Xeon E5520, como utilizando un dispositivo gráfico Nvidia GeForce GTX 295 con 240 núcleos CUDA. Resumen En cuanto a la optimización del algoritmo, se han aplicado al cálculo una serie de optimizaciones al código secuencial tales como desenrollado de bucles, eliminación de subexpresiones comunes o vectorizaciones mediante instrucciones SSE. Resumen Para la paralelización en el entorno multi-core, y con el fin de poder explotar todos los recursos hardware del sistema, se han probado diferentes modelos de programación paralela, tanto basados en sistemas de memoria distribuida como MPI, como basados en sistemas de memoria compartida como OpenMP y POSIX Threads. Resumen En cuanto a la implementación en el entorno many-core, se han utilizado dos formas diferentes de enfocar la resolución del problema, una de ellas empleando el método que a simple vista parece más práctico para la resolución, y la otra creando una estructuración de las tareas más apropiado para la arquitectura utilizada. La diferencia d
Published: 2011

234. Operación stencil en CUDA

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Garcés, Bernardo, Herrero Zaragoza, José Ramón, Otero Calviño, Beatriz, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Garcés, Bernardo, Herrero Zaragoza, José Ramón, and Otero Calviño, Beatriz
Abstract: Los problemas derivados de la disipación de energía en la computación secuencial, están haciendo que cada vez se popularice más el uso de máquinas y sistemas con mayor cantidad de núcleos de proceso, pasando desde pequeños procesadores con un número reducido de núcleos, por clusters con varias máquinas secuenciales distribuidas, e incluso por dispositivos de coprocesamiento gráfico con varios cientos de núcleos que permiten asignar tareas generales a estos. Muchos algoritmos están siendo adaptados a estos modelos de paralelización., Preprint
Published: 2011

235. Trace-driven simulation of multithreaded applications

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Rico Carro, Alejandro, Duran González, Alejandro, Cabarcas, Felipe, Etsion, Yoav, Ramírez Bellido, Alejandro, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Rico Carro, Alejandro, Duran González, Alejandro, Cabarcas, Felipe, Etsion, Yoav, Ramírez Bellido, Alejandro, and Valero Cortés, Mateo
Abstract: Over the past few years, computer architecture research has moved towards execution-driven simulation, due to the inability of traces to capture timing-dependent thread execution interleaving. However, trace-driven simulation has many advantages over execution-driven that are being missed in multithreaded application simulations. We present a methodology to properly simulate multithreaded applications using trace-driven environments. We distinguish the intrinsic application behavior from the computation for managing parallelism. Application traces capture the intrinsic behavior in the sections of code that are independent from the dynamic multithreaded nature, and the points where parallelism-management computation occurs. The simulation framework is composed of a trace-driven simulation engine and a dynamic-behavior component that implements the parallelism-management operations for the application. Then, at simulation time, these operations are reproduced by invoking their implementation in the dynamic-behavior component. The decisions made by these operations are based on the simulated architecture, allowing to dynamically reschedule sections of code taken from the trace to the target simulated components. As the captured sections of code are independent from the parallel state of the application, they can be simulated on the trace-driven engine, while the parallelism-management operations, that require to be re-executed, are carried out by the execution-driven component, thus achieving the best of both trace- and execution-driven worlds. This simulation methodology creates several new research opportunities, including research on scheduling and other parallelism-management techniques for future architectures, and hardware support for programming models., Peer Reviewed, Postprint (published version)
Published: 2011

236. Circuit design of a dual-versioning L1 data cache for optimistic concurrency

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Seyedi, Azam, Armejach, Adrià, Cristal Kestelman, Adrián, Unsal, Osman Sabri, Hur, Ibrahim, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Seyedi, Azam, Armejach, Adrià, Cristal Kestelman, Adrián, Unsal, Osman Sabri, Hur, Ibrahim, and Valero Cortés, Mateo
Abstract: This paper proposes a novel L1 data cache design with dual-versioning SRAM cells (dvSRAM) for chip multi-processors (CMP) that implement optimistic concurrency proposals. In this new cache architecture, each dvSRAM cell has two cells, a main cell and a secondary cell, which keep two versions of the same data. These values can be accessed, modified, moved back and forth between the main and secondary cells within the access time of the cache. We design and simulate a 32-KB dual-versioning L1 data cache with 45nm CMOS technology at 2GHz processor frequency and 1V supply voltage, which we describe in detail. We also introduce three well-known use cases that make use of optimistic concurrency execution and that can benefit from our proposed design. Moreover, we evaluate one of the use cases to show the impact of the dual-versioning cell in both performance and energy consumption. Our experiments show that large speedups can be achieved with acceptable overall energy dissipation., Postprint (published version)
Published: 2011

237. A highly scalable parallel implementation of H.264

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Azevedo, Arnaldo, Juurlink, Ben, Meenderinck, Cor, Terechko, Andrei, Hoogerbrugge, Jan, Álvarez Mesa, Mauricio, Ramírez Bellido, Alejandro, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Azevedo, Arnaldo, Juurlink, Ben, Meenderinck, Cor, Terechko, Andrei, Hoogerbrugge, Jan, Álvarez Mesa, Mauricio, Ramírez Bellido, Alejandro, and Valero Cortés, Mateo
Abstract: Developing parallel applications that can harness and efficiently use future many-core architectures is the key challenge for scalable computing systems. We contribute to this challenge by presenting a parallel implementation of H.264 that scales to a large number of cores. The algorithm exploits the fact that independent macroblocks (MBs) can be processed in parallel, but whereas a previous approach exploits only intra-frame MB-level parallelism, our algorithm exploits intra-frame as well as inter-frame MB-level parallelism. It is based on the observation that inter-frame dependencies have a limited spatial range. The algorithm has been implemented on a many-core architecture consisting of NXP TriMedia TM3270 embedded processors. This required to develop a subscription mechanism, where MBs are subscribed to the kick-off lists associated with the reference MBs. Extensive simulation results show that the implementation scales very well, achieving a speedup of more than 54 on a 64-core processor, in which case the previous approach achieves a speedup of only 23. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the frame latency might increase. Scheduling policies to address these drawbacks are also presented. The results show that these policies combat memory and latency issues with a negligible effect on the performance scalability. Results analyzing the impact of the memory latency, L1 cache size, and the synchronization and thread management overhead are also presented. Finally, we present performance requirements for entropy (CABAC) decoding. This work was performed while the fourth author was with NXP Semiconductors., Peer Reviewed, Postprint (author's final draft)
Published: 2011

238. Symmetric rank-k update on clusters of multicore processors with SMPSs

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Resistència de Materials i Estructures a l'Enginyeria, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Badia Sala, Rosa Maria, Labarta Mancho, Jesús José, Marjanovic, Vladimir, Martín Huertas, Alberto Francisco, Mayo, Rafael, Quintana Ortí, Enrique Salvador, Reyes, Ruymán, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Resistència de Materials i Estructures a l'Enginyeria, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Badia Sala, Rosa Maria, Labarta Mancho, Jesús José, Marjanovic, Vladimir, Martín Huertas, Alberto Francisco, Mayo, Rafael, Quintana Ortí, Enrique Salvador, and Reyes, Ruymán
Abstract: We investigate the use of the SMPSs programming model to leverage task parallelism in the execution of a message-pas sing implementation of the symmetric rank- k update on clusters equipped with multicore processors. Our experience shows that the major difficulties to adapt the code to the MPI/SMPSs instance of this programming model are due to the usage of the conventional column-major layout of matrices in numerical libraries. On the other hand, the experimental results show a considerable increase in the performance and scalability of our solution when compared with the standard options based on the use of a pure MPI approach or a hybrid one that combines MPI/multi-threaded BLAS., Peer Reviewed, Postprint (published version)
Published: 2011

239. Fg-STP: fine-grain single thread partitioning on multicores

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors, Ranjan, Rakesh, Latorre Salinas, Fernando, Marcuello Pascual, Pedro, González Colás, Antonio María, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors, Ranjan, Rakesh, Latorre Salinas, Fernando, Marcuello Pascual, Pedro, and González Colás, Antonio María
Abstract: Power and complexity issues have led the microprocessor industry to shift to Chip Multiprocessors in order to be able to better utilize the additional transistors ensured by Moore's law. While parallel programs are going to be able to take most of the advantage of these CMPs, single thread applications are not equipped to benefit from them. In this paper we propose Fine-Grain Single-Thread Partitioning (Fg-STP), a hardware-only scheme that takes advantage of CMP designs to speedup single-threaded applications. Our proposal improves single thread performance by reconfiguring two cores with the aim of collaborating on the fetching and execution of the instructions. These cores are basically conventional out-of-order cores in which execution is orchestrated using a dedicated hardware that has minimum and localized impact on the original design of the cores. This approach partitions the code at instruction granularity and differs from previous proposals on the extensive use of dependence speculation, replication and communication. These features are combined with the ability to look for parallelism on large instruction windows without any software intervention (no re-compilation or profiling hints are needed). These characteristics allow Fg-STP to speedup single thread by 18% and 7% on average over similar hardware-only approaches like Core Fusion, on medium sized and small sized 2-core CMP respectively for Spec 2006 benchmarks., Peer Reviewed, Postprint (published version)
Published: 2011

240. Running stream-like programs on heterogeneous multi-core systems

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ayguadé Parra, Eduard, Ramírez Bellido, Alejandro, Carpenter, Paul Matthew, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ayguadé Parra, Eduard, Ramírez Bellido, Alejandro, and Carpenter, Paul Matthew
Abstract: All major semiconductor companies are now shipping multi-cores. Phones, PCs, laptops, and mobile internet devices will all require software that can make effective use of these cores. Writing high-performance parallel software is difficult, time-consuming and error prone, increasing both time-to-market and cost. Software outlives hardware; it typically takes longer to develop new software than hardware, and legacy software tends to survive for a long time, during which the number of cores per system will increase. Development and maintenance productivity will be improved if parallelism and technical details are managed by the machine, while the programmer reasons about the application as a whole. Parallel software should be written using domain-specific high-level languages or extensions. These languages reveal implicit parallelism, which would be obscured by a sequential language such as C. When memory allocation and program control are managed by the compiler, the program's structure and data layout can be safely and reliably modified by high-level compiler transformations. One important application domain contains so-called stream programs, which are structured as independent kernels interacting only through one-way channels, called streams. Stream programming is not applicable to all programs, but it arises naturally in audio and video encode and decode, 3D graphics, and digital signal processing. This representation enables high-level transformations, including kernel unrolling and kernel fusion. This thesis develops new compiler and run-time techniques for stream programming. The first part of the thesis is concerned with a statically scheduled stream compiler. It introduces a new static partitioning algorithm, which determines which kernels should be fused, in order to balance the loads on the processors and interconnects. A good partitioning algorithm is crucial if the compiler is to produce efficient code. The algorithm also takes account of downstream comp, Totes les empreses de semiconductors produeixen actualment multi-cores. Mòbils,PCs, portàtils, i dispositius mòbils d’Internet necessitaran programari quefaci servir eficientment aquests cores. Escriure programari paral·lel d’altrendiment és difícil, laboriós i propens a errors, incrementant tant el tempsde llançament al mercat com el cost. El programari té una vida més llarga queel maquinari; típicament pren més temps desenvolupar nou programi que noumaquinari, i el programari ja existent pot perdurar molt temps, durant el qualel nombre de cores dels sistemes incrementarà. La productivitat dedesenvolupament i manteniment millorarà si el paral·lelisme i els detallstècnics són gestionats per la màquina, mentre el programador raona sobre elconjunt de l’aplicació.El programari paral·lel hauria de ser escrit en llenguatges específics deldomini. Aquests llenguatges extrauen paral·lelisme implícit, el qual és ocultatper un llenguatge seqüencial com C. Quan l’assignació de memòria i lesestructures de control són gestionades pel compilador, l’estructura iorganització de dades del programi poden ser modificades de manera segura ifiable per les transformacions d’alt nivell del compilador.Un dels dominis de l’aplicació importants és el que consta dels programes destream; aquest programes són estructurats com a nuclis independents queinteractuen només a través de canals d’un sol sentit, anomenats streams. Laprogramació de streams no és aplicable a tots els programes, però sorgeix deforma natural en la codificació i descodificació d’àudio i vídeo, gràfics 3D, iprocessament de senyals digitals. Aquesta representació permet transformacionsd’alt nivell, fins i tot descomposició i fusió de nucli.Aquesta tesi desenvolupa noves tècniques de compilació i sistemes en tempsd’execució per a programació de streams. La primera part d’aquesta tesi esfocalitza amb un compilador de streams de planificació estàtica. Presenta unnou algorisme de partició estàtica, que determina quins nuclis han d, Postprint (published version)
Published: 2011

241. A low cost split-issue technique to improve performance of SMT clustered VLIW processors

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Gupta, Manoj, Sánchez Carracedo, Fermín, Llosa Espuny, José Francisco, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Gupta, Manoj, Sánchez Carracedo, Fermín, and Llosa Espuny, José Francisco
Abstract: —Very Long Instruction Word (VLIW) processors are a popular choice in embedded domain due to their hardware simplicity, low cost and low power consumption. Simultaneous MultiThreading (SMT) is a popular technique for improving processor performance. To maintain execution semantics, a VLIW instruction needs to be issued in entirety, which restricts the opportunities in SMT. Split-issue at operation-level is a technique that allows issuing a VLIW instruction in parts without breaking execution semantics. Issuing an instruction in parts allows non-conflicting part of an instruction to be issued along with other instructions and improves SMT performance. However, implementing splitissue at operation-level requires complex structures and is not practical for an embedded VLIW processor. This paper proposes cluster-level split-issue, which implements split-issue at a cluster-level boundary for clustered VLIW processors. Cluster-level split-issue has a very low hardware overhead in contrast to split-issue at operation-level. Experimental results show that cluster-level split-issue, despite being more restrictive than split-issue at operation-level, achieves similar performance and improves SMT performance significantly., Postprint (published version)
Published: 2010

242. Improving cache Behavior in CMP architectures throug cache partitioning techniques

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Valero Cortés, Mateo, Cazorla Almeida, Francisco Javier, Moretó Planas, Miquel, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Valero Cortés, Mateo, Cazorla Almeida, Francisco Javier, and Moretó Planas, Miquel
Abstract: Premi extraordinari doctorat curs 2009-2010, àmbit de les TIC, The evolution of microprocessor design in the last few decades has changed significantly, moving from simple inorder single core architectures to superscalar and vector architectures in order to extract the maximum available instruction level parallelism. Executing several instructions from the same thread in parallel allows significantly improving the performance of an application. However, there is only a limited amount of parallelism available in each thread, because of data and control dependences. Furthermore, designing a high performance, single, monolithic processor has become very complex due to power and chip latencies constraints. These limitations have motivated the use of thread level parallelism (TLP) as a common strategy for improving processor performance. Multithreaded processors allow executing different threads at the same time, sharing some hardware resources. There are several flavors of multithreaded processors that exploit the TLP, such as chip multiprocessors (CMP), coarse grain multithreading, fine grain multithreading, simultaneous multithreading (SMT), and combinations of them. To improve cost and power efficiency, the computer industry has adopted multicore chips. In particular, CMP architectures have become the most common design decision (combined sometimes with multithreaded cores). Firstly, CMPs reduce design costs and average power consumption by promoting design re-use and simpler processor cores. For example, it is less complex to design a chip with many small, simple cores than a chip with fewer, larger, monolithic cores. Furthermore, simpler cores have less power hungry centralized hardware structures. Secondly, CMPs reduce costs by improving hardware resource utilization. On a multicore chip, co-scheduled threads can share costly microarchitecture resources that would otherwise be underutilized. Higher resource utilization improves aggregate performance and enables lower cost design alternatives. One of the resources, Award-winning, Postprint (published version)
Published: 2010

243. Filtering directory lookups in CMPs

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Bosque, Ana, Viñals Yufera, Víctor, Ibáñez, Pablo, Llaberia Griñó, José M., Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Bosque, Ana, Viñals Yufera, Víctor, Ibáñez, Pablo, and Llaberia Griñó, José M.
Abstract: Coherence protocols consume an important fraction of power to determine which coherence action should take place. In this paper we focus on CMPs with a shared cache and a directory-based coherence protocol implemented as a duplicate of local caches tags. We observe that a big fraction of directory lookups produce a miss since the block looked up is not cached in any local cache. We propose to add a filter before the directory lookup in order to reduce the number of lookups to this structure. The filter identifies whether the current block was last accessed as a data or as an instruction. With this information, looking up the whole directory can be avoided for most accesses. We evaluate the filter in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with a shared L2 cache.We show that a filter with a size of 3% of the tag array of the shared cache can avoid more than 70% of all comparisons performed by directory lookups with a performance loss of just 0.2% for SPLASH2 and 1.5% for Specweb2005. On average, the number of 15-bit comparisons avoided per cycle is 54 out of 77 for SPLASH2 and 29 out of 41 for Specweb2005. In both cases, the filter requires less than one read of 1 bit per cycle., Postprint (published version)
Published: 2010

244. Implementación y evaluación de la Factorización de Cholesky mediante TBB y threads en arquitecturas multicore

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Otero Calviño, Beatriz, Bordas Pérez, Francisco, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Otero Calviño, Beatriz, and Bordas Pérez, Francisco
Abstract: Este proyecto tiene cuatro objetivos claramente diferenciados: 1. Proponer e implementar un algoritmo para realizar la factorización de Cholesky. 2. Optimizar el código propuesto usando lenguajes de programación y técnicas de optimización de software que permitan mejorar el rendimiento del tiempo de ejecución de la factorización de Cholesky. 3. Evaluar la idoneidad del uso de Threading Building Blocks (TBB) de Intel en la paralelización del código desarrollado. 4. Elaborar una heurística que determine el mejor algoritmo a ejecutar (dependiendo del orden de la matriz) para garantizar máximo rendimiento.
Published: 2010

245. Desarrollo de un multiprocesador superescalar in-order en CycleSim

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Villavieja Prados, Carlos, Álvarez Hernández, Jesús, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Villavieja Prados, Carlos, and Álvarez Hernández, Jesús
Abstract: La evolución del campo de la informática ha sido impresionante desde su nacimiento. Si echamos un vistazo a la historia de la informática podremos apreciar los grandes cambios generados y el gran aumento de la capacidad computacional de los ordenadores hasta la actualidad. Este gran progreso de la oferta es fruto de una insaciable demanda poder computacional, hasta hoy en día. Actualmente los procesadores han encontrado un límite físico que les impide evolucionar según la Ley de Moore, obligando a los fabricantes de microprocesadores a investigar nuevas técnicas y tecnologías para paliar la demanda del mercado. Debido a las anteriores premisas, surge un proyecto para desarrollar un simulador capaz de emular el trabajo de un microprocesador basado en la infraestructura CycleSim, con el objetivo de estudiar los cambios en el rendimiento de un procesador modificando la configuración de los diferentes elementos de este. El proyecto al cual hace referencia esta memoria es una de las piezas del proyecto anteriormente mencionado. Concretamente se centra en desarrollar una CPU superescalar “in-order” para la infraestructura CycleSim, partiendo de una versión inicial ya existente. Además aumentará el número de instrucciones reconocidas por la CPU para poder generar trazas más completas a simular.
Published: 2010

246. Mapping parallel loops on multicore systems

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Tabik, Siham, Romero, Felipe, Utrera Iglesias, Gladys Miriam, Plata, Oscar, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Tabik, Siham, Romero, Felipe, Utrera Iglesias, Gladys Miriam, and Plata, Oscar
Abstract: The compute nodes in contemporary HPC systems contain one or more multicore processors. As a result, these nodes constitute a shared-memory multiprocessor, often combining CMP and SMT concurrency technologies. This configuration introduces different levels of sharing in the cache hierarchy, resulting in non-uniform data sharing overheads. In this paper we analyze the data-sharing patterns that exhibit a real multithreaded application when executing on a multicore system, with emphasis in the use of the shared last level cache (LLC) for the concurrent threads. As a consequence of this study, we explore the loop mapping problem in such systems with the aim of optimizing the shared use of the the LLC by all parallel threads. We propose a three-phase loop mapping strategy that deals with workload imbalances, minimizes cache sharing interferences, and maximizes intra-core and inter-core data reuse in the cache hierarchy. Preliminary results show some benefits of our approach. However, this is a work in progress and much more research is being done., Postprint (author’s final draft)
Published: 2010

247. Buffer sizing for self-timed stream programs on heterogeneous distributed memory multiprocessors

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Carpenter, Paul Matthew, Ramírez Bellido, Alejandro, Ayguadé Parra, Eduard, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Carpenter, Paul Matthew, Ramírez Bellido, Alejandro, and Ayguadé Parra, Eduard
Abstract: Stream programming is a promising way to expose concurrency to the compiler. A stream program is built from kernels that communicate only via point-to-point streams. The stream compiler statically allocates these kernels to processors, applying blocking, fission and fusion transformations. The compiler determines the sizes of the communication buffers, which affects performance since local memories can be small. In this paper, we propose a feedback-directed algorithm that determines the size of each communication buffer, based on i) the stream program that has been mapped onto processors, ii) feedback from an earlier execution, and iii) the memory constraints. The algorithm exposes a trade-off between throughput and latency. It is general, in that it applies to stream programs with unstructured stream graphs, and it supports variable execution times and communication rates. We show results for the StreamIt benchmarks and random graphs. For the StreamIt benchmarks, throughput is optimal after the first iteration. For random graphs with stochastic computation times, throughput is within 3% of optimal after four iterations. Compared with the previous general algorithm, by Basten and Hoogerbrugge, our algorithm has significantly better performance and latency., Postprint (published version)
Published: 2010

248. The velox transactional memory stack

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors, Cristal Kestelman, Adrián, Felber, Pascal, Riviere, Etienne, Moreira, Walter Maldonado, Harmanci, Derin, Marlier, Patrick, Diestelhorst, Stephan, Hohmuth, Michael, Pohlack, Martin, Afek, Yehuda, Tomić, Saša, Drepper, Ulrich, Gramoli, Vincent, Kapalka, Michal, Guerraoui, Rachid, Dragojevic, Aleksandar, Stenstrom, Per, Unsal, Osman Sabri, Hur, Ibrahim, Korland, Guy, Nowack, Martin, Riegel, Torvald, Shavit, Nir, Fetzer, Christof, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors, Cristal Kestelman, Adrián, Felber, Pascal, Riviere, Etienne, Moreira, Walter Maldonado, Harmanci, Derin, Marlier, Patrick, Diestelhorst, Stephan, Hohmuth, Michael, Pohlack, Martin, Afek, Yehuda, Tomić, Saša, Drepper, Ulrich, Gramoli, Vincent, Kapalka, Michal, Guerraoui, Rachid, Dragojevic, Aleksandar, Stenstrom, Per, Unsal, Osman Sabri, Hur, Ibrahim, Korland, Guy, Nowack, Martin, Riegel, Torvald, Shavit, Nir, and Fetzer, Christof
Abstract: The transactional memory programming paradigm could become the coordination methodology of choice for actual and future multicore and many-core architectures. The transactional memory support spans a complete software and hardware stack, including programming language and hardware support, runtime and libraries, compilers, and application environments. The VELOX project has developed such a comprehensive transactional memory stack., Peer Reviewed, Postprint (published version)
Published: 2010

249. Hardware transactional memory with software-defined conflicts

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Titos Gil, Rubén, Acacio, Manuel E., García, José M., Harris, Tim, Cristal Kestelman, Adrián, Unsal, Osman Sabri, Hur, Ibrahim, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Titos Gil, Rubén, Acacio, Manuel E., García, José M., Harris, Tim, Cristal Kestelman, Adrián, Unsal, Osman Sabri, Hur, Ibrahim, and Valero Cortés, Mateo
Abstract: In this paper we propose conflict-defined blocks, a programming language construct that allows programmers to change the concept of conflict from one transaction to another, or even throughout the course of the same transaction. Defining conflicts in software makes possible the removal of dependencies which, though not necessary for the correct execution of the transactions, arise as a result of the coarse synchronization style encouraged by TM. Programmers take advantage of their knowledge about the problem and specify through confict-defined blocks what types of dependencies are superfluous in a certain part of the transaction, in order to extract more performance out of coarse-grained transactions without having to write minimally synchronized code. Our experiments with several transactional benchmarks reveal that using software-defined conflicts, the programmer achieves significant reductions in the number of aborted transactions and improve scalability., Peer Reviewed, Postprint (author's final draft)
Published: 2010

250. Adapting cache partitioning algorithms to pseudo-LRU replacement policies

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Kedzierski, Kamil, Moretó Planas, Miquel, Cazorla, Francisco, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Kedzierski, Kamil, Moretó Planas, Miquel, Cazorla, Francisco, and Valero Cortés, Mateo
Abstract: Recent studies have shown that cache partitioning is an efficient technique to improve throughput, fairness and Quality of Service (QoS) in CMP processors. The cache partitioning algorithms proposed so far assume Least Recently Used (LRU) as the underlying replacement policy. However, it has been shown that the true LRU imposes extraordinary complexity and area overheads when implemented on high associativity caches, such as last level caches. As a consequence, current processors available on the market use pseudo-LRU replacement policies, which provide similar behavior as LRU, while reducing the hardware complexity. Thus, the presented so far LRU-based cache partitioning solutions cannot be applied to real CMP architectures. This paper proposes a complete partitioning system for caches using the pseudo-LRU replacement policy. In particular, the paper focuses on the pseudo-LRU implementations proposed by Sun Microsystems and IBM, called Not Recently Used (NRU) and Binary Tree (BT), respectively. We propose a high accuracy profiling logic and a cache partitioning hardware for both schemes. We evaluate our proposals' hardware costs in terms of area and power, and compare them against the LRU partitioning algorithm. Overall, this paper presents two hardware techniques to adapt the existing cache partitioning algorithms to real replacement policies. The results show that our solutions impose negligible performance degradation with respect to the LRU., Peer Reviewed, Postprint (published version)
Published: 2010

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

414 results on '"Multiprocessadors"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources