Author: "Nemirovsky, Mario" / Search Limiters: Full Text - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Nemirovsky, Mario"' showing total 89 results

Start Over Author "Nemirovsky, Mario" Search Limiters Full Text

89 results on '"Nemirovsky, Mario"'

1. Disaggregated Computing. An Evaluation of Current Trends for Datacentres

Author: Meyer, Hugo, Sancho, José Carlos, Quiroga, Josue V., Zyulkyarov, Ferad, Roca, Damián, and Nemirovsky, Mario
Published: 2017
Full Text: View/download PDF

2. SABES: statistical available bandwidth estimation from passive TCP measurements

Author: Arcas Abella, Oriol, Montero Banegas, Diego Teodoro, Serral García, René, Ciaccia, Francesco, Romero, Iván, Nemirovsky, Mario, Arcas Abella, Oriol, Montero Banegas, Diego Teodoro, Serral García, René, Ciaccia, Francesco, Romero, Iván, and Nemirovsky, Mario
Abstract: Estimating available network resources is fundamental when adapting the sending rate both at the application and transport layer. Traditional approaches either rely on active probing techniques or iteratively adapting the average sending rate, as is the case for modern TCP congestion control algorithms. In this paper, we propose a statistical method based on the inter-packet arrival time analysis of TCP acknowledgments to estimate a path available bandwidth. SABES first estimates the bottleneck link capacity exploiting the TCP flow slow start traffic patterns. Then, an heuristic based on the capacity estimation, provides an approximation of the end-to-end available bandwidth. Exhaustive experimentation on both simulations and real-world scenarios were conducted to validate our technique, and our results are promising. Furthermore, we train an artificial neural network to improve the estimation accuracy.
Published: 2020

3. SABES: Statistical Available Bandwidth EStimation from passive TCP measurements

Author: Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ciaccia, Francesco, Romero Ruiz, Ivan, Arcas Abella, Oriol, Montero Banegas, Diego Teodoro, Serral Gracià, René, Nemirovsky, Mario, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ciaccia, Francesco, Romero Ruiz, Ivan, Arcas Abella, Oriol, Montero Banegas, Diego Teodoro, Serral Gracià, René, and Nemirovsky, Mario
Abstract: Estimating available network resources is fundamental when adapting the sending rate both at the application and transport layer. Traditional approaches either rely on active probing techniques or iteratively adapting the average sending rate, as is the case for modern TCP congestion control algorithms. In this paper, we propose a statistical method based on the inter-packet arrival time analysis of TCP acknowledgments to estimate a path available bandwidth. SABES first estimates the bottleneck link capacity exploiting the TCP flow slow start traffic patterns. Then, an heuristic based on the capacity estimation, provides an approximation of the end-to-end available bandwidth. Exhaustive experimentation on both simulations and real-world scenarios were conducted to validate our technique, and our results are promising. Furthermore, we train an artificial neural network to improve the estimation accuracy., This work was supported by the grant 2015DI023 as part of the Industrial PhD grants of AGAUR and Generalitat de Catalunya. Project co-financed by the Spanish Ministry of Ciencia Innovacion y Universidades with reference RTC-2017-6655-7 (FEDER)., Peer Reviewed, Postprint (author's final draft)
Published: 2020

4. Definition of new WAN paradigms enabled by smart measurements

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Serral Gracià, René, Nemirovsky, Mario, Ciaccia, Francesco, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Serral Gracià, René, Nemirovsky, Mario, and Ciaccia, Francesco
Abstract: Nowadays massive amounts of data are being moved over the Internet thanks to data-hungry applications, Big Data, and multimedia content. Combined with a reduction in cost and augmented reliability for high-speed broadband access, the whole Internet infrastructure is facing new challenges especially when information crosses long geographical distances. That is the case for Wide Area Networks (WANs), which are typically traversed in enterprises with multi-site deployments. When a connection is established between end-points that are geographically distant with high latency and high bandwidth, data is flowing over a so-called Long Fat Network. Currently, transport protocols in end-points are not able to exploit the resources of such links, notably the most common TCP implementations still stuffer from design flaws that limit their efficiency. More recent developments still suffer from low fairness in resource sharing and lack of global visibility. We identify SD-WAN as an SDN use-case that can enable new transport protocols adoption, improving traffic behavior over WANs, without the need to modify the end-points. In this thesis, we explore new approaches to network measurements that will enable both end-points and SD-WAN edge routers, to gain visibility over the end-to-end network status. Such additional visibility promotes the development of smarter control mechanisms for network traffic. The preliminary study carried on comprises TCP behavior over WANs and existing methodologies to control its traffic patterns and enforce rate throttling. We also identify a specific use case that poses challenges for WAN scenarios: the Split TCP connections in a Performance Enhancing Proxy (PEP). New control mechanisms to improve resource utilization and fairness are defined in this project. Specifically, we propose a new approach called Receive Window Modulation (RWM) that allows edge-routers to control the sending rate of a TCP connection by modifying the window advertised by the r, Hoy en día, Internet mueve cantidades considerables de datos debido a aplicaciones que requieren muchos datos (Big Data). En combinación con una reducción en los costes y un aumento en la fiabilidad de los enlaces de acceso a banda ancha, la infraestructura de Internet tiene que hacer frente a nuevos retos, especialmente cuando la información tiene que atravesar grandes distancias geográficas. Esto es el caso de las Redes de Area Extendida (WANs), que típicamente forman parte de la infraestructura de empresas con distintas sedes y oficinas. Hoy en día, los protocoles de transporto en los puntos finales no son capaces de explotar los recursos de las WANs, las mas comunes siendo implementaciones de TCP, las cuales todavía sufren de fallos en sus diseños que limitan la eficiencia. Desarrollos TCP recientes todavía no garantizan una repartición equitativa de los recursos de red y faltan de visibilidad global. Identificamos SD-WAN como un caso de uso el cual puede facilitar la adopción de nuevos protocoles de transporte, mejorando el comportamiento del trafico de red sobre WANs, sin la necesidad de modificar los puntos finales. En esta tesis exploramos nuevas propuestas en el campo de las medidas de red, las cuales permiten tanto a puntos finales como a router de borde, de ganar visibilidad sobre el estado de la red. Dicha visibilidad añadida permite el desarrollo de mecanismos de control del trafico de red mas inteligentes. Identificamos un caso de uso especifico que pone retos en los escenarios WAN: las conexiones Split TCP en el caso de un Performance Enhancing Proxy (PEP). En el proyecto vienen definidos nuevos mecanismos que mejoran la explotación y repartición de los recursos de red. En concreto, proponemos un nuevo esquema llamado Receive Window Modulation (RWM), que permite a los routers de borde controlar la ratio de envío de una conexión TCP modificando la ventana de recepción declarada por el recibidor. Probamos que dicho controlador puede mejorar la eficienci, Postprint (published version)
Published: 2020

5. Evaluating University-Business Collaboration at Science Parks: a Business Perspective

Author: Universitat Politècnica de Catalunya. Doctorat en Administració i Direcció d'Empreses, Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, Universitat Politècnica de Catalunya. KEMLG - Grup d'Enginyeria del Coneixement i Aprenentatge Automàtic, Olvera Herrera, Claudia, Piqué, Josep Ma, Cortés García, Claudio Ulises, Nemirovsky, Mario, Universitat Politècnica de Catalunya. Doctorat en Administració i Direcció d'Empreses, Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, Universitat Politècnica de Catalunya. KEMLG - Grup d'Enginyeria del Coneixement i Aprenentatge Automàtic, Olvera Herrera, Claudia, Piqué, Josep Ma, Cortés García, Claudio Ulises, and Nemirovsky, Mario
Abstract: The evaluation of the companies’ performance at University Science Parks (SPs) becomes essential in identifying the needs of the companies and the feasibility of the University-Business Collaboration (ubc). The companies’ real needs are also of interest for universities and SPs, since they face the challenge of designing strategies that best help them to transfer knowledge more effectively. This research article focuses on Key Performance Indicators (kpis) in ubc, needs and business objectives of companies co-located at SPs in Spain and Mexico. This article (i) aims to identify the kpis in ubc used by co-located companies at SPs, and (ii) explore the kpis in ubc and critical success factors of SPs. This article focuses on the perspective of companies, with a secondary focus on the perspectives of SPs and universities. For this study, data was collected through online company surveys in Spain and Mexico., Postprint (published version)
Published: 2020

6. Advances in the Hierarchical Emergent Behaviors (HEB) approach to autonomous vehicles

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Milito, Rodolfo, Nemirovsky, Mario, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Milito, Rodolfo, Nemirovsky, Mario, and Valero Cortés, Mateo
Abstract: Widespread deployment of autonomous vehicles (AVs) presents formidable challenges in terms on handling scalability and complexity, particularly regarding vehicular reaction in the face of unforeseen corner cases. Hierarchical Emergent Behaviors (HEB) is a scalable architecture based on the concepts of emergent behaviors and hierarchical decomposition. It relies on a few simple but powerful rules to govern local vehicular interactions. Rather than requiring prescriptive programming of every possible scenario, HEB’s approach relies on global behaviors induced by the application of these local, well-understood rules. Our first two papers on HEB focused on a primal set of rules applied at the first hierarchical level. On the path to systematize a solid design methodology, this paper proposes additional rules for the second level, studies through simulations the resultant richer set of emergent behaviors, and discusses the communica-tion mechanisms between the different levels., Peer Reviewed, Postprint (author's final draft)
Published: 2020

7. HIRE: Hidden Inter-packet Red-shift Effect

Author: Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ciaccia, Francesco, Romero Ruiz, Ivan, Serral Gracià, René, Nemirovsky, Mario, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ciaccia, Francesco, Romero Ruiz, Ivan, Serral Gracià, René, and Nemirovsky, Mario
Abstract: Over the years, different techniques have been proposed to detect bottleneck bandwidth and available bandwidth of an end-to-end path. However, to the author's knowledge, no work has been conducted on detecting which link or node on the path could be the narrow link. In this paper, we present a novel technique based on packet pairs dispersion analysis, whose objective is twofold: first, it allows to estimate the narrow link capacity using a new approach which takes into account both inter-packet time and packet propagation delay. Its second objective is to induce the specific hop in the end-to-end path which represents the narrow link. This is achieved by injecting packets trains with intermediate TTL-expiring packets which decrease the train rate when they cross the narrow link (red-shift effect). We validate our approach in simulations showing the tool robustness in very complex scenarios., This work was supported by the Industrial PhD grant 2015DI023 of AGAUR and Gencat and the project Efficient Smart Multi Connected Networks co-financed by the Spanish Ministry of Ciencia Innovacion y Universidades with reference RTC-2017-6655-7, The Spanish Agenda Estatal de Investigacin and the European Regional Development Fund (FEDER)., Peer Reviewed, Postprint (author's final draft)
Published: 2020

8. Evaluating University-Business Collaboration at Science Parks: a Business Perspective

Author: Olvera, Claudia, primary, Piqué, Josep M., additional, Cortés, Ulises, additional, and Nemirovsky, Mario, additional
Published: 2020
Full Text: View/download PDF

9. Improving TCP performance and reducing self-induced congestion with receive window modulation

Author: Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ciaccia, Francesco, Arcas Abella, Oriol, Montero Banegas, Diego Teodoro, Romero Ruiz, Ivan, Milito, Rodolfo, Serral Gracià, René, Nemirovsky, Mario, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Ciaccia, Francesco, Arcas Abella, Oriol, Montero Banegas, Diego Teodoro, Romero Ruiz, Ivan, Milito, Rodolfo, Serral Gracià, René, and Nemirovsky, Mario
Abstract: We present a control module for software edge routers called Receive Window Modulation - RWM. Its main objective is to mitigate what we define as self-induced congestion: the result of traffic emission patterns at the source that cause buffering and packet losses in any of the intermediate routers along the path between the connection's endpoints. The controller modifies the receiver's TCP advertised window to match the computed bandwidth-delay product, based on the connection round-trip time estimation and the bandwidth locally available at the edge router. The implemented controller does not need any endpoint modification, allowing it to be deployed in corporate edge routers, increasing visibility and control capabilities. This scheme, when used in real-world experiments with loss-based congestion control algorithms such as CUBIC, is shown to optimize access link utilization and per-connection goodput, and to reduce latency variability and packet losses., This work was supported by the grant 2015DI023 in the framework of the Industrial PhD grants of AGAUR and Generalitat de Catalunya., Peer Reviewed, Postprint (author's final draft)
Published: 2019

10. Improving TCP performance and reducing self-induced congestion with receive window modulation

Author: Montero Banegas, Diego Teodoro, Nemirovsky, Mario Daniel, Serral Graciá, René, Romero, Ivan, Arcas Abella, Oriol, Milito, Rodolfo, Ciaccia, Francesco, Montero Banegas, Diego Teodoro, Nemirovsky, Mario Daniel, Serral Graciá, René, Romero, Ivan, Arcas Abella, Oriol, Milito, Rodolfo, and Ciaccia, Francesco
Abstract: We present a control module for software edge routers called Receive Window Modulation - RWM. Its main objective is to mitigate what we define as self-induced congestion: the result of traffic emission patterns at the source that cause buffering and packet losses in any of the intermediate routers along the path between the connection's endpoints. The controller modifies the receiver's TCP advertised window to match the computed bandwidth-delay product, based on the connection round-trip time estimation and the bandwidth locally available at the edge router. The implemented controller does not need any endpoint modification, allowing it to be deployed in corporate edge routers, increasing visibility and control capabilities. This scheme, when used in real-world experiments with loss-based congestion control algorithms such as CUBIC, is shown to optimize access link utilization and per-connection goodput, and to reduce latency variability and packet losses.
Published: 2019

11. The effectiveness of knowledge and technology transfer through university-business collaboration in science parks

Author: Universitat Politècnica de Catalunya. Departament d'Organització d'Empreses, Cortés García, Claudio Ulises, Nemirovsky, Mario, Olvera, Caludia, Universitat Politècnica de Catalunya. Departament d'Organització d'Empreses, Cortés García, Claudio Ulises, Nemirovsky, Mario, and Olvera, Caludia
Abstract: Science and Technology Parks (STPs) facilitate the flow of knowledge and technology among universities; R&D institutions; companies and markets, and foster the creation and growth of innovation-based companies. Among the diversities of STPs, it is possible to identify two types: (i) Science Parks (SPs), which involve university shareholding and (ii) Technology Parks (TPs), which are not owned by universities. This study will take into account only SPs since they are closely linked to the university, and they are the bridge between a University and companies in the process of Knowledge and Technology Transfer (KTT). The evaluation of the firms' performance in Science Parks results determinant to identify the needs of the companies and the feasibility of the University-Business Collaboration (UBC). The firms' real needs also are of interest for Universities and Science parks, since they face the challenge of designing strategies that best help them to transfer the knowledge more effectively. While previous studies have been focused on tenants´ innovation performance on-Park and off-Park, very little research has taken into account the Parks heterogeneity that may affect the firms' performance. This research paper focuses on SPs in Spain and Mexico due to data availability. This paper (i) aims to identify the Key Performance Indicators (KPIs) in UBC used by co-located companies at SPs, and (ii) explore the performance measure (KPIs) in UBC and critical success factors of SPs. For this study, data was collected through fifty eight online company surveys in Spain and forty two in México. This empirical analysis uses fourteen semi-structured interviews, addressed to SPs directors in order to explore (KPIs) and success factors of SPs in both countries, Los Parques Científicos y Tecnológicos (PTS) facilitan el flujo de conocimiento y tecnología entre las universidades; Centros de Investigación; empresas y mercados, y fomentan la Creación y crecimiento de empresas basadas en la innovación. Entre las diversidades de STP, es posible identificar dos tipos: (i) Parques científicos (SP), el los cuales la universidad, tiene una participación accionaria y (ii) Parques Tecnológicos (TP), en los cuales las universidades tienen una participación mínima de acciones. Este estudio tomará en cuenta solo los SP ya que están estrechamente vinculados a la universidad, y son el puente entre una universidad y empresas en proceso de transferencia de conocimiento y tecnología. (KTT). La evaluación del desempeño de las empresas en los parques científicos es determinante para identificar las necesidades de las empresas y la viabilidad de la Colaboración Universidad-Empresa. (UBC). Las necesidades reales de las empresas también son de interés para Universidades, ya a que enfrentan el desafío de diseñar estrategias que les ayuden a transferir el conocimiento de una forma más eficaz. Mientras que estudios anteriores se han centrado en medir la innovacion de las empresas (on-park y off-park), muy poca investigación ha tenido en cuenta la heterogeneidad de los SP, que puede afectar el desempeño de las empresas. Este trabajo de investigación se centra en los SP en España y México por disponibilidad de datos. Este estudio (i) tiene como objetivo identificar los Key Perfofromance Indicators (KPI) en UBC utilizados por las empresas establecidas en los SP, y (ii) explorar las métricas (KPI) en UBC y factores críticos de éxito de los SP. Para este estudio,se enviaron encuestas en linea a nueve SP de México y España y se obtuvieron cincuenta y ocho encuestas de empresas en España y cuarenta y dos en México. Este análisis también utiliza investigación qualitativa, ( 14 entrevistas semi-estructuradas, dirigidas a directores de SP), para explorar (KPI), Postprint (published version)
Published: 2019

12. Evaluating University-Business Collaboration at Science Parks: a Business Perspective.

Author: Olvera, Claudia, Piqué, Josep M., Cortés, Ulises, and Nemirovsky, Mario
Subjects: BUSINESS parks, KNOWLEDGE transfer, RESEARCH parks, COOPERATIVE research, CRITICAL success factor, KEY performance indicators (Management)
Abstract: Copyright of Triple Helix is the property of Brill Academic Publishers and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2021
Full Text: View/download PDF

13. Tackling IoT ultra large scale systems: Fog computing in support of hierarchical emergent behaviors

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Milito, Rodolfo, Nemirovsky, Mario, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Milito, Rodolfo, Nemirovsky, Mario, and Valero Cortés, Mateo
Abstract: The Internet of Things (IoT) marks a phase transition in the evolution of the Internet, distinguished by a massive connectivity and the interaction with the physical world. The organic evolution of IoT requires the consideration of three dimensions: scale, organization, and context. These dimensions are particularly relevant in Ultra Large Scale Systems (ULSS), of which autonomous vehicles is a prime example. Fog Computing is well positioned to support contextual awareness and communication, critical for ULSS. The design and orchestration of ULSS require fresh approaches, new organizing principles. A recent paper proposed Hierarchical Emergent Behaviors (HEB), an architecture that builds on established concepts of emergent behaviors and hierarchical decomposition and organization. HEB’s local rules induce emergent behaviors, i.e., useful behaviors not explicitly programmed. In this chapter we take a first step to validate HEB concepts through the study of two basic self-driven car “primitives”: exiting a platoon formation, and maneuvering in anticipation of obstacles beyond the range of on-board sensors. Fog nodes provide the critical contextual information required., Damian Roca work was supported by a Doctoral Scholarship provided by Fundación La Caixa. This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493) and by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P)., Peer Reviewed, Postprint (author's final draft)
Published: 2018

14. A general guide to applying machine learning to computer architecture

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Nemirovsky, Daniel, Arkose, Tugberk, Markovic, Nikola, Nemirovsky, Mario, Unsal, Osman Sabri, Cristal Kestelman, Adrián, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Nemirovsky, Daniel, Arkose, Tugberk, Markovic, Nikola, Nemirovsky, Mario, Unsal, Osman Sabri, Cristal Kestelman, Adrián, and Valero Cortés, Mateo
Abstract: The resurgence of machine learning since the late 1990s has been enabled by significant advances in computing performance and the growth of big data. The ability of these algorithms to detect complex patterns in data which are extremely difficult to achieve manually, helps to produce effective predictive models. Whilst computer architects have been accelerating the performance of machine learning algorithms with GPUs and custom hardware, there have been few implementations leveraging these algorithms to improve the computer system performance. The work that has been conducted, however, has produced considerably promising results. The purpose of this paper is to serve as a foundational base and guide to future computer architecture research seeking to make use of machine learning models for improving system efficiency. We describe a method that highlights when, why, and how to utilize machine learning models for improving system performance and provide a relevant example showcasing the effectiveness of applying machine learning in computer architecture. We describe a process of data generation every execution quantum and parameter engineering. This is followed by a survey of a set of popular machine learning models. We discuss their strengths and weaknesses and provide an evaluation of implementations for the purpose of creating a workload performance predictor for different core types in an x86 processor. The predictions can then be exploited by a scheduler for heterogeneous processors to improve the system throughput. The algorithms of focus are stochastic gradient descent based linear regression, decision trees, random forests, artificial neural networks, and k-nearest neighbors., This work has been supported by the European Research Council (ERC) Advanced Grant RoMoL (Grant Agreemnt 321253) and by the Spanish Ministry of Science and Innovation (contract TIN 2015-65316P)., Peer Reviewed, Postprint (published version)
Published: 2018

15. Analysis and simulation of emergent architectures for internet of things

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Nemirovsky, Mario, Valero Cortés, Mateo, Roca Marí, Damián, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Nemirovsky, Mario, Valero Cortés, Mateo, and Roca Marí, Damián
Abstract: The Internet of Things (IoT) promises a plethora of new services and applications supported by a wide range of devices that includes sensors and actuators. To reach its potential IoT must break down the silos that limit applications' interoperability and hinder their manageability. These silos' result from existing deployment techniques where each vendor set up its own infrastructure, duplicating the hardware and increasing the costs. Fog Computing can serve as the underlying platform to support IoT applications thus avoiding the silos'. Each application becomes a system formed by IoT devices (i.e. sensors, actuators), an edge infrastructure (i.e. Fog Computing) and the Cloud. In order to improve several aspects of human lives, different systems can interact to correlate data obtaining functionalities not achievable by any of the systems in isolation. Then, we can analyze the IoT as a whole system rather than a conjunction of isolated systems. Doing so leads to the building of Ultra-Large Scale Systems (ULSS), an extension of the concept of Systems of Systems (SoS), in several verticals including Autonomous Vehicles, Smart Cities, and Smart Grids. The scope of ULSS is large in the number of things and complex in the variety of applications, volume of data, and diversity of communication patterns. To handle this scale and complexity in this thesis we propose Hierarchical Emergent Behaviors (HEB), a paradigm that builds on the concepts of emergent behavior and hierarchical organization. Rather than explicitly program all possible situations in the vast space of ULSS scenarios, HEB relies on emergent behaviors induced by local rules that define the interactions of the "things" between themselves and also with their environment. We discuss the modifications to classical IoT architectures required by HEB, as well as the new challenges. Once these challenges such as scalability and manageability are addressed, we can illustrate HEB's usefulness dealing with an IoT-based U, El Internet de las Cosas (IoT) promete una plétora de nuevos servicios y aplicaciones habilitadas por una amplia gama de dispositivos que incluye sensores y actuadores. Para alcanzar su potencial, IoT debe superar los silos que limitan la interoperabilidad de las aplicaciones y dificultan su administración. Estos silos son el resultado de las técnicas de implementación existentes en las que cada proveedor instala su propia infraestructura y duplica el hardware, incrementando los costes. Fog Computing puede servir como la plataforma subyacente que soporte aplicaciones del IoT evitando así los silos. Cada aplicación se convierte en un sistema formado por dispositivos IoT (por ejemplo sensores y actuadores), una infraestructura (como Fog Computing) y la nube. Con el fin de mejorar varios aspectos de la vida humana, diferentes sistemas pueden interactuar para correlacionar datos obteniendo funcionalidades que no pueden lograrse por ninguno de los sistemas de forma aislada. Entonces, podemos analizar el IoT como un único sistema en lugar de una conjunción de sistemas aislados. Esta perspectiva conduce a la construcción de Ultra-Large Scale Systems (ULSS), una extensión del concepto de Systems of Systems (SoS), en varios verticales, incluidos los vehículos autónomos, Smart Cities y Smart Grids. El alcance de ULSS es vasto debido a la cantidad de dispositivos y complejo en la variedad de aplicaciones, volumen de datos y diversidad de patrones de comunicación. Para manejar esta escala y complejidad, en esta tesis proponemos Hierarchical Emergent Behaviors (HEB), un paradigma que se basa en los conceptos de comportamientos emergente y organización jerárquica. En lugar de programar explícitamente todas las situaciones posibles en el vasto espacio de escenarios presentes en los ULSS, HEB se basa en comportamientos emergentes inducidos por reglas locales que definen las interacciones de las "cosas" entre ellas y también con su entorno. Discutimos las modificaciones a las arquit, Postprint (published version)
Published: 2018

16. iQ: an efficient and flexible queue-based simulation framework

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Nemirovsky, Daniel, Casas, Marc, Moreto Planas, Miquel, Valero Cortés, Mateo, Nemirovsky, Mario, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Nemirovsky, Daniel, Casas, Marc, Moreto Planas, Miquel, Valero Cortés, Mateo, and Nemirovsky, Mario
Abstract: Conventional system simulators are readily used by computer architects to design and evaluate their processor designs. These simulators provide reasonable levels of accuracy and execution detail but suffer from long simulation latencies and increased implementation complexity. In this work we propose iQ, a queue-based modeling technique that targets design space exploration and optimization studies at the core component level. iQ emulates processor elements by abstracting the implementation details into modular components composed of queue structures, delay parameters, probabilistic driven message generation and event control. Its easy reconfigurability makes iQ a highly flexible and powerful processor simulator. We have used iQ to build an Ivy Bridge and a Core 2 Duo processor model and have validated them against real hardware running SPEC CPU2006 Int achieving average error rates of 9.55% and 8.93%., The authors would like to thank Mauricio Breternitz, Rodolfo Milito, and Vasilis Karakostas for their helpful reviews. Damian Roca work was supported by a Doctoral Scholarship provided by Fundación La Caixa. This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493) and by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P)., Peer Reviewed, Postprint (author's final draft)
Published: 2017

17. Fog function virtualization: A flexible solution for IoT applications

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Quiroga, Josue V., Valero Cortés, Mateo, Nemirovsky, Mario, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Quiroga, Josue V., Valero Cortés, Mateo, and Nemirovsky, Mario
Abstract: The Internet of Things applications must carefully assess certain crucial factors such as the real-time and largely distributed nature of the “things”. Fog Computing provides an architecture to satisfy those requirements through nodes located from near the “things” till the edge. The problem comes with the integration of the Fog nodes into current infrastructures. This process requires the development of complex software solutions and prevents Fog growth. In this paper we propose three innovations to enhance Fog: (i) a new orchestration policy, (ii) the creation of constellations of nodes, and (iii) Fog Function Virtualization (FFV). All together will complement Fog to reach its true potential as a generic scalable platform, running multiple IoT applications simultaneously. Deploying a new service is reduced to the development of the application code, fact that brings the democratization of the Fog Computing paradigm through ease of deployment and cost reduction., The authors thanks Rodolfo Milito for his insightful comments and revisions. Damian Roca work was supported by a Doctoral Scholarship provided by Fundación La Caixa. Josue V. Quiroga work was supported by a Doctoral Scholarship provided by the Mexican National Council of Science and Technology (CONACyT). This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493) and by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P)., Peer Reviewed, Postprint (author's final draft)
Published: 2017

18. Disaggregated Computing. An Evaluation of Current Trends for Datacentres

Author: Barcelona Supercomputing Center, Meyer, Hugo, Sancho, Jose C., Quiroga, Josue V., Zyulkyarov, Ferad, Roca, Damian, Nemirovsky, Mario, Barcelona Supercomputing Center, Meyer, Hugo, Sancho, Jose C., Quiroga, Josue V., Zyulkyarov, Ferad, Roca, Damian, and Nemirovsky, Mario
Abstract: Next generation data centers will likely be based on the emerging paradigm of disaggregated function-blocks-as-a-unit departing from the current state of mainboard-as-a-unit. Multiple functional blocks or bricks such as compute, memory and peripheral will be spread through the entire system and interconnected together via one or multiple high speed networks. The amount of memory available will be very large distributed among multiple bricks. This new architecture brings various benefits that are desirable in today’s data centers such as fine-grained technology upgrade cycles, fine-grained resource allocation, and access to a larger amount of memory and accelerators. An analysis of the impact and benefits of memory disaggregation is presented in this paper. One of the biggest challenges when analyzing these architectures is that memory accesses should be modeled correctly in order to obtain accurate results. However, modeling every memory access would generate a high overhead that can make the simulation unfeasible for real data center applications. A model to represent and analyze memory disaggregation has been designed and a statistics-based queuing-based full system simulator was developed to rapidly and accurately analyze applications performance in disaggregated systems. With a mean error of 10%, simulation results pointed out that the network layers may introduce overheads that degrade applications’ performance up to 66%. Initial results also suggest that low memory access bandwidth may degrade up to 20% applications’ performance., This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 687632 (dReDBox project) and TIN2015-65316-P - Computacion de Altas Prestaciones VII., Peer Reviewed, Postprint (published version)
Published: 2017

19. Scalability of Broadcast Performance in Wireless Network-on-Chip

Author: Abadal, Sergi, primary, Mestres, Albert, additional, Nemirovsky, Mario, additional, Lee, Heekwan, additional, Gonzalez, Antonio, additional, Alarcon, Eduard, additional, and Cabellos-Aparicio, Albert, additional
Published: 2016
Full Text: View/download PDF

20. Emergent Behaviors in the Internet of Things: The Ultimate Ultra-Large-Scale System

Author: Roca, Damian, primary, Nemirovsky, Daniel, additional, Nemirovsky, Mario, additional, Milito, Rodolfo, additional, and Valero, Mateo, additional
Published: 2016
Full Text: View/download PDF

21. Scalability of broadcast performance in wireless network-on-chip

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Mestres Sugrañes, Albert, Nemirovsky, Mario, Lee, Heekwan, González Colás, Antonio María, Alarcón Cot, Eduardo José, Cabellos Aparicio, Alberto, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Mestres Sugrañes, Albert, Nemirovsky, Mario, Lee, Heekwan, González Colás, Antonio María, Alarcón Cot, Eduardo José, and Cabellos Aparicio, Alberto
Abstract: Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC., Peer Reviewed, Postprint (published version)
Published: 2016

22. Thread assignment in multicore/multithreaded processors: A statistical approach

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Carpenter, Paul Matthew, Moretó Planas, Miquel, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Carpenter, Paul Matthew, Moretó Planas, Miquel, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, and Valero Cortés, Mateo
Abstract: © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works., The introduction of multicore/multithreaded processors, comprised of a large number of hardware contexts (virtual CPUs) that share resources at multiple levels, has made process scheduling, in particular assignment of running threads to available hardware contexts, an important aspect of system performance. Nevertheless, thread assignment of applications running on state-of-the art processors is an NP-complete problem. Over the years, numerous studies have proposed heuristic-based algorithms for thread assignment. Since the thread assignment problem is intractable, it is in general impossible to know the performance of the optimal assignment, so the room for improvement of a given algorithm is also unknown. It is therefore hard to decide whether to invest more effort and time to improve an algorithm that may already be close to optimal. In this paper, we present a statistical approach to the thread assignment problem. First, we present a method that predicts the performance of the optimal thread assignment, based on the observed performance of each thread assignment in a random sample. The method is based on Extreme Value Theory (EVT), a branch of statistics that analyses extreme deviations from the population mean. We also propose sample pruning, a method that significantly reduces the time required to apply the statistical method by reducing the number of candidate solutions that need to be measured. Finally, we show that, if no suitable heuristic-based algorithm is available, a sample of several thousand random thread assignments is enough to obtain, with high confidence, an assignment with performance close to optimal. The presented approach is architecture and application independent, and it can be used to address the thread assignment problem in various domains. It is especially well suited for systems in which the workload seldom changes. An example is network systems, which typically provide a constant set of services that are known in advance, with network, This work has been supported by the Spanish Ministry of Science and Innovation under grant TIN2012-34557, the HiPEAC Network of Excellence, and by the European Research Council under the European Union’s 7th FP, ERC Grant Agreement number 321253. Miquel Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047., Peer Reviewed, Postprint (author's final draft)
Published: 2016

23. Thread Assignment in Multicore/Multithreaded Processors: A Statistical Approach

Author: Radojkovic, Petar, Carpenter, Paul M., Moreto, Miquel, Cakarevic, Vladimir, Verdu, Javier, Pajuelo, Alex, Cazorla, Francisco J., Nemirovsky, Mario, Valero, Mateo, Radojkovic, Petar, Carpenter, Paul M., Moreto, Miquel, Cakarevic, Vladimir, Verdu, Javier, Pajuelo, Alex, Cazorla, Francisco J., Nemirovsky, Mario, and Valero, Mateo
Abstract: The introduction of multicore/multithreaded processors, comprised of a large number of hardware contexts (virtual CPUs) that share resources at multiple levels, has made process scheduling, in particular assignment of running threads to available hardware contexts, an important aspect of system performance. Nevertheless, thread assignment of applications running on state-of-the art processors is an NP-complete problem. Over the years, numerous studies have proposed heuristic-based algorithms for thread assignment. Since the thread assignment problem is intractable, it is in general impossible to know the performance of the optimal assignment, so the room for improvement of a given algorithm is also unknown. It is therefore hard to decide whether to invest more effort and time to improve an algorithm that may already be close to optimal. In this paper, we present a statistical approach to the thread assignment problem. First, we present a method that predicts the performance of the optimal thread assignment, based on the observed performance of each thread assignment in a random sample. The method is based on Extreme Value Theory (EVT), a branch of statistics that analyses extreme deviations from the population mean. We also propose sample pruning, a method that significantly reduces the time required to apply the statistical method by reducing the number of candidate solutions that need to be measured. Finally, we show that, if no suitable heuristic-based algorithm is available, a sample of several thousand random thread assignments is enough to obtain, with high confidence, an assignment with performance close to optimal. The presented approach is architecture and application independent, and it can be used to address the thread assignment problem in various domains. It is especially well suited for systems in which the workload seldom changes. An example is network systems, which typically provide a constant set of services that are known in advance, with network
Published: 2016

24. Emergent behaviors in the Internet of things: The ultimate ultra-large-scale system

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Nemirovsky, Daniel, Nemirovsky, Mario, Milito, Rodolfo, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Roca, Damian, Nemirovsky, Daniel, Nemirovsky, Mario, Milito, Rodolfo, and Valero Cortés, Mateo
Abstract: To reach its potential, the Internet of Things (IoT) must break down the silos that limit applications' interoperability and hinder their manageability. Doing so leads to the building of ultra-large-scale systems (ULSS) in several areas, including autonomous vehicles, smart cities, and smart grids. The scope of ULSS is both large and complex. Thus, the authors propose Hierarchical Emergent Behaviors (HEB), a paradigm that builds on the concepts of emergent behavior and hierarchical organization. Rather than explicitly programming all possible decisions in the vast space of ULSS scenarios, HEB relies on the emergent behaviors induced by local rules at each level of the hierarchy. The authors discuss the modifications to classical IoT architectures required by HEB, as well as the new challenges. They also illustrate the HEB concepts in reference to autonomous vehicles. This use case paves the way to the discussion of new lines of research., Damian Roca work was supported by a Doctoral Scholarship provided by Fundación La Caixa. This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493) and by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P)., Peer Reviewed, Postprint (author's final draft)
Published: 2016

25. Scalability of Broadcast Performance in Wireless Network-on-Chip

Author: Barcelona Supercomputing Center, Abadal Cavallé, Sergi, Mestres Sugrañes, Albert, Nemirovsky, Mario, Lee, Heekwan, Gonzalez, Antonio, Alarcón Cot, Eduardo José, Cabellos Aparicio, Alberto, Barcelona Supercomputing Center, Abadal Cavallé, Sergi, Mestres Sugrañes, Albert, Nemirovsky, Mario, Lee, Heekwan, Gonzalez, Antonio, Alarcón Cot, Eduardo José, and Cabellos Aparicio, Alberto
Abstract: Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC., The authors gratefully acknowledge support from INTEL’s Doctoral Student Honor Program, as well as from Samsung’s Global Research Outreach (GRO) program. This work has been also partially supported by the Catalan Government through a FI-AGAUR grant and by the Spanish State Ministry of Economy and Competitiveness under grant aid PCIN-2015-012., Peer Reviewed, Postprint (author's final draft)
Published: 2016

26. Improving the performance and energy-efficiency of virtual memory

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Nemirovsky, Mario, Ünsal, Osman, Cristal Kestelman, Adrián, Karakostas, Vasileios, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Nemirovsky, Mario, Ünsal, Osman, Cristal Kestelman, Adrián, and Karakostas, Vasileios
Abstract: Virtual memory improves programmer productivity, enhances process security, and increases memory utilization. However, virtual memory requires an address translation from the virtual to the physical address space on every memory operation. Page-based implementations of virtual memory divide physical memory into fixed size pages, and use a per-process page table to map virtual pages to physical pages. The hardware key component for accelerating address translation is the Translation Lookaside Buffer (TLB), that holds recently used mappings from the virtual to the physical address space. However, address translation still incurs high (i) performance overheads due to costly page table walks after TLB misses, and (ii) energy overheads due to frequent TLB lookups on every memory operation. This thesis quantifies these overheads and proposes techniques to mitigate them. In this thesis we argue that fixed size page-based approaches for address translation exhibit limited potential for improving TLB performance because they increase the TLB reach by a fixed amount. To overcome the limitations of such approaches, we introduce the concept of range translations and we show how they can significantly improve the performance and energy-efficiency of address translation. We first comprehensively quantify the address translation performance overhead on a collection of emerging scale-out applications. We show that address translation accounts for up to 16% of the total execution time. We find that huge pages may improve the application performance by reducing the time spent in page walks, enabling better exploitation of the available execution resources. However, the limited hardware support for huge pages in combination with the workloads' low memory locality leave ample space for performance optimizations. To reduce the performance overheads of address translation, we propose Redundant Memory Mappings (RMM). RMM provides an efficient alternative representation of many virtual-to, La memoria virtual aumenta la productividad del programador, provee seguridad a los procesos e incrementa la utilización de la memoria. No obstante, la memoria virtual requiere de una traducción de direcciones entre los espacios de direcciones virtual y físico en cada operación de memoria. La implementación de la memoria virtual paginada divide la memoria física en páginas de tamaño fijo. El principal componente para acelerar la traducción de direcciones es la TLB (Translation Lookaside Buffer). Sin embargo, la traducción de direcciones tiene un alto coste en el rendimiento, por la necesidad de buscar en la tabla de páginas después de un fallo de TLB, y por el coste energético por las frecuentes búsquedas en la TLB (una por cada operación de memoria). En esta tesis defendemos que los mecanismos de traducción basados en páginas tienen un potencial limitado para aumentar el rendimiento de la TLB. Principalmente porque solo se puede aumentar en una cantidad limitada el conjunto de direcciones que la TLB puede traducir. Para superar esta limitaciones, introducimos el concepto de traducciones por rangos y mostramos como este mecanismo puede mejorar significativamente el rendimiento y la eficiencia energética en la traducción de direcciones. Primero, cuantificamos la pérdida de rendimiento debido a la traducción en aplicaciones emergentes que escalan bien al agregar más procesadores. Mostramos que en estas aplicaciones la traducción de direcciones es responsable de hasta el 16% del tiempo de ejecución. Además, también mostramos que las páginas grandes pueden mejorar el rendimiento de las aplicaciones, permitiendo un mejor uso de los recursos disponibles. Sin embargo, el limitado soporte del hardware para páginas grandes, combinado con cargas de trabajo con poca localidad, nos deja mucho espacio para la optimización. Para reducir los costes de rendimiento de la traducción de direcciones, proponemos RMM (Redundant Memory Mappings). RMM esta basado en rangos de páginas y ofr, Postprint (published version)
Published: 2016

27. Range Translations for Fast Virtual Memory

Author: Gandhi, Jayneel, primary, Karakostas, Vasileios, additional, Ayar, Furkan, additional, Cristal, Adrian, additional, Hill, Mark D., additional, McKinley, Kathryn S., additional, Nemirovsky, Mario, additional, Swift, Michael M., additional, and Unsal, Osman S., additional
Published: 2016
Full Text: View/download PDF

28. Thread Assignment in Multicore/Multithreaded Processors: A Statistical Approach

Author: Radojkovic, Petar, primary, Carpenter, Paul M., additional, Moreto, Miquel, additional, Cakarevic, Vladimir, additional, Verdu, Javier, additional, Pajuelo, Alex, additional, Cazorla, Francisco J., additional, Nemirovsky, Mario, additional, and Valero, Mateo, additional
Published: 2016
Full Text: View/download PDF

29. Broadcast-enabled massive multicore architectures: a wireless RF approach

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Sheinman, Benny, Katz, Oded, Markish, Ofer, Elad, Danny, Fournier, Yvan, Roca, Damian, Hanzich, Mauricio, Houzeaux, Guillaume, Nemirovsky, Mario, Alarcón Cot, Eduardo José, Cabellos Aparicio, Alberto, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Sheinman, Benny, Katz, Oded, Markish, Ofer, Elad, Danny, Fournier, Yvan, Roca, Damian, Hanzich, Mauricio, Houzeaux, Guillaume, Nemirovsky, Mario, Alarcón Cot, Eduardo José, and Cabellos Aparicio, Alberto
Abstract: Broadcast traditionally has been regarded as a prohibitive communication transaction in multiprocessor environments. Nowadays, such a constraint largely drives the design of architectures and algorithms all-pervasive in diverse computing domains, directly and indirectly leading to diminishing performance returns as the many-core era is approaching. Novel interconnect technologies could help revert this trend by offering, among others, improved broadcast support, even in large-scale chip multiprocessors. This article outlines the prospects of wireless on-chip communication technologies pointing toward low-latency (a few cycles) and energy-efficient broadcast (a few picojoules per bit). It also discusses the challenges and potential impact of adopting these technologies as key enablers of unconventional hardware architectures and algorithmic approaches, in the pathway of significantly improving the performance, energy efficiency, scalability, and programmability of many-core chips., Peer Reviewed, Postprint (author's final draft)
Published: 2015

30. Virtualized security at the network edge: a user-centric approach

Author: Kuusijarvi, Jarkko, Sassu, Roberto, Montero Banegas, Diego Teodoro, Lioy, Antonio, Basile, Cataldo, Serral Gracià, René, Risso, Fulvio, Ciaccia, Francesco, Jacquin, Ludovic, Georgiades, Michael, Shaw, Adrian, Charalambides, Savvas, Bosco, Francesca, Nemirovsky, Mario, Yannuzzi,, Marcelo, Pastor, Antonio, Kuusijarvi, Jarkko, Sassu, Roberto, Montero Banegas, Diego Teodoro, Lioy, Antonio, Basile, Cataldo, Serral Gracià, René, Risso, Fulvio, Ciaccia, Francesco, Jacquin, Ludovic, Georgiades, Michael, Shaw, Adrian, Charalambides, Savvas, Bosco, Francesca, Nemirovsky, Mario, Yannuzzi,, Marcelo, and Pastor, Antonio
Abstract: The current device-centric protection model against security threats has serious limitations. On one hand, the proliferation of user terminals such as smartphones, tablets, notebooks, smart TVs, game consoles, and desktop computers makes it extremely difficult to achieve the same level of protection regardless of the device used. On the other hand, when various users share devices (e.g., parents and kids using the same devices at home), the setup of distinct security profiles, policies, and protection rules for the different users of a terminal is far from trivial. In light of this, this article advocates for a paradigm shift in user protection. In our model, protection is decoupled from users' terminals, and it is provided by the access network through a trusted virtual domain. Each trusted virtual domain provides unified and homogeneous security for a single user irrespective of the terminal employed. We describe a user-centric model where nontechnically savvy users can define their own profiles and protection rules in an intuitive way. We show that our model can harness the virtualization power offered by next-generation access networks, especially from network functions virtualization in the points of presence at the edge of telecom operators. We also analyze the distinctive features of our model, and the challenges faced based on the experience gained in the development of a proof of concept.
Published: 2015

31. Networking challenges and prospective impact of broadcast-oriented wireless networkson- chip

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Nemirovsky, Mario, Alarcón Cot, Eduardo José, Cabellos Aparicio, Alberto, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Nemirovsky, Mario, Alarcón Cot, Eduardo José, and Cabellos Aparicio, Alberto
Abstract: The cost of broadcast has been constraining the design of manycore processors and of the algorithms that run upon them. However, as on-chip RF technologies allow the design of small-footprint and high-bandwidth antennas and transceivers, native low-latency (a few clock cycles) and low-power (a few pJ/bit) broadcast support through wireless communication can be envisaged. In this paper, we analyze the main networking design aspects and challenges of Broadcast-oriented Wireless Network-on-Chip (BoWNoC), which are basically reduced to the development of Medium Access Control (MAC) protocols able to handle hundreds of cores. We evaluate the broadcast performance and scalability of different MAC designs, to then discuss the impact that the proposed paradigm could exert on the performance, scalability and programmability of future manycore architectures, programming models and parallel algorithms., Peer Reviewed, Postprint (published version)
Published: 2015

32. NEMsCAM: A novel CAM cell based on nano-electro-mechanical switch and CMOS for energy efficient TLBs

Author: Barcelona Supercomputing Center, Seyedi, Azam, Karakostas, Vasileios, Cosemans, Stefan, Cristal Kestelman, Adrián, Nemirovsky, Mario, Unsal, Osman, Barcelona Supercomputing Center, Seyedi, Azam, Karakostas, Vasileios, Cosemans, Stefan, Cristal Kestelman, Adrián, Nemirovsky, Mario, and Unsal, Osman
Abstract: In this paper we propose a novel Content Addressable Memory (CAM) cell, NEMsCAM, based on both Nano-electro-mechanical (NEM) switches and CMOS technologies. The memory component of the proposed CAM cell is designed with two complementary non-volatile NEM switches and located on top of the CMOS-based comparison component. As a use case for the NEMsCAM cell, we design first-level data and instruction Translation Lookaside Buffers (TLBs) with 16nm CMOS technology at 2GHz. The simulations show that the NEMsCAM TLB reduces the energy consumption per search operation (by 27%), write operation (by 41.9%) and standby mode (by 53.9%), and the area (by 40.5%) compared to a CMOS-only TLB with minimal performance overhead., We thank all anonymous reviewers for their insightful comments. This work is supported in part by the European Union (FEDER funds) under contract TIN2012-34557, and the European Union’s Seventh Framework Programme (FP7/2007-2013) under the ParaDIME project (GA no. 318693), Postprint (author's final draft)
Published: 2015

33. Virtualized security at the network edge: a user-centric approach

Author: Montero, Diego, primary, Yannuzzi, Marcelo, additional, Shaw, Adrian, additional, Jacquin, Ludovic, additional, Pastor, Antonio, additional, Serral-Gracia, Rene, additional, Lioy, Antonio, additional, Risso, Fulvio, additional, Basile, Cataldo, additional, Sassu, Roberto, additional, Nemirovsky, Mario, additional, Ciaccia, Francesco, additional, Georgiades, Michael, additional, Charalambides, Savvas, additional, Kuusijarvi, Jarkko, additional, and Bosco, Francesca, additional
Published: 2015
Full Text: View/download PDF

34. High level queuing architecture model for high-end processors

Author: Nemirovsky, Mario, Moreto Planas, Miquel, Roca Marí, Damián, Nemirovsky, Mario, Moreto Planas, Miquel, and Roca Marí, Damián
Abstract: We have developed a new kind of simulator based in queue models and statistical methods. It allows a fast and accurate simulation. It is really useful to perform a really fast design space exploration. We have validated the model against a real chip, Intel Ivy Bridge Processor
Published: 2014

35. Key ingredients in an IoT recipe: fog computing, cloud computing, and more fog computing

Author: Yannuzzi,, Marcelo, Serral Gracià, René, Nemirovsky, Mario, Montero Banegas, Diego Teodoro, Milito, Rodolfo A., Yannuzzi,, Marcelo, Serral Gracià, René, Nemirovsky, Mario, Montero Banegas, Diego Teodoro, and Milito, Rodolfo A.
Abstract: This paper examines some of the most promising and challenging scenarios in IoT, and shows why current compute and storage models confined to data centers will not be able to meet the requirements of many of the applications foreseen for those scenarios. Our analysis is particularly centered on three interrelated requirements: 1) mobility; 2) reliable control and actuation; and 3) scalability, especially, in IoT scenarios that span large geographical areas and require real-time decisions based on data analytics. Based on our analysis, we expose the reasons why Fog Computing is the natural platform for IoT, and discuss the unavoidable interplay of the Fog and the Cloud in the coming years. In the process, we review some of the technologies that will require considerable advances in order to support the applications that the IoT market will demand.
Published: 2014

36. On the area and energy scalability of wireless network-on-chip: a model-based benchmarked design space exploration

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Iannazzo Soteras, Mario Enrique, Nemirovsky, Mario, Cabellos Aparicio, Alberto, Lee, Heekwan, Alarcón Cot, Eduardo José, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Iannazzo Soteras, Mario Enrique, Nemirovsky, Mario, Cabellos Aparicio, Alberto, Lee, Heekwan, and Alarcón Cot, Eduardo José
Abstract: Networks-on-Chip (NoCs) are emerging as the way to interconnect the processing cores and the memory within a chip multiprocessor. As recent years have seen a significant increase in the number of cores per chip, it is crucial to guarantee the scalability of NoCs in order to avoid communication to become the next performance bottleneck in multicore processors. Among other alternatives, the concept of Wireless Network-on- Chip (WNoC) has been proposed, wherein on-chip antennas would provide native broadcast capabilities leading to enhanced network performance. Since energy consumption and chip area are the two primary constraints, this work is aimed to explore the area and energy implications of scaling a WNoC in terms of (a) the number of cores within the chip, and (b) the capacity of each link in the network. To this end, an integral design space exploration is performed, covering implementation aspects (area and energy), communication aspects (link capacity) and networklevel considerations (number of cores and network architecture). The study is entirely based upon analytical models, which will allow to benchmark the WNoC scalability against a baseline NoC. Eventually, this investigation will provide qualitative and quantitative guidelines for the design of future transceivers for wireless on-chip communication., Peer Reviewed, Postprint (author’s final draft)
Published: 2014

37. Measuring operating system overhead on Sun UltraSparc T1 processor

Author: Radojković, Petar, Cakarevic, Vladimir, Verdú Mulà, Javier|||0000-0003-4485-2419, Pajuelo González, Manuel Alejandro|||0000-0002-5510-6860, Gioiosa, Roberto, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, Valero Cortés, Mateo|||0000-0003-2917-2482, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Multicore multithreaded processors, Software_OPERATINGSYSTEMS, Overhead, Operating systems (Computers), Linux, Sistemes operatius (Ordinadors), Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Informàtica::Sistemes operatius [Àrees temàtiques de la UPC], Solaris (Computer file), Operating systems
Abstract: Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing, especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this study, we analyze sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1.We compare results, measured in Linux and Solaris, with the results provided by a low-overhead runtime environment that introduces almost no overhead in applications’ execution time. Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context as the timer interrupt handler, and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor.
Published: 2009

38. Graphene-Enabled Wireless Communication for Massive Multicore Architectures

Author: Abadal, Sergi, Alarcon, Eduard, Cabellos-Aparicio, Albert, Lemme, Max C., Nemirovsky, Mario, Abadal, Sergi, Alarcon, Eduard, Cabellos-Aparicio, Albert, Lemme, Max C., and Nemirovsky, Mario
Abstract: Current trends in microprocessor architecture design are leading towards a dramatic increase of core-level parallelization, wherein a given number of independent processors or cores are interconnected. Since the main bottleneck is foreseen to migrate from computation to communication, efficient and scalable means of inter-core communication are crucial for guaranteeing steady performance improvements in many-core processors. As the number of cores grows, it remains unclear whether initial proposals, such as the Network-on-Chip (NoC) paradigm, will meet the stringent requirements of this scenario. This position paper presents a new research area where massive multicore architectures have wireless communication capabilities at the core level. This goal is feasible by using graphene-based planar antennas, which can radiate signals at the Terahertz band while utilizing lower chip area than its metallic counterparts. To the best of our knowledge, this is the first work that discusses the utilization of graphene-enabled wireless communication for massive multicore processors. Such wireless systems enable broadcasting, multicasting, all-to-all communication, as well as significantly reduce many of the issues present in massively multicore environments, such as data coherency, consistency, synchronization and communication problems. Several open research challenges are pointed out related to implementation, communications and multicore architectures, which pave the way for future research in this multidisciplinary area., QC 20131220
Published: 2013
Full Text: View/download PDF

39. Improving the energy efficiency of hardware-assisted watchpoint systems

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Karakostas, Vasileios, Tomić, Saša, Unsal, Osman Sabri, Nemirovsky, Mario, Cristal Kestelman, Adrián, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Karakostas, Vasileios, Tomić, Saša, Unsal, Osman Sabri, Nemirovsky, Mario, and Cristal Kestelman, Adrián
Abstract: Hardware-assisted watchpoint systems enhance the execution of numerous dynamic software techniques, such as memory protection, module isolation, deterministic execution, and data race detection. In this paper, we show that previous hardware proposals may introduce significant energy overheads, and propose WatchPoint Filtering (WPF), a novel filtering mechanism that eliminates unnecessary watchpoint checks. We evaluate WPF on two state-of-the-art proposals for hardware-assisted watchpoints using two common memory checkers. WPF eliminates 83% of the watchpoint checks (up to 99.7%) and reduces 57% of the dynamic energy overhead (up to 78%) on average, without introducing additional performance execution overhead., Postprint (published version)
Published: 2013

40. Improving the effective use of multithreaded architectures : implications on compilation, thread assignment, and timing analysis

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Cazorla Almeida, Francisco Javier, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Nemirovsky, Mario, Radojković, Petar, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Cazorla Almeida, Francisco Javier, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Nemirovsky, Mario, and Radojković, Petar
Abstract: This thesis presents cross-domain approaches that improve the effective use of multithreaded architectures. The contributions of the thesis can be classified in three groups. First, we propose several methods for thread assignment of network applications running in multithreaded network servers. Second, we analyze the problem of graph partitioning that is a part of the compilation process of multithreaded streaming applications. Finally, we present a method that improves the measurement-based timing analysis of multithreaded architectures used in time-critical environments. The following sections summarize each of the contributions. (1) Thread assignment on multithreaded processors: State-of-the-art multithreaded processors have different level of resource sharing (e.g. between thread running on the same core and globally shared resources). Thus, the way that threads of a given workload are assigned to processors' hardware contexts determines which resources the threads share, which, in turn, may significantly affect the system performance. In this thesis, we demonstrate the importance of thread assignment for network applications running in multithreaded servers. We also present TSBSched and BlackBox scheduler, methods for thread assignment of multithreaded network applications running on processors with several levels of resource sharing. Finally, we propose a statistical approach to the thread assignment problem. In particular, we show that running a sample of several hundred or several thousand random thread assignments is sufficient to capture at least one out of 1% of the best-performing assignments with a very high probability. We also describe the method that estimates the optimal system performance for given workload. We successfull y applied TSBSched, BlackBox scheduler, and the presented statistical approach to a case study of thread assignment of multithreaded network applications running on the UltraSPARC T2 processor. (2) Kernel partitioning of streami, Postprint (published version)
Published: 2013

41. Thread assignment of multithreaded network applications in multicore/multithreaded processors

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, and Valero Cortés, Mateo
Abstract: © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works., The introduction of multithreaded processors comprised of a large number of cores with many shared resources makes thread scheduling, and in particular optimal assignment of running threads to processor hardware contexts to become one of the most promising ways to improve the system performance. However, finding optimal thread assignments for workloads running in state-of-the-art multicore/multithreaded processors is an NP-complete problem. In this paper, we propose BlackBox scheduler, a systematic method for thread assignment of multithreaded network applications running on multicore/multithreaded processors. The method requires minimum information about the target processor architecture and no data about the hardware requirements of the applications under study. The proposed method is evaluated with an industrial case study for a set of multithreaded network applications running on the UltraSPARC T2 processor. In most of the experiments, the proposed thread assignment method detected the best actual thread assignment in the evaluation sample. The method improved the system performance from 5 to 48 percent with respect to load balancing algorithms used in state-of-the-art OSs, and up to 60 percent with respect to a naive thread assignment., Peer Reviewed, Postprint (author's final draft)
Published: 2013

42. Communication bottelneck analysis on big data applications

Author: Nemirovsky, Mario, Solé Pareta, Josep, Roca Marí, Damián, Nemirovsky, Mario, Solé Pareta, Josep, and Roca Marí, Damián
Abstract: [ANGLÈS] Computers, and multicore processors in specific, need cache memory to improve memory bandwidth and overall performance. There are different types of cache (private and shared) divided into different levels of hierarchy. Keeping coherence and consistency of shared values in these caches is a major performance bottleneck on multicore systems. To address this issue, there are several protocols that invalidate or update these values when a core needs to modify them. But these protocols require broadcast communication (or similar) that in today NoCs represents a big cost in terms of cycles. In order to improve this bottleneck, the first step in this research is to know and have an approximation of the target that represents these invalidations in the terms of performance of the system. To obtain that estimation is mandatory to use programs or simulators of a real process inside a multicore/multithreaded processor to visualize the communications between these cores and the effects of sharing a part of the space address. The reason is that these invalidations are produced by keeping the coherence between different copies of the same variable (shared space). Once that we have a simulator that allows us to see the communications we can make different configurations to emulate a real processor in different scenarios. With these cases, we can obtain how the number of invalidations is modified depending on the parameters of the system (number of cores, size of cache memories, etc) and the applications which are running. Due to this, the results can vary for different applications since each of them uses the shared memory space in a different way. With this information we can elaborate some statistics to extract the first conclusions and fix the bases for future work. These results also enables us to study the scalability of the actual models to see what would happen if we have more than 1000-core processor because the actual simulators do not support such high number o, [CASTELLÀ] Los chips multicore conforman la realidad en el campo de los computadores. Pero dichos sistemas presentan muchos problemas que restringen su potencial. En este proyecto se realiza un estudio del principal, el sistema de memoria y mas concretamente, la memoria cache. Se estudia la escalibilidad que presentan las soluciones actuales con respecto al número de cores y se extraen las conclusiones pertinentes., [CATALÀ] La tendència actual quant a processadors consisteix a integrar múltiples cores dins d'un mateix xip. Són coneguts com a xips multicore (CMP), però el seu diseny està ple de problemes. En aquest projecte s'estudien, centrant-se en el sistema de memoria i més concretament en la memòria cau. En cocnret, s'analitza el funcionament dels protocols de coherència i la seva escalabilitat respecte el nombre de cores. Finalment, s'extreuen les conclusions que les solucions actuals no serveixen per a un nombre de cores elevat.
Published: 2013

43. Area and laser power scalability analysis in photonic networks-on-chip

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. GCO - Grup de Comunicacions Òptiques, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Cabellos Aparicio, Alberto, Lázaro Villa, José Antonio, Nemirovsky, Mario, Alarcón Cot, Eduardo José, Solé Pareta, Josep, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. CBA - Sistemes de Comunicacions i Arquitectures de Banda Ampla, Universitat Politècnica de Catalunya. GCO - Grup de Comunicacions Òptiques, Universitat Politècnica de Catalunya. EPIC - Energy Processing and Integrated Circuits, Abadal Cavallé, Sergi, Cabellos Aparicio, Alberto, Lázaro Villa, José Antonio, Nemirovsky, Mario, Alarcón Cot, Eduardo José, and Solé Pareta, Josep
Abstract: In the last decade, the field of microprocessor architecture has seen the rise of multicore processors, which consist of the interconnection of a set of independent processing units or cores in the same chip. As the number of cores per multiprocessor increases, the bandwidth and energy requirements for their interconnection networks grow exponentially and it is expected that conventional on-chip wires will not be able to meet such demands. Alternatively, nanophotonics has been regarded as a strong candidate for chip communication since it could provide high bandwidth with low area and energy footprints. However, issues such as the unavailability of efficient on-chip light sources or the difficulty of implementing all-optical buffering or header processing hinder the development of scalable photonic on-chip networks. In this paper, the area and laser power of several photonic on-chip network proposals is analytically modeled and its scalability is evaluated. Also, a graphene-based hybrid wireless/optical-wired approach is presented, aiming at enabling end-to-end photonic on-chip networks to scale beyond thousands of cores, Peer Reviewed, Postprint (published version)
Published: 2013

44. Thread assignment of multithreaded network applications in multicore/multithreaded processors

Author: Radojkovic, Petar, Cakarevic, Vladimir, Verdu, Javier, Pajuelo, Alex, Cazorla, Francisco J., Nemirovsky, Mario, Valero, Mateo, Radojkovic, Petar, Cakarevic, Vladimir, Verdu, Javier, Pajuelo, Alex, Cazorla, Francisco J., Nemirovsky, Mario, and Valero, Mateo
Abstract: The introduction of multithreaded processors comprised of a large number of cores with many shared resources makes thread scheduling, and in particular optimal assignment of running threads to processor hardware contexts to become one of the most promising ways to improve the system performance. However, finding optimal thread assignments for workloads running in state-of-the-art multicore/multithreaded processors is an NP-complete problem. In this paper, we propose BlackBox scheduler, a systematic method for thread assignment of multithreaded network applications running on multicore/multithreaded processors. The method requires minimum information about the target processor architecture and no data about the hardware requirements of the applications under study. The proposed method is evaluated with an industrial case study for a set of multithreaded network applications running on the UltraSPARC T2 processor. In most of the experiments, the proposed thread assignment method detected the best actual thread assignment in the evaluation sample. The method improved the system performance from 5 to 48 percent with respect to load balancing algorithms used in state-of-the-art OSs, and up to 60 percent with respect to a naive thread assignment. © 1990-2012 IEEE.
Published: 2013

45. Thread Assignment of Multithreaded Network Applications in Multicore/Multithreaded Processors

Author: Radojkovic, Petar, primary, Cakarevic, Vladimir, additional, Verdu, Javier, additional, Pajuelo, Alex, additional, Cazorla, Francisco J., additional, Nemirovsky, Mario, additional, and Valero, Mateo, additional
Published: 2013
Full Text: View/download PDF

46. Graphene-enabled wireless communication for massive multicore architectures

Author: Abadal, Sergi, primary, Alarcón, Eduard, additional, Cabellos-Aparicio, Albert, additional, Lemme, Max, additional, and Nemirovsky, Mario, additional
Published: 2013
Full Text: View/download PDF

47. An abstraction methodology for the evaluation of multi-core multi-threaded architectures

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CNDS - Xarxes de Computadors i Sistemes Distribuïts, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Zilan, Ruken, Verdú Mulà, Javier, García Vidal, Jorge, Nemirovsky, Mario, Milito, Rodolfo, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CNDS - Xarxes de Computadors i Sistemes Distribuïts, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Zilan, Ruken, Verdú Mulà, Javier, García Vidal, Jorge, Nemirovsky, Mario, Milito, Rodolfo, and Valero Cortés, Mateo
Abstract: As the evolution of multi-core multi-threaded processors continues, the complexity demanded to perform an extensive trade-off analysis, increases proportionally. Cycle-accurate or trace-driven simulators are too slow to execute the large amount of experiments required to obtain indicative results. To achieve a thorough analysis of the system, software benchmarks or traces are required. In many cases when an analysis is needed most, during the earlier stages of the processor design, benchmarks or traces are not available. Analytical models overcome these limitations but do not provide the fine grain details needed for a deep analysis of these architectures. In this work we present a new methodology to abstract processor architectures, at a level between cycle-accurate and analytical simulators. To apply our methodology we use queueing modeling techniques. Thus, we introduce Q-MAS, a queueing based tool targeting a real chip (the Ultra SPARC T2 processor) and aimed at facilitating the quantification of trade-offs during the design phase of multi-core multi-threaded processor architectures. The results demonstrate that Q-MAS, the tool that we developed, provides accurate results very close to the actual hardware, with a minimal cost of running what-if scenarios., Peer Reviewed, Postprint (published version)
Published: 2011

48. Thread to strand binding of parallel network applications in massive multi-threaded systems

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, and Valero Cortés, Mateo
Abstract: In processors with several levels of hardware resource sharing, like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/ thread must be assigned to one of the hardware contexts (strands). We call this last scheduling step the Thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing. We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative ofmultithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers., Peer Reviewed, Postprint (published version)
Published: 2010

49. Internet traffic and the behavior of processing workloads

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CNDS - Xarxes de Computadors i Sistemes Distribuïts, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Zilan, Ruken, Verdú Mulà, Javier, García Vidal, Jorge, Nemirovsky, Mario, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CNDS - Xarxes de Computadors i Sistemes Distribuïts, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Zilan, Ruken, Verdú Mulà, Javier, García Vidal, Jorge, Nemirovsky, Mario, and Valero Cortés, Mateo
Abstract: Nowadays, the evolution of network services provided at the edge of Internet increases the requirements of network applications. Such applications result in complexities thus, the processors need to execute more complex workloads that can deal not only with the packet header, but also with the packet payload (e.g. Deep Packet Inspection). Unlike common routing applications that show similar processing among packets, next-generation of network applications present variations in the processing procedure among packets. Thus, different traffic behaviors can produce different process patterns and present different memory and processing requirements. The aim of this work is to present an ongoing work towards correlating Internet traffic features with variations of processing workloads on the next-generation of edge routers., Peer Reviewed, Postprint (published version)
Published: 2009

50. Measuring operating system overhead on Sun UltraSparc T1 processor

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Gioiosa, Roberto, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Radojković, Petar, Cakarevic, Vladimir, Verdú Mulà, Javier, Pajuelo González, Manuel Alejandro, Gioiosa, Roberto, Cazorla Almeida, Francisco Javier, Nemirovsky, Mario, and Valero Cortés, Mateo
Abstract: Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing, especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this study, we analyze sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1.We compare results, measured in Linux and Solaris, with the results provided by a low-overhead runtime environment that introduces almost no overhead in applications’ execution time. Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context as the timer interrupt handler, and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor., Postprint (published version)
Published: 2009

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

89 results on '"Nemirovsky, Mario"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources