37 results on '"Germán Llort"'
Search Results
2. 15+ years of joint parallel application performance analysis/tools training with Scalasca/Score-P and Paraver/Extrae toolsets.
- Author
-
Brian J. N. Wylie, Judit Giménez, Christian Feld, Markus Geimer, Germán Llort, Sandra Méndez, Estanislao Mercadal, Anke Visser, and Marta García-Gasulla
- Published
- 2025
- Full Text
- View/download PDF
3. Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver.
- Author
-
Adrian Munera, Sara Royuela, Germán Llort, Estanislao Mercadal, Franck Wartel, and Eduardo Quiñones
- Published
- 2020
- Full Text
- View/download PDF
4. Analyzing the Efficiency of Hybrid Codes.
- Author
-
Judit Giménez, Estanislao Mercadal, Germán Llort, and Sandra Méndez
- Published
- 2020
- Full Text
- View/download PDF
5. Automating the Application Data Placement in Hybrid Memory Systems.
- Author
-
Harald Servat, Antonio J. Peña, Germán Llort, Estanislao Mercadal, Hans-Christian Hoppe, and Jesús Labarta
- Published
- 2017
- Full Text
- View/download PDF
6. Performance Analysis of Parallel Python Applications.
- Author
-
Michael Wagner 0003, Germán Llort, Estanislao Mercadal, Judit Giménez, and Jesús Labarta
- Published
- 2017
- Full Text
- View/download PDF
7. The Secrets of the Accelerators Unveiled: Tracing Heterogeneous Executions Through OMPT.
- Author
-
Germán Llort, Antonio Filgueras, Daniel Jiménez-González, Harald Servat, Xavier Teruel, Estanislao Mercadal, Carlos álvarez 0001, Judit Giménez, Xavier Martorell, Eduard Ayguadé, and Jesús Labarta
- Published
- 2016
- Full Text
- View/download PDF
8. Large-Memory Nodes for Energy Efficient High-Performance Computing.
- Author
-
Darko Zivanovic, Milan Radulovic, Germán Llort, David Zaragoza, Janko Strassburg, Paul M. Carpenter, Petar Radojkovic, and Eduard Ayguadé
- Published
- 2016
- Full Text
- View/download PDF
9. Bio-Inspired Call-Stack Reconstruction for Performance Analysis.
- Author
-
Harald Servat, Germán Llort, Juan Gonzalez, Judit Giménez, and Jesús Labarta
- Published
- 2016
- Full Text
- View/download PDF
10. Low-Overhead Detection of Memory Access Patterns and Their Time Evolution.
- Author
-
Harald Servat, Germán Llort, Juan Gonzalez, Judit Giménez, and Jesús Labarta
- Published
- 2015
- Full Text
- View/download PDF
11. Detailed and simultaneous power and performance analysis.
- Author
-
Harald Servat, Germán Llort, Judit Giménez, and Jesús Labarta
- Published
- 2016
- Full Text
- View/download PDF
12. Identifying Code Phases Using Piece-Wise Linear Regressions.
- Author
-
Harald Servat, Germán Llort, Juan Gonzalez, Judit Giménez, and Jesús Labarta
- Published
- 2014
- Full Text
- View/download PDF
13. On the usefulness of object tracking techniques in performance analysis.
- Author
-
Germán Llort, Harald Servat, Juan Gonzalez, Judit Giménez, and Jesús Labarta
- Published
- 2013
- Full Text
- View/download PDF
14. On the Instrumentation of OpenMP and OmpSs Tasking Constructs.
- Author
-
Harald Servat, Xavier Teruel, Germán Llort, Alejandro Duran, Judit Giménez, Xavier Martorell, Eduard Ayguadé, and Jesús Labarta
- Published
- 2012
- Full Text
- View/download PDF
15. Unveiling Internal Evolution of Parallel Application Computation Phases.
- Author
-
Harald Servat, Germán Llort, Judit Giménez, Kevin A. Huck, and Jesús Labarta
- Published
- 2011
- Full Text
- View/download PDF
16. Trace Spectral Analysis toward Dynamic Levels of Detail.
- Author
-
Germán Llort, Marc Casas, Harald Servat, Kevin A. Huck, Judit Giménez, and Jesús Labarta
- Published
- 2011
- Full Text
- View/download PDF
17. Folding: Detailed Analysis with Coarse Sampling.
- Author
-
Harald Servat, Germán Llort, Judit Giménez, Kevin A. Huck, and Jesús Labarta
- Published
- 2011
- Full Text
- View/download PDF
18. On-line detection of large-scale parallel application's structure.
- Author
-
Germán Llort, Juan Gonzalez, Harald Servat, Judit Giménez, and Jesús Labarta
- Published
- 2010
- Full Text
- View/download PDF
19. Detailed Performance Analysis Using Coarse Grain Sampling.
- Author
-
Harald Servat, Germán Llort, Judit Giménez, and Jesús Labarta
- Published
- 2009
- Full Text
- View/download PDF
20. Scalability of Tracing and Visualization Tools.
- Author
-
Jesús Labarta, Judit Giménez, E. Martínez, P. González, Harald Servat, Germán Llort, and Xavier Aguilar
- Published
- 2005
21. Framework for a productive performance optimization.
- Author
-
Harald Servat, Germán Llort, Kevin A. Huck, Judit Giménez, and Jesús Labarta
- Published
- 2013
- Full Text
- View/download PDF
22. Analyzing the Efficiency of Hybrid Codes
- Author
-
Germán Llort, Estanislao Mercadal, Judit Gimenez, Sandra Mendez, and Barcelona Supercomputing Center
- Subjects
Computer science ,Load modeling ,05 social sciences ,Performance analysis ,050301 education ,Parallel programming (Computer science) ,Thread (computing) ,Parallel computing ,Scala-bility efficiency ,Programació en paral·lel (Informàtica) ,Hybrid approach ,Hybrid parallelization ,Instruction set ,CUDA ,Shared memory ,Scalability ,Programming paradigm ,0501 psychology and cognitive sciences ,High performance computing ,0503 education ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Efficiency model ,Càlcul intensiu (Informàtica) ,050104 developmental & child psychology - Abstract
Hybrid parallelization may be the only path for most codes to use HPC systems on a very large scale. Even within a small scale, with an increasing number of cores per node, combining MPI with some shared memory thread-based library allows to reduce the application network requirements. Despite the benefits of a hybrid approach, it is not easy to achieve an efficient hybrid execution. This is not only because of the added complexity of combining two different programming models, but also because in many cases the code was initially designed with just one level of parallelization and later extended to a hybrid mode. This paper presents our model to diagnose the efficiency of hybrid applications, distinguishing the contribution of each parallel programming paradigm. The flexibility of the proposed methodology allows us to use it for different paradigms and scenarios, like comparing the MPI+OpenMP and MPI+CUDA versions of the same code. This work has been partially developed under the scope of POP CoE which has received funding from the European Union´s Horizon 2020 research and innovation programme (under grant agreements No. 676553 and 824080), and with the support of the Comision Interministerial de Ciencia y Tecnología (CICYT) under contract No. PID2019- 107255GB-C22. We also want to acknowledge the ChEESE CoE and the EDANYA group from Universidad de Málaga (www.uma.es/edanya) that granted us permission to report on the Tsunami-HySEA analysis.
- Published
- 2020
- Full Text
- View/download PDF
23. Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver
- Author
-
Sara Royuela, Germán Llort, Estanislao Mercadal, Eduardo Quinones, Adrian Munera, Franck Wartel, and Barcelona Supercomputing Center
- Subjects
Parallel processing (Electronic computers) ,Computer science ,business.industry ,Embedded systems ,Processament en paral·lel (Ordinadors) ,Analysis tools ,Parallel programming models ,OpenMP ,Parallel programs (Computer programs) ,Tracing ,Porting ,Embedded computer systems ,Arquitectura d'ordinadors ,Domain (software engineering) ,Software ,Embedded system ,Parallel programming model ,Performance evaluation ,Computer architecture ,Layer (object-oriented design) ,business ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,OpenMP (Application program interface) ,Abstraction (linguistics) - Abstract
Cutting-edge functionalities in embedded systems require the use of parallel architectures to meet their performance requirements. This imposes the introduction of a new layer in the software stacks of embedded systems: the parallel programming model. Unfortunately, the tools used to analyze embedded systems fall short to characterize the performance of parallel applications at a parallel programming model level, and correlate this with information about non-functional requirements such as real-time, energy, memory usage, etc. HPC tools, like Extrae, are designed with that level of abstraction in mind, but their main focus is on performance evaluation. Overall, providing insightful information about the performance of parallel embedded applications at the parallel programming model level, and relate it to the non-functional requirements, is of paramount importance to fully exploit the performance capabilities of parallel embedded architectures. This paper contributes to the state-of-the-art of analysis tools for embedded systems by: (1) analyzing the particular constraints of embedded systems compared to HPC systems (e.g., static setting, restricted memory, limited drivers) to support HPC analysis tools; (2) porting Extrae, a powerful tracing tool from the HPC domain, to the GR740 platform, a SoC used in the space domain; and (3) augmenting Extrae with new features needed to correlate the parallel execution with the following non-functional requirements: energy, temperature and memory usage. Finally, the paper presents the usefulness of Extrae to characterize OpenMP applications and its non-functional requirements, evaluating different aspects of the applications running in the GR740. This work has been partially funded from the HP4S (High Performance Parallel Payload Processing for Space) project under the ESA-ESTEC ITI contract № 4000124124/18/NL/CRS.
- Published
- 2020
24. Performance Analysis of Parallel Python Applications
- Author
-
Judit Gimenez, Michael Wagner, Jesús Labarta, Estanislao Mercadal, Germán Llort, and Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
- Subjects
Computer science ,Fortran ,Paraver ,02 engineering and technology ,010501 environmental sciences ,Tracing ,computer.software_genre ,01 natural sciences ,Tools ,Extrae ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,0105 earth and related environmental sciences ,General Environmental Science ,computer.programming_language ,Parallel processing (Electronic computers) ,business.industry ,Processament en paral·lel (Ordinadors) ,Performance analysis ,Python (programming language) ,021001 nanoscience & nanotechnology ,Parallel ,High-level programming languages ,HPC ,Operating system ,General Earth and Planetary Sciences ,High performance computing ,0210 nano-technology ,Software engineering ,business ,Càlcul intensiu (Informàtica) ,computer ,Python - Abstract
Python is progressively consolidating itself within the HPC community with its simple syntax, large standard library, and powerful third-party libraries for scientific computing that are especially attractive to domain scientists. Despite Python lowering the bar for accessing parallel computing, utilizing the capacities of HPC systems efficiently remains a challenging task, after all. Yet, at the moment only few supporting tools exist and provide merely basic information in the form of summarized profile data. In this paper, we present our efforts in developing event-based tracing support for Python within the performance monitor Extrae to provide detailed information and enable a profound performance analysis. We present concepts to record the complete communication behavior as well as to capture entry and exit of functions in Python to provide the according application context. We evaluate our implementation in Extrae by analyzing the well-established electronic structure simulation package GPAW and demonstrate that the recorded traces provide equivalent information as for traditional C or Fortran applications and, therefore, offering the same profound analysis capabilities now for Python, as well.
- Published
- 2017
- Full Text
- View/download PDF
25. Detailed and simultaneous power and performance analysis
- Author
-
Judit Gimenez, Harald Servat, Jesús Labarta, and Germán Llort
- Subjects
020203 distributed computing ,Computer Networks and Communications ,Computer science ,Real-time computing ,02 engineering and technology ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Supercomputer ,Exascale computing ,020202 computer hardware & architecture ,Computer Science Applications ,Theoretical Computer Science ,Power (physics) ,Computational Theory and Mathematics ,0202 electrical engineering, electronic engineering, information engineering ,Node (circuits) ,Instrumentation (computer programming) ,Software ,Energy (signal processing) ,Efficient energy use - Abstract
On the road to Exascale computing, both performance and power areas are meant to be tackled at different levels, from system to processor level. The processor itself is the main responsible for the serial node performance and also for the most of the energy consumed by the system. Thus, it is important to have tools to simultaneously analyze both performance and energy efficiency at processor level.
- Published
- 2013
- Full Text
- View/download PDF
26. Automating the application data placement in hybrid memory systems
- Author
-
Hans-Christian Hoppe, Harald Servat, Germán Llort, Antonio J. Peña, Estanislao Mercadal, Jesús Labarta, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, and Barcelona Supercomputing Center
- Subjects
Heterogeneous memory ,Flat memory model ,Monitoring ,Computer science ,010103 numerical & computational mathematics ,02 engineering and technology ,Overlay ,computer.software_genre ,01 natural sciences ,law.invention ,Tools ,Memory address ,law ,0202 electrical engineering, electronic engineering, information engineering ,Multiprocessors ,Proposals ,0101 mathematics ,Sampling ,Instrumentation ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Highbandwidth memory ,020203 distributed computing ,Dynamic random-access memory ,Distributed shared memory ,Measurement ,Xeon ,Parallel processing (Electronic computers) ,Resource management ,Processament en paral·lel (Ordinadors) ,Performance analysis ,Uniform memory access ,Multiprocessadors ,Physical address ,Memory management ,Computer architecture ,Shared memory ,Operating system ,PEBS ,Distributed memory ,Instruments ,computer ,Hybrid memory ,Xeon Phi - Abstract
Multi-tiered memory systems, such as those based on Intel® Xeon Phi™processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the time– to–solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application’s most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions. This work has been performed in the Intel-BSC Exascale Lab. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. We would like to thank the Intel’s DCG HEAT team for allowing us to access their computational resources. We also want to acknowledge this team, especially Larry Meadows and Jason Sewall, as well as Pardo Keppel for the productive discussions. We thank Raphaël Léger for allowing us to access the MAXW-DGTD application and its input.
- Published
- 2017
- Full Text
- View/download PDF
27. Monitoring Heterogeneous Applications with the OpenMP Tools Interface
- Author
-
Carlos Alvarez, Harald Servat, Estanislao Mercadal, Judit Gimenez, Eduard Ayguadé, Jesús Labarta, Antonio Filgueras, Michael Wagner, Daniel Jiménez-González, Xavier Teruel, Germán Llort, and Xavier Martorell
- Subjects
Xeon ,Computer science ,Interface (Java) ,020207 software engineering ,02 engineering and technology ,Supercomputer ,Runtime system ,Computer architecture ,Parallel programming model ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,Hardware acceleration ,020201 artificial intelligence & image processing ,Instrumentation (computer programming) - Abstract
Heterogeneous systems are gaining more importance in supercomputing, yet they are challenging to program and developers require support tools to understand how well their accelerated codes perform and how they can be improved. The OpenMP Tools Interface (OMPT) is a new performance monitoring interface that is being considered for integration into the OpenMP standard. OMPT allows monitoring the execution of heterogeneous OpenMP applications by revealing the activity of the runtime through a standardized API as well as facilitating the exchange of performance information between devices with accelerated codes, and the analysis tool. In this paper we describe our efforts implementing parts of the OMPT specification necessary to monitor accelerators. In particular, the integration of the OMPT features to our parallel runtime system and instrumentation framework helps to obtain detailed performance information about the execution of the accelerated tasks issued to the devices to allow an insightful analysis. As a result of this analysis, the parallel runtime of the programming model has been improved. We focus on the evaluation of monitoring FPGA devices studying the performance of a common kernel in scientific algorithms: matrix multiplication. Nonetheless, this development is as well applicable to monitor GPU accelerators and Intel®; Xeon PhiTM co-processors operating under the OmpSs programming model.
- Published
- 2017
- Full Text
- View/download PDF
28. The secrets of the accelerators unveiled: tracing heterogeneous executions through OMPT
- Author
-
Xavier Teruel, Judit Gimenez, Germán Llort, Harald Servat, Xavier Martorell, Daniel Jiménez-González, Estanislao Mercadal, Jesús Labarta, Carlos Alvarez, Antonio Filgueras, Eduard Ayguadé, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Computer science ,Llenguatges de programació ,ComputerApplications_COMPUTERSINOTHERSYSTEMS ,02 engineering and technology ,Programming languages (Electronic computers) ,Tracing ,computer.software_genre ,0202 electrical engineering, electronic engineering, information engineering ,Computer architecture ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Software engineering ,business.industry ,Algorithm analysis and problem complexity ,020207 software engineering ,Computer hardware ,Programming languages ,Informàtica::Llenguatges de programació [Àrees temàtiques de la UPC] ,OmpT ,Arquitectura d'ordinadors ,020202 computer hardware & architecture ,Processor architectures ,Compilers ,Parallel programming model ,Interpreters ,Operating system ,Hardware acceleration ,Compiler ,business ,computer ,System performance and evaluation - Abstract
Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them. Having different types of hardware accelerators available, each with their own specific low-level APIs to program them, there is not yet a clear consensus on a standard way to retrieve information about the accelerator’s performance. To improve this scenario, OMPT is a novel performance monitoring interface that is being considered for integration into the OpenMP standard. OMPT allows analysis tools to monitor the execution of parallel OpenMP applications by providing detailed information about the activity of the runtime through a standard API. For accelerated devices, OMPT also facilitates the exchange of performance information between the runtime and the analysis tool. We implement part of the OMPT specification that refers to the use of accelerators both in the Nanos++ parallel runtime system and the Extrae tracing framework, obtaining detailed performance information about the execution of the tasks issued to the accelerated devices to later conduct insightful analysis. Our work extends previous efforts in the field to expose detailed information from the OpenMP and OmpSs runtimes, regarding the activity and performance of task-based parallel applications. In this paper, we focus on the evaluation of FPGA devices studying the performance of two common kernels in scientific algorithms: matrix multiplication and Cholesky decomposition. Furthermore, this development is seamlessly applicable for the analysis of GPGPU accelerators and Intel®Xeon PhiTM co-processors operating under the OmpSs programming model. This work was partially supported by the European Union H2020 program through the AXIOM project (grant ICT-01-2014 GA 645496) and the Mont-Blanc 2 project, by the Ministerio de Economía y Competitividad, under contracts Computación de Altas Prestaciones VII (TIN2015-65316-P); Departament d'Innovació, Universitats i Empresa de la Generalitat de Catalunya, under projects MPEXPAR: Models de Programació i Entorns d'Execució Paral·lels (2014-SGR-1051) and 2009-SGR-980; the BSC-CNS Severo Ochoa program (SEV-2011-00067); the Intel-BSC Exascale Laboratory project; and the OMPT Working Group.
- Published
- 2016
29. Large-memory nodes for energy efficient high-performance computing
- Author
-
Janko Strassburg, David Zaragoza, Germán Llort, Eduard Ayguadé, Paul M. Carpenter, Petar Radojković, Milan Radulovic, Darko Zivanovic, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
010302 applied physics ,Computer systems organization ,Computer science ,Distributed computing ,Real-time computing ,020206 networking & telecommunications ,Distributed architectures ,02 engineering and technology ,Energy consumption ,Total cost of ownership ,Supercomputer ,Investment (macroeconomics) ,01 natural sciences ,Power and energy ,Hardware ,Informàtica [Àrees temàtiques de la UPC] ,Server ,0103 physical sciences ,Node (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,High performance computing ,Càlcul intensiu (Informàtica) ,Energy (signal processing) ,Efficient energy use - Abstract
Energy consumption is by far the most important contributor to HPC cluster operational costs, and it accounts for a significant share of the total cost of ownership. Advanced energy-saving techniques in HPC components have received significant research and development effort, but a simple measure that can dramatically reduce energy consumption is often overlooked. We show that, in capacity computing, where many small to medium-sized jobs have to be solved at the lowest cost, a practical energy-saving approach is to scale-in the application on large-memory nodes. We evaluate scaling-in; i.e. decreasing the number of application processes and compute nodes (servers) to solve a fixed-sized problem, using a set of HPC applications running in a production system. Using standard-memory nodes, we obtain average energy savings of 36%, already a huge figure. We show that the main source of these energy savings is a decrease in the node-hours (node_hours = #nodes x exe_time), which is a consequence of the more efficient use of hardware resources. Scaling-in is limited by the per-node memory capacity. We therefore consider using large-memory nodes to enable a greater degree of scaling-in. We show that the additional energy savings, of up to 52%, mean that in many cases the investment in upgrading the hardware would be recovered in a typical system lifetime of less than five years.
- Published
- 2016
30. Studying performance changes with tracking analysis
- Author
-
Harald Servat, Germán Llort, Juan Gonzalez, Judit Gimenez, Jesús Labarta, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Software ,Operations research ,business.industry ,Computer science ,Real-time computing ,Memory bandwidth ,High performance computing ,business ,Tracking (particle physics) ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,Execution time ,Càlcul intensiu (Informàtica) - Abstract
Numerical simulation and modelling using High Performance Computing has evolved into an established technique in academic and industrial research. At the same time, the High Performance Computing infrastructure is becoming ever more complex. For instance, most of the current top systems around the world use thousands of nodes in which classical CPUs are combined with accelerator cards in order to enhance their compute power and energy efficiency. This complexity can only be mastered with adequate development and optimization tools. Key topics addressed by these tools include parallelization on heterogeneous systems, performance optimization for CPUs and accelerators, debugging of increasingly complex scientific applications and optimization of energy usage in the spirit of green IT. This book represents the proceedings of the 8th International Parallel Tools Workshop, held October 1-2, 2014 in Stuttgart, Germany – which is a forum to discuss the latest advancements in the parallel tools.
- Published
- 2015
31. Low-Overhead Detection of Memory Access Patterns and Their Time Evolution
- Author
-
Judit Gimenez, Harald Servat, Germán Llort, Jesús Labarta, and Juan Gonzalez
- Subjects
Source code ,Low overhead ,Computer science ,Mechanism (biology) ,Distributed computing ,media_common.quotation_subject ,Time evolution ,Instrumentation (computer programming) ,Data structure ,media_common - Abstract
We present a performance analysis tool that reports the temporal evolution of the memory access patterns of in-production applications in order to help analysts understand the accesses to the application data structures. This information is captured using the Precise Event Based Sampling (PEBS) mechanism from the recent Intel processors, and it is correlated with the source code and the nature of the performance bottlenecks if any. Consequently, this tool gives a complete approach to allow analysts to unveil the application behavior better, and to lead them to improvements while taking the most benefit from the system’s characteristics. We apply the tool to two optimized parallel applications and provide detailed insight of their memory access behavior, thus demonstrating the usefulness of the tool.
- Published
- 2015
- Full Text
- View/download PDF
32. Detailed performance analysis using coarse grain sampling
- Author
-
Jesús Labarta, Harald Servat, Germán Llort, Judit Gimenez, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Informàtica::Arquitectura de computadors::Arquitectures distribuïdes [Àrees temàtiques de la UPC] ,Computer science ,Suite ,Sampling (statistics) ,Parallel programming (Computer science) ,Tracing ,Enginyeria de la telecomunicació [Àrees temàtiques de la UPC] ,Programació en paral·lel (Informàtica) ,computer.software_genre ,Code (cryptography) ,Point (geometry) ,Instrumentation (computer programming) ,Data mining ,computer ,TRACE (psycholinguistics) - Abstract
Performance evaluation tools enable analysts to shed light on how applications behave both from a general point of view and at concrete execution points, but cannot provide detailed information beyond the monitored regions of code. Having the ability to determine when and which data has to be collected is crucial for a successful analysis. This is particularly true for trace-based tools, which can easily incur either unmanageable large traces or information shortage. In order to mitigate the well-known resolution vs. usability trade-off, we present a procedure that obtains fine grain performance information using coarse grain sampling, projecting performance metrics scattered all over the execution into thoroughly detailed representative areas. This mechanism has been incorporated into the MPItrace tracing suite, greatly extending the amount of performance information gathered from statically instrumented points with further periodic samples collected beyond them. We have applied this solution to the analysis of two applications to introduce a novel performance analysis methodology based on the combination of instrumentation and sampling techniques.
- Published
- 2014
33. Framework for a productive performance optimization
- Author
-
Germán Llort, Judit Gimenez, Jesús Labarta, Harald Servat, Kevin Huck, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Source code ,Computer Networks and Communications ,Computer science ,media_common.quotation_subject ,Parallel programming (Computer science) ,02 engineering and technology ,Programació en paral·lel (Informàtica) ,Theoretical Computer Science ,law.invention ,Supercomputadors ,Artificial Intelligence ,law ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Application tuning ,Instrumentation (computer programming) ,Sampling ,Instrumentation ,media_common ,Informàtica::Programació [Àrees temàtiques de la UPC] ,020203 distributed computing ,business.industry ,Node (networking) ,Performance tools ,Performance analysis ,020206 networking & telecommunications ,Supercomputers ,Computer Graphics and Computer-Aided Design ,Identification (information) ,Microprocessor ,Computer engineering ,Hardware and Architecture ,Embedded system ,Performance models ,Execution unit ,business ,Software - Abstract
Modern supercomputers deliver large computational power, but it is difficult for an application to exploit such power. One factor that limits the application performance is the single node performance. While many performance tools use the microprocessor performance counters to provide insights on serial node performance issues, the complex semantics of these counters pose an obstacle to an inexperienced developer. We present a framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications. The output of the framework is precise and it is capable of correlating performance inefficiencies with small regions of code within the application. The framework not only points to regions of code but also simplifies the semantics of the performance counters into metrics that refer to processor functional units. With such information the developer can focus on the identified code and improve it by knowing which processor execution unit is degrading the performance. To demonstrate the usefulness of the framework we apply it to three already optimized applications using realistic inputs and, according to the results, modify their source code. By doing modifications that require little effort, we successfully increase the applications’ performance from 10% to 30% and thus shorten the time required to reach the solution and/or allow facing increased problem sizes.
- Published
- 2013
- Full Text
- View/download PDF
34. On the instrumentation of OpenMP and OmpSs Tasking constructs
- Author
-
Judit Gimenez, Eduard Ayguadé, Jesús Labarta, Alejandro Duran, Germán Llort, Xavier Martorell, Harald Servat, Xavier Teruel, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
020203 distributed computing ,Multi-core processor ,Parallel processing (Electronic computers) ,Computer science ,Processament en paral·lel (Ordinadors) ,Runtime library ,020207 software engineering ,02 engineering and technology ,Parallel computing ,computer.software_genre ,Shared memory ,Parallel programming model ,0202 electrical engineering, electronic engineering, information engineering ,Parallelism (grammar) ,Overhead (computing) ,Instrumentation (computer programming) ,Compiler ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,computer - Abstract
Parallelism has become more and more commonplace with the advent of the multicore processors. Although different parallel pro- gramming models have arisen to exploit the computing capabilities of such processors, developing applications that take benefit of these pro- cessors may not be easy. And what is worse, the performance achieved by the parallel version of the application may not be what the developer expected, as a result of a dubious ut ilization of the resources offered by the processor. We present in this paper a fruitful synergy of a shared memory parallel compiler and runtime, and a performance extraction library. The objective of this work is not only to reduce the performance analysis life-cycle when doing the parallelization of an application, but also to extend the analysis experience of the parallel application by incorporating data that is only known in the compiler and runtime side. Additionally we present performance results obtained with the execution of instrumented application and evaluate the overhead of the instrumentation.
- Published
- 2012
35. Folding: Detailed Analysis with Coarse Sampling
- Author
-
Harald Servat, Kevin A. Huck, Jesús Labarta, Germán Llort, and Judit Gimenez
- Subjects
020203 distributed computing ,Computer science ,Real-time computing ,Sampling (statistics) ,02 engineering and technology ,Folding (DSP implementation) ,High frequency sampling ,Computer engineering ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Overhead (computing) ,020201 artificial intelligence & image processing ,Instrumentation (computer programming) ,Level of detail ,TRACE (psycholinguistics) - Abstract
Performance analysis tools help the application users to find bottlenecks that prevent the application to run at full speed in current supercomputers. The level of detail and the accuracy of the performance tools are crucial to completely depict the nature of the bottlenecks. The details exposed do not only depend on the nature of the tools (profile-based or trace-based) but also on the mechanism on which they rely (instrumentation or sampling) to gather information.In this paper we present a mechanism called folding that combines both instrumentation and sampling for trace-based performance analysis tools. The folding mechanism takes advantage of long execution runs and low frequency sampling to finely detail the evolution of the user code with minimal overhead on the application. The reports provided by the folding mechanism are extremely useful to understand the behavior of a region of code at a very low level. We also present a practical study we have done in a in-production scenario with the folding mechanism and show that the results of the folding resembles to high frequency sampling.
- Published
- 2012
- Full Text
- View/download PDF
36. On-line detection of large-scale parallel application's structure
- Author
-
Harald Servat, Juan Antonio Rodríguez González, JESUS LABARTA, Juan Gonzalez-Garcia, Germán Llort, Judit Gimenez, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Informàtica::Arquitectura de computadors::Arquitectures distribuïdes [Àrees temàtiques de la UPC] ,Computer science ,Scale (chemistry) ,Distributed computing ,Volume (computing) ,Parallel programming (Computer science) ,Enginyeria de la telecomunicació [Àrees temàtiques de la UPC] ,Pattern clustering ,Programació en paral·lel (Informàtica) ,computer.software_genre ,Task (project management) ,Anàlisi de conglomerats ,Cluster analysis ,Parallel processing (DSP implementation) ,Parallel processing ,Algorithm design ,Data mining ,computer ,TRACE (psycholinguistics) - Abstract
With larger and larger systems being constantly deployed, trace-based performance analysis of parallel applications has become a daunting task. Even if the amount of performance data gathered per single process is small, traces rapidly become unmanageable when merging together the information collected from all processes. In general, an e cient analysis of such a large volume of data is subject to a previous ltering step that directs the analyst's attention towards what is meaningful to understand the observed application behavior. Furthermore, the iterative nature of most scienti c applications usually ends up producing repetitive information. Discarding irrelevant data aims at reducing both the size of traces, and the time required to perform the analysis and deliver results. In this paper, we present an on-line analysis framework that relies on clustering techniques to intelligently select the most relevant information to understand how does the application behave, while keeping the trace volume at a reasonable size.
37. Bio-inspired call-stack reconstruction for performance analysis
- Author
-
Harald Servat, Germán Llort, Juan Gonzalez, Judit Gimssnez, Jesús Labarta, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, and Barcelona Supercomputing Center
- Subjects
Source code ,Call stack ,Computer science ,media_common.quotation_subject ,Distributed computing ,Context (language use) ,010103 numerical & computational mathematics ,02 engineering and technology ,01 natural sciences ,Call-stack analysis ,Sampling (signal processing) ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Multiprocessors ,Instrumentation (computer programming) ,0101 mathematics ,Sampling ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Instrumentation ,media_common ,020203 distributed computing ,Performance analysis ,Process (computing) ,Multiprocessadors ,Multi-sequence alignment ,Algorithm design - Abstract
The correlation of performance bottlenecks and their associated source code has become a cornerstone of performance analysis. It allows understanding why the efficiency of an application falls behind the computer's peak performance and enabling optimizations on the code ultimately. To this end, performance analysis tools collect the processor call-stack and then combine this information with measurements to allow the analyst comprehend the application behavior. Some tools modify the call-stack during run-time to diminish the collection expense but at the cost of resulting in non-portable solutions. In this paper, we present a novel portable approach to associate performance issues with their source code counterpart. To address it, we capture a reduced segment of the call-stack (up to three levels) and then process the segments using an algorithm inspired by multi-sequence alignment techniques. The results of our approach are easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code. To demonstrate the usefulness of our approach, we have applied the algorithm to several first-time seen in-production applications to describe them finely, and optimize them by using tiny modifications based on the analyses. We thankfully acknowledge Mathis Bode for giving us access to the Arts CF binaries, and Miguel Castrillo and Kim Serradell for their valuable insight regarding Nemo. We would like to thank Forschungszentrum Jülich for the computation time on their Blue Gene/Q system. This research has been partially funded by the CICYT under contracts No. TIN2012-34557 and TIN2015-65316-P.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.