Author: "Orti, Enrique" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Orti, Enrique"' showing total 440 results

Start Over Author "Orti, Enrique"

440 results on '"Orti, Enrique"'

1. Parallel Reduced Order Modeling for Digital Twins using High-Performance Computing Workflows

Author: de Parga, S. Ares, Bravo, J. R., Sibuet, N., Hernandez, J. A., Rossi, R., Boschert, Stefan, Quintana-Ortí, Enrique S., Tomás, Andrés E., Tatu, Cristian Cătălin, Vázquez-Novoa, Fernando, Ejarque, Jorge, and Badia, Rosa M.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The integration of Reduced Order Models (ROMs) with High-Performance Computing (HPC) is critical for developing digital twins, particularly for real-time monitoring and predictive maintenance of industrial systems. This paper describes a comprehensive, HPC-enabled workflow for developing and deploying projection-based ROMs (PROMs). We use PyCOMPSs' parallel framework to efficiently execute ROM training simulations, employing parallel Singular Value Decomposition (SVD) algorithms such as randomized SVD, Lanczos SVD, and full SVD based on Tall-Skinny QR. In addition, we introduce a partitioned version of the hyper-reduction scheme known as the Empirical Cubature Method. Despite the widespread use of HPC for PROMs, there is a significant lack of publications detailing comprehensive workflows for building and deploying end-to-end PROMs in HPC environments. Our workflow is validated through a case study focusing on the thermal dynamics of a motor. The PROM is designed to deliver a real-time prognosis tool that could enable rapid and safe motor restarts post-emergency shutdowns under different operating conditions for further integration into digital twins or control systems. To facilitate deployment, we use the HPC Workflow as a Service strategy and Functional Mock-Up Units to ensure compatibility and ease of integration across HPC, edge, and cloud environments. The outcomes illustrate the efficacy of combining PROMs and HPC, establishing a precedent for scalable, real-time digital twin applications across multiple industries.
Published: 2024

2. Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

Author: Lei, Jie and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: This paper investigates the design of parallel general matrix multiplication (GEMM) for a Versal Adaptive Compute Accelerated Platform (ACAP) equipped with a VC1902 system-on-chip and multiple Artificial Intelligence Engines (AIEs). Our efforts aim to port standard optimization techniques applied in the high-performance realization of GEMM on CPUs to the Versal ACAP. In particular, 1) we address the flexible exploitation of the Versal ACA multi-level memory hierarchy; 2) we delve into the efficient use of the vector units in the AIE tiles, proposing an architecture-specific micro-kernel for mixed precision arithmetic to address the strong demand for adaptive-precision inference in deep learning; and 3) we introduce a parallel design for GEMM that spans multiple AIE tiles, enhancing the computational throughput. We conduct experimental profiling, with up to 32 AI Engines, that demonstrates the high parallel scalability of the solution., Comment: 12 pages
Published: 2024

3. Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

Author: Ramírez, Cristian, Castelló, Adrián, Martínez, Héctor, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Hardware Architecture
Abstract: The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the matrix-matrix multiplication (GEMM) kernel on processors designed to operate at the edge. Our simulator adheres to the modern implementations of GEMM, advocated by GotoBLAS2, BLIS, OpenBLAS, etc., to carefully account for the amount of data transfers across the memory hierarchy of different algorithmic variants of the kernel. %Armed with this tool, A small collection of experiments provide the necessary data to calibrate the simulator and deliver highly accurate estimations of the execution time for a given processor architecture., Comment: 12 pages, 2 Tables, 6 Figures
Published: 2024
Full Text: View/download PDF

4. Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors

Author: Tomas, Andres E., Quintana-Orti, Enrique S., and Anzt, Hartwig
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, G.4.5
Abstract: We investigate the solution of low-rank matrix approximation problems using the truncated SVD. For this purpose, we develop and optimize GPU implementations for the randomized SVD and a blocked variant of the Lanczos approach. Our work takes advantage of the fact that the two methods are composed of very similar linear algebra building blocks, which can be assembled using numerical kernels from existing high-performance linear algebra libraries. Furthermore, the experiments with several sparse matrices arising in representative real-world applications and synthetic dense test matrices reveal a performance advantage of the block Lanczos algorithm when targeting the same approximation accuracy., Comment: 16 pages, 4 figures
Published: 2024
Full Text: View/download PDF

5. GreenLightningAI: An Efficient AI System with Decoupled Structural and Quantitative Knowledge

Author: Duato, Jose, Mestre, Jose I., Dolz, Manuel F., and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing, I.2, I.2.6
Abstract: The number and complexity of artificial intelligence (AI) applications is growing relentlessly. As a result, even with the many algorithmic and mathematical advances experienced over past decades as well as the impressive energy efficiency and computational capacity of current hardware accelerators, training the most powerful and popular deep neural networks comes at very high economic and environmental costs. Recognising that additional optimisations of conventional neural network training is very difficult, this work takes a radically different approach by proposing GreenLightningAI, a new AI system design consisting of a linear model that is capable of emulating the behaviour of deep neural networks by subsetting the model for each particular sample. The new AI system stores the information required to select the system subset for a given sample (referred to as structural information) separately from the linear model parameters (referred to as quantitative knowledge). In this paper we present a proof of concept, showing that the structural information stabilises far earlier than the quantitative knowledge. Additionally, we show experimentally that the structural information can be kept unmodified when re-training the AI system with new samples while still achieving a validation accuracy similar to that obtained when re-training a neural network with similar size. Since the proposed AI system is based on a linear model, multiple copies of the model, trained with different datasets, can be easily combined. This enables faster and greener (re)-training algorithms, including incremental re-training and federated incremental re-training., Comment: 15 pages, 7 figures
Published: 2023

6. Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

Author: Alaejos, Guillermo, Castelló, Adrián, Alonso-Jordá, Pedro, Igual, Francisco D., Martínez, Héctor, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Computation and Language
Abstract: We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in order to obtain high-performance blocked formulations of the general matrix multiplication (GEMM). % In addition, we fully automatize the generation process, by also leveraging the Apache TVM framework to derive a complete variety of the processor-specific micro-kernels for GEMM. This is in contrast with the convention in high performance libraries, which hand-encode a single micro-kernel per architecture using Assembly code. % In global, the combination of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves portability, maintainability and, globally, streamlines the software life cycle; 2)~provides high flexibility to easily tailor and optimize the solution to different data types, processor architectures, and matrix operand shapes, yielding performance on a par (or even superior for specific matrix shapes) with that of hand-tuned libraries; and 3)~features a small memory footprint., Comment: 35 pages, 22 figures. Submitted to ACM TOMS
Published: 2023

7. Co-Design of the Dense Linear AlgebravSoftware Stack for Multicore Processors

Author: Martínez, Héctor, Catalán, Sandra, Igual, Francisco D., Herrero, José R., Rodríguez-Sánchez, Rafael, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: This paper advocates for an intertwined design of the dense linear algebra software stack that breaks down the strict barriers between the high-level, blocked algorithms in LAPACK (Linear Algebra PACKage) and the low-level, architecture-dependent kernels in BLAS (Basic Linear Algebra Subprograms). Specifically, we propose customizing the GEMM (general matrix multiplication) kernel, which is invoked from the blocked algorithms for relevant matrix factorizations in LAPACK, to improve performance on modern multicore processors with hierarchical cache memories. To achieve this, we leverage an analytical model to dynamically adapt the cache configuration parameters of the GEMM to the shape of the matrix operands. Additionally, we accommodate a flexible development of architecture-specific micro-kernels that allow us to further improve the utilization of the cache hierarchy. Our experiments on two platforms, equipped with ARM (NVIDIA Carmel, Neon) and x86 (AMD EPYC, AVX2) multi-core processors, demonstrate the benefits of this approach in terms of better cache utilization and, in general, higher performance. However, they also reveal the delicate balance between optimizing for multi-threaded parallelism versus cache usage.
Published: 2023

8. Toward matrix multiplication for deep learning inference on the Xilinx Versal

Author: Lei, Jie, Flich, José, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: The remarkable positive impact of Deep Neural Networks on many Artificial Intelligence (AI) tasks has led to the development of various high performance algorithms as well as specialized processors and accelerators. In this paper we address this scenario by demonstrating that the principles underlying the modern realization of the general matrix multiplication (GEMM) in conventional processor architectures, are also valid to achieve high performance for the type of operations that arise in deep learning (DL) on an exotic accelerator such as the AI Engine (AIE) tile embedded in Xilinx Versal platforms. In particular, our experimental results with a prototype implementation of the GEMM kernel, on a Xilinx Versal VCK190, delivers performance close to 86.7% of the theoretical peak that can be expected on an AIE tile, for 16-bit integer operands., Comment: 11 pages
Published: 2023

9. Inference with Transformer Encoders on ARM and RISC-V Multicore Processors

Author: Martínez, Héctor, Igual, Francisco D., Rodríguez-Sánchez, Rafael, Catalán, Sandra, Castelló, Adrián, Quintana-Ortí, Enrique S., Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Carretero, Jesus, editor, Shende, Sameer, editor, Garcia-Blas, Javier, editor, Brandic, Ivona, editor, Olcoz, Katzalin, editor, and Schreiber, Martin, editor
Published: 2024
Full Text: View/download PDF

10. Tall-and-Skinny QR Factorization for Clusters of GPUs Using High-Performance Building Blocks

Author: Tomás, Andrés E., Quintana-Ortí, Enrique S., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Zeinalipour, Demetris, editor, Blanco Heras, Dora, editor, Pallis, George, editor, Herodotou, Herodotos, editor, Trihinas, Demetris, editor, Balouek, Daniel, editor, Diehl, Patrick, editor, Cojean, Terry, editor, Fürlinger, Karl, editor, Kirkeby, Maja Hanne, editor, Nardelli, Matteo, editor, and Di Sanzo, Pierangelo, editor
Published: 2024
Full Text: View/download PDF

11. Enabling Dynamic and Intelligent Workflows for HPC, Data Analytics, and AI Convergence

Author: Ejarque, Jorge, Badia, Rosa M., Albertin, Loïc, Aloisio, Giovanni, Baglione, Enrico, Becerra, Yolanda, Boschert, Stefan, Berlin, Julian R., D'Anca, Alessandro, Elia, Donatello, Exertier, François, Fiore, Sandro, Flich, José, Folch, Arnau, Gibbons, Steven J, Koldunov, Nikolay, Lordan, Francesc, Lorito, Stefano, Løvholt, Finn, Macías, Jorge, Marozzo, Fabrizio, Michelini, Alberto, Monterrubio-Velasco, Marisol, Pienkowska, Marta, de la Puente, Josep, Queralt, Anna, Quintana-Ortí, Enrique S., Rodríguez, Juan E., Romano, Fabrizio, Rossi, Riccardo, Rybicki, Jedrzej, Kupczyk, Miroslaw, Selva, Jacopo, Talia, Domenico, Tonini, Roberto, Trunfio, Paolo, and Volp, Manuela
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The evolution of High-Performance Computing (HPC) platforms enables the design and execution of progressively larger and more complex workflow applications in these systems. The complexity comes not only from the number of elements that compose the workflows but also from the type of computations they perform. While traditional HPC workflows target simulations and modelling of physical phenomena, current needs require in addition data analytics (DA) and artificial intelligence (AI) tasks. However, the development of these workflows is hampered by the lack of proper programming models and environments that support the integration of HPC, DA, and AI, as well as the lack of tools to easily deploy and execute the workflows in HPC systems. To progress in this direction, this paper presents use cases where complex workflows are required and investigates the main issues to be addressed for the HPC/DA/AI convergence. Based on this study, the paper identifies the challenges of a new workflow platform to manage complex workflows. Finally, it proposes a development approach for such a workflow platform addressing these challenges in two directions: first, by defining a software stack that provides the functionalities to manage these complex workflows; and second, by proposing the HPC Workflow as a Service (HPCWaaS) paradigm, which leverages the software stack to facilitate the reusability of complex workflows in federated HPC infrastructures. Proposals presented in this work are subject to study and development as part of the EuroHPC eFlows4HPC project.
Published: 2022
Full Text: View/download PDF

12. Insights into the anticancer photodynamic activity of Ir(III) and Ru(II) polypyridyl complexes bearing β-carboline ligands

Author: Sanz-Villafruela, Juan, Bermejo-Casadesus, Cristina, Zafon, Elisenda, Martínez-Alonso, Marta, Durá, Gema, Heras, Aranzazu, Soriano-Díaz, Iván, Giussani, Angelo, Ortí, Enrique, Tebar, Francesc, Espino, Gustavo, and Massaguer, Anna
Published: 2024
Full Text: View/download PDF

13. Parallel GEMM-based convolutions for deep learning on multicore ARM and RISC-V architectures

Author: Martínez, Héctor, Catalán, Sandra, Castelló, Adrián, and Quintana-Ortí, Enrique S.
Published: 2024
Full Text: View/download PDF

14. High performance and energy efficient inference for deep learning on ARM processors

Author: Castelló, Adrián, Barrachina, Sergio, Dolz, Manuel F., Quintana-Ortí, Enrique S., and Juan, Pau San
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, Computer Science - Performance
Abstract: We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors involves several high-level transformations of the original framework, such as the development and integration of Cython routines to exploit thread-level parallelism; the design and development of micro-kernels for the matrix multiplication, vectorized with ARMs NEON intrinsics, that can accommodate layer fusion; and the appropriate selection of several cache configuration parameters tailored to the memory hierarchy of the target ARM processors. Our experiments evaluate both inference throughput (measured in processed images/s) and inference latency (i.e., time-to-response) as well as energy consumption per image when varying the level of thread parallelism and the processor power modes. The experiments with the new inference engine are reported for the ResNet50 v1.5 model on the ImageNet dataset from the MLPerf suite using the ARM v8.2 cores in the NVIDIA Jetson AGX Xavier board. These results show superior performance compared with the well-spread TFLite from Google and slightly inferior results when compared with ArmNN, the native library from ARM for DNN inference., Comment: 13 pages, 7 figures
Published: 2021

15. GEMM-Like Convolution for Deep Learning Inference on the Xilinx Versal

Author: Lei, Jie, Martínez, Héctor, Flich, José, Quintana-Ortí, Enrique S., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bienz, Amanda, editor, Weiland, Michèle, editor, Baboulin, Marc, editor, and Kruse, Carola, editor
Published: 2023
Full Text: View/download PDF

16. Performance Analysis of Convolution Algorithms for Deep Learning on Edge Processors

Author: Alonso-Jordá, Pedro, Martínez, Héctor, Quintana-Ortí, Enrique S., Ramírez, Cristian, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wyrzykowski, Roman, editor, Dongarra, Jack, editor, Deelman, Ewa, editor, and Karczewski, Konrad, editor
Published: 2023
Full Text: View/download PDF

17. Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

Author: Agullo, Emmanuel, Altenbernd, Mirco, Anzt, Hartwig, Bautista-Gomez, Leonardo, Benacchio, Tommaso, Bonaventura, Luca, Bungartz, Hans-Joachim, Chatterjee, Sanjay, Ciorba, Florina M., DeBardeleben, Nathan, Drzisga, Daniel, Eibl, Sebastian, Engelmann, Christian, Gansterer, Wilfried N., Giraud, Luc, Goeddeke, Dominik, Heisig, Marco, Jezequel, Fabienne, Kohl, Nils, Li, Xiaoye Sherry, Lion, Romain, Mehl, Miriam, Mycek, Paul, Obersteiner, Michael, Quintana-Orti, Enrique S., Rizzi, Francesco, Ruede, Ulrich, Schulz, Martin, Fung, Fred, Speck, Robert, Stals, Linda, Teranishi, Keita, Thibault, Samuel, Thoennes, Dominik, Wagner, Andreas, and Wohlmuth, Barbara
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, D.4.5, G.4, G.1, D.4.4
Abstract: This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge., Comment: 45 pages, 3 figures, submitted to The International Journal of High Performance Computing Applications
Published: 2020

18. Compressed Basis GMRES on High Performance GPUs

Author: Aliaga, José I., Anzt, Hartwig, Grützmacher, Thomas, Quintana-Ortí, Enrique S., and Tomás, Andrés E.
Subjects: Computer Science - Mathematical Software
Abstract: Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in all current computer architectures, motivating the recent investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This paper introduces a new communication-reduction strategy for the (Krylov) GMRES solver that advocates for decoupling the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory access, the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a lower volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating point as well as fixed point formats with little impact on the convergence of the iterative process. We develop a high performance implementation of the "compressed basis GMRES" solver in the Ginkgo sparse linear algebra library and using a large set of test problems from the SuiteSparse matrix collection we demonstrate robustness and performance advantages on a modern NVIDIA V100 GPU of up to 50% over the standard GMRES solver that stores all data in IEEE double precision.
Published: 2020

19. Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing

Author: Anzt, Hartwig, Cojean, Terry, Flegar, Goran, Göbel, Fritz, Grützmacher, Thomas, Nayak, Pratik, Ribizel, Tobias, Tsai, Yuhsiang Mike, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Mathematical Software, D.2, G.1.3, G.4
Abstract: In this paper, we present Ginkgo, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Ginkgo's design principle abstracts all functionality as "linear operators", motivating the notation of a "linear operator algebra library". Ginkgo's current focus is oriented towards providing sparse linear algebra functionality for high performance GPU architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific back ends and provide details on extensibility and sustainability measures. We also demonstrate Ginkgo's usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of Ginkgo's high performance on state-of-the-art GPU architectures., Comment: Preprint submitted to ACM Transactions on Mathematical Software
Published: 2020

20. Reproducibility of Parallel Preconditioned Conjugate Gradient in Hybrid Programming Environments

Author: Iakymchuk, Roman, Barreda, Maria, Graillat, Stef, Aliaga, Jose I., and Quintana-Orti, Enrique S.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Mathematical Software
Abstract: The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7 % on 768 cores.
Published: 2020

21. High Performance and Portable Convolution Operators for ARM-based Multicore Processors

Author: Juan, Pablo San, Castelló, Adrián, Dolz, Manuel F., Alonso-Jordá, Pedro, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Performance, B.8, C.4, I.2, I.4
Abstract: The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the \imcol transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.
Published: 2020

22. DMR API: Improving cluster productivity by turning applications into malleable

Author: Iserte, Sergio, Mayo, Rafael, Quintana-Orti, Enrique S., Beltran, Vicenc, and Peña, Antonio J.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Adaptive workloads can change on--the--fly the configuration of their jobs, in terms of number of processes. In order to carry out these job reconfigurations, we have designed a methodology which enables a job to communicate with the resource manager and, through the runtime, to change its number of MPI ranks. The collaboration between both the workload manager---aware of the queue of jobs and the resource allocation---and the parallel runtime---able to transparently handle the processes and the program data---is crucial for our throughput-aware malleability methodology. Hence, when a job triggers a reconfiguration, the resource manager will check the cluster status and return an action: an expansion, if there are spare resources; a shrink, if queued jobs can be initiated; or none, if no change can improve the global productivity. In this paper, we describe the internals of our framework and how it is capable of reducing the global workload completion time along with providing a smarter usage of the underlying resources. For this purpose, we present a thorough study of the adaptive workloads processing by showing the detailed behavior of our framework in representative experiments and the low overhead that our reconfiguration involves.
Published: 2020
Full Text: View/download PDF

23. Exploiting nested task-parallelism in the $\mathcal{H}-LU$ factorization

Author: Carratalá-Sáez, Rocío, Christophersen, Sven, Aliaga, José I., Beltran, Vicenç, Börm, Steffen, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Mathematical Software, Computer Science - Distributed, Parallel, and Cluster Computing, Mathematics - Numerical Analysis, 68W10, 65N38, 65F05
Abstract: We address the parallelization of the LU factorization of hierarchical matrices ($\mathcal{H}$-matrices) arising from boundary element methods. Our approach exploits task-parallelism via the OmpSs programming model and runtime, which discovers the data-flow parallelism intrinsic to the operation at execution time, via the analysis of data dependencies based on the memory addresses of the tasks' operands. This is especially challenging for $\mathcal{H}$-matrices, as the structures containing the data vary in dimension during the execution. We tackle this issue by decoupling the data structure from that used to detect dependencies. Furthermore, we leverage the support for weak operands and early release of dependencies, recently introduced in OmpSs-2, to accelerate the execution of parallel codes with nested task-parallelism and fine-grain tasks.
Published: 2019
Full Text: View/download PDF

24. Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

Author: Catalán, Sandra, Igual, Francisco D., Herrero, José R., Rodríguez-Sánchez, Rafael, and Quintana-Ortí, Enrique S.
Published: 2023
Full Text: View/download PDF

25. Reformulating the direct convolution for high-performance deep learning inference on ARM processors

Author: Barrachina, Sergio, Castelló, Adrián, Dolz, Manuel F., Low, Tze Meng, Martínez, Héctor, Quintana-Ortí, Enrique S., Sridhar, Upasana, and Tomás, Andrés E.
Published: 2023
Full Text: View/download PDF

26. QR Factorization Using Malleable BLAS on Multicore Processors

Author: Castelló, Adrián, Catalán, Sandra, Igual, Francisco D., Quintana-Ortí, Enrique S., Rodríguez-Sánchez, Rafael, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Anzt, Hartwig, editor, Bienz, Amanda, editor, Luszczek, Piotr, editor, and Baboulin, Marc, editor
Published: 2022
Full Text: View/download PDF

27. Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

Author: Ramírez, Cristian, Castelló, Adrián, Martínez, Héctor, Quintana-Ortí, Enrique S., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Anzt, Hartwig, editor, Bienz, Amanda, editor, Luszczek, Piotr, editor, and Baboulin, Marc, editor
Published: 2022
Full Text: View/download PDF

28. Approximate Computing for Scientific Applications

Author: Anzt, Hartwig, Casas, Marc, Malossi, A. Cristiano I., Quintana-Ortí, Enrique S., Scheidegger, Florian, Zhuang, Sicong, Bosio, Alberto, editor, Ménard, Daniel, editor, and Sentieys, Olivier, editor
Published: 2022
Full Text: View/download PDF

29. Computational Modelling of Supramolecular Polymers

Author: Doncel-Giménez, Azahara, Calbo, Joaquín, Ortí, Enrique, Aragó, Juan, Aboudzadeh, M. Ali, editor, and Frontera, Antonio, editor
Published: 2022
Full Text: View/download PDF

30. Programming Parallel Dense Matrix Factorizations with Look-Ahead and OpenMP

Author: Catalán, Sandra, Castelló, Adrián, Igual, Francisco D., Rodríguez-Sánchez, Rafael, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Mathematical Software
Abstract: We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multithreaded version of BLAS. This approach is also different from the more sophisticated runtime-assisted implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a highlevel of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of LAPACK functionality on any multicore platform with an OpenMP-like runtime., Comment: 28 pages
Published: 2018

31. Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors

Author: Barrachina, Sergio, Dolz, Manuel F., San Juan, Pablo, and Quintana-Ortí, Enrique S.
Published: 2022
Full Text: View/download PDF

32. High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS

Author: Castelló, Adrián, Barrachina, Sergio, Dolz, Manuel F., Quintana-Ortí, Enrique S., Juan, Pau San, and Tomás, Andrés E.
Published: 2022
Full Text: View/download PDF

33. Look-Ahead in the Two-Sided Reduction to Compact Band Forms for Symmetric Eigenvalue Problems and the SVD

Author: Rodríguez-Sánchez, Rafael, Catalán, Sandra, Herrero, José R., Quintana-Ortí, Enrique S., and Tomás, Andrés E.
Subjects: Computer Science - Mathematical Software, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: We address the reduction to compact band forms, via unitary similarity transformations, for the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD). Concretely, in the first case we revisit the reduction to symmetric band form while, for the second case, we propose a similar alternative, which transforms the original matrix to (unsymmetric) band form, replacing the conventional reduction method that produces a triangular--band output. In both cases, we describe algorithmic variants of the standard Level-3 BLAS-based procedures, enhanced with look-ahead, to overcome the performance bottleneck imposed by the panel factorization. Furthermore, our solutions employ an algorithmic block size that differs from the target bandwidth, illustrating the important performance benefits of this decision. Finally, we show that our alternative compact band form for the SVD is key to introduce an effective look-ahead strategy into the corresponding reduction procedure.
Published: 2017

34. A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting

Author: Catalán, Sandra, Herrero, José R., Quintana-Ortí, Enrique S., Rodríguez-Sánchez, Rafael, and van de Geijn, Robert
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Mathematical Software, Computer Science - Performance
Abstract: We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration. The two mechanisms are instantiated via a new malleable thread-level implementation of the Basic Linear Algebra Subprograms (BLAS), and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on a six core Intel-Xeon processor show the benefits of combining WS+ET, reporting competitive performance in comparison with a task-parallel runtime-based solution.
Published: 2016

35. Balanced and Compressed Coordinate Layout for the Sparse Matrix-Vector Product on GPUs

Author: Aliaga, José Ignacio, Anzt, Hartwig, Quintana-Ortí, Enrique S., Tomás, Andrés E., Tsai, Yuhsiang M., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Balis, Bartosz, editor, B. Heras, Dora, editor, Antonelli, Laura, editor, Bracciali, Andrea, editor, Gruber, Thomas, editor, Hyun-Wook, Jin, editor, Kuhn, Michael, editor, Scott, Stephen L., editor, Unat, Didem, editor, and Wyrzykowski, Roman, editor
Published: 2021
Full Text: View/download PDF

36. Machine learning for optimal selection of sparse triangular system solvers on GPUs

Author: Dufrechou, Ernesto, Ezzatti, Pablo, Freire, Manuel, and Quintana-Ortí, Enrique S.
Published: 2021
Full Text: View/download PDF

37. Multiprecision Block-Jacobi for Iterative Triangular Solves

Author: Goebel, Fritz, Anzt, Hartwig, Cojean, Terry, Flegar, Goran, Quintana-Ortí, Enrique S., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Malawski, Maciej, editor, and Rzadca, Krzysztof, editor
Published: 2020
Full Text: View/download PDF

38. Structure-Aware Calculation of Many-Electron Wave Function Overlaps on Multicore Processors

Author: Davidović, Davor, Quintana-Ortí, Enrique S., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wyrzykowski, Roman, editor, Deelman, Ewa, editor, Dongarra, Jack, editor, and Karczewski, Konrad, editor
Published: 2020
Full Text: View/download PDF

39. Architecture-Aware Optimization of an HEVC decoder on Asymmetric Multicore Processors

Author: Rodríguez-Sánchez, Rafael and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Multimedia, Computer Science - Performance
Abstract: Low-power asymmetric multicore processors (AMPs) attract considerable attention due to their appealing performance-power ratio for energy-constrained environments. However, these processors pose a significant programming challenge due to the integration of cores with different performance capabilities, asking for an asymmetry-aware scheduling solution that carefully distributes the workload. The recent HEVC standard, which offers several high-level parallelization strategies, is an important application that can benefit from an implementation tailored for the low-power AMPs present in many current mobile or hand-held devices. In this scenario, we present an architecture-aware implementation of an HEVC decoder that embeds a criticality-aware scheduling strategy tuned for a Samsung Exynos 5422 system-on-chip furnished with an ARM big.LITTLE AMP. The performance and energy efficiency of our solution is further enhanced by exploiting the NEON vector engine available in the ARM big.LITTLE architecture. Experimental results expose a 1080p real-time HEVC decoding at 24 frames/sec, and a reduction of energy consumption over 20%.
Published: 2016

40. Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors

Author: Catalán, Sandra, Herrero, José R., Igual, Francisco D., Rodríguez-Sánchez, Rafael, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Mathematical Software, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures,the adaption of these libraries for asymmetric multicore processors (AMPs)is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power hungry versus slow/energy efficient. For this purpose, we integrate coarse-grain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type. Our results on an ARM big.LITTLE processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach.
Published: 2015

41. Efficient update of determinants for many-electron wave function overlaps

Author: Alonso-Jordá, Pedro, Davidović, Davor, Sapunar, Marin, Herrero, José R., and Quintana-Ortí, Enrique S.
Published: 2021
Full Text: View/download PDF

42. Revisiting Conventional Task Schedulers to Exploit Asymmetry in ARM big.LITTLE Architectures for Dense Linear Algebra

Author: Costero, Luis, Igual, Francisco D., Olcoz, Katzalin, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Dealing with asymmetry in the architecture opens a plethora of questions from the perspective of scheduling task-parallel applications, and there exist early attempts to address this problem via ad-hoc strategies embedded into a runtime framework. In this paper we take a different path, which consists in addressing the complexity of the problem at the library level, via a few asymmetry-aware fundamental kernels, hiding the architecture heterogeneity from the task scheduler. For the specific domain of dense linear algebra, we show that this is not only possible but delivers much higher performance than a naive approach based on an asymmetry-oblivious scheduler. Furthermore, this solution also outperforms an ad-hoc asymmetry-aware scheduler furnished with sophisticated scheduling techniques.
Published: 2015

43. Performance and Energy Optimization of Matrix Multiplication on Asymmetric big.LITTLE Processors

Author: Catalán, Sandra, Igual, Francisco D., Mayo, Rafael, Piñuel, Luis, Quintana-Ortí, Enrique S., and Rodríguez-Sánchez, Rafael
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Asymmetric processors have emerged as an appealing technology for severely energy-constrained environments, especially in the mobile market where heterogeneity in applications is mainstream. In addition, given the growing interest on ultra low-power architectures for high performance computing, this type of platforms are also being investigated in the road towards the implementation of energy- efficient high-performance scientific applications. In this paper, we propose a first step towards a complete implementation of the BLAS interface adapted to asymmetric ARM big.LITTLE processors, analyzing the trade-offs between performance and energy efficiency when compared to existing homogeneous (symmetric) multi-threaded BLAS implementations. Our experimental results reveal important gains in performance while maintaining the energy efficiency of homogeneous solutions by efficiently exploiting all the resources of the asymmetric processor., Comment: Presented at HiPEAC 2015, Amsterdam. Foundation of the Asymmetric BLIS implementation
Published: 2015

44. Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

Author: Catalán, Sandra, Igual, Francisco D., Mayo, Rafael, Rodríguez-Sánchez, Rafael, and Quintana-Ortí, Enrique S.
Subjects: Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Mathematical Software, Computer Science - Numerical Analysis
Abstract: Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency.
Published: 2015

45. Evaluating Asymmetric Multicore Systems-on-Chip using Iso-Metrics

Author: Chalios, Charalampos, Nikolopoulos, Dimitrios S., and Quintana-Orti, Enrique S.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, C.1.3, G.4
Abstract: The end of Dennard scaling has pushed power consumption into a first order concern for current systems, on par with performance. As a result, near-threshold voltage computing (NTVC) has been proposed as a potential means to tackle the limited cooling capacity of CMOS technology. Hardware operating in NTV consumes significantly less power, at the cost of lower frequency, and thus reduced performance, as well as increased error rates. In this paper, we investigate if a low-power systems-on-chip, consisting of ARM's asymmetric big.LITTLE technology, can be an alternative to conventional high performance multicore processors in terms of power/energy in an unreliable scenario. For our study, we use the Conjugate Gradient solver, an algorithm representative of the computations performed by a large range of scientific and engineering codes., Comment: Presented at HiPEAC EEHCO '15, 6 pages
Published: 2015

46. Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics Processors

Author: Tomás, Andrés E., Quintana-Ortí, Enrique S., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, and Yahyapour, Ramin, editor
Published: 2019
Full Text: View/download PDF

47. Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Author: Iakymchuk, Roman, Barreda, Maria, Wiesenberger, Matthias, Aliaga, José I., and Quintana-Ortí, Enrique S.
Published: 2020
Full Text: View/download PDF

48. Improved Accuracy and Parallelism for MRRR-based Eigensolvers -- A Mixed Precision Approach

Author: Petschow, Matthias, Quintana-Orti, Enrique, and Bientinesi, Paolo
Subjects: Computer Science - Numerical Analysis, Computer Science - Mathematical Software
Abstract: The real symmetric tridiagonal eigenproblem is of outstanding importance in numerical computations; it arises frequently as part of eigensolvers for standard and generalized dense Hermitian eigenproblems that are based on a reduction to tridiagonal form. For its solution, the algorithm of Multiple Relatively Robust Representations (MRRR) is among the fastest methods. Although fast, the solvers based on MRRR do not deliver the same accuracy as competing methods like Divide & Conquer or the QR algorithm. In this paper, we demonstrate that the use of mixed precisions leads to improved accuracy of MRRR-based eigensolvers with limited or no performance penalty. As a result, we obtain eigensolvers that are not only equally or more accurate than the best available methods, but also -in most circumstances- faster and more scalable than the competition.
Published: 2013

49. Concurrent and Accurate RNA Sequencing on Multicore Platforms

Author: Martínez, Héctor, Tárraga, Joaquín, Medina, Ignacio, Barrachina, Sergio, Castillo, Maribel, Dopazo, Joaquín, and Quintana-Ortí, Enrique S.
Subjects: Quantitative Biology - Genomics, Computer Science - Distributed, Parallel, and Cluster Computing, Quantitative Biology - Quantitative Methods, D.1.3, J.3
Abstract: In this paper we introduce a novel parallel pipeline for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, named HPG-Aligner, leverages the speed of the Burrows-Wheeler Transform to map a large number of RNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, that is employed to deal with conflictive reads. The aligner is complemented with a careful strategy to detect splice junctions based on the division of RNA reads into short segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing useful information for the successful alignment of the complete reads. Experimental results on platforms with AMD and Intel multicore processors report the remarkable parallel performance of HPG-Aligner, on short and long RNA reads, which excels in both execution time and sensitivity to an state-of-the-art aligner such as TopHat 2 built on top of Bowtie and Bowtie 2.
Published: 2013

50. Evaluating the NVIDIA Tegra Processor as a Low-Power Alternative for Sparse GPU Computations

Author: Aliaga, José I., Dufrechou, Ernesto, Ezzatti, Pablo, Quintana-Ortí, Enrique S., Barbosa, Simone Diniz Junqueira, Series editor, Chen, Phoebe, Series editor, Filipe, Joaquim, Series editor, Kotenko, Igor, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Yuan, Junsong, Series editor, Zhou, Lizhu, Series editor, Mocskos, Esteban, editor, and Nesmachnow, Sergio, editor
Published: 2018
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

440 results on '"Orti, Enrique"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources