Descriptor: "D.1.3" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"D.1.3"' showing total 1,366 results

Start Over Descriptor "D.1.3"

1,366 results on '"D.1.3"'

1. Assembly of FETI dual operator using CUDA

Author: Homola, Jakub, Vavřík, Radim, Meca, Ondřej, Brzobohatý, Tomáš, and Říha, Lubomír
Subjects: Computer Science - Mathematical Software, D.1.3, G.1.3, G.4, I.3.1
Abstract: FETI is a numerical method used to solve engineering problems. It builds on the ideas of domain decomposition, which makes it highly scalable and capable of efficiently utilizing whole supercomputers. One of the most time-consuming parts of the FETI solver is the application of the dual operator F in every iteration of the solver. It is traditionally performed on the CPU using an implicit approach of applying the individual sparse matrices that form F right-to-left. Another approach is to apply the dual operator explicitly, which primarily involves a simple dense matrix-vector multiplication and can be efficiently performed on the GPU. However, this requires additional preprocessing on the CPU where the dense matrix is assembled, which makes the explicit approach beneficial only after hundreds of iterations are performed. In this paper, we use the GPU to accelerate the assembly process as well. This significantly shortens the preprocessing time, thus decreasing the number of solver iterations needed to make the explicit approach beneficial. With a proper configuration, we only need a few tens of iterations to achieve speedup relative to the implicit CPU approach. Compared to the CPU-only explicit approach, we achieved up to 10x speedup for the preprocessing and 25x for the application., Comment: 10 pages, 12 figures, submitted for review to PDSEC 2025 workshop, part of IPDPS 2025 conference
Published: 2025

2. Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

Author: Wang, Wenyi, Gonthier, Maxime, Nookala, Poornima, Pan, Haochen, Foster, Ian, Raicu, Ioan, and Chard, Kyle
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, D.1.3
Abstract: Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. Results from the first and second advances demonstrate up to 1522.8$\times$ performance improvement compared to the original GNU OpenMP. Further improvements from lock-less load balancing show up to 4$\times$ improvement compared to GNU OpenMP using XQueue. Through a rich set of profiling and instrumentation tools, we are able to investigate the runtime behavior of GNU OpenMP and improve its performance on fine-grained tasks by many orders of magnitude., Comment: 13 pages, 11 figures, preprint, accepted by IPDPS2025
Published: 2025

3. Complementing an imperative process algebra with a rely/guarantee logic

Author: Middelburg, C. A.
Subjects: Computer Science - Logic in Computer Science, D.1.3, D.2.4, F.1.2, F.3.1
Abstract: This paper concerns the relation between imperative process algebra and rely/guarantee logic. An imperative process algebra is complemented by a rely/guarantee logic that can be used to reason about how data change in the course of a process. The imperative process algebra used is the extension of ACP (Algebra of Communicating Processes) that is used earlier in a paper about the relation between imperative process algebra and Hoare logic. A complementing rely/guarantee logic that concerns judgments of partial correctness is treated in detail. The adaptation of this logic to weak and strong total correctness is also addressed. A simple example is given that suggests that a rely/guarantee logic is more suitable as a complementing logic than a Hoare logic if interfering parallel processes are involved., Comment: 30 pages, Sections 2 and 3 of this paper are abridged versions of Sections 2 and 3 of arXiv:1906.04491
Published: 2025

4. Work-Efficient Parallel Non-Maximum Suppression Kernels

Author: Oro, David, Fernández, Carles, Martorell, Xavier, and Hernando, Javier
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Distributed, Parallel, and Cluster Computing, D.1.3, I.4.8
Abstract: In the context of object detection, sliding-window classifiers and single-shot Convolutional Neural Network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-Maximum Suppression (NMS) is the process of selecting a single representative candidate within this cluster of detections, so as to obtain a unique detection per object appearing on a given picture. In this paper, we present a highly scalable NMS algorithm for embedded GPU architectures that is designed from scratch to handle workloads featuring thousands of simultaneous detections on a given picture. Our kernels are directly applicable to other sequential NMS algorithms such as FeatureNMS, Soft-NMS or AdaptiveNMS that share the inner workings of the classic greedy NMS method. The obtained performance results show that our parallel NMS algorithm is capable of clustering 1024 simultaneous detected objects per frame in roughly 1 ms on both NVIDIA Tegra X1 and NVIDIA Tegra X2 on-die GPUs, while taking 2 ms on NVIDIA Tegra K1. Furthermore, our proposed parallel greedy NMS algorithm yields a 14x-40x speed up when compared to state-of-the-art NMS methods that require learning a CNN from annotated data., Comment: Code: https://github.com/hertasecurity/gpu-nms
Published: 2025
Full Text: View/download PDF

5. Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

Author: Li, Yinghan, Li, Yifei, Zhang, Jiejing, Chen, Bujiao, Chen, Xiaotong, Duan, Lian, Jin, Yejun, Li, Zheng, Liu, Xuanyu, Wang, Haoyu, Wang, Wente, Wang, Yajie, Yang, Jiacheng, Zhang, Peiyang, Zheng, Laiwen, and Yu, Wenyuan
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, D.1.3, I.2.6
Abstract: It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up to 91% of the peak Tensor Core throughput on NVIDIA H800 GPU and 95% on NVIDIA H20 GPU., Comment: 11 pages
Published: 2025

6. FedAlign: Federated Domain Generalization with Cross-Client Feature Alignment

Author: Gupta, Sunny, Sutar, Vinay, Singh, Varunav, and Sethi, Amit
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Distributed, Parallel, and Cluster Computing, I.2.6, C.1.4, D.1.3, I.5.1, H.3.4, I.2.10, I.4.0, I.4.1, I.4.2, I.4.6, I.4.7, I.4.8, I.4.9, I.4.10, I.5.2, I.5.4, J.2, I.2.11
Abstract: Federated Learning (FL) offers a decentralized paradigm for collaborative model training without direct data sharing, yet it poses unique challenges for Domain Generalization (DG), including strict privacy constraints, non-i.i.d. local data, and limited domain diversity. We introduce FedAlign, a lightweight, privacy-preserving framework designed to enhance DG in federated settings by simultaneously increasing feature diversity and promoting domain invariance. First, a cross-client feature extension module broadens local domain representations through domain-invariant feature perturbation and selective cross-client feature transfer, allowing each client to safely access a richer domain space. Second, a dual-stage alignment module refines global feature learning by aligning both feature embeddings and predictions across clients, thereby distilling robust, domain-invariant features. By integrating these modules, our method achieves superior generalization to unseen domains while maintaining data privacy and operating with minimal computational and communication overhead., Comment: 9 pages, 4 figures
Published: 2025

7. Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

Author: Devarakonda, Aditya and Kannan, Ramakrishnan
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, 68W10, G.1.6, D.1.3
Abstract: Distributed-memory implementations of numerical optimization algorithm, such as stochastic gradient descent (SGD), require interprocessor communication at every iteration of the algorithm. On modern distributed-memory clusters where communication is more expensive than computation, the scalability and performance of these algorithms are limited by communication cost. This work generalizes prior work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) which attains a continuous performance trade off between the two baseline algorithms. We present theoretical analysis which show the convergence, computation, communication, and memory trade offs between $s$-step SGD, FedAvg, 2D parallel SGD, and other parallel SGD variants. We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system. Our empirical results show that HybridSGD achieves better convergence than FedAvg at similar processor scales while attaining speedups of $5.3\times$ over $s$-step SGD and speedups up to $121\times$ over FedAvg when used to solve binary classification tasks using the convex, logistic regression model on datasets obtained from the LIBSVM repository.
Published: 2025

8. The B2Scala Tool: Integrating Bach in Scala with Security in Mind

Author: Ouardi, Doha, Barkallah, Manel, and Jacquet, Jean-Marie
Subjects: Computer Science - Programming Languages, Computer Science - Multiagent Systems, Computer Science - Symbolic Computation, D.1.3, D.2.4, D.3.3
Abstract: Process algebras have been widely used to verify security protocols in a formal manner. However they mostly focus on synchronous communication based on the exchange of messages. We present an alternative approach relying on asynchronous communication obtained through information available on a shared space. More precisely this paper first proposes an embedding in Scala of a Linda-like language, called Bach. It consists of a Domain Specific Language, internal to Scala, that allows us to experiment programs developed in Bach while benefiting from the Scala eco-system, in particular from its type system as well as program fragments developed in Scala. Moreover, we introduce a logic that allows to restrict the executions of programs to those meeting logic formulae. Our work is illustrated on the Needham-Schroeder security protocol, for which we manage to automatically rediscover the man-in-the-middle attack first put in evidence by G. Lowe., Comment: In Proceedings ICE 2024, arXiv:2412.07570
Published: 2024
Full Text: View/download PDF

9. A Gentle Overview of Asynchronous Session-based Concurrency: Deadlock Freedom by Typing

Author: Heuvel, Bas van den and Pérez, Jorge A.
Subjects: Computer Science - Programming Languages, Computer Science - Logic in Computer Science, D.1.3, D.2.4, D.3.1
Abstract: While formal models of concurrency tend to focus on synchronous communication, asynchronous communication is relevant in practice. In this paper, we will discuss asynchronous communication in the context of session-based concurrency, the model of computation in which session types specify the structure of the two-party protocols implemented by the channels of a communicating process. We overview recent work on addressing the challenge of ensuring the deadlock-freedom property for message-passing processes that communicate asynchronously in cyclic process networks governed by session types. We offer a gradual presentation of three typed process frameworks and outline how they may be used to guarantee deadlock freedom for a concurrent functional language with sessions., Comment: In Proceedings ICE 2024, arXiv:2412.07570
Published: 2024
Full Text: View/download PDF

10. Tensor-product vertex patch smoothers for biharmonic problems

Author: Witte, Julius, Cui, Cu, Bonizzoni, Francesca, and Kanschat, Guido
Subjects: Mathematics - Numerical Analysis, 65Y10, 65N55, 65N30, 74K20, G.1.8, D.1.3
Abstract: We discuss vertex patch smoothers as overlapping domain decomposition methods for fourth order elliptic partial differential equations. We show that they are numerically very efficient and yield high convergence rates. Furthermore, we discuss low rank tensor approximations for their efficient implementation. Our experiments demonstrate that the inexact local solver yields a method which converges fast and uniformly with respect to mesh refinement. The multiplicative smoother shows superior performance in terms of solution efficiency, requiring fewer iterations. However, in three-dimensional cases, the additive smoother outperforms its multiplicative counterpart due to the latter's lower potential for parallelism. Additionally, the solver infrastructure supports a mixed-precision approach, executing the multigrid preconditioner in single precision while performing the outer iteration in double precision, thereby increasing throughput by up to 70 percent.
Published: 2024

11. Cascaded Prediction and Asynchronous Execution of Iterative Algorithms on Heterogeneous Platforms

Author: Gao, Jianhua, Liu, Bingjie, Wang, Yizhuo, Ji, Weixing, and Huang, Hua
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Mathematical Software, 68-02, 68W10, 65F50, A.1, D.1.3, G.1.3
Abstract: Owing to the diverse scales and varying distributions of sparse matrices arising from practical problems, a multitude of choices are present in the design and implementation of sparse matrix-vector multiplication (SpMV). Researchers have proposed many machine learning-based optimization methods for SpMV. However, these efforts only support one area of sparse matrix format selection, SpMV algorithm selection, or parameter configuration, and rarely consider a large amount of time overhead associated with feature extraction, model inference, and compression format conversion. This paper introduces a machine learning-based cascaded prediction method for SpMV computations that spans various computing stages and hierarchies. Besides, an asynchronous and concurrent computing model has been designed and implemented for runtime model prediction and iterative algorithm solving on heterogeneous computing platforms. It not only offers comprehensive support for the iterative algorithm-solving process leveraging machine learning technology, but also effectively mitigates the preprocessing overheads. Experimental results demonstrate that the cascaded prediction introduced in this paper accelerates SpMV by 1.33x on average, and the iterative algorithm, enhanced by cascaded prediction and asynchronous execution, optimizes by 2.55x on average., Comment: 12 pages, 9 figures, 7 tables
Published: 2024

12. Precision-Aware Iterative Algorithms Based on Group-Shared Exponents of Floating-Point Numbers

Author: Gao, Jianhua, Shen, Jiayuan, Zhang, Yuxiang, Ji, Weixing, and Huang, Hua
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Mathematics - Numerical Analysis, 68-02, 68W10, 65F50, A.1, D.1.3, G.1.3
Abstract: Iterative solvers are frequently used in scientific applications and engineering computations. However, the memory-bound Sparse Matrix-Vector (SpMV) kernel computation hinders the efficiency of iterative algorithms. As modern hardware increasingly supports low-precision computation, the mixed-precision optimization of iterative algorithms has garnered widespread attention. Nevertheless, existing mixed-precision methods pose challenges, including format conversion overhead, tight coupling between storage and computation representation, and the need to store multiple precision copies of data. This paper proposes a floating-point representation based on the group-shared exponent and segmented storage of the mantissa, enabling higher bit utilization of the representation vector and fast switches between different precisions without needing multiple data copies. Furthermore, a stepped mixed-precision iterative algorithm is proposed. Our experimental results demonstrate that, compared with existing floating-point formats, our approach significantly improves iterative algorithms' performance and convergence residuals., Comment: 13 pages, 9 figures
Published: 2024

13. An Evaluation of Massively Parallel Algorithms for DFA Minimization

Author: Martens, Jan and Wijs, Anton
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Logic in Computer Science, D.1.3, F.4.3
Abstract: We study parallel algorithms for the minimization of Deterministic Finite Automata (DFAs). In particular, we implement four different massively parallel algorithms on Graphics Processing Units (GPUs). Our results confirm the expectations that the algorithm with the theoretically best time complexity is not practically suitable to run on GPUs due to the large amount of resources needed. We empirically verify that parallel partition refinement algorithms from the literature perform better in practice, even though their time complexity is worse. Lastly, we introduce a novel algorithm based on partition refinement with an extra parallel partial transitive closure step and show that on specific benchmarks it has better run-time complexity and performs better in practice., Comment: In Proceedings GandALF 2024, arXiv:2410.21884
Published: 2024
Full Text: View/download PDF

14. Final Report for CHESS: Cloud, High-Performance Computing, and Edge for Science and Security

Author: Tallent, Nathan, Strube, Jan, Guo, Luanzheng, Lee, Hyungro, Firoz, Jesun, Ghosh, Sayan, Fang, Bo, Bel, Oceane, Spurgeon, Steven, Akers, Sarah, Doty, Christina, and Cromwell, Erol
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Performance, Electrical Engineering and Systems Science - Systems and Control, C.2.4, C.4, D.1.3, J.2, K.6.4
Abstract: Automating the theory-experiment cycle requires effective distributed workflows that utilize a computing continuum spanning lab instruments, edge sensors, computing resources at multiple facilities, data sets distributed across multiple information sources, and potentially cloud. Unfortunately, the obvious methods for constructing continuum platforms, orchestrating workflow tasks, and curating datasets over time fail to achieve scientific requirements for performance, energy, security, and reliability. Furthermore, achieving the best use of continuum resources depends upon the efficient composition and execution of workflow tasks, i.e., combinations of numerical solvers, data analytics, and machine learning. Pacific Northwest National Laboratory's LDRD "Cloud, High-Performance Computing (HPC), and Edge for Science and Security" (CHESS) has developed a set of interrelated capabilities for enabling distributed scientific workflows and curating datasets. This report describes the results and successes of CHESS from the perspective of open science.
Published: 2024

15. A Study of Performance Portability in Plasma Physics Simulations

Author: Ruzicka, Josef, Asch, Christian, Meneses, Esteban, Rampp, Markus, and Laure, Erwin
Subjects: Physics - Plasma Physics, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance, Physics - Computational Physics, 68Q85, D.1.3
Abstract: The high-performance computing (HPC) community has recently seen a substantial diversification of hardware platforms and their associated programming models. From traditional multicore processors to highly specialized accelerators, vendors and tool developers back up the relentless progress of those architectures. In the context of scientific programming, it is fundamental to consider performance portability frameworks, i.e., software tools that allow programmers to write code once and run it on different computer architectures without sacrificing performance. We report here on the benefits and challenges of performance portability using a field-line tracing simulation and a particle-in-cell code, two relevant applications in computational plasma physics with applications to magnetically-confined nuclear-fusion energy research. For these applications we report performance results obtained on four HPC platforms with server-class CPUs from Intel (Xeon) and AMD (EPYC), and high-end GPUs from Nvidia and AMD, including the latest Nvidia H100 GPU and the novel AMD Instinct MI300A APU. Our results show that both Kokkos and OpenMP are powerful tools to achieve performance portability and decent "out-of-the-box" performance, even for the very latest hardware platforms. For our applications, Kokkos provided performance portability to the broadest range of hardware architectures from different vendors., Comment: 15 pages, 8 figures, this is a pre-print to be published in the Latin America High Performance Computing Conference (CARLA) 2024 proceedings
Published: 2024

16. MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices

Author: Hamdi, Mohamed Amine, Daghero, Francesco, Sarda, Giuseppe Maria, Van Delm, Josse, Symons, Arne, Benini, Luca, Verhelst, Marian, Pagliari, Daniele Jahier, and Burrello, Alessio
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, I.2.2, D.1.3
Abstract: Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to a different heterogeneous MCU family implies labor-intensive re-development of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, resulting in the generation of general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework enhanced with hardware cost models can compete with and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency by up to 60.88 times on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by 2.15 times compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board., Comment: 13 pages, 11 figures, 4 tables
Published: 2024

17. Agent-based modeling for realistic reproduction of human mobility and contact behavior to evaluate test and isolation strategies in epidemic infectious disease spread

Author: Kerkmann, David, Korf, Sascha, Nguyen, Khoa, Abele, Daniel, Schengen, Alain, Gerstein, Carlotta, Göbbert, Jens Henrik, Basermann, Achim, Kühn, Martin J., and Meyer-Hermann, Michael
Subjects: Computer Science - Multiagent Systems, Computer Science - Distributed, Parallel, and Cluster Computing, Physics - Physics and Society, I.6.4, I.6.5, D.1.3
Abstract: Agent-based models have proven to be useful tools in supporting decision-making processes in different application domains. The advent of modern computers and supercomputers has enabled these bottom-up approaches to realistically model human mobility and contact behavior. The COVID-19 pandemic showcased the urgent need for detailed and informative models that can answer research questions on transmission dynamics. We present a sophisticated agent-based model to simulate the spread of respiratory diseases. The model is highly modularized and can be used on various scales, from a small collection of buildings up to cities or countries. Although not being the focus of this paper, the model has undergone performance engineering on a single core and provides an efficient intra- and inter-simulation parallelization for time-critical decision-making processes. In order to allow answering research questions on individual level resolution, nonpharmaceutical intervention strategies such as face masks or venue closures can be implemented for particular locations or agents. In particular, we allow for sophisticated testing and isolation strategies to study the effects of minimal-invasive infectious disease mitigation. With realistic human mobility patterns for the region of Brunswick, Germany, we study the effects of different interventions between March 1st and May 30, 2021 in the SARS-CoV-2 pandemic. Our analyses suggest that symptom-independent testing has limited impact on the mitigation of disease dynamics if the dark figure in symptomatic cases is high. Furthermore, we found that quarantine length is more important than quarantine efficiency but that, with sufficient symptomatic control, also short quarantines can have a substantial effect., Comment: 35 pages, 13 figures, to be submitted to Elsevier
Published: 2024

18. FedStein: Enhancing Multi-Domain Federated Learning Through James-Stein Estimator

Author: Gupta, Sunny, Jangid, Nikita, and Sethi, Amit
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Distributed, Parallel, and Cluster Computing, I.2.6, C.1.4, D.1.3, I.5.1, H.3.4, I.2.10, I.4.0, I.4.1, I.4.2, I.4.6, I.4.7, I.4.8, I.4.9, I.4.10, I.5.2, I.5.4, J.2, I.2.11
Abstract: Federated Learning (FL) facilitates data privacy by enabling collaborative in-situ training across decentralized clients. Despite its inherent advantages, FL faces significant challenges of performance and convergence when dealing with data that is not independently and identically distributed (non-i.i.d.). While previous research has primarily addressed the issue of skewed label distribution across clients, this study focuses on the less explored challenge of multi-domain FL, where client data originates from distinct domains with varying feature distributions. We introduce a novel method designed to address these challenges FedStein: Enhancing Multi-Domain Federated Learning Through the James-Stein Estimator. FedStein uniquely shares only the James-Stein (JS) estimates of batch normalization (BN) statistics across clients, while maintaining local BN parameters. The non-BN layer parameters are exchanged via standard FL techniques. Extensive experiments conducted across three datasets and multiple models demonstrate that FedStein surpasses existing methods such as FedAvg and FedBN, with accuracy improvements exceeding 14% in certain domains leading to enhanced domain generalization. The code is available at https://github.com/sunnyinAI/FedStein, Comment: 12 pages, 2 figures. Accepted at International Workshop on Federated Foundation Models In Conjunction with NeurIPS 2024 (FL@FM-NeurIPS'24)
Published: 2024

19. FLeNS: Federated Learning with Enhanced Nesterov-Newton Sketch

Author: Gupta, Sunny, Jindal, Mohit, Kashyap, Pankhi, Jeevan, Pranav, and Sethi, Amit
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Distributed, Parallel, and Cluster Computing, Mathematics - Optimization and Control, I.2.6, C.1.4, D.1.3, I.5.1, H.3.4
Abstract: Federated learning faces a critical challenge in balancing communication efficiency with rapid convergence, especially for second-order methods. While Newton-type algorithms achieve linear convergence in communication rounds, transmitting full Hessian matrices is often impractical due to quadratic complexity. We introduce Federated Learning with Enhanced Nesterov-Newton Sketch (FLeNS), a novel method that harnesses both the acceleration capabilities of Nesterov's method and the dimensionality reduction benefits of Hessian sketching. FLeNS approximates the centralized Newton's method without relying on the exact Hessian, significantly reducing communication overhead. By combining Nesterov's acceleration with adaptive Hessian sketching, FLeNS preserves crucial second-order information while preserving the rapid convergence characteristics. Our theoretical analysis, grounded in statistical learning, demonstrates that FLeNS achieves super-linear convergence rates in communication rounds - a notable advancement in federated optimization. We provide rigorous convergence guarantees and characterize tradeoffs between acceleration, sketch size, and convergence speed. Extensive empirical evaluation validates our theoretical findings, showcasing FLeNS's state-of-the-art performance with reduced communication requirements, particularly in privacy-sensitive and edge-computing scenarios. The code is available at https://github.com/sunnyinAI/FLeNS, Comment: 10 pages, 3 figures, 2 Tables
Published: 2024

20. Handling expression evaluation under interference

Author: Hayes, Ian J., Jones, Cliff B., and Meinicke, Larissa A.
Subjects: Computer Science - Logic in Computer Science, Computer Science - Software Engineering, F.3.1, D.1.3
Abstract: Hoare-style inference rules for program constructs permit the copying of expressions and tests from program text into logical contexts. It is known that this requires care even for sequential programs but further issues arise for concurrent programs because of potential interference to the values of variables. The "rely-guarantee" approach does tackle the issue of recording acceptable interference and offers a way to provide safe inference rules. This paper shows how the algebraic presentation of rely-guarantee ideas can clarify and formalise the conditions for safely re-using expressions and tests from program text in logical contexts for reasoning about programs., Comment: 17 pages, 1 figure
Published: 2024

21. DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL

Author: Gonzalez-Escribano, Arturo, García-Álvarez, Diego, and Cámara, Jesús
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, K.3.2, D.1.3
Abstract: We present an assignment for a full Parallel Computing course. Since 2017/2018, we have proposed a different problem each academic year to illustrate various methodologies for approaching the same computational problem using different parallel programming models. They are designed to be parallelized using shared-memory programming with OpenMP, distributed-memory programming with MPI, and GPU programming with CUDA or OpenCL. The problem chosen for this year implements a brute-force solution for exact DNA sequence alignment of multiple patterns. The program searches for exact coincidences of multiple nucleotide strings in a long DNA sequence. The sequential implementation is designed to be clear and understandable to students while offering many opportunities for parallelization and optimization. This assignment addresses key concepts many students find difficult to apply in practical scenarios: race conditions, reductions, collective operations, and point-to-point communications. It also covers the problem of parallel generation of pseudo-random sequences and strategies to notify and stop speculative computations when matches are found. This assignment serves as an exercise that reinforces basic knowledge and prepares students for more complex parallel computing concepts and structures. It has been successfully implemented as a practical assignment in a Parallel Computing course in the third year of a Computer Engineering degree program. Supporting materials for this and previous assignments in this series are publicly available., Comment: 3 pages, 1 figure, 1 artifact and reproducibility appendix. Accepted for presentation at EduHPC-24: Workshop on Education for High-Performance Computing, to be held during Supercomputing 2024 conference
Published: 2024

22. Conversational Concurrency

Author: Garnock-Jones, Tony
Subjects: Computer Science - Programming Languages, D.3.3, D.3.1, D.1.3, D.1.1, D.4.7, E.1
Abstract: Concurrent computations resemble conversations. In a conversation, participants direct utterances at others and, as the conversation evolves, exploit the known common context to advance the conversation. Similarly, collaborating software components share knowledge with each other in order to make progress as a group towards a common goal. This dissertation studies concurrency from the perspective of cooperative knowledge-sharing, taking the conversational exchange of knowledge as a central concern in the design of concurrent programming languages. In doing so, it makes five contributions: 1. It develops the idea of a common dataspace as a medium for knowledge exchange among concurrent components, enabling a new approach to concurrent programming. While dataspaces loosely resemble both "fact spaces" from the world of Linda-style languages and Erlang's collaborative model, they significantly differ in many details. 2. It offers the first crisp formulation of cooperative, conversational knowledge-exchange as a mathematical model. 3. It describes two faithful implementations of the model for two quite different languages. 4. It proposes a completely novel suite of linguistic constructs for organizing the internal structure of individual actors in a conversational setting. The combination of dataspaces with these constructs is dubbed Syndicate. 5. It presents and analyzes evidence suggesting that the proposed techniques and constructs combine to simplify concurrent programming. The dataspace concept stands alone in its focus on representation and manipulation of conversational frames and conversational state and in its integral use of explicit epistemic knowledge. The design is particularly suited to integration of general-purpose I/O with otherwise-functional languages, but also applies to actor-like settings more generally., Comment: PhD dissertation
Published: 2024
Full Text: View/download PDF

23. Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL

Author: de Castro, Manuel, andújar, Francisco J., Osorio, Roberto R., Carratalá-Sáez, Rocío, and Llanos, Diego R.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance, D.1.3
Abstract: As the interest in FPGA-based accelerators for HPC applications increases, new challenges also arise, especially concerning different programming and portability issues. This paper aims to provide a snapshot of the current state of the FPGA tooling and its problems. To do so, we evaluate the performance portability of two frameworks for developing FPGA solutions for HPC (SYCL and OpenCL) when using them to port a highly-parallel application to FPGAs, using both ND-range and single-task type of kernels. The developer's general recommendation when using FPGAs is to develop single-task kernels for them, as they are commonly regarded as more suited for such hardware. However, we discovered that, when using high-level approaches such as OpenCL and SYCL to program a highly-parallel application with no FPGA-tailored optimizations, ND-range kernels significantly outperform single-task codes. Specifically, while SYCL struggles to produce efficient FPGA implementations of applications described as single-task codes, its performance excels with ND-range kernels, a result that was unexpectedly favorable.
Published: 2024

24. Stream parallel skeleton optimization

Author: Aldinucci, Marco and Danelutto, Marco
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, D.1.3, D.3.2, C.1.3
Abstract: We discuss the properties of the composition of stream parallel skeletons such as pipelines and farms. By looking at the ideal performance figures assumed to hold for these skeletons, we show that any stream parallel skeleton composition can always be rewritten into an equivalent "normal form" skeleton composition, delivering a service time which is equal or even better to the service time of the original skeleton composition, and achieving a better utilization of the processors used. The normal form is defined as a single farm built around a sequential worker code. Experimental results are discussed that validate this normal form.
Published: 2024

25. Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures

Author: Chillarón, Mónica, Quintana-Ortí, Gregorio, Vidal, Vicente, and Martinsson, Per-Gunnar
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance, 68-04, 68W10, 15-04, G.1.3, G.4, C.4, D.1.3, F.2.1
Abstract: Solving very large linear systems of equations is a key computational task in science and technology. In many cases, the coefficient matrix of the linear system is rank-deficient, leading to systems that may be underdetermined, inconsistent, or both. In such cases, one generally seeks to compute the least squares solution that minimizes the residual of the problem, which can be further defined as the solution with smallest norm in cases where the coefficient matrix has a nontrivial nullspace. This work presents several new techniques for solving least squares problems involving coefficient matrices that are so large that they do not fit in main memory. The implementations include both CPU and GPU variants. All techniques rely on complete orthogonal decompositions that guarantee that both conditions of a least squares solution are met, regardless of the rank properties of the matrix. Specifically, they rely on the recently proposed "randUTV" algorithm that is particularly effective in strongly communication-constrained environments. A detailed precision and performance study reveals that the new methods, that operate on data stored on disk, are competitive with state-of-the-art methods that store all data in main memory., Comment: 26 pages, 12 figures
Published: 2024

26. Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

Author: Xu, Yao and Cooperman, Gene
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, D.1.3
Abstract: MPI is the de facto standard for parallel computing on a cluster of computers. Checkpointing is an important component in any strategy for software resilience and for long-running jobs that must be executed by chaining together time-bounded resource allocations. This work solves an old problem: a practical and general algorithm for transparent checkpointing of MPI that is both efficient and compatible with most of the latest network software. Transparent checkpointing is attractive due to its generality and ease of use for most MPI application developers. Earlier efforts at transparent checkpointing for MPI, one decade ago, had two difficult problems: (i) by relying on a specific MPI implementation tied to a specific network technology; and (ii) by failing to demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's introduction of split processes. Problem (ii) (efficient runtime overhead) is solved in this work. This paper introduces an approach that avoids these limitations, employing a novel topological sort to algorithmically determine a safe future synchronization point. The algorithm is valid for both blocking and non-blocking collective communication in MPI. We demonstrate the efficacy and scalability of our approach through both micro-benchmarks and a set of five real-world MPI applications, notably including the widely used VASP (Vienna Ab Initio Simulation Package), which is responsible for 11% of the workload on the Perlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was previously cited as a special challenge for checkpointing, in part due to its multi-algorithm codes., Comment: 22 pages, 9 figures and 1 table, accepted to IEEE Cluster'24
Published: 2024

27. Parallel Strategies for Best-First Generalized Planning

Author: Fernández-Alburquerque, Alejandro and Segovia-Aguas, Javier
Subjects: Computer Science - Artificial Intelligence, I.2.8, D.1.3
Abstract: In recent years, there has been renewed interest in closing the performance gap between state-of-the-art planning solvers and generalized planning (GP), a research area of AI that studies the automated synthesis of algorithmic-like solutions capable of solving multiple classical planning instances. One of the current advancements has been the introduction of Best-First Generalized Planning (BFGP), a GP algorithm based on a novel solution space that can be explored with heuristic search, one of the foundations of modern planners. This paper evaluates the application of parallel search techniques to BFGP, another critical component in closing the performance gap. We first discuss why BFGP is well suited for parallelization and some of its differentiating characteristics from classical planners. Then, we propose two simple shared-memory parallel strategies with good scaling with the number of cores., Comment: 3 pages
Published: 2024

28. Scalable Dual Coordinate Descent for Kernel Methods

Author: Shao, Zishan and Devarakonda, Aditya
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, 65Y05, D.1.3, G.4, F.2.1
Abstract: Dual Coordinate Descent (DCD) and Block Dual Coordinate Descent (BDCD) are important iterative methods for solving convex optimization problems. In this work, we develop scalable DCD and BDCD methods for the kernel support vector machines (K-SVM) and kernel ridge regression (K-RR) problems. On distributed-memory parallel machines the scalability of these methods is limited by the need to communicate every iteration. On modern hardware where communication is orders of magnitude more expensive, the running time of the DCD and BDCD methods is dominated by communication cost. We address this communication bottleneck by deriving $s$-step variants of DCD and BDCD for solving the K-SVM and K-RR problems, respectively. The $s$-step variants reduce the frequency of communication by a tunable factor of $s$ at the expense of additional bandwidth and computation. The $s$-step variants compute the same solution as the existing methods in exact arithmetic. We perform numerical experiments to illustrate that the $s$-step variants are also numerically stable in finite-arithmetic, even for large values of $s$. We perform theoretical analysis to bound the computation and communication costs of the newly designed variants, up to leading order. Finally, we develop high performance implementations written in C and MPI and present scaling experiments performed on a Cray EX cluster. The new $s$-step variants achieved strong scaling speedups of up to $9.8\times$ over existing methods using up to $512$ cores.
Published: 2024

29. Vahana.jl -- A framework (not only) for large-scale agent-based models

Author: Fürst, Steffen, Conrad, Tim, Jaeger, Carlo, and Wolf, Sarah
Subjects: Computer Science - Multiagent Systems, Computer Science - Distributed, Parallel, and Cluster Computing, 37E25, D.1.3, I.6.5, J.4
Abstract: Agent-based models (ABMs) offer a powerful framework for understanding complex systems. However, their computational demands often become a significant barrier as the number of agents and complexity of the simulation increase. Traditional ABM platforms often struggle to fully exploit modern computing resources, hindering the development of large-scale simulations. This paper presents Vahana.jl, a high performance computing open source framework that aims to address these limitations. Building on the formalism of synchronous graph dynamical systems, Vahana.jl is especially well suited for models with a focus on (social) networks. The framework seamlessly supports distribution across multiple compute nodes, enabling simulations that would otherwise be beyond the capabilities of a single machine. Implemented in Julia, Vahana.jl leverages the interactive Read-Eval-Print Loop (REPL) environment, facilitating rapid model development and experimentation.
Published: 2024

30. Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Author: Pekkilä, Johannes, Lappi, Oskar, Robertsén, Fredrik, and Korpi-Lagg, Maarit J.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, 65Y10 (Primary), 35-04, 76W05, 85-08 (Secondary), G.4, I.3.1, G.1.8, D.1.3, D.3.4, J.2
Abstract: Over the last ten years, graphics processors have become the de facto accelerator for data-parallel tasks in various branches of high-performance computing, including machine learning and computational sciences. However, with the recent introduction of AMD-manufactured graphics processors to the world's fastest supercomputers, tuning strategies established for previous hardware generations must be re-evaluated. In this study, we evaluate the performance and energy efficiency of stencil computations on modern datacenter graphics processors, and propose a tuning strategy for fusing cache-heavy stencil kernels. The studied cases comprise both synthetic and practical applications, which involve the evaluation of linear and nonlinear stencil functions in one to three dimensions. Our experiments reveal that AMD and Nvidia graphics processors exhibit key differences in both hardware and software, necessitating platform-specific tuning to reach their full computational potential., Comment: 21 pages, 14 figures. Submitted to Concurrency and Computation: Practice and Experience
Published: 2024

31. Porting the grid-based 3D+3V hybrid-Vlasov kinetic plasma simulation Vlasiator to heterogeneous GPU architectures

Author: Battarbee, Markus, Papadakis, Konstantinos, Ganse, Urs, Hokkanen, Jaro, Kotipalo, Leo, Pfau-Kempf, Yann, Alho, Markku, and Palmroth, Minna
Subjects: Physics - Computational Physics, Physics - Plasma Physics, Physics - Space Physics, J.2, D.1.3
Abstract: Vlasiator is a space plasma simulation code which models near-Earth ion-kinetic dynamics in three spatial and three velocity dimensions. It is highly parallelized, modeling the Vlasov equation directly through the distribution function, discretized on a Cartesian grid, instead of the more common particle-in-cell approach. Modeling near-Earth space, plasma properties span several orders of magnitude in temperature, density, and magnetic field strength. In order to fit the required six-dimensional grids in memory, Vlasiator utilizes a sparse block-based velocity mesh, where chunks of velocity space are added or deleted based on the advection requirements of the Vlasov solver. In addition, the spatial mesh is adaptively refined through cell-based octree refinement. In this paper, we describe the design choices of porting Vlasiator to heterogeneous CPU/GPU architectures. We detail the memory management, algorithmic changes, and kernel construction as well as our unified codebase approach, resulting in portability to both NVIDIA and AMD hardware (CUDA and HIP languages, respectively). In particular, we showcase a highly parallel block adjustment approach allowing efficient re-ordering of a sparse velocity mesh. We detail pitfalls we have overcome and lay out a plan for optimization to facilitate future exascale simulations using multi-node GPU supercomputing., Comment: 31 pages, 12 figures, submitted to ASTRONUM 2024 - the 16th International Conference on Numerical Modeling of Space Plasma Flows
Published: 2024

32. Construction of a Byzantine Linearizable SWMR Atomic Register from SWSR Atomic Registers

Author: Kshemkalyani, Ajay D., Piduguralla, Manaswini, Peri, Sathya, and Misra, Anshuman
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, C.2.4, D.1.3
Abstract: The SWMR atomic register is a fundamental building block in shared memory distributed systems and implementing it from SWSR atomic registers is an important problem. While this problem has been solved in crash-prone systems, it has received less attention in Byzantine systems. Recently, Hu and Toueg gave such an implementation of the SWMR register from SWSR registers. While their definition of register linearizability is consistent with the definition of Byzantine linearizability of a concurrent history of Cohen and Keidar, it has these drawbacks. (1) If the writer is Byzantine, the register is linearizable no matter what values the correct readers return. (2) It ignores values written consistently by a Byzantine writer. We need a stronger notion of a {\em correct write operation}. (3) It allows a value written to just one or a few readers' SWSR registers to be returned, thereby not validating the intention of the writer to write that value honestly. (4) Its notion of a ``current'' value returned by a correct reader is not related to the most recent value written by a correct write operation of a Byzantine writer. We need a more up to date version of the value that can be returned by a correct reader. In this paper, we give a stronger definition of a Byzantine linearizable register that overcomes the above drawbacks. Then we give a construction of a Byzantine linearizable SWMR atomic register from SWSR registers that meets our stronger definition. The construction is correct when $n>3f$, where $n$ is the number of readers, $f$ is the maximum number of Byzantine readers, and the writer can also be Byzantine. The construction relies on a public-key infrastructure., Comment: 18 pages
Published: 2024

33. Local Adjoints for Simultaneous Preaccumulations with Shared Inputs

Author: Blühdorn, Johannes and Gauger, Nicolas R.
Subjects: Computer Science - Mathematical Software, D.1.3, G.1.4, G.4, J.2
Abstract: In shared-memory parallel automatic differentiation, inputs that are shared among simultaneous thread-local preaccumulations lead to data races if Jacobians are accumulated with a single, shared vector of adjoint variables. In this work, we discuss the benefits and tradeoffs of re-enabling such preaccumulations by a transition to suitable local adjoints. We propose different vector- and map-based approaches for storing local adjoint variables and analyze them with respect to memory consumption, memory allocation, and adjoint variable access times in the context of simultaneous preaccumulations in multiple threads. We implement the approaches in CoDiPack and benchmark them in parallel discrete adjoint computations in the multiphysics simulation suite SU2., Comment: 12 pages, 5 figures. Updated and extended all parts of the paper
Published: 2024

34. Restructuring a concurrent refinement algebra

Author: Hayes, Ian J., Meinicke, Larissa A., and Evangelou-Oost, Naso
Subjects: Computer Science - Logic in Computer Science, F.3.1, D.1.3
Abstract: The concurrent refinement algebra has been developed to support rely/guarantee reasoning about concurrent programs. The algebra supports atomic commands and defines parallel composition as a synchronous operation, as in Milner's SCCS. In order to allow specifications to be combined, the algebra also provides a weak conjunction operation, which is also a synchronous operation that shares many properties with parallel composition. The three main operations, sequential composition, parallel composition and weak conjunction, all respect a (weak) quantale structure over a lattice of commands. Further structure involves combinations of pairs of these operations: sequential/parallel, sequential/weak conjunction and parallel/weak conjunction, each pair satisfying a weak interchange law similar to Concurrent Kleene Algebra. Each of these pairs satisfies a common biquantale structure. Additional structure is added via compatible sets of commands, including tests, atomic commands and pseudo-atomic commands. These allow stronger (equality) interchange and distributive laws. This paper describes the result of restructuring the algebra to better exploit these commonalities. The algebra is implemented in Isabelle/HOL.
Published: 2024

35. Data reification in a concurrent rely-guarantee algebra

Author: Meinicke, Larissa A., Hayes, Ian J., and Jones, Cliff B.
Subjects: Computer Science - Logic in Computer Science, Computer Science - Software Engineering, F.3.1, D.1.3
Abstract: Specifications of significant systems can be made short and perspicuous by using abstract data types; data reification can provide a clear, stepwise, development history of programs that use more efficient concrete representations. Data reification (or "refinement") techniques for sequential programs are well established. This paper applies these ideas to concurrency, in particular, an algebraic theory supporting rely-guarantee reasoning about concurrency. A concurrent version of the Galler-Fischer equivalence relation data structure is used as an example.
Published: 2024

36. Hybrid parallel discrete adjoints in SU2

Author: Blühdorn, Johannes, Gomes, Pedro, Aehle, Max, and Gauger, Nicolas R.
Subjects: Computer Science - Mathematical Software, D.1.3, G.1.4, G.4, J.2
Abstract: The open-source multiphysics suite SU2 features discrete adjoints by means of operator overloading automatic differentiation (AD). While both primal and discrete adjoint solvers support MPI parallelism, hybrid parallelism using both MPI and OpenMP has only been introduced for the primal solvers so far. In this work, we enable hybrid parallel discrete adjoint solvers. Coupling SU2 with OpDiLib, an add-on for operator overloading AD tools that extends AD to OpenMP parallelism, marks a key step in this endeavour. We identify the affected parts of SU2's advanced AD workflow and discuss the required changes and their tradeoffs. Detailed performance studies compare MPI parallel and hybrid parallel discrete adjoints in terms of memory and runtime and unveil key performance characteristics. We showcase the effectiveness of performance optimizations and highlight perspectives for future improvements. At the same time, this study demonstrates the applicability of OpDiLib in a large code base and its scalability on large test cases, providing valuable insights for future applications both within and beyond SU2., Comment: 28 pages, 9 figures, 2 listings; new layout, revised section structure, polishing and small updates
Published: 2024

37. A Systematic Literature Survey of Sparse Matrix-Vector Multiplication

Author: Gao, Jianhua, Liu, Bingjie, Ji, Weixing, and Huang, Hua
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, 68-02, 68W10, 65F50, A.1, D.1.3, G.1.3
Abstract: Sparse matrix-vector multiplication (SpMV) is a crucial computing kernel with widespread applications in iterative algorithms. Over the past decades, research on SpMV optimization has made remarkable strides, giving rise to various optimization contributions. However, the comprehensive and systematic literature survey that introduces, analyzes, discusses, and summarizes the advancements of SpMV in recent years is currently lacking. Aiming to fill this gap, this paper compares existing techniques and analyzes their strengths and weaknesses. We begin by highlighting two representative applications of SpMV, then conduct an in-depth overview of the important techniques that optimize SpMV on modern architectures, which we specifically classify as classic, auto-tuning, machine learning, and mixed-precision-based optimization. We also elaborate on the hardware-based architectures, including CPU, GPU, FPGA, processing in Memory, heterogeneous, and distributed platforms. We present a comprehensive experimental evaluation that compares the performance of state-of-the-art SpMV implementations. Based on our findings, we identify several challenges and point out future research directions. This survey is intended to provide researchers with a comprehensive understanding of SpMV optimization on modern architectures and provide guidance for future work., Comment: 34 pages, 18 figures, 16 tables
Published: 2024

38. How to Relax Instantly: Elastic Relaxation of Concurrent Data Structures

Author: von Geijer, Kåre and Tsigas, Philippas
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Distributed, Parallel, and Cluster Computing, D.1.3, E.1
Abstract: The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions. In this paper, we first introduce the concept of elastic relaxation and consequently present the Lateral structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the Lateral , we design novel elastically relaxed, lock-free queues and stacks capable of reconfiguring relaxation during run time. We establish linearizability and define upper bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs hold up against state-of-the-art statically relaxed designs, while also swiftly managing trade-offs between relaxation and operational latency. We also outline how to use the Lateral to design elastically relaxed lock-free counters and deques.
Published: 2024

39. Reasoning about distributive laws in a concurrent refinement algebra

Author: Meinicke, Larissa A. and Hayes, Ian J.
Subjects: Computer Science - Logic in Computer Science, F.3.1, D.1.3
Abstract: Distributive laws are important for algebraic reasoning in arithmetic and logic. They are equally important for algebraic reasoning about concurrent programs. In existing theories such as Concurrent Kleene Algebra, only partial correctness is handled, and many of its distributive laws are weak, in the sense that they are only refinements in one direction, rather than equalities. The focus of this paper is on strengthening our theory to support the proof of strong distributive laws that are equalities, and in doing so come up with laws that are quite general. Our concurrent refinement algebra supports total correctness by allowing both finite and infinite behaviours. It supports the rely/guarantee approach of Jones by encoding rely and guarantee conditions as rely and guarantee commands. The strong distributive laws may then be used to distribute rely and guarantee commands over sequential compositions and into (and out of) iterations. For handling data refinement of concurrent programs, strong distributive laws are essential., Comment: 20 pages, 1 Figure
Published: 2024

40. Rhizomes and Diffusions for Processing Highly Skewed Graphs on Fine-Grain Message-Driven Systems

Author: Chandio, Bibrak Qamar, Srivastava, Prateek, Brodowicz, Maciej, Swany, Martin, and Sterling, Thomas
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Data Structures and Algorithms, C.1.4, C.3, C.4, D.1.3
Abstract: The paper provides a unified co-design of 1) a programming and execution model that allows spawning tasks from within the vertex data at runtime, 2) language constructs for \textit{actions} that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 3) and an innovative vertex-centric data-structure, using the concept of Rhizomes, that parallelizes both the out and in-degree load of vertex objects across many cores and yet provides a single programming abstraction to the vertex objects. The data structure hierarchically parallelizes the out-degree load of vertices and the in-degree load laterally. The rhizomes internally communicate and remain consistent, using event-driven synchronization mechanisms, to provide a unified and correct view of the vertex. Simulated experimental results show performance gains for BFS, SSSP, and Page Rank on large chip sizes for the tested input graph datasets containing highly skewed degree distribution. The improvements come from the ability to express and create fine-grain dynamic computing task in the form of \textit{actions}, language constructs that aid the compiler to generate code that the runtime system uses to optimally schedule tasks, and the data structure that shares both in and out-degree compute workload among memory-processing elements., Comment: arXiv admin note: text overlap with arXiv:2402.02576
Published: 2024

41. Exploring the Design Space for Message-Driven Systems for Dynamic Graph Processing using CCA

Author: Chandio, Bibrak Qamar, Brodowicz, Maciej, and Sterling, Thomas
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, C.1.5, C.4, C.5, D.1.3
Abstract: Computer systems that have been successfully deployed for dense regular workloads fall short of achieving scalability and efficiency when applied to irregular and dynamic graph applications. Conventional computing systems rely heavily on static, regular, numeric intensive computations while High Performance Computing systems executing parallel graph applications exhibit little locality, spatial or temporal, and are fine-grained and memory intensive. With the strong interest in AI which depend on these very different use cases combined with the end of Moore's Law at nanoscale, dramatic alternatives in architecture and underlying execution models are required. This paper identifies an innovative non-von Neumann architecture, Continuum Computer Architecture (CCA), that redefines the nature of computing structures to yield powerful innovations in computational methods to deliver a new generation of highly parallel hardware architecture. CCA reflects a genus of highly parallel architectures that while varying in specific quantities (e.g., memory blocks), share a multiple of attributes not found in typical von Neumann machines. Among these are memory-centric components, message-driven asynchronous flow control, and lightweight out-of-order execution across a global name space. Together these innovative non-von Neumann architectural properties guided by a new original execution model will deliver the new future path for extending beyond the von Neumann model. This paper documents a series of interrelated experiments that together establish future directions for next generation non-von Neumann architectures, especially for graph processing.
Published: 2024

42. Programming Distributed Collective Processes in the eXchange Calculus

Author: Audrito, Giorgio, Casadei, Roberto, Damiani, Ferruccio, Torta, Gianluca, and Viroli, Mirko
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems, Computer Science - Programming Languages, D.1.3, F.1.1, F.4.3, I.2.11, J.7
Abstract: Recent trends like the Internet of Things (IoT) suggest a vision of dense and multi-scale deployments of computing devices in nearly all kinds of environments. A prominent engineering challenge revolves around programming the collective adaptive behaviour of such computational ecosystems. This requires abstractions able to capture concepts like ensembles (dynamic groups of cooperating devices) and collective tasks (joint activities carried out by ensembles). In this work, we consider collections of devices interacting with neighbours and that execute in nearly-synchronised sense-compute-interact rounds, where the computation is given by a single program mapping sensing values and incoming messages to output and outcoming messages. To support programming whole computational collectives, we propose the abstraction of a distributed collective process, which can be used to define at once the ensemble formation logic and its collective task. We formalise the abstraction in the eXchange Calculus (XC), a core functional language based on neighbouring values (maps from neighbours to values) where state and interaction is handled through a single primitive, exchange, and provide a corresponding implementation in the FCPP language. Then, we exercise distributed collective processes using two case studies: multi-hop message propagation and distributed monitoring of spatial properties. Finally, we discuss the features of the abstraction and its suitability for different kinds of distributed computing applications., Comment: 41 pages, 17 figures
Published: 2024

43. On the relativistic viability of multi-automaton systems: essential concepts, challenges and prospects

Author: Băbeanu, Alexandru-Ionuţ
Subjects: Physics - General Physics, D.1.3, E.1, G.2.3, I.2.11, I.6.8, J.2
Abstract: Our understanding of the Universe breaks down for very small spacetime intervals, corresponding to an extremely high level of granularity (and energy), commonly referred to as the ``Planck scale''. At this fundamental level, there are attempts of describing physics in terms of interacting automata that perform classical, deterministic computation. On one hand, various mathematical arguments have already illustrated how quantum laws (which describe elementary particles and interactions) could in principle arise as low-granularity approximations of automata-based systems. On the other hand, understanding how such systems might give rise to relativistic laws (which describe spacetime and gravity) remains a major problem. I explain here a few ideas that seem crucial for overcoming this problem, along with related algorithmic challenges that need to be addressed. Giving emphasis to meaningful computational counterparts of locality and general covariance, I outline basic ingredients of a distributed communication-rewiring protocol that would allow us to construct multi-automaton models that are viable from a relativistic perspective. I also explain how viable models can be evaluated using a variety of criteria, and discuss related aspects pertaining to the falsifiability and plausibility of the automata paradigm., Comment: 7 pages, 1 figure
Published: 2024

44. Report of the DOE/NSF Workshop on Correctness in Scientific Computing, June 2023, Orlando, FL

Author: Gokhale, Maya, Gopalakrishnan, Ganesh, Mayo, Jackson, Nagarakatte, Santosh, Rubio-González, Cindy, and Siegel, Stephen F.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Software Engineering, B.8.1, C.1.4, D.0.3, D.0.4, D.1.3, D.2.1, D.2.5, D.3.1, G.1.2, J.2
Abstract: This report is a digest of the DOE/NSF Workshop on Correctness in Scientific Computing (CSC'23) held on June 17, 2023, as part of the Federated Computing Research Conference (FCRC) 2023. CSC was conceived by DOE and NSF to address the growing concerns about correctness among those who employ computational methods to perform large-scale scientific simulations. These concerns have escalated, given the complexity, scale, and heterogeneity of today's HPC software and hardware. If correctness is not proactively addressed, there is the risk of producing flawed science on top of unacceptable productivity losses faced by computational scientists and engineers. HPC systems are beginning to include data-driven methods, including machine learning and surrogate models, and their impact on overall HPC system correctness was also felt urgent to discuss. Stakeholders of correctness in this space were identified to belong to several sub-disciplines of computer science; from computer architecture researchers who design special-purpose hardware that offers high energy efficiencies; numerical algorithm designers who develop efficient computational schemes based on reduced precision as well as reduced data movement; all the way to researchers in programming language and formal methods who seek methodologies for correct compilation and verification. To include attendees with such a diverse set of backgrounds, CSC was held during the Federated Computing Research Conference (FCRC) 2023., Comment: 36 pages. DOE/NSF Workshop on Correctness in Scientific Computing (CSC 2023) was a PLDI 2023 workshop
Published: 2023

45. FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

Author: Randall, Thomas, Allen, Tyler, and Ge, Rong
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing, I.2.7, D.1.3, G.4
Abstract: Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains., Comment: 12 pages, 7 figures, 7 tables, the definitive version of this work is published in the Proceedings of the ACM International Conference on Supercomputing 2021, available at https://doi.org/10.1145/3447818.3460373
Published: 2023
Full Text: View/download PDF

46. High Performance Multiple Sequence Alignment Algorithms for Comparison of Microbial Genomes

Author: Helal, Manal, El-Gindy, Hossam, Gaeta, Bruno, and Sinchenko, Vitali
Subjects: Quantitative Biology - Genomics, Computer Science - Distributed, Parallel, and Cluster Computing, D.1.3
Abstract: Advances in gene sequencing have enabled in silico analyses of microbial genomes and have led to the revision of concepts of microbial taxonomy and evolution. We explore deficiencies in existing multiple sequence global alignment algorithms and introduce a new indexing scheme to partition the dynamic programming algorithm hypercube scoring tensor over processors based on the dependency between partitions to be scored in parallel. The performance of algorithms is compared in the study of rpoB gene sequences of Mycoplasma species.
Published: 2023

47. Computing the k-th Eigenvalue of Symmetric $H^2$-Matrices

Author: Apriansyah, M. Ridwan and Yokota, Rio
Subjects: Mathematics - Numerical Analysis, G.1.2, G.1.3, G.4, D.1.3
Abstract: The numerical solution of eigenvalue problems is essential in various application areas of scientific and engineering domains. In many problem classes, the practical interest is only a small subset of eigenvalues so it is unnecessary to compute all of the eigenvalues. Notable examples are the electronic structure problems where the $k$-th smallest eigenvalue is closely related to the electronic properties of materials. In this paper, we consider the $k$-th eigenvalue problems of symmetric dense matrices with low-rank off-diagonal blocks. We present a linear time generalized LDL decomposition of $\mathcal{H}^2$ matrices and combine it with the bisection eigenvalue algorithm to compute the $k$-th eigenvalue with controllable accuracy. In addition, if more than one eigenvalue is required, some of the previous computations can be reused to compute the other eigenvalues in parallel. Numerical experiments show that our method is more efficient than the state-of-the-art dense eigenvalue solver in LAPACK/ScaLAPACK and ELPA. Furthermore, tests on electronic state calculations of carbon nanomaterials demonstrate that our method outperforms the existing HSS-based bisection eigenvalue algorithm on 3D problems., Comment: 14 pages, 11 figures
Published: 2023
Full Text: View/download PDF

48. Efficient Algorithms for Monte Carlo Particle Transport on AI Accelerator Hardware

Author: Tramm, John, Allen, Bryce, Yoshii, Kazutomo, Siegel, Andrew, and Wilson, Leighton
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance, D.1.3, J.2
Abstract: The recent trend toward deep learning has led to the development of a variety of highly innovative AI accelerator architectures. One such architecture, the Cerebras Wafer-Scale Engine 2 (WSE-2), features 40 GB of on-chip SRAM, making it a potentially attractive platform for latency- or bandwidth-bound HPC simulation workloads. In this study, we examine the feasibility of performing continuous energy Monte Carlo (MC) particle transport on the WSE-2 by porting a key kernel from the MC transport algorithm to Cerebras's CSL programming model. New algorithms for minimizing communication costs and for handling load balancing are developed and tested. The WSE-2 is found to run 130 times faster than a highly optimized CUDA version of the kernel run on an NVIDIA A100 GPU -- significantly outpacing the expected performance increase given the difference in transistor counts between the architectures.
Published: 2023

49. A Performance-Portable SYCL Implementation of CRK-HACC for Exascale

Author: Rangel, Esteban M., Pennycook, S. John, Pope, Adrian, Frontiere, Nicholas, Ma, Zhiqiang, and Madananth, Varsha
Subjects: Computer Science - Performance, Astrophysics - Cosmology and Nongalactic Astrophysics, Computer Science - Distributed, Parallel, and Cluster Computing, D.2.7, D.2.8, D.1.3, J.2
Abstract: The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining multiple versions of their code. It is necessary to document experiences with such programming models to assist developers in understanding the advantages and disadvantages of different approaches. To this end, this paper evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable programming model for performance-portable applications., Comment: 12 pages, 13 figures, 2023 International Workshop on Performance, Portability & Productivity in HPC
Published: 2023

50. Compiler Testing With Relaxed Memory Models

Author: Geeson, Luke and Smith, Lee
Subjects: Computer Science - Programming Languages, Computer Science - Hardware Architecture, Computer Science - Software Engineering, D.1.3, B.1.2, B.1.4, D.2.5
Abstract: Finding bugs is key to the correctness of compilers in wide use today. If the behaviour of a compiled program, as allowed by its architecture memory model, is not a behaviour of the source program under its source model, then there is a bug. This holds for all programs, but we focus on concurrency bugs that occur only with two or more threads of execution. We focus on testing techniques that detect such bugs in C/C++ compilers. We seek a testing technique that automatically covers concurrency bugs up to fixed bounds on program sizes and that scales to find bugs in compiled programs with many lines of code. Otherwise, a testing technique can miss bugs. Unfortunately, the state-of-the-art techniques are yet to satisfy all of these properties. We present the T\'el\'echat compiler testing tool for concurrent programs. T\'el\'echat compiles a concurrent C/C++ program and compares source and compiled program behaviours using source and architecture memory models. We make three claims: T\'el\'echat improves the state-of-the-art at finding bugs in code generation for multi-threaded execution, it is the first public description of a compiler testing tool for concurrency that is deployed in industry, and it is the first tool that takes a significant step towards the desired properties. We provide experimental evidence suggesting T\'el\'echat finds bugs missed by other state-of-the-art techniques, case studies indicating that T\'el\'echat satisfies the properties, and reports of our experience deploying T\'el\'echat in industry regression testing., Comment: 12 pages, Accepted to IEEE/ACM International Symposium on Code Generation and Optimization
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

1,366 results on '"D.1.3"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources