Descriptor: "Heterogeneous computing" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Heterogeneous computing"' showing total 5,392 results

Start Over Descriptor "Heterogeneous computing"

5,392 results on '"Heterogeneous computing"'

1. An enhanced list scheduling algorithm for heterogeneous computing using an optimized Predictive Cost Matrix

Author: Wang, Min, Chen, Jiawang, Wang, Haoyuan, Gao, Ziyi, Bian, Weihao, and Qiao, Sibo
Published: 2025
Full Text: View/download PDF

2. CPU–GPU heterogeneous code acceleration of a finite volume Computational Fluid Dynamics solver

Author: Xue, Weicheng, Wang, Hongyu, and Roy, Christopher J.
Published: 2024
Full Text: View/download PDF

3. Domain Decomposition for the Numerical Solution of the Cahn-Hilliard Equation

Author: Prokhorov, Dmitry, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Voevodin, Vladimir, editor, Antonov, Alexander, editor, and Nikitenko, Dmitry, editor
Published: 2025
Full Text: View/download PDF

4. A Novel Mixed Precision Defect Correction Solver for Heterogeneous Computing

Author: Delorme, Yann T., Wasserman, Mark, Zameret, Alon, Ding, Zhaohui, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Weiland, Michèle, editor, Neuwirth, Sarah, editor, Kruse, Carola, editor, and Weinzierl, Tobias, editor
Published: 2025
Full Text: View/download PDF

5. Deductive Verification of SYCL in VerCors

Author: Wittingen, Ellen, Huisman, Marieke, Şakar, Ömer, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Madeira, Alexandre, editor, and Knapp, Alexander, editor
Published: 2025
Full Text: View/download PDF

6. Parallel Implementation of Sieving Algorithm on Heterogeneous CPU-GPU Computing Architectures

Author: Wu, Mengsi, Li, Pei, Chen, Jiageng, Yao, Shixiong, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Xia, Zhe, editor, and Chen, Jiageng, editor
Published: 2025
Full Text: View/download PDF

7. Optimal task partitioning to minimize failure in heterogeneous computational platform.

Author: Narayana, Divyaprabha Kabbal and Babu, Sudarshan Tekal Subramanyam
Subjects: ENERGY consumption, RESOURCE allocation, CLOUD computing, COMPUTING platforms, CONSUMPTION (Economics), HETEROGENEOUS computing
Abstract: The increased energy consumption by heterogeneous cloud platforms surges the carbon emissions and reduces system reliability, thus, making workload scheduling an extremely challenging process. The dynamic voltagefrequency scaling (DVFS) technique provides an efficient mechanism in improving the energy efficiency of cloud platform; however, employing DVFS reduces reliability and increases the failure rate of resource scheduling. Most of the current workload scheduling methods have failed to optimize the energy and reliability together under a central processing unitgraphical processing unit (CPU-GPU) heterogeneous computing platform; As a result, reducing energy consumption and task failure are prime issues this work aims to address. This work introduces task failure minimization (TFM) through optimal task partitioning (OTP) for workload scheduling in the CPU-GPU cloud computational platform. The TFM-OTP introduces a task partitioning model for the CPU-GPU pair; then, it provides a DVFSbased energy consumption model. Finally, the energy-load optimization problem is defined, and the optimal resource allocation design is presented. The experiment is conducted on two standard workloads namely SIPHT and CyberShake workload. The result shows that the proposed TFA-OTP model reduces energy consumption by 30.35%, reduces makespan by 70.78% and reduces task failure energy overhead by 83.7% in comparison with energy minimized scheduling (EMS) approach. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

8. Exploring data flow design and vectorization with oneAPI for streaming applications on CPU+GPU: Exploring data flow design and vectorization with oneAPI for streaming...: C. Campos et al.

Author: Campos, Cristian, Asenjo, Rafael, and Navarro, Angeles
Abstract: In recent times, oneAPI has emerged as a competitive framework to optimize streaming applications on heterogeneous CPU+GPU architectures, since it provides portability and performance thanks to the SYCL programming language and efficient parallel libraries as oneTBB. However, this approach opens up a wealth of implementations alternatives in this type of applications: from how to design the data flow to how to exploit data parallelism. Choosing the best alternative is not trivial, so in this paper we analyze them and contribute with an analytical model based on queue theory that helps in the on-line selection of the alternative that maximizes the throughput and the occupancy of the CPU and GPU compute units. We explore the design space offered by: a) different APIs to define the data flow (parallel_pipeline and Flow Graph from oneTBB, and SYCL events from SYCL); b) alternative kernel implementations to express data parallelism (SYCL, AVX and std::simd); and c) the mapping of the kernels into the available computing resources (CPU cores and GPU). The results show that the std::simd library can be 1.54x faster, 3% more energy efficient, and requires 7.36x less programming effort than AVX, and that implementations that enable asynchronous offloading of tasks to the devices as those based on SYCL events and Flow Graph APIs outperform the other APIs, being up to 1.10x faster and up to 1.18x more energy efficient. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

9. An autotuning approach to select the inter-GPU communication library on heterogeneous systems.

Author: Cámara, Jesús, Cuenca, Javier, Galindo, Victor, Vicente, Arturo, and Boratto, Murilo
Abstract: In this work, an automatic optimisation approach for parallel routines on multi-GPU systems is presented. Several inter-GPU communication libraries (such as CUDA-Aware MPI or NCCL) are used with a set of routines to perform the numerical operations among the GPUs located on the compute nodes. The main objective is the selection of the most appropriate communication library, the number of GPUs to be used and the workload to be distributed among them in order to reduce the cost of data movements, which represent a large percentage of the total execution time. To this end, a hierarchical modelling of the execution time of each routine to be optimised is proposed, combining experimental and theoretical approaches. The results show that near-optimal decisions are taken in all the scenarios analysed. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

10. HFSL: heterogeneity split federated learning based on client computing capabilities.

Author: Wu, Nengwu, Zhao, Wenjie, Chen, Yuxiang, Xiao, Jiahong, Wang, Jin, Liang, Wei, Li, Kuan-Ching, and Sukhija, Nitin
Abstract: With the rapid growth of the internet of things (IoT) and smart devices, edge computing has emerged as a critical technology for processing massive amounts of data and protecting user privacy. Split federated learning, an emerging distributed learning framework, enables model training without needing data to leave local devices, effectively preventing data leakage and misuse. However, the disparity in computational capabilities of edge devices necessitates partitioning models according to the least capable client, resulting in a significant portion of the computational load being offloaded to a more capable server-side infrastructure, thereby incurring substantial training overheads. This work proposes a novel method for split federated learning targeting heterogeneous endpoints to address these challenges. The method addresses the problem of heterogeneous training across different clients by adding auxiliary layers, enhances the accuracy of heterogeneous model split training using self-distillation techniques, and leverages the global model from the previous round to mitigate the accuracy degradation during federated aggregation. We conducted validations on the CIFAR-10 dataset and compared it with the existing SL, SFLV1, and SFLV2 methods; our HFSL2 method improved by 3.81%, 13.94%, and 6.19%, respectively. Validations were also carried out on the HAM10000, FashionMNIST, and MNIST datasets, through which we found that our algorithm can effectively enhance the aggregation accuracy of heterogeneous computing capabilities. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

11. Performance optimization of heterogeneous computing for large-scale dynamic graph data.

Author: Wang, Haifeng, Guo, Wenkang, and Zhang, Ming
Abstract: The performance of machines has not been fully exploited when processing large-scale dynamic graph in heterogeneous GPU clusters. To improve the performance of graph computing, a distributed heterogeneous engine (DHE) has been designed. A new heterogeneous graph partitioning algorithm is implemented, which can achieve load balancing among internal nodes in GPU clusters. DHE introduces synergy model to quantify the co-computing performance of heterogeneous processors and designs computing pipelines to optimize the performance of accessing memory. The graph algorithms PageRank, CC, SSSP and K-core are selected for the experiment. Under the same conditions, DHE’s partitioning algorithm improves scalability and load balancing ability compared to other graph partitioning algorithms. The heterogeneous computing pipeline exhibits better memory access compared to the basic engine. The experiments show that the synergy of this system converges to 1 stably. Compared with the other three graph partitioning algorithms, the system reduced processing time by approximately 20–30% and the performance is improved by a factor of 1.2 times. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

12. An efficient heterogeneous parallel password recovery system on MT-3000.

Author: Luo, Yongtao, Liu, Jie, Gong, Chunye, and Li, Tun
Abstract: Password-based recovery is a widely used method for regaining access to applications or services when passwords are lost or forgotten. It is commonly used in electronic forensics by law enforcement agencies, information acquisition in the commercial sector, and data recovery for individuals. However, as encryption algorithms and complex passwords become more prevalent for security purposes, traditional CPU-based and GPU-based password recovery systems are struggling to meet the time-sensitive requirements for deciphering, and there is an urgent need for a more efficient password recovery system. Therefore, this paper presents an efficient heterogeneous parallel password recovery system based on the MT-3000 multi-zone processor. According to the architectural features of MT-3000, this system adopts a heterogeneous multi-level parallelism strategy, including inter-acceleration cluster data parallelism through MPI, intra-acceleration core data parallelism through the hthreads APIs, and instruction-level parallelism through a very long instruction word manner. Additionally, this system utilizes a unified task allocation mechanism that assigns the initialization and comparison verification tasks to the CPU side, while the accelerator side executes the hash iteration. This approach ensures that the system achieves optimal performance while maintaining its efficiency. The experimental analysis and results confirm that the proposed system significantly improves the recovery efficiency compared to traditional CPU-based and GPU-based systems, and also has an advantage in deciphering speed compared to the most popular hybrid CPU-FPGA-based system. Furthermore, it offers superior scalability, allowing for expansion to more compute nodes, making it a good solution for large-scale password recovery needs. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

13. Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection.

Author: Xu, Jia, Pu, Han, and Wang, Dong
Subjects: CONVOLUTIONAL neural networks, ARTIFICIAL neural networks, HETEROGENEOUS computing, MAP design, ENERGY consumption, CACHE memory
Abstract: Reconfigurable processor-based acceleration of deep convolutional neural network (DCNN) algorithms has emerged as a widely adopted technique, with particular attention on sparse neural network acceleration as an active research area. However, many computing devices that claim high computational power still struggle to execute neural network algorithms with optimal efficiency, low latency, and minimal power consumption. Consequently, there remains significant potential for further exploration into improving the efficiency, latency, and power consumption of neural network accelerators across diverse computational scenarios. This paper investigates three key techniques for hardware acceleration of sparse neural networks. The main contributions are as follows: (1) Most neural network inference tasks are typically executed on general-purpose computing devices, which often fail to deliver high energy efficiency and are not well-suited for accelerating sparse convolutional models. In this work, we propose a specialized computational circuit for the convolutional operations of sparse neural networks. This circuit is designed to detect and eliminate the computational effort associated with zero values in the sparse convolutional kernels, thereby enhancing energy efficiency. (2) The data access patterns in convolutional neural networks introduce significant pressure on the high-latency off-chip memory access process. Due to issues such as data discontinuity, the data reading unit often fails to fully exploit the available bandwidth during off-chip read and write operations. In this paper, we analyze bandwidth utilization in the context of convolutional accelerator data handling and propose a strategy to improve off-chip access efficiency. Specifically, we leverage a compiler optimization plugin developed for Vitis HLS, which automatically identifies and optimizes on-chip bandwidth utilization. (3) In coefficient-based accelerators, the synchronous operation of individual computational units can significantly hinder efficiency. Previous approaches have achieved asynchronous convolution by designing separate memory units for each computational unit; however, this method consumes a substantial amount of on-chip memory resources. To address this issue, we propose a shared feature map cache design for asynchronous convolution in the accelerators presented in this paper. This design resolves address access conflicts when multiple computational units concurrently access a set of caches by utilizing a hash-based address indexing algorithm. Moreover, the shared cache architecture reduces data redundancy and conserves on-chip resources. Using the optimized accelerator, we successfully executed ResNet50 inference on an Intel Arria 10 1150GX FPGA, achieving a throughput of 497 GOPS, or an equivalent computational power of 1579 GOPS, with a power consumption of only 22 watts. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

14. Industrial Internet of Water Things architecture for data standarization based on blockchain and digital twin technology☆.

Author: Mohammed, Mazin Abed, Lakhan, Abdullah, Abdulkareem, Karrar Hameed, Abd Ghani, Mohd Khanapi, Marhoon, Haydar Abdulameer, Kadry, Seifedine, Nedoma, Jan, Martinek, Radek, and Zapirain, Begonya Garcia
Subjects: *DIGITAL twins, *HETEROGENEOUS computing, *WATER distribution, *SMART cities, *WATER management
Abstract: [Display omitted] • We present a hybrid model called CNN-LSTM. The CNN-LSTM model is designed to efficiently extract and process data from diverse sources, thereby addressing the issue of data modality. • We devise the cross-platform runtime environment based on Message Queuing Telemetry Transport (MQTT) protocol. This environment allows for the seamless processing of diverse data sources and types through heterogeneous computing nodes in IIoWT. By leveraging MQTT, the devised runtime environment ensures interoperability and facilitates the execution of data across heterogeneous computing nodes. • We devise a blockchain scheme that utilizes SHA-256 for data hashing. This scheme incorporates a proof-of-work mechanism and ensures data validity across heterogeneous nodes in IIoWT. The Industrial Internet of Water Things (IIoWT) has recently emerged as a leading architecture for efficient water distribution in smart cities. Its primary purpose is to ensure high-quality drinking water for various institutions and households. However, existing IIoWT architecture has many challenges. One of the paramount challenges in achieving data standardization and data fusion across multiple monitoring institutions responsible for assessing water quality and quantity. This paper introduces the Industrial Internet of Water Things System for Data Standardization based on Blockchain and Digital Twin Technology. The main objective of this study is to design a new IIoWT architecture where data standardization, interoperability, and data security among different water institutions must be met. We devise the digital twin-enabled cross-platform environment using the Message Queuing Telemetry Transport (MQTT) protocol to achieve seamless interoperability in heterogeneous computing. In water management, we encounter different types of data from various sensors. Therefore, we propose a CNN-LSTM and blockchain data transactional (BCDT) scheme for processing valid data across different nodes. Through simulation results, we demonstrate that the proposed IIoWT architecture significantly reduces processing time while improving the accuracy of data standardization within the water distribution management system. Overall, this paper presents a comprehensive approach to tackle the challenges of data standardization and security in the IIoWT architecture. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. Pareto Approximation Empirical Results of Energy-Aware Optimization for Precedence-Constrained Task Scheduling Considering Switching Off Completely Idle Machines.

Author: Castán Rocha, José Antonio, Santiago, Alejandro, García-Ruiz, Alejandro H., Terán-Villanueva, Jesús David, Martínez, Salvador Ibarra, and Treviño Berrones, Mayra Guadalupe
Subjects: *LANGUAGE models, *MULTI-objective optimization, *DIRECTED acyclic graphs, *SUPERCOMPUTERS, *ENERGY consumption
Abstract: Recent advances in cloud computing, large language models, and deep learning have started a race to create massive High-Performance Computing (HPC) centers worldwide. These centers increase in energy consumption proportionally to their computing capabilities; for example, according to the top 500 organization, the HPC centers Frontier, Aurora, and Super Computer Fugaku report energy consumptions of 22,786 kW, 38,698 kW, and 29,899 kW, respectively. Currently, energy-aware scheduling is a topic of interest to many researchers. However, as far as we know, this work is the first approach considering the idle energy consumption by the HPC units and the possibility of turning off unused units entirely, driven by a quantitative objective function. We found that even when turning off unused machines, the objectives of makespan and energy consumption still conflict and, therefore, their multi-objective optimization nature. This work presents empirical results for AGEMOEA, AGEMOEA2, GWASFGA, MOCell, MOMBI, MOMBI2, NSGA2, and SMS-EMOA. The best-performing algorithm is MOCell for the 400 real scheduling problem tests. In contrast, the best-performing algorithm is GWASFGA for a small-instance synthetic testbed. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. Assessment of Driver Stress using Multimodal wereable Signals and Self-Attention Networks.

Author: Pavan, Kaveti and Ganapathy, Nagarajan
Subjects: COMBINED modality therapy, CONVOLUTIONAL neural networks, ARTIFICIAL neural networks, ELECTROCARDIOGRAPHY, HETEROGENEOUS computing
Abstract: Assessment of driver stress, crucial for road safety, can greatly benefit from the analysis of multimodal physiological signals. However, fusing such heterogeneous data poses significant challenges, particularly in intermediate fusion where noise can also be fused. In this study, we address this challenge by exploring a 1D convolutional neural network (CNN) with self-attention mechanisms on multimodal data. Electrocardiogram (ECG) signals (256 Hz) and respiration (RESP) signals (128 Hz) were obtained from ten subjects using textile electrodes while driving in different scenarios, namely normal driving and phone usage (calling). The obtained multimodal data is preprocessed and then applied to a self-attention mechanism (SAM) CNN (SAMcNN) to identify driver stress. Experiments are validated using Leave-one-outsubject cross validation. The proposed approach is capable of classifying driver stress. It is observed that shorter segments yield an accuracy of 64.16% compared to longer segment lengths. Thus, exploring self-attention mechanisms for multimodal signals using wearable shirts facilitates non-intrusive monitoring in real-world driving scenarios. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

17. Deep learning parallel approach using CUDA technology.

Author: Rakhimov, Mekhriddin, Zaripova, Dilnoza, Javliev, Shakhzod, and Karimberdiyev, Jakhongir
Subjects: *ARTIFICIAL intelligence, *CENTRAL processing units, *GRAPHICS processing units, *HETEROGENEOUS computing, *COMPUTER systems, *DEEP learning
Abstract: It can be seen that today with the development of technology, the field of artificial intelligence and the speed of computers are increasing day by day, especially deep learning and artificial intelligence are important in almost all fields related to education. Especially when discussing the use of images, deep learning applications can help extract valuable features from large amounts of data from images, videos, text, speech signals, and sensors. In addition, working with image-based data requires a large amount of time to perform digital operations. Therefore, in this paper, we aim to use heterogeneous computing systems and parallel processing to improve the time efficiency of image processing. So in machine learning, we consider using central processing units (CPUs) and graphics processing units (GPUs) for image preprocessing, and we use heterogeneous computing systems with technologies such as OpenCL and CUDA to achieve faster results. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

18. Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey.

Author: Liu, Zhihong, Xu, Xin, Qiao, Peng, and Li, Dongsheng
Subjects: *REINFORCEMENT learning, *ARTIFICIAL neural networks, *DEEP reinforcement learning, *DISTRIBUTED artificial intelligence, *PATTERN recognition systems, *HETEROGENEOUS computing, *PARALLEL algorithms
Published: 2025
Full Text: View/download PDF

19. TAMM: Tensor algebra for many-body methods.

Author: Mutlu, Erdal, Panyala, Ajay, Gawande, Nitin, Bagusetty, Abhishek, Glabe, Jeffrey, Kim, Jinsung, Kowalski, Karol, Bauman, Nicholas P., Peng, Bo, Pathak, Himadri, Brabec, Jiri, and Krishnamoorthy, Sriram
Subjects: *TENSOR algebra, *UNIVERSAL algebra, *GRAPHICS processing units, *COMPUTING platforms, *MODULAR construction, *HETEROGENEOUS computing
Abstract: Tensor algebra operations such as contractions in computational chemistry consume a significant fraction of the computing time on large-scale computing platforms. The widespread use of tensor contractions between large multi-dimensional tensors in describing electronic structure theory has motivated the development of multiple tensor algebra frameworks targeting heterogeneous computing platforms. In this paper, we present Tensor Algebra for Many-body Methods (TAMM), a framework for productive and performance-portable development of scalable computational chemistry methods. TAMM decouples the specification of the computation from the execution of these operations on available high-performance computing systems. With this design choice, the scientific application developers (domain scientists) can focus on the algorithmic requirements using the tensor algebra interface provided by TAMM, whereas high-performance computing developers can direct their attention to various optimizations on the underlying constructs, such as efficient data distribution, optimized scheduling algorithms, and efficient use of intra-node resources (e.g., graphics processing units). The modular structure of TAMM allows it to support different hardware architectures and incorporate new algorithmic advances. We describe the TAMM framework and our approach to the sustainable development of scalable ground- and excited-state electronic structure methods. We present case studies highlighting the ease of use, including the performance and productivity gains compared to other frameworks. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

20. NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators

Author: Jeman Park, Misun Yu, Jinse Kwon, Junmo Park, Jemin Lee, and Yongin Kwon
Subjects: ai accelerator, deep learning compiler, heterogeneous computing, model quantization, multi-level ir, Telecommunication, TK5101-6720, Electronics, TK7800-8360
Abstract: Deep learning (DL) has significantly advanced artificial intelligence (AI); how-ever, frameworks such as PyTorch, ONNX, and TensorFlow are optimized for general-purpose GPUs, leading to inefficiencies on specialized accelerators such as neural processing units (NPUs) and processing-in-memory (PIM) devices. These accelerators are designed to optimize both throughput and energy efficiency but they require more tailored optimizations. To address these limitations, we propose the NEST compiler (NEST-C), a novel DL frame-work that improves the deployment and performance of models across various AI accelerators. NEST-C leverages profiling-based quantization, dynamic graph partitioning, and multi-level intermediate representation (IR) integration for efficient execution on diverse hardware platforms. Our results show that NEST-C significantly enhances computational efficiency and adaptability across various AI accelerators, achieving higher throughput, lower latency, improved resource utilization, and greater model portability. These benefits contribute to more efficient DL model deployment in modern AI applications.
Published: 2024
Full Text: View/download PDF

21. Adaptive federated learning for resource-constrained IoT devices through edge intelligence and multi-edge clustering.

Author: Mughal, Fahad Razaque, He, Jingsha, Das, Bhagwan, Dharejo, Fayaz Ali, Zhu, Nafei, Khan, Surbhi Bhatia, and Alzahrani, Saeed
Subjects: *COMPUTER network traffic, *ARTIFICIAL intelligence, *FEDERATED learning, *EDGE computing, *HETEROGENEOUS computing
Abstract: In the rapidly growing Internet of Things (IoT) landscape, federated learning (FL) plays a crucial role in enhancing the performance of heterogeneous edge computing environments due to its scalability, robustness, and low energy consumption. However, one of the major challenges in such environments is the efficient selection of edge nodes and the optimization of resource allocation, especially in dynamic and resource-constrained settings. To address this, we propose a novel architecture called Multi-Edge Clustered and Edge AI Heterogeneous Federated Learning (MEC-AI HetFL), which leverages multi-edge clustering and AI-driven node communication. This architecture enables edge AI nodes to collaborate, dynamically selecting significant nodes and optimizing global learning tasks with low complexity. Compared to existing solutions like EdgeFed, FedSA, FedMP, and H-DDPG, MEC-AI HetFL improves resource allocation, quality score, and learning accuracy, offering up to 5 times better performance in heterogeneous and distributed environments. The solution is validated through simulations and network traffic tests, demonstrating its ability to address the key challenges in IoT edge computing deployments. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. Human Activity Recognition Based on Point Clouds from Millimeter-Wave Radar.

Author: Lim, Seungchan, Park, Chaewoon, Lee, Seongjoo, and Jung, Yunho
Subjects: HUMAN activity recognition, IMAGE recognition (Computer vision), HETEROGENEOUS computing, GATE array circuits, POINT cloud
Abstract: Human activity recognition (HAR) technology is related to human safety and convenience, making it crucial for it to infer human activity accurately. Furthermore, it must consume low power at all times when detecting human activity and be inexpensive to operate. For this purpose, a low-power and lightweight design of the HAR system is essential. In this paper, we propose a low-power and lightweight HAR system using point-cloud data collected by radar. The proposed HAR system uses a pillar feature encoder that converts 3D point-cloud data into a 2D image and a classification network based on depth-wise separable convolution for lightweighting. The proposed classification network achieved an accuracy of 95.54%, with 25.77 M multiply–accumulate operations and 22.28 K network parameters implemented in a 32 bit floating-point format. This network achieved 94.79% accuracy with 4 bit quantization, which reduced memory usage to 12.5% compared to existing 32 bit format networks. In addition, we implemented a lightweight HAR system optimized for low-power design on a heterogeneous computing platform, a Zynq UltraScale+ ZCU104 device, through hardware–software implementation. It took 2.43 ms of execution time to perform one frame of HAR on the device and the system consumed 3.479 W of power when running. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

23. A hybrid genetic-based task scheduling algorithm for cost-efficient workflow execution in heterogeneous cloud computing environment.

Author: Khademi Dehnavi, Mohsen, Broumandnia, Ali, Hosseini Shirvani, Mirsaeid, and Ahanian, Iman
Subjects: *COST control, *HETEROGENEOUS computing, *NP-hard problems, *GENETIC algorithms, *CLOUD computing, *WORKFLOW management systems
Abstract: Many businesses utilize cost-efficient cloud services to execute their industrial and scientific workflow applications. Business continuity is a very important issue for both cloud users and providers. To have reliable workflow execution, the engagement of reliable resources is a challenging job that can supply business continuity. In addition, the lowest execution time and monetary cost are two cost features that adhere users to providers. In this regard, the task scheduling algorithm is very prominent in reducing costs in favor of users and providers. To address the issue, a system framework and different cost-type models are suggested. Then, the task scheduling issue is formulated into an optimization problem with an overall cost reduction viewpoint. To solve this NP-Hard problem, a hybrid genetic algorithm (HGA) is presented for reliable and cost-efficient task scheduling of workflow execution in a heterogeneous cloud computing environment. The proposed HGA has different phases chief amongst them is to apply new crossover and mutation operators for global search, and a Walking around procedure to enhance the quality of local search solutions. It makes a good balance between local and global searches in a huge search space that leads to efficient results. To verify the proposed hybrid algorithm, it has been tested in different twelve scenarios with variable communication to computation ratios datasets. The results of extensive simulations in twelve datasets scenarios prove that HGA significantly dominates other state-of-the-art in terms of three prominent cost metrics, namely, makespan, monetary cost, and failure cost in the amount of 14.10%, 18.70%, and 42.30% cost reduction respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

24. A resource optimization scheduling model and algorithm for heterogeneous computing clusters based on GNN and RL.

Author: Zhang, Zhen, Xu, Chen, Liu, Kun, Xu, Shaohua, and Huang, Long
Subjects: *GRAPH neural networks, *CONVOLUTIONAL neural networks, *REINFORCEMENT learning, *HETEROGENEOUS computing, *COMPUTER workstation clusters, *LOAD balancing (Computer networks)
Abstract: In the realm of heterogeneous computing, the efficient allocation of resources is pivotal for optimizing system performance. However, user-submitted tasks are often complex and have varied resource demands. Moreover, the dynamic nature of resource states in such platforms, coupled with variations in resource types and capabilities, results in significant intricacy of the system environment. To this end, we propose a scheduling algorithm based on hierarchical reinforcement learning, namely MD-HRL. Such an algorithm could simultaneously harmonize task completion time, device power consumption, and load balancing. It contains a high-level agent (H-Agent) for task selection and a low-level agent (L-Agent) for resource allocation. The H-Agent leverages multi-hop attention graph neural networks (MAGNA) and one-dimensional convolutional neural networks (1DCNN) to encode the information of tasks and resources. Kolmogorov–Arnold networks is then employed for integrating these representations while calculating subtask priority scores. The L-Agent exploits a double deep Q network to approximate the best strategy and objective function, thereby optimizing the task-to-resource mapping in a dynamic environment. Experimental results demonstrate that MD-HRL outperforms several state of the art baselines. It reduces makespan by 12.54%, improves load balancing by 5.83%, and lowers power consumption by 6.36% on average compared with the suboptimal method. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. Dynamic service provisioning in heterogeneous fog computing architecture using deep reinforcement learning.

Author: Alizadeh Govarchinghaleh, Yaghoub and Sabaei, Masoud
Subjects: *DEEP reinforcement learning, *REINFORCEMENT learning, *HEURISTIC algorithms, *HETEROGENEOUS computing, *INTERNET of things
Abstract: The exponential growth of IoT devices and the surge in the data volume, coupled with the rise of latency-sensitive applications, has led to a heightened interest in fog computing to meet user demands. In this context, the service provisioning problem consists of dynamically selecting desirable fog computing nodes and routing user traffic to these nodes. Given that the fog computing layer is composed of heterogeneous nodes, which vary in resource capacity, availability, and power sources, the service provisioning problem becomes challenging. Existing solutions, often using classical optimization approaches or heuristic algorithms due to the NP-hardness of the problem, have struggled to address the issue effectively, particularly in accounting for the heterogeneity of fog nodes and uncertainty of the ad hoc fog nodes. These techniques show exponential computation times and deal only with small network scales. To overcome these issues, we are motivated to replace these approaches with deep reinforcement learning (DRL) techniques, specifically employing the proximal policy optimization (PPO) algorithm to understand the dynamic behavior of the environment. The main objective of the proposed DRL-based dynamic service provisioning (DDSP) algorithm is minimizing service provisioning costs while considering service delay constraints, the uncertainty of ad hoc fog nodes, and the heterogeneity of both ad hoc and dedicated fog nodes. Extensive simulations demonstrate that our approach provides a near-optimal solution with high efficiency. Notably, our proposed algorithm selects more stable fog nodes for service provisioning and successfully minimizes cost even with uncertainty regarding ad hoc fog nodes, compared to heuristic algorithms. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. LATA: learning automata-based task assignment on heterogeneous cloud computing platform.

Author: Gheisari, Soulmaz and ShokrZadeh, Hamid
Subjects: *WIDE area networks, *VIRTUAL machine systems, *DIRECTED acyclic graphs, *CLOUD computing, *HETEROGENEOUS computing
Abstract: A cloud computing environment is a distributed system where idle resources are accessible across a wide area network, such as the Internet. Due to the diverse specifications of these resources, computational clouds exhibit high heterogeneity. Task scheduling, the process of dispatching cloud applications onto processing nodes, becomes a critical challenge in such environments. Ensuring high utilization in this heterogeneous environment entails identifying suitable machines or virtual machines capable of efficiently executing jobs, constituting a multi-objective optimization problem. This paper proposes a dynamic Learning Automata-based Task Assignment algorithm, named LATA, to address this challenge. In the algorithm, each application is represented as a Directed Acyclic Graph, with tasks as nodes and data dependencies as edges. Initially, tasks are grouped based on their data dependencies to consolidate independent tasks into one group. Subsequently, a variable-structure learning automaton is assigned to each group of tasks to identify appropriate task-machine combinations. The primary objectives of LATA include minimizing makespan and energy consumption by facilitating efficient task placement to achieve load balance and maximize resource utilization. Additionally, an enhancement is proposed, involving the use of a different grouping policy prior to task assignment to further improve performance. Computer simulation results demonstrate the superior performance of the proposed algorithms in highly heterogeneous environments compared to state-of-the-art algorithms. Notably, total execution time and energy consumption decrease by up to 50% and 37%, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. Adaptive mesh refinement for a particle level simulation of supercritical water fluidized bed reactors.

Author: Su, Haozhe, Jin, Hui, Wang, Bingcheng, and Guo, Liejin
Subjects: *SUPERCRITICAL water, *GRANULAR flow, *HETEROGENEOUS computing, *SCALABILITY, *FLUIDS
Abstract: High-precision computational fluid dynamics-discrete element method (CFD-DEM) simulation serves as a foundation for the detailed design of supercritical water fluidized bed reactors (SCWFBRs). However, as the scale of these reactors increases, the computational domain and reacting particles expand exponentially. Traditional adaptive mesh refinement is constrained in dealing with dense phase zones in SCWFBRs, as the nonlinear characteristics of the dense reacting particle flow render local-field error estimation ineffective. To address this limitation, an efficient and accurate CFD-DEM framework is established, integrating adaptive mesh refinement and heterogeneous computing. The mesh adaptation strategy is based on whole-field error estimation and is defined by a set of cell-marking criteria. The effect of the criteria on the accuracy and efficiency of the computational process is evaluated. This method reduces the demand for computational resources, thereby enabling the simulation of large-scale reactors at the particle level. The optimized adaptation strategy maintains an error of 1.67% while saving 51.59% of mesh cells required and 28.4% computation time. The primary factors influencing computational accuracy are identified as fluid velocity gradient and the particle reaction rate. The scalability of the adaptation strategy is also validated. For a fivefold radial scaled-up reactor, the cell count is reduced by 46.86% and the computation time by 22.2%, while the maximum error is limited to 2.79%. This work provides a solution that can consider both the computational scale and accuracy, thereby enabling the detailed design of large-scale SCWFBRs. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. Implementation of an Improved LeNet Traffic Sign Multi-classification Heterogeneous Accelerator.

Author: YANG Yongjie, ZHENG Juntai, MA Li, and YANG Hao
Subjects: TRAFFIC signs & signals, HETEROGENEOUS computing, PARALLEL processing, PARALLEL programming, ELECTRONIC data processing
Abstract: An implementation of traffic sign multi-classification heterogeneous accelerator based on improved LeNet is proposed. The accelerator utilizes an ARM+FPGA heterogeneous platform to deploy the forward inference of the improved LeNet on the FPGA for parallel computing. On the FPGA side, the AXI-Stream protocol is employed with DMA to achieve high-speed data streaming, and techniques such as array partitioning and multi-level pipeline are utilized for parallel data processing. On the ARM side, the PYNQ framework is used for data updates and accelerator scheduling. Experimental results on GTSRB demonstrate that proposed design achieves an average inference time of 14.489 ms at a working clock frequency of 50 MHz, compared to 710 ms on the MCU, resulting in a speedup of up to 49 times. This design provides significant assistance for edge applications involving traffic sign multi-classification. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

29. TrinitySLAM: On-board Real-time Event-image Fusion SLAM System for Drones.

Author: Cai, Xinjun, Xu, Jingao, Deng, Kuntian, Lan, Hongbo, Wu, Yue, Zhuge, Xiangwen, and Yang, Zheng
Subjects: FLIGHT control systems, HETEROGENEOUS computing, COMPUTING platforms, HIGH-speed aeronautics, IMAGE sensors, HIGH dynamic range imaging
Abstract: Drones have witnessed extensive popularity among diverse smart applications, and visual Simultaneous Localization and Mapping (SLAM) technology is commonly used to estimate the six-degrees-of-freedom pose for drone flight control systems. However, traditional image-based SLAM cannot ensure the flight safety of drones, especially in challenging environments such as high-speed flight and high dynamic range scenarios. The event camera, a new vision sensor, holds the potential to enable drones to overcome these challenging scenarios if fused with the image-based SLAM. Unfortunately, the computational demands of event-image fusion SLAM have grown manifold compared with image-based SLAM. Existing research on visual SLAM acceleration cannot achieve real-time operation of event-image fusion SLAM on on-board computing platforms for drones. To fill this gap, we present TrinitySLAM, a high-accuracy, real-time, low-energy consumption event-image fusion SLAM acceleration framework utilizing Xilinx Zynq, an on-board heterogeneous computing platform. The key innovations of TrinitySLAM include a fine-grained computation allocation strategy, several novel hardware–software co-acceleration designs, and an efficient data exchange mechanism. We fully implement TrinitySLAM on the latest Zynq UltraScale+ platform and evaluate its performance on one custom-made drone dataset and four official datasets covering various scenarios. Comprehensive experiments show that TrinitySLAM improves the pose estimation accuracy by 28% with half end-to-end latency and 1.2× energy consumption reduction compared with the most comparable state-of-the-art heterogeneous computing platform acceleration baseline. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. OPTIMIZING HADOOP DATA LOCALITY: PERFORMANCE ENHANCEMENT STRATEGIES IN HETEROGENEOUS COMPUTING ENVIRONMENTS.

Author: SI-YEONG KIM and TAI-HOON KIM
Subjects: MACHINE learning, HETEROGENEOUS distributed computing, HETEROGENEOUS computing, DISTRIBUTED computing, ELECTRONIC data processing
Abstract: As organizations increasingly harness big data for analytics and decision-making, the efficient processing of massive datasets becomes paramount. Hadoop, a widely adopted distributed computing framework, excels in processing large-scale data. However, its performance is contingent on effective data locality, which becomes challenging in heterogeneous computing environments comprising diverse hardware resources. This research addresses the imperative of enhancing Hadoop's data locality performance in heterogeneous computing environments. The study explores strategies to optimize data placement and task scheduling, considering the diverse characteristics of nodes within the infrastructure. Through a comprehensive analysis of Hadoop's data locality algorithms and their impact on performance, this work proposes novel approaches to mitigate challenges associated with disparate hardware capabilities. Weighted Extreme Learning Machine Technique (Weighted ELM) with the Firefly Algorithm (WELM-FF) is used in the proposed work. The integration of Weighted Extreme Learning Machine (WELM) with the Firefly Algorithm holds promise for enhancing machine learning models in the context of large-scale data processing. The research employs a combination of theoretical analysis and practical experiments to evaluate the effectiveness of the proposed enhancements. Factors such as network latency, disk I/O, and CPU capabilities are taken into account to develop a holistic framework for improving data locality and, consequently, overall Hadoop performance. The findings presented in this study contribute valuable insights to the field of distributed computing, offering practical recommendations for organizations seeking to maximize the efficiency of their Hadoop deployments in heterogeneous computing environments. By addressing the intricacies of data locality, this research strives to enhance the scalability and performance of Hadoop clusters, thereby facilitating more effective utilization of big data resources. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

31. GAPS: GPU-accelerated processing service for SM9.

Author: Xu, Wenhan, Ma, Hui, and Zhang, Rui
Subjects: BATCH processing, GRAPHICS processing units, MATHEMATICAL optimization, HETEROGENEOUS computing, INTERNET of things
Abstract: SM9 was established in 2016 as a Chinese official identity-based cryptographic (IBC) standard, and became an ISO standard in 2021. It is well-known that IBC is suitable for Internet of Things (IoT) applications, since a centralized processing of client data (e.g. IoT cloud) is often done by gateways. However, due to limited computation resources inside IoT devices, the performance of SM9 becomes a bottleneck in practical usage. The existing SM9 implementations are often CPU-based, with relatively low latency and low throughput. Consequently, a pivotal challenge for SM9 in large-scale applications is how to reduce the latency while maximizing throughput for numerous concurrent inputs. After a systematic analysis of the SM9 algorithms, we apply optimization techniques including precomputation, resource caching and parallelization to reduce the overhead of SM9. In this work, we introduce the first practical implementation of SM9 and its underlying SM9_P256 curve on GPU. Our GPU implementation combines multiple algorithms and low-level optimizations tailored for GPU's single instruction, multiple threads architecture in order to achieve high throughput for SM9. Based on these, we propose GAPS, a high-performance Cryptography as a Service (CaaS) for SM9. GAPS adopts a heterogeneous computing architecture that flexibly schedules the inputs across two implementation platforms: a CPU for the low-latency processing of sporadic inputs, and a GPU for the high-throughput processing of batch inputs. According to our benchmark, GAPS only takes a few milliseconds to process a single SM9 request in idle mode. Moreover, when operating in its batch processing mode, GAPS can generate 2,038,071 private keys, 248,239 signatures or 238,001 ciphertexts per second. The results show that GAPS scales seamlessly across inputs of different sizes, preliminarily demonstrating the efficacy of our solution. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. MOTORS: multi-objective task offloading and resource scheduling algorithm for heterogeneous fog-cloud computing scenario.

Author: Shukla, Prashant and Pandey, Sudhakar
Subjects: *HETEROGENEOUS computing, *COMPUTER systems, *SEARCH algorithms, *COMPUTING platforms, *COST control, *WORKFLOW
Abstract: Along with the rising popularity of pay-as-you-go cloud services, many businesses and communities are deploying their business or scientific workflow applications on cloud-based computing platforms. The primary responsibility of cloud service providers is to reduce the monetary cost and execution time of Infrastructure as a Service (IaaS) cloud services. The majority of current solutions for cost and makespan reduction were developed for conventional cloud platforms and are incompatible with heterogeneous computing systems (HCS) having service-based resource management approaches and pricing models. Fog-cloud infrastructures (FCI) have emerged as desirable target areas for workflow automation across several fields of application. In heterogeneous FCI, the execution of workflows involving tasks having different properties might influence the performance in terms of resource usage. The primary goal of this research is to efficiently offload the computational task and optimally schedule the workflow in such diverse computing environment. In this article, we present a novel strategy for building an environment that includes techniques for offloading and scheduling while balancing competing demands from the user and the resource providers. In order to address the issue of uncertainty, our approach incorporates a fuzzy dominance-based task clustering and offloading technique. To construct a suitable execution sequence of tasks that helps to limit the precedence relationship, by preserving dependency constraints among the tasks, a novel algorithm for tasks segmentation is employed. To simplify the problem of the complexity, a hybrid-heuristics based on Harmony Search Algorithm (HSA) and Genetic Algorithm (GA) for resource scheduling algorithm is used. The multi-objective optimization using three competing objectives is taken into consideration for investigation in heterogeneous FCI. The fitness function derived includes minimization of makespan and cost along with maximization of resource utilization. We performed experimental research using five workflow datasets in order to investigate and verify the efficacy of our proposed technique. We contrasted our proposed strategy with the primary, closely comparable strategies. Extensive testing using scientific workflows confirms the effectiveness of our offloading approach. Our solution provided a substantially better cost-makespan tradeoffs, while achieving significantly less energy consumption and can execute marginally quicker than the existing algorithms. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

33. NEST‐C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators.

Author: Park, Jeman, Yu, Misun, Kwon, Jinse, Park, Junmo, Lee, Jemin, and Kwon, Yongin
Subjects: ARTIFICIAL intelligence, HETEROGENEOUS computing, COMPUTER systems, ENERGY consumption, COMPILERS (Computer programs)
Abstract: Deep learning (DL) has significantly advanced artificial intelligence (AI); however, frameworks such as PyTorch, ONNX, and TensorFlow are optimized for general‐purpose GPUs, leading to inefficiencies on specialized accelerators such as neural processing units (NPUs) and processing‐in‐memory (PIM) devices. These accelerators are designed to optimize both throughput and energy efficiency but they require more tailored optimizations. To address these limitations, we propose the NEST compiler (NEST‐C), a novel DL framework that improves the deployment and performance of models across various AI accelerators. NEST‐C leverages profiling‐based quantization, dynamic graph partitioning, and multi‐level intermediate representation (IR) integration for efficient execution on diverse hardware platforms. Our results show that NEST‐C significantly enhances computational efficiency and adaptability across various AI accelerators, achieving higher throughput, lower latency, improved resource utilization, and greater model portability. These benefits contribute to more efficient DL model deployment in modern AI applications. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

34. Energy efficiency assessment in advanced driver assistance systems with real-time image processing on custom Xilinx DPUs.

Author: Tatar, Güner and Bayar, Salih
Abstract: The rapid advancement in embedded AI, driven by integrating deep neural networks (DNNs) into embedded systems for real-time image and video processing, has been notably pushed by AI-specific platforms like the AMD Xilinx Vitis AI on the MPSoC-FPGA platform. This platform utilizes a configurable Deep Processing Unit (DPU) for scalable resource utilization and operating frequencies. Our study employed a detailed methodology to assess the impact of various DPU configurations and frequencies on resource utilization and energy consumption. The findings reveal that increasing the DPU frequency enhances resource utilization efficiency and improves performance. Conversely, lower frequencies significantly reduce resource utilization, with only a borderline decrease in performance. These trade-offs are influenced not only by frequency but also by variations in DPU parameters. These findings are critical for developing energy-efficient AI-driven systems in Advanced Driver Assistance Systems (ADAS) based on real-time video processing. By leveraging the capabilities of Xilinx Vitis AI deployed on the Kria KV260 MPSoC platform, we explore the intricacies of optimizing energy efficiency through multi-task learning in real-time ADAS applications. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

35. DGCQN: a RL and GCN combined method for DAG scheduling in edge computing.

Author: Qin, Bin, Lei, Qinyang, and Wang, Xin
Subjects: *EDGE computing, *REINFORCEMENT learning, *DIRECTED acyclic graphs, *CONVOLUTIONAL neural networks, *HETEROGENEOUS computing, *SCHEDULING
Abstract: Edge computing is an emerging paradigm that enables low-latency and high-performance computing at the network edge. However, effectively scheduling complex and interdependent tasks on heterogeneous and dynamic edge computing nodes presents a significant challenge in meeting users' real-time response requirements. To solve this problem, a DGCQN scheduling network that leverages reinforcement learning and graph convolutional neural networks to learn an optimal scheduling strategy is proposed in this paper. The proposed method embeds the graph structure of Directed Acyclic Graph (DAG) tasks and node information of Kubernetes (K8s) clusters into a Q value function, guiding the DQN network in selecting the best action at each step. The method is evaluated across various DAG tasks and edge computing scenarios. Compared with HEFT, DQN, and GOSU, the task completion time of the proposed method is reduced by about 20, 10, and 1.5%, respectively. The results demonstrate the effectiveness of the proposed method. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

36. Spark 异构集群负载均衡调度策略.

Author: 陶宇炜 and 谢爱娟
Subjects: *DYNAMIC loads, *DEAD loads (Mechanics), *HETEROGENEOUS computing, *DATA distribution, *SCHEDULING
Abstract: Aiming at the problem that the Spark scalable distributed platform does not consider the computing capabilities of heterogeneous cluster nodes and load balance during job task scheduling, which affects the system performance, this paper constructs heterogeneous cluster nodes load balance scheduling policy under the Spark environment, Heterogeneous cluster node predicts the data distribution characteristics according to the sampling algorithm. divides the data into balancing partitions. According to the static load and dynamic load weight distribution, heterogeneous cluster node obtains the real-time load, and dynamically schedules job tasks. Finally. Wordcount, TeraSort, and K-means three benchmark tests were used to compare and analyze during heterogeneous cluster operation. Experimental results show that this algorithm can reduce the execution time significantly, and improve the performance of heterogeneous cluster. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

37. Enhancing Kokkos with OpenACC.

Author: Valero-Lara, Pedro, Lee, Seyong, Gonzalez-Tallada, Marc, Denny, Joel, Teranishi, Keita, and Vetter, Jeffrey S.
Subjects: *LATTICE Boltzmann methods, *PARALLEL programming, *HETEROGENEOUS computing, *C++
Abstract: C++ template metaprogramming has emerged as a prominent approach for achieving performance portability in heterogeneous computing. Kokkos represents a notable paradigm in this domain, offering programmers a suite of high-level abstractions for generic programming while deferring much of the device-specific code generation and optimization to the compiler through template specializations. Kokkos furnishes a range of device-specific code specializations across multiple back ends, including CUDA and HIP. Diverging from conventional back ends, the OpenACC implementation presents a high-level, multicompiler, multidevice, and directive-based programming model. This paper presents recent advancements in the OpenACC back end for Kokkos (i.e., KokkACC) and focuses on its integration into the Kokkos ecosystem, exploration of automatic device selection capabilities to enhance productivity, and performance evaluation on modern hardware such as NVIDIA H100 GPUs. The study includes implementation details and a thorough performance assessment across various computational benchmarks, including minibenchmarks (AXPY and DOT product), miniapps (LULESH, MiniFE, and SNAP-LAMMPS), and a scientific kernel based on the lattice Boltzmann method. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

38. GPU-HADVPPM4HIP V1.0: using the heterogeneous-compute interface for portability (HIP) to speed up the piecewise parabolic method in the CAMx (v6.10) air quality model on China's domestic GPU-like accelerator.

Author: Cao, Kai, Wu, Qizhong, Wang, Lingling, Guo, Hengliang, Wang, Nan, Cheng, Huaqiong, Tang, Xiao, Li, Dongxing, Liu, Lina, Li, Dongqing, Wu, Hao, and Wang, Lanning
Subjects: *MESSAGE passing (Computer science), *GRAPHICS processing units, *AIR quality, *EARTH sciences, *HETEROGENEOUS computing
Abstract: Graphics processing units (GPUs) are becoming a compelling acceleration strategy for geoscience numerical models due to their powerful computing performance. In this study, AMD's heterogeneous-compute interface for portability (HIP) was implemented to port the GPU acceleration version of the piecewise parabolic method (PPM) solver (GPU-HADVPPM) from NVIDIA GPUs to China's domestic GPU-like accelerators like GPU-HADVPPM4HIP. Further, it introduced the multi-level hybrid parallelism scheme to improve the total computational performance of the HIP version of the CAMx (Comprehensive Air Quality Model with Extensions; CAMx-HIP) model on China's domestic heterogeneous cluster. The experimental results show that the acceleration effect of GPU-HADVPPM on the different GPU accelerators is more apparent when the computing scale is more extensive, and the maximum speedup of GPU-HADVPPM on the domestic GPU-like accelerator is 28.9 × faster. The hybrid parallelism with a message passing interface (MPI) and HIP enables achieving up to a 17.2 × speedup when configuring 32 CPU cores and GPU-like accelerators on the domestic heterogeneous cluster. The OpenMP technology is introduced further to reduce the computation time of the CAMx-HIP model by 1.9 ×. More importantly, by comparing the simulation results of GPU-HADVPPM on NVIDIA GPUs and domestic GPU-like accelerators, it is found that the simulation results of GPU-HADVPPM on domestic GPU-like accelerators have less difference than the NVIDIA GPUs. Furthermore, we also show that the data transfer efficiency between CPU and GPU has a meaningful essential impact on heterogeneous computing and point out that optimizing the data transfer efficiency between CPU and GPU is one of the critical directions to improve the computing efficiency of geoscience numerical models in heterogeneous clusters in the future. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

39. FPGA Accelerators for Computing Interatomic Potential-Based Molecular Dynamics Simulation for Gold Nanoparticles: Exploring Different Communication Protocols.

Author: Patel, Ankitkumar, Vasudevan, Srivathsan, and Bulusu, Satya
Subjects: ARTIFICIAL neural networks, FIELD programmable gate arrays, PCI bus (Computer bus), HETEROGENEOUS computing, GOLD nanoparticles
Abstract: Molecular Dynamics (MD) simulation for computing Interatomic Potential (IAP) is a very important High-Performance Computing (HPC) application. MD simulation on particles of experimental relevance takes huge computation time, despite using an expensive high-end server. Heterogeneous computing, a combination of the Field Programmable Gate Array (FPGA) and a computer, is proposed as a solution to compute MD simulation efficiently. In such heterogeneous computation, communication between FPGA and Computer is necessary. One such MD simulation, explained in the paper, is the (Artificial Neural Network) ANN-based IAP computation of gold (Au147& Au309) nanoparticles. MD simulation calculates the forces between atoms and the total energy of the chemical system. This work proposes the novel design and implementation of an ANN IAP-based MD simulation for Au147& Au309using communication protocols, such as Universal Asynchronous Receiver-Transmitter (UART) and Ethernet, for communication between the FPGA and the host computer. To improve the latency of MD simulation through heterogeneous computing, Universal Asynchronous Receiver-Transmitter (UART) and Ethernet communication protocols were explored to conduct MD simulation of 50,000 cycles. In this study, computation times of 17.54 and 18.70 h were achieved with UART and Ethernet, respectively, compared to the conventional server time of 29 h for Au147nanoparticles. The results pave the way for the development of a Lab-on-a-chip application. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

40. Delay-Aware and Energy-Efficient Task Scheduling Using Strength Pareto Evolutionary Algorithm II in Fog-Cloud Computing Paradigm.

Author: Daghayeghi, Atousa and Nickray, Mohsen
Subjects: EVOLUTIONARY algorithms, HETEROGENEOUS computing, SCHEDULING, ENERGY consumption, CLOUD computing
Abstract: The exponential growth of technology and advent of the Internet of Things (IoT) paradigm have caused large volumes of data to be continuously generated from the intelligent devices. One common feature of these devices is their limited capabilities, hence, they are not able to process large volumes of generated data. However, the processing of these data in the cloud leads to high latency and high power consumption. Hence, providing services to the latency-sensitive IoT applications in the cloud is a challenging issue. Fog computing as a complement to the cloud, allows data to be processed near IoT devices. However, the resources in the fog layer are heterogeneous. Thus, the proper distribution of tasks among heterogeneous nodes while serving the task within the intended deadline is of great importance. In this paper, we have presented a task scheduling model in the fog-cloud paradigm, which formulates the task scheduling problem as a multi-objective optimization problem with the aim of minimizing service response time and the total energy consumption of the system, while considers deadline and load balancing constraints. Since the problem of task scheduling is np-hard, we have proposed a modified version of Strength Pareto Evolutionary Algorithm II (SPEA-II) with customized operators to achieve the optimal scheduling strategy. The experimental results reveal that the proposed scheme outperforms some benchmarking algorithms in terms of service response time and energy consumption. Furthermore, by optimal distribution of tasks among heterogeneous computing nodes, it leads to better resource utilization and improvement in the percentage of missed-deadline tasks. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

41. Design and implementation of a scalable correlator based on ROACH2 + GPU cluster for tianlai 96-dual-polarization antenna array.

Author: Wang, Zhao, Li, Ji-Xia, Zhang, Ke, Wu, Feng-Quan, Tian, Hai-Jun, Niu, Chen-Hui, Zhang, Ju-Yong, Chen, Zhi-Ping, Yu, Dong-Jin, Chen, Xue-Lei, Werthimer, Dan, and An, Tao
Subjects: *DIGITAL signal processing, *RADIO telescopes, *HETEROGENEOUS computing, *SIGNAL processing, *RADIO astronomy
Abstract: The digital correlator is one of the most crucial data processing components of a radio telescope array. With the scale of radio interferometeric array growing, many efforts have been devoted to developing a cost-effective and scalable correlator in the field of radio astronomy. In this paper, a 192-input digital correlator with six CASPER ROACH2 boards and seven GPU servers has been deployed as the digital signal processing system for Tianlai cylinder pathfinder located in Hongliuxia observatory. The correlator consists of 192 input signals (96 dual-polarization), 125-MHz bandwidth, and full-Stokes output. The correlator inherits the advantages of the CASPER system, for example, low cost, high performance, modular scalability, and a heterogeneous computing architecture. With a rapidly deployable ROACH2 digital sampling system, a commercially expandable 10 Gigabit switching network system, and a flexible upgradable GPU computing system, the correlator forms a low-cost and easilyupgradable system, poised to support scalable large-scale interferometeric array in the future. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

42. BPaaS placement over optimum cloud availability zones.

Author: Hedhli, Ameni, Mezni, Haithem, and Ben Said, Lamjed
Subjects: *MACHINE learning, *BUSINESS process management, *INFORMATION networks, *HETEROGENEOUS computing, *SOFTWARE as a service
Abstract: Business Process as a Service (BPaaS) has recently emerged from the synergy between business process management and cloud computing, allowing companies to outsource and migrate their businesses to the cloud. BPaaS management refers to the set of operations (decomposition, customization, placement, etc.) that maintain a high-quality of the deployed cloud-based businesses. Like its ancestor SaaS, BPaaS placement consists on the dispersion of its composing fragments over multiple cloud availability zones (CAZ). These latter are characterized by their huge, diverse and dynamic data, which are exploited to select the high-performance servers holding BPaaS fragments, while preserving their constraints. These fragments' relations and their placement schemes constitute a dynamic BPaaS information network. However, the few existing BPaaS solutions adopt a static placement strategy, while it is important to take the CAZ dynamic and uncertain nature into account. Also, current solutions do not properly model the BPaaS environment. To offer an efficient BPaaS placement scheme, we combine prediction and learning capabilities, which will help identify the migrating fragments and their new hosting servers. We first model the BPaaS context as a heterogeneous information network. Then, we apply an incremental representation learning approach to facilitate its processing. Using the principles of proximity-aware representation learning, we infer useful knowledge regarding BPaaS fragments and the available servers at different CAZ. Finally based on the degree of closeness between the BPaaS environment's entities (e.g., fragments, servers), we select the optimum cloud availability zone on which the resource-consuming BPaaS fragments are migrated based on a proposed placement scheme. Obtained results were very promising compared to traditional BPaaS placement solutions. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

43. Revisiting Database Indexing for Parallel and Accelerated Computing: A Comprehensive Study and Novel Approaches.

Author: Abbasi, Maryam, Bernardo, Marco V., Váz, Paulo, Silva, José, and Martins, Pedro
Subjects: *PROCESS capability, *HETEROGENEOUS computing, *DATABASES, *REINFORCEMENT learning, *COMPUTER systems
Abstract: While the importance of indexing strategies for optimizing query performance in database systems is widely acknowledged, the impact of rapidly evolving hardware architectures on indexing techniques has been an underexplored area. As modern computing systems increasingly leverage parallel processing capabilities, multi-core CPUs, and specialized hardware accelerators, traditional indexing approaches may not fully capitalize on these advancements. This comprehensive experimental study investigates the effects of hardware-conscious indexing strategies tailored for contemporary and emerging hardware platforms. Through rigorous experimentation on a real-world database environment using the industry-standard TPC-H benchmark, this research evaluates the performance implications of indexing techniques specifically designed to exploit parallelism, vectorization, and hardware-accelerated operations. By examining approaches such as cache-conscious B-Tree variants, SIMD-optimized hash indexes, and GPU-accelerated spatial indexing, the study provides valuable insights into the potential performance gains and trade-offs associated with these hardware-aware indexing methods. The findings reveal that hardware-conscious indexing strategies can significantly outperform their traditional counterparts, particularly in data-intensive workloads and large-scale database deployments. Our experiments show improvements ranging from 32.4% to 48.6% in query execution time, depending on the specific technique and hardware configuration. However, the study also highlights the complexity of implementing and tuning these techniques, as they often require intricate code optimizations and a deep understanding of the underlying hardware architecture. Additionally, this research explores the potential of machine learning-based indexing approaches, including reinforcement learning for index selection and neural network-based index advisors. While these techniques show promise, with performance improvements of up to 48.6% in certain scenarios, their effectiveness varies across different query types and data distributions. By offering a comprehensive analysis and practical recommendations, this research contributes to the ongoing pursuit of database performance optimization in the era of heterogeneous computing. The findings inform database administrators, developers, and system architects on effective indexing practices tailored for modern hardware, while also paving the way for future research into adaptive indexing techniques that can dynamically leverage hardware capabilities based on workload characteristics and resource availability. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

44. Optimizing Three-Dimensional Stencil-Operations on Heterogeneous Computing Environments.

Author: Herrmann, Nina, Dieckmann, Justus, and Kuchen, Herbert
Subjects: *HETEROGENEOUS computing, *LATTICE Boltzmann methods, *COMPUTATIONAL fluid dynamics, *PARALLEL programming, *LATTICE gas
Abstract: Complex algorithms and enormous data sets require parallel execution of programs to attain results in a reasonable amount of time. Both aspects are combined in the domain of three-dimensional stencil operations, for example, computational fluid dynamics. This work contributes to the research on high-level parallel programming by discussing the generalizable implementation of a three-dimensional stencil skeleton that works in heterogeneous computing environments. Two exemplary programs, a gas simulation with the Lattice Boltzmann method, and a mean blur, are executed in a multi-node multi-graphics processing units environment, proving the runtime improvements in heterogeneous computing environments compared to a sequential program. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

45. Sequence alignment software migration and performance evaluation based on DPCT.

Author: LI Pei-zhen, ZHANG Yang, and CHEN Wen-bo
Abstract: This paper explores the process of migrating CUDA programs to DPC++ using the GASAL2 sequence alignment software. The DPCT tool is utilized during the migration process to automatically convert CUDA APIs to DPC++ APIs. However, the migrated code still requires adaptation and modification to compile and run correctly. This paper evaluates the effectiveness of the DPCT tool in migrating CUDA programs to DPC++ and demonstrates the high-efficiency performance of DPC++ across different architectures. Experiments show that the migrated program maintains the accuracy of the original program and can run on heterogeneous devices with the Intel GPU architecture without code modification. At the same time, the migrated DPC++-based GASAL2 heterogeneous computing performance can reach approximately 90%-95% of the original CUDA-based GASAL2 computing performance, fully demonstrating the feasibility of DPC++ heterogeneous programming. The results provide a promising solution for cross-platform heterogeneous programming to fully utilize a wider range of hardware support [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

46. An integrated DEM code for tracing the entire regolith mass movement on asteroids.

Author: Song, Zhijun, Yu, Yang, Soldini, Stefania, Cheng, Bin, and Michel, Patrick
Subjects: *ASTEROIDS, *REGOLITH, *DISCRETE element method, *HETEROGENEOUS computing, *SURFACE topography, *PARALLEL algorithms, *GRAVITATIONAL fields
Abstract: This paper presents a general strategy for tracking the scale-span movement process of asteroid regolith materials. It achieves the tracking of the mass movement on the asteroid at a realistic scale, under conditions of high-resolution asteroid surface topography (submeter level) and actual regolith particle sizes. To overcome the memory exponential expansion caused by the enlarged computational domain, we improved the conventional cell-linked list method so that it can be applied to arbitrarily large computational domains around asteroids. An efficient contact detection algorithm for particles and polyhedral shape models of asteroids is presented, which avoids traversing all surface triangles and thus allows us to model high-resolution surface topography. A parallel algorithm based on Compute Unified Device Architecture for the gravitational field of the asteroid is presented. Leveraging heterogeneous computing features, further architectural optimization overlaps computations of the long-range and short-range interactions, resulting in an approaching doubling of computational efficiency compared to the code lacking architectural optimizations. Using the above strategy, a specific high-fidelity discrete element method code that integrates key mechanical models, including the irregular gravitational field, the interparticle and particle-surface interactions, and the coupled dynamics between the particles and the asteroid, is developed to track the asteroid regolith mass movement. As tests, we simulated the landslide of a sand pile on the asteroid's surface during spin-up. The simulation results demonstrate that the code can track the mass movement of the regolith particles on the surface of the asteroid from local landslides to mass leakage with good accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

47. NIST CSF-2.0 Compliant GPU Shader Execution.

Author: Lungu, Nelson, Al Rababah, Ahmad Abdulqadir, Dash, Bibhuti Bhusan, Syed, Asif Hassan, Barik, Lalbihari, Rout, Suchismita, Tembo, Simon, Lubobya, Charles, and Patra, Sudhansu Shekhar
Subjects: HETEROGENEOUS computing, BEST practices, INTERNET security, PROTOTYPES
Abstract: This article introduces a mechanism for ensuring trusted GPU shader execution that adheres to the NIST Cybersecurity Framework (CSF) 2.0 standard. The CSF is a set of best practices for reducing cybersecurity risks. We focus on the CSF’s identification, protection, detection, and response mechanisms for GPU-specific security. To this end, we exploit recent advancements in side-channel analysis and hardware-assisted security for the real-time and introspective monitoring of shader execution. We prototype our solution and measure its performance across different GPU platforms. The evaluation results demonstrate the effectiveness of the proposed mechanism in detecting anomalous shader behaviors that only incur modest overhead at runtime. Integrating the CSF 2.0 principles into the proposed GPU shader pipeline leads to an organizational recipe for securing heterogeneous computing resources. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

48. CAL: Core-Aware Lock for the big.LITTLE Multicore Architecture.

Author: Nie, Shiqiang, Liu, Yingming, Niu, Jie, and Wu, Weiguo
Subjects: CENTRAL processing units, HETEROGENEOUS computing, CELL phones, REDUCED instruction set computers, ALGORITHMS
Abstract: The concept of "all cores are created equal" has been popular for several decades due to its simplicity and effectiveness in CPU (Central Processing Unit) design. The more cores the CPU has, the higher performance the host owns and the higher the power consumption. However, power-saving is also one of the key goals for servers in data centers and embedded devices (e.g., mobile phones). The big.LITTLE multicore architecture, which contains high-performance cores (namely big core) and power-saved cores (namely little core), has been developed by ARM (Advanced RISC Machine) and Intel to trade off performance and power efficiency. Facing the new heterogeneous computing architecture, the traditional lock algorithms, which are designed to run on homogeneous computing architecture, cannot work optimally as usual and drop into the performance issue for the difference between big core and little core. In our preliminary experiment, we observed that, in the big.LITTLE multicore architecture, all these lock algorithms exhibit sub-optimal performance. The FIFO-based (First In First Out) locks experience throughput degradation, while the performance of competition-based locks can be divided into two categories. One of them is big-core-friendly, so their tail latency increases significantly; the other is little-core-friendly. Not only does the tail latency increase, but the throughput is also degraded. Motivated by this observation, we propose a Core-Aware Lock for the big.LITTLE multicore architecture named CAL, which keeps each core having an equal opportunity to access the critical section in the program. The core idea of the CAL is to take the slowdown ratio as the matric to reorder lock requests of these big and little cores. By evaluating benchmarks and a real-world application named LevelDB, CAL is confirmed to achieve fairness goals in heterogeneous computing architecture without sacrificing the performance of the big core. Compared to several traditional lock algorithms, the CAL's fairness has increased by up to 67%; and Its throughput is 26% higher than FIFO-based locks and 53% higher than competition-based locks, respectively. In addition, the tail latency of CAL is always kept at a low level. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

49. Quantum optimization algorithms: Energetic implications.

Author: Hong Enriquez, Rolando P., Badia, Rosa M., Chapman, Barbara, Bresniker, Kirk, Pakin, Scott, Mishra, Alok, Bruel, Pedro, Dhakal, Aditya, Rattihalli, Gourav, Hogade, Ninad, Frachtenberg, Eitan, and Milojicic, Dejan
Subjects: OPTIMIZATION algorithms, QUANTUM computers, QUANTUM computing, STOCHASTIC approximation, QUANTUM annealing, QUBITS
Abstract: Summary: Since the dawn of quantum computing (QC), theoretical developments like Shor's algorithm proved the conceptual superiority of QC over traditional computing. However, such quantum supremacy claims are difficult to achieve in practice because of the technical challenges of realizing noiseless qubits. In the near future, QC applications will need to rely on noisy quantum devices that offload part of their work to classical devices. One way to achieve this is by using parameterized quantum circuits in optimization or even in machine learning tasks. The energy requirements of quantum algorithms have not yet been studied extensively. In this article, we explore several optimization algorithms using both theoretical insights and numerical experiments to understand their impact on energy consumption. Specifically, we highlight why and how algorithms like quantum natural gradient descent, simultaneous perturbation stochastic approximations or circuit learning methods, are at least 2×$$ 2\times $$ to 4×$$ 4\times $$ more energy efficient than their classical counterparts; why feedback‐based quantum optimization is energy‐inefficient; and how techniques like Rosalin can improve the energy efficiency of other algorithms by a factor of ≥$$ \ge $$20×$$ \times $$. Finally, we use the NchooseK high‐level programming model to run optimization problems on both gate‐based quantum computers and quantum annealers. Empirical data indicate that these optimization problems run faster, have better success rates, and consume less energy on quantum annealers than on their gate‐based counterparts. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

50. Optimization of uncertain dependent task mapping on heterogeneous computing platforms.

Author: Zhang, Jing and Han, Zhanwei
Subjects: *COMPUTING platforms, *DIRECTED acyclic graphs, *HETEROGENEOUS computing, *HEURISTIC algorithms, *PRODUCTION scheduling, *MONTE Carlo method, *DETERMINISTIC algorithms
Abstract: Dependent tasks are typically modeled using directed acyclic graphs (DAGs), and scheduling algorithms based on DAGs have been extensively researched. Most of the existing algorithms assume that the task or communication duration is deterministic. Nevertheless, any delays in task execution or communication can significantly affect the scheduling results. Aiming at minimizing the DAGs' makespan, a heuristic algorithms called heterogeneous optimistic complete time (HOCT) is proposed. The algorithm assumes that the task characteristic values are modeled randomly. It calculates task priorities based on the acceleration ratio and allocates computing resources using an optimistic execution timetable. Then, a Monte-Carlo simulation-based scheduling algorithm which built on the top of HOCT is proposed. Experimental results show that the proposed algorithm achieves better makespan of the stochastic DAG. It also provides a more robust scheduling solution to unpredictability than critical-path-on-a-processor, heterogeneous earliest finish time-no cross and parental prioritization earliest finish time algorithms. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

5,392 results on '"Heterogeneous computing"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources