6,498 results on '"General-purpose computing on graphics processing units"'
Search Results
2. Evaluation of NVIDIA Xavier NX Platform for Real-Time Image Processing for Plasma Diagnostics †.
- Author
-
Jabłoński, Bartłomiej, Makowski, Dariusz, Perek, Piotr, Nowak vel Nowakowski, Patryk, Sitjes, Aleix Puig, Jakubowski, Marcin, Gao, Yu, and Winter, Axel
- Subjects
- *
PLASMA diagnostics , *IMAGE processing , *CENTRAL processing units , *PLASMA materials processing , *NUCLEAR fusion , *SYSTEMS on a chip , *GRAPHICS processing units - Abstract
Machine protection is a core task of real-time image diagnostics aiming for steady-state operation in nuclear fusion devices. The paper evaluates the applicability of the newest low-power NVIDIA Jetson Xavier NX platform for image plasma diagnostics. This embedded NVIDIA Tegra System-on-a-Chip (SoC) integrates a Graphics Processing Unit (GPU) and Central Processing Unit (CPU) on a single chip. The hardware differences and features compared to the previous NVIDIA Jetson TX2 are signified. Implemented algorithms detect thermal events in real-time, utilising the high parallelism provided by the embedded General-Purpose computing on Graphics Processing Units (GPGPU). The performance and accuracy are evaluated on the experimental data from the Wendelstein 7-X (W7-X) stellarator. Strike-line and reflection events are primarily investigated, yet benchmarks for overload hotspots, surface layers and visualisation algorithms are also included. Their detection might allow for automating real-time risk evaluation incorporated in the divertor protection system in W7-X. For the first time, the paper demonstrates the feasibility of complex real-time image processing in nuclear fusion applications on low-power embedded devices. Moreover, GPU-accelerated reference processing pipelines yielding higher accuracy compared to the literature results are proposed, and remarkable performance improvement resulting from the upgrade to the Xavier NX platform is attained. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
3. Iterative Parallel Sampling RRT for Racing Car Simulation
- Author
-
Gomes, Samuel, Dias, João, Martinho, Carlos, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Oliveira, Eugénio, editor, Gama, João, editor, Vale, Zita, editor, and Lopes Cardoso, Henrique, editor
- Published
- 2017
- Full Text
- View/download PDF
4. Simulation of variational Gaussian process NARX models with GPGPU.
- Author
-
Krivec, Tadej, Papa, Gregor, and Kocijan, Juš
- Subjects
GAUSSIAN processes ,MONTE Carlo method ,AUTOREGRESSIVE models ,BIG data ,DYNAMICAL systems ,GRAPHICS processing units ,GENETIC programming - Abstract
Gaussian processes (GP) regression is a powerful probabilistic tool for modeling nonlinear dynamical systems. The downside of the method is its cubic computational complexity with respect to the training data that can be partially reduced using pseudo-inputs. The dynamics can be represented with an autoregressive model, which simplifies the training to that of the static case. When simulating an autoregressive model, the uncertainty is propagated through a nonlinear function and the simulation cannot be evaluated in closed-form. This paper combines the variational methods of GP approximations with a nonlinear autoregressive model with exogenous inputs (NARX) to form variational GP (VGP-NARX) models. We show how VGP-NARX models, on average, better approximate a full GP-NARX model than more commonly used GP-NARX (FITC) model on 10 chaotic time-series. The modeling capabilities of VGP-NARX models are compared with the existing approaches on two benchmarks for modeling nonlinear dynamical systems. The advantage of general-purpose computing on graphics processing units (GPGPU) for Monte Carlo simulation on large validation data sets is addressed. • Variational approximation of GP-NARX models for dynamic system identification. • Comparison of GP-NARX approximations for modeling chaotic time-series. • Comparison of VGP-NARX models with existing approaches on dynamic benchmark data. • General-purpose computing on graphics processing units for GP-NARX simulation. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
5. Evaluation of NVIDIA Xavier NX Platform for Real-Time Image Processing for Plasma Diagnostics
- Author
-
Bartłomiej Jabłoński, Dariusz Makowski, Piotr Perek, Patryk Nowak vel Nowakowski, Aleix Puig Sitjes, Marcin Jakubowski, Yu Gao, Axel Winter, and The W-X Team
- Subjects
graphics processing unit ,general-purpose computing on graphics processing units ,image processing ,plasma diagnostics ,embedded system ,Technology - Abstract
Machine protection is a core task of real-time image diagnostics aiming for steady-state operation in nuclear fusion devices. The paper evaluates the applicability of the newest low-power NVIDIA Jetson Xavier NX platform for image plasma diagnostics. This embedded NVIDIA Tegra System-on-a-Chip (SoC) integrates a Graphics Processing Unit (GPU) and Central Processing Unit (CPU) on a single chip. The hardware differences and features compared to the previous NVIDIA Jetson TX2 are signified. Implemented algorithms detect thermal events in real-time, utilising the high parallelism provided by the embedded General-Purpose computing on Graphics Processing Units (GPGPU). The performance and accuracy are evaluated on the experimental data from the Wendelstein 7-X (W7-X) stellarator. Strike-line and reflection events are primarily investigated, yet benchmarks for overload hotspots, surface layers and visualisation algorithms are also included. Their detection might allow for automating real-time risk evaluation incorporated in the divertor protection system in W7-X. For the first time, the paper demonstrates the feasibility of complex real-time image processing in nuclear fusion applications on low-power embedded devices. Moreover, GPU-accelerated reference processing pipelines yielding higher accuracy compared to the literature results are proposed, and remarkable performance improvement resulting from the upgrade to the Xavier NX platform is attained.
- Published
- 2022
- Full Text
- View/download PDF
6. Acceleration of cardiac tissue simulation with graphic processing units
- Author
-
Sato, Daisuke, Xie, Yuanfang, Weiss, James N., Qu, Zhilin, Garfinkel, Alan, and Sanderson, Allen R.
- Subjects
Engineering ,Computer Applications ,Imaging / Radiology ,Human Physiology ,Biomedical Engineering ,General-purpose computing on graphics processing units ,Whole heart simulation ,Excitable media - Abstract
In this technical note we show the promise of using graphic processing units (GPUs) to accelerate simulations of electrical wave propagation in cardiac tissue, one of the more demanding computational problems in cardiology. We have found that the computational speed of two-dimensional (2D) tissue simulations with a single commercially available GPU is about 30 times faster than with a single 2.0 GHz Advanced Micro Devices (AMD) Opteron processor. We have also simulated wave conduction in the three-dimensional (3D) anatomic heart with GPUs where we found the computational speed with a single GPU is 1.6 times slower than with a 32-central processing unit (CPU) Opteron cluster. However, a cluster with two or four GPUs is faster than the CPU-based cluster. These results demonstrate that a commodity personal computer is able to perform a whole heart simulation of electrical wave conduction within times that enable the investigators to interact more easily with their simulations.
- Published
- 2009
7. Hybrid Group Anomaly Detection for Sequence Data: Application to Trajectory Data Analytics
- Author
-
Youcef Djenouri, Alberto Cano, Asma Belhadi, Jerry Chun-Wei Lin, and Gautam Srivastava
- Subjects
Speedup ,Computer science ,Mechanical Engineering ,Anomaly detection ,GPU computing ,computer.software_genre ,Computer Science Applications ,k-nearest neighbors algorithm ,Automotive Engineering ,Outlier ,Data analysis ,Sequence databases ,Pruning (decision trees) ,Data mining ,General-purpose computing on graphics processing units ,Cluster analysis ,computer - Abstract
Many research areas depend on group anomaly detection. The use of group anomaly detection can maintain and provide security and privacy to the data involved. This research attempts to solve the deficiency of the existing literature in outlier detection thus a novel hybrid framework to identify group anomaly detection from sequence data is proposed in this paper. It proposes two approaches for efficiently solving this problem: i) Hybrid Data Mining-based algorithm, consists of three main phases: first, the clustering algorithm is applied to derive the micro-clusters. Second, the kNN algorithm is applied to each micro-cluster to calculate the candidates of the group's outliers. Third, a pattern mining framework gets applied to the candidates of the group's outliers as a pruning strategy, to generate the groups of outliers, and ii) a GPU-based approach is presented, which benefits from the massively GPU computing to boost the runtime of the hybrid data mining-based algorithm. Extensive experiments were conducted to show the advantages of different sequence databases of our proposed model. Results clearly show the efficiency of a GPU direction when directly compared to a sequential approach by reaching a speedup of 451. In addition, both approaches outperform the baseline methods for group detection.
- Published
- 2022
8. CU.POKer: Placing DNNs on WSE With Optimal Kernel Sizing and Efficient Protocol Optimization
- Author
-
Xiaopeng Zhang, Jingsong Chen, Fangzhou Wang, Lixin Liu, Evangeline F. Y. Young, Bentian Jiang, and Jinwei Liu
- Subjects
Scheme (programming language) ,Artificial neural network ,Computer science ,business.industry ,Deep learning ,Computer Graphics and Computer-Aided Design ,Floorplan ,Computer architecture ,Kernel (statistics) ,Artificial intelligence ,Electrical and Electronic Engineering ,General-purpose computing on graphics processing units ,Heuristics ,business ,computer ,Protocol (object-oriented programming) ,Software ,computer.programming_language - Abstract
The tremendous growth in deep learning (DL) applications has created an exponential demand for computing power, which leads to the rise of AI-specific hardware. Targeted towards accelerating computation-intensive deep learning applications, AI hardware, including but not limited to GPGPU, TPU, ASICs, etc., have been adopted ubiquitously. As a result, domainspecific CAD tools play more and more important roles and have been deeply involved in both the design and compilation stages of modern AI hardware. Recently, ISPD 2020 contest introduced a special challenge targeting at the physical mapping of neural network workloads onto the largest commercial deep learning accelerator, CS-1 Wafer-Scale Engine (WSE). In this paper, we proposed CU.POKer, a high-performance engine fullycustomized for WSE’s DNN workload placement challenge. A provably optimal placeable kernel candidate searching scheme and a data-flow-aware placement tool are developed accordingly to ensure the state-of-the-art quality on the real industrial benchmarks. Experimental results on ISPD 2020 contest evaluation suites demonstrated the superiority of our proposed framework over not only the state-of-the-art (SOTA) placer but also the conventional heuristics used in general floorplanning.
- Published
- 2022
9. AnyDSL: a partial evaluation framework for programming high-performance libraries
- Author
-
Roland Leißa, André Müller, Bertil Schmidt, Richard Membarth, Klaas Boesche, Arsène Pérard-Gayot, Sebastian Hack, and Philipp Slusallek
- Subjects
Intermediate language ,Computer science ,020207 software engineering ,Image processing ,02 engineering and technology ,Parallel computing ,Partial evaluation ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Code generation ,Ray tracing (graphics) ,General-purpose computing on graphics processing units ,Safety, Risk, Reliability and Quality ,Implementation ,Software ,Compile time - Abstract
This paper advocates programming high-performance code using partial evaluation. We present a clean-slate programming system with a simple, annotation-based, online partial evaluator that operates on a CPS-style intermediate representation. Our system exposes code generation for accelerators (vectorization/parallelization for CPUs and GPUs) via compiler-known higher-order functions that can be subjected to partial evaluation. This way, generic implementations can be instantiated with target-specific code at compile time. In our experimental evaluation we present three extensive case studies from image processing, ray tracing, and genome sequence alignment. We demonstrate that using partial evaluation, we obtain high-performance implementations for CPUs and GPUs from one language and one code base in a generic way. The performance of our codes is mostly within 10%, often closer to the performance of multi man-year, industry-grade, manually-optimized expert codes that are considered to be among the top contenders in their fields.
- Published
- 2023
- Full Text
- View/download PDF
10. G-PICS: A Framework for GPU-Based Spatial Indexing and Query Processing
- Author
-
Zhila Nouri Lewis and Yi-Cheng Tu
- Subjects
Spatial query ,Computational Theory and Mathematics ,Parallel processing (DSP implementation) ,Computer science ,Spatial database ,Dynamic data ,Parallel algorithm ,Parallel computing ,General-purpose computing on graphics processing units ,Spatial analysis ,Massively parallel ,Computer Science Applications ,Information Systems - Abstract
Support for efficient spatial data storage and retrieval has become a vital component in almost all spatial database systems. While GPUs have become a mainstream platform for high-throughput data processing in recent years, exploiting the massively parallel processing power of GPUs is non-trivial. Current approaches that parallelize one query at a time have low work efficiency and cannot make good use of GPU resources. On the other hand, many spatial database systems could receive a large number of queries simultaneously. In this paper, we present a comprehensive framework named G-PICS for parallel processing of concurrent spatial queries on GPUs. GPICS encapsulates efficient parallel algorithms for constructing a variety of spatial trees with different space partitioning methods. G-PICS also provides highly optimized programs for processing major spatial query types, and such programs can be accessed via an API that could be further extended to implement user-defined algorithms. While support for dynamic data inputs is missing in existing work, G-PICS implements efficient parallel algorithms for bulk updates of data. Furthermore, G-PICS is designed to work in a Multi-GPU environment to support datasets beyond the size of a single GPU's global memory. Empirical evaluation of GPICS shows significant performance improvement over the state-of-the-art GPU and parallel CPU-based spatial query processing systems.
- Published
- 2022
11. Comparative analysis of software optimization methods in context of branch predication on GPUs
- Author
-
I. Yu. Sesin and R. G. Bolbakov
- Subjects
Information theory ,Computer science ,Adaptive optimization ,Speculative execution ,predication ,Context (language use) ,Software performance testing ,Branch predictor ,computer.software_genre ,Branch predication ,Computer engineering ,general-purpose computing for graphical processing units ,optimizing compilers ,General Earth and Planetary Sciences ,Compiler ,Q350-390 ,General-purpose computing on graphics processing units ,computer ,General Environmental Science - Abstract
General Purpose computing for Graphical Processing Units (GPGPU) technology is a powerful tool for offloading parallel data processing tasks to Graphical Processing Units (GPUs). This technology finds its use in variety of domains – from science and commerce to hobbyists. GPU-run general-purpose programs will inevitably run into performance issues stemming from code branch predication. Code predication is a GPU feature that makes both conditional branches execute, masking the results of incorrect branch. This leads to considerable performance losses for GPU programs that have large amounts of code hidden away behind conditional operators. This paper focuses on the analysis of existing approaches to improving software performance in the context of relieving the aforementioned performance loss. Description of said approaches is provided, along with their upsides, downsides and extents of their applicability and whether they address the outlined problem. Covered approaches include: optimizing compilers, JIT-compilation, branch predictor, speculative execution, adaptive optimization, run-time algorithm specialization, profile-guided optimization. It is shown that the aforementioned methods are mostly catered to CPU-specific issues and are generally not applicable, as far as branch-predication performance loss is concerned. Lastly, we outline the need for a separate performance improving approach, addressing specifics of branch predication and GPGPU workflow.
- Published
- 2021
12. An Effective GPGPU Visual Secret Sharing by Contrast-Adaptive ConvNet Super-Resolution
- Author
-
M. Raviraja Holla and Alwyn R. Pais
- Subjects
Computer science ,business.industry ,media_common.quotation_subject ,Contrast (vision) ,Computer vision ,Artificial intelligence ,Electrical and Electronic Engineering ,General-purpose computing on graphics processing units ,business ,Superresolution ,Secret sharing ,Computer Science Applications ,media_common - Published
- 2021
13. K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs
- Author
-
Yoonhee Kim and Sejin Kim
- Subjects
Computer Networks and Communications ,business.industry ,Computer science ,Distributed computing ,Multiprocessing ,Cloud computing ,Workload ,Shared resource ,Scheduling (computing) ,Resource (project management) ,Human multitasking ,General-purpose computing on graphics processing units ,business ,Software - Abstract
Data centers and cloud environments have recently started providing graphic processing unit (GPU)-based infrastructure services. Actual general purpose GPU (GPGPU) applications have low GPU utilization, unlike GPU-friendly applications. To improve the resource utilization of GPUs, there is the need for the concurrent execution of different applications while sharing resources in a streaming multiprocessor (SM). However, it is difficult to predict the execution performance of applications because resource contention can be caused by intra-SM multitasking. Furthermore, it is crucial to find the best resource partitioning and an execution set of applications that show the best performance among many applications. To address this, the current paper proposes K-Scheduler, a multitasking placement scheduler based on the intra-SM resource-use characteristics of applications. First, the resource-use and multitasking characteristics of applications are analyzed according to their classification and their individual execution characteristics. Rules for concurrent execution are derived according to each observation, and scheduling is performed according to the corresponding rules. The results verified that the total workload execution performance of K-Scheduler improved by 18% compared to previous studies, and individual execution performance improved by 32%.
- Published
- 2021
14. A User-Centric Data Protection Method for Cloud Storage Based on Invertible DWT
- Author
-
Gerard Memmi, Meikang Qiu, Han Qiu, Hassan N. Noura, and Zhong Ming
- Subjects
Discrete wavelet transform ,Security analysis ,Computer Networks and Communications ,Computer science ,End user ,business.industry ,Distributed computing ,Cloud computing ,Encryption ,Computer Science Applications ,Hardware and Architecture ,Data Protection Act 1998 ,General-purpose computing on graphics processing units ,business ,Cloud storage ,Software ,Information Systems - Abstract
Protection on end users' data stored in Cloud servers becomes an important issue in today's Cloud environments. In this paper, we present a novel data protection method combining Selective Encryption (SE) concept with fragmentation and dispersion on storage. Our method is based on the invertible Discrete Wavelet Transform (DWT) to divide agnostic data into three fragments with three different levels of protection. Then, these three fragments can be dispersed over different storage areas with different levels of trustworthiness to protect end users' data by resisting possible leaks in Clouds. Thus, our method optimizes the storage cost by saving expensive, private, and secure storage spaces and utilizing cheap but low trustworthy storage space. We have intensive security analysis performed to verify the high protection level of our method. Additionally, the efficiency is proved by implementation of deploying tasks between CPU and General Purpose Graphic Processing Unit (GPGPU) in an optimized manner.
- Published
- 2021
15. gIM: GPU Accelerated RIS-Based Influence Maximization Algorithm
- Author
-
Soheil Shahrouz, Saber Salehkaleybar, and Matin Hashemi
- Subjects
FOS: Computer and information sciences ,Approximation theory ,Computational complexity theory ,Computer science ,Parallel algorithm ,Graph theory ,Maximization ,Expected value ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Distributed, Parallel, and Cluster Computing (cs.DC) ,General-purpose computing on graphics processing units ,Greedy algorithm ,Algorithm - Abstract
Given a social network modeled as a weighted graph $G$, the influence maximization problem seeks $k$ vertices to become initially influenced, to maximize the expected number of influenced nodes under a particular diffusion model. The influence maximization problem has been proven to be NP-hard, and most proposed solutions to the problem are approximate greedy algorithms, which can guarantee a tunable approximation ratio for their results with respect to the optimal solution. The state-of-the-art algorithms are based on Reverse Influence Sampling (RIS) technique, which can offer both computational efficiency and non-trivial $(1-\frac{1}{e}-\epsilon)$-approximation ratio guarantee for any $\epsilon >0$. RIS-based algorithms, despite their lower computational cost compared to other methods, still require long running times to solve the problem in large-scale graphs with low values of $\epsilon$. In this paper, we present a novel and efficient parallel implementation of a RIS-based algorithm, namely IMM, on GPU. The proposed GPU-accelerated influence maximization algorithm, named gIM, can significantly reduce the running time on large-scale graphs with low values of $\epsilon$. Furthermore, we show that gIM algorithm can solve other variations of the IM problem, only by applying minor modifications. Experimental results show that the proposed solution reduces the runtime by a factor up to $220 \times$. The source code of gIM is publicly available online., Comment: Accepted for publication in IEEE Transactions on Parallel and Distributed Systems (TPDS)
- Published
- 2021
16. An image generator based on neural networks in GPU
- Author
-
Halamo Reis, Thiago W. Silva, Elmar U. K. Melcher, Alisson V. Brito, and Antonio Marcus Nogueira Lima
- Subjects
Distributed Computing Environment ,Artificial neural network ,Computer Networks and Communications ,Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Graphics processing unit ,computer.software_genre ,Image (mathematics) ,Set (abstract data type) ,CUDA ,Hardware and Architecture ,Media Technology ,Data mining ,General-purpose computing on graphics processing units ,computer ,Software ,Generator (mathematics) - Abstract
Existing image databases contain a few diversity of images. Likewise, there is no specific image base available in other situations, leading to the need to undertake additional efforts in capturing images and creating datasets. Many of these datasets contain only a single object in each image, but often the scenario in which projects must operate in production requires several objects per image. Thus, it is necessary to expand original datasets into more complex ones with specific combinations to achieve the goal of the application. This work proposes a technique for image generation to extend an initial dataset. It has been designed generically to work with various images and create a data set from some initial images. The generated set of images is used in a distributed environment. It is possible to perform image generation in this environment, producing datasets with specific images to work in certain applications. The generation of images consists of two methods: generation by deformation and generation by a neural network. With the proposed methods, this work sought to bring as main contributions the specification and implementation of an image generating component so that it is possible to easily integrate it with possible heterogeneous devices capable of parallel computing, such as General Purpose Graphics Processing Unit (GPGPU). In comparison with the existing methods to the proposed one, this one proposes to use the image generator enlarging an initial image bank with the combination of two methods. Some experiments are presented doing generation with handwritten digits to validate the proposed approach. The generator was designed with CUDA and GPU-optimized libraries as TensorFlow-specific modules. The results obtained can optimize the integration process with the simulation of possible stimuli choices, avoiding problems in the generation of image phase tests.
- Published
- 2021
17. Study and evaluation of automatic GPU offloading method from various language applications
- Author
-
Yoji Yamato
- Subjects
Computer Networks and Communications ,business.industry ,Gate array ,Computer science ,Graphics processing unit ,Central processing unit ,General-purpose computing on graphics processing units ,business ,Field-programmable gate array ,Software ,Computer hardware - Abstract
Heterogeneous hardware other than a small-core central processing unit (CPU) is increasingly being used, such as a graphics processing unit (GPU), field-programmable gate array (FPGA) or many-core ...
- Published
- 2021
18. Letting future programmers experience performance-related tasks
- Author
-
David Bednárek, Martin Kruliš, and Jakub Yaghob
- Subjects
Computer Networks and Communications ,business.industry ,Computer science ,020206 networking & telecommunications ,02 engineering and technology ,Theoretical Computer Science ,Software ,Artificial Intelligence ,Hardware and Architecture ,Computer cluster ,0202 electrical engineering, electronic engineering, information engineering ,Decomposition (computer science) ,Parallelism (grammar) ,020201 artificial intelligence & image processing ,Product (category theory) ,General-purpose computing on graphics processing units ,Software engineering ,business ,Set (psychology) ,Range (computer programming) - Abstract
Programming courses usually focus on software-engineering problems like software decomposition and code maintenance. While computer-science lessons emphasize algorithm complexity, technological problems are usually neglected although they may significantly affect the performance in terms of wall time. As the technological problems are best explained by hands-on experience, we present a set of homework assignments focused on a range of technologies from instruction-level parallelism to GPU programming to cluster computing. These assignments are a product of a decade of development and testing on live subjects – the students of three performance-related software courses at the Faculty of Mathematics and Physics of the Charles University in Prague.
- Published
- 2021
19. Regional soft error vulnerability and error propagation analysis for GPGPU applications
- Author
-
Ömer Faruk Karadaş and Isil Oz
- Subjects
Propagation of uncertainty ,Computer science ,Reliability (computer networking) ,Vulnerability ,Fault tolerance ,Fault injection ,Theoretical Computer Science ,Reliability engineering ,Soft error ,Hardware and Architecture ,Vulnerability assessment ,General-purpose computing on graphics processing units ,Software ,Information Systems - Abstract
The wide use of GPUs for general-purpose computations as well as graphics programs makes soft errors a critical concern. Evaluating the soft error vulnerability of GPGPU programs and employing efficient fault tolerance techniques for more reliable execution become more important. Protecting only the most error-sensitive program regions maintains an acceptable reliability level by eliminating the large performance overheads due to redundant operations. Therefore, fine-grained regional soft error vulnerability analysis is crucial for the systems targeting both performance and reliability. In this work, we present a regional fault injection framework and perform a detailed error propagation analysis to evaluate the soft error vulnerability of GPGPU applications. We evaluate both intra-kernel and inter-kernel vulnerabilities for a set of programs and quantify the severity of the data corruptions by considering metrics other than SDC rates. Our experimental study demonstrates that the code regions inside GPGPU programs exhibit different characteristics in terms of soft error vulnerability and the soft errors corrupting the variables propagate into the program output in several ways. We present the potential impact of our analysis by discussing the usage scenarios after we compile our observations acquired from our empirical work.
- Published
- 2021
20. Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasks
- Author
-
Peter Mandl, Alexander Döschl, and Max-Emanuel Keller
- Subjects
CUDA ,Computer Networks and Communications ,Distributed algorithm ,Computer science ,Computer cluster ,Scalability ,Spark (mathematics) ,Graphics processing unit ,Brute-force search ,Parallel computing ,General-purpose computing on graphics processing units ,Information Systems - Abstract
PurposeThis paper aims to evaluate different approaches for the parallelization of compute-intensive tasks. The study compares a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and resilient distributed data set (RDD) (Apache Spark) paradigms and a graphics processing unit (GPU) approach with Numba for compute unified device architecture (CUDA).Design/methodology/approachThe paper uses a simple but computationally intensive puzzle as a case study for experiments. To find all solutions using brute force search, 15! permutations had to be computed and tested against the solution rules. The experimental application comprises a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) paradigms and a GPU approach with Numba for CUDA. The implementations were benchmarked on Amazon-EC2 instances for performance and scalability measurements.FindingsThe comparison of the solutions with Apache Hadoop and Apache Spark under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30% lower, while the performance of Spark especially benefits from an increasing number of tasks. With the CUDA implementation, more than 16 times faster execution is achievable for the same price compared to the Spark solution. Apart from the multi-threaded implementation, the processing times of all solutions scale approximately linearly. Finally, several application suggestions for the different parallelization approaches are derived from the insights of this study.Originality/valueThere are numerous studies that have examined the performance of parallelization approaches. Most of these studies deal with processing large amounts of data or mathematical problems. This work, in contrast, compares these technologies on their ability to implement computationally intensive distributed algorithms.
- Published
- 2021
21. Efficient Flow Processing in 5G-Envisioned SDN-Based Internet of Vehicles Using GPUs
- Author
-
Varun G. Menon, Ali Najafi, Ghulam Muhammad, Milad Rafiee, Mohammad Reza Khosravi, and Mahdi Abbasi
- Subjects
Instruction set ,CUDA ,Packet switching ,Search algorithm ,Computer science ,Network packet ,Mechanical Engineering ,Automotive Engineering ,Tuple space ,Throughput ,Parallel computing ,General-purpose computing on graphics processing units ,Computer Science Applications - Abstract
In the 5G-envisioned Internet of vehicles (IoV), a significant volume of data is exchanged through networks between intelligent transport systems (ITS) and clouds or fogs. With the introduction of Software-Defined Networking (SDN), the problems mentioned above are resolved by high-speed flow-based processing of data in network systems. To classify flows of packets in the SDN network, high throughput packet classification systems are needed. Although software packet classifiers are cheaper and more flexible than hardware classifiers, they could only deliver limited performance. A key idea to resolve this problem is parallelizing packet classification on graphical processing units (GPUs). In this paper, we study parallel forms of Tuple Space Search and Pruned Tuple Space Search algorithms for the flow classification suitable for GPUs using CUDA (Compute Unified Device Architecture). The key idea behind the offered methodology is to transfer the stream of packets from host memory to the global memory of the CUDA device, then assigning each of them to a classifier thread. To evaluate the proposed method, the GPU-based versions of the algorithms were implemented on two different CUDA devices, and two different CPU-based implementations of the algorithms were used as references. Experimental results showed that GPU computing enhances the performance of Pruned Tuple Space Search remarkably more than Tuple Space Search. Moreover, results evinced the computational efficiency of the proposed method for parallelizing packet classification algorithms.
- Published
- 2021
22. QuanTaichi
- Author
-
Weiwei Xu, Qiang Dai, Frédo Durand, William T. Freeman, Yuanming Hu, Xuanda Yang, Mingkuan Xu, Jiafeng Liu, and Ye Kuang
- Subjects
Domain-specific language ,Computer science ,Bandwidth (signal processing) ,Scalability ,Overhead (computing) ,Compiler ,General-purpose computing on graphics processing units ,computer.software_genre ,Computer Graphics and Computer-Aided Design ,computer ,Data type ,Sparse matrix ,Computational science - Abstract
High-resolution simulations can deliver great visual quality, but they are often limited by available memory, especially on GPUs. We present a compiler for physical simulation that can achieve both high performance and significantly reduced memory costs, by enabling flexible and aggressive quantization. Low-precision ("quantized") numerical data types are used and packed to represent simulation states, leading to reduced memory space and bandwidth consumption. Quantized simulation allows higher resolution simulation with less memory, which is especially attractive on GPUs. Implementing a quantized simulator that has high performance and packs the data tightly for aggressive storage reduction would be extremely labor-intensive and error-prone using a traditional programming language. To make the creation of quantized simulation practical, we have developed a new set of language abstractions and a compilation system. A suite of tailored domain-specific optimizations ensure quantized simulators often run as fast as the full-precision simulators, despite the overhead of encoding-decoding the packed quantized data types. Our programming language and compiler, based on Taichi , allow developers to effortlessly switch between different full-precision and quantized simulators, to explore the full design space of quantization schemes, and ultimately to achieve a good balance between space and precision. The creation of quantized simulation with our system has large benefits in terms of memory consumption and performance, on a variety of hardware, from mobile devices to workstations with high-end GPUs. We can simulate with levels of resolution that were previously only achievable on systems with much more memory, such as multiple GPUs. For example, on a single GPU, we can simulate a Game of Life with 20 billion cells (8× compression per pixel), an Eulerian fluid system with 421 million active voxels (1.6× compression per voxel), and a hybrid Eulerian-Lagrangian elastic object simulation with 235 million particles (1.7× compression per particle). At the same time, quantized simulations create physically plausible results. Our quantization techniques are complementary to existing acceleration approaches of physical simulation: they can be used in combination with these existing approaches, such as sparse data structures, for even higher scalability and performance.
- Published
- 2021
23. Parallelizing High-Frequency Trading using GPGPU
- Author
-
Lalitha Ramchandar, Aditya Anil, A. Balasundaram, and Ashwin Sudha Arun
- Subjects
Speedup ,business.industry ,Computer science ,Distributed computing ,Deep learning ,Graphics processing unit ,Feature (computer vision) ,Market data ,Key (cryptography) ,Artificial intelligence ,General-purpose computing on graphics processing units ,High-frequency trading ,business ,Engineering (miscellaneous) - Abstract
The world of trading and market has evolved greatly. With the aid of technology, traders and trading establishments use trading platforms to perform various transactions. They are able to utilize several effective algorithms to analyse the market data and identify the key points required to carry out a successful trading operation. High-frequency trading (HFT) platforms are capable of such operations and are used by traders, investors and establishments to make their operations easier and faster. To accommodate high processing and high frequency of transactions, we integrate the concept of parallelism and combine the processing power of GPU using general-purpose graphics processing unit (GPGPU) to enhance the speedup of the system. High processing power without involving further costs in hardware upgradation is our approach. Methods of deep learning and machine learning also add a feature to provide help or assistance for several traders using this platform.
- Published
- 2021
24. iMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures
- Author
-
Xiaoguang Guo, Chenyang Zhang, Xiao Zhang, Xiaoyong Du, Bingsheng He, and Feng Zhang
- Subjects
020203 distributed computing ,Computer science ,business.industry ,Process (engineering) ,Overhead (engineering) ,02 engineering and technology ,Machine learning ,computer.software_genre ,Task (computing) ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Task analysis ,Artificial intelligence ,General-purpose computing on graphics processing units ,business ,computer - Abstract
Utilizing heterogeneous accelerators, especially GPUs, to accelerate machine learning tasks has shown to be a great success in recent years. GPUs bring huge performance improvements to machine learning and greatly promote the widespread adoption of machine learning. However, the discrete CPU-GPU architecture design with high PCIe transmission overhead decreases the GPU computing benefits in machine learning training tasks. To overcome such limitations, hardware vendors release CPU-GPU integrated architectures with shared unified memory. In this article, we design a benchmark suite for machine learning training on CPU-GPU integrated architectures, called iMLBench, covering a wide range of machine learning applications and kernels. We mainly explore two features on integrated architectures: 1) zero-copy, which means that the PCIe overhead has been eliminated for machine learning tasks and 2) co-running, which means that the CPU and the GPU co-run together to process a single machine learning task. Our experimental results on iMLBench show that the integrated architecture brings an average 7.1× performance improvement over the original implementations. Specifically, the zero-copy design brings 4.65× performance improvement, and co-running brings 1.78× improvement. Moreover, integrated architectures exhibit promising results from both performance-per-dollar and energy perspectives, achieving 6.50× performance-price ratio while 4.06× energy efficiency over discrete GPUs. The benchmark is open-sourced at https://github.com/ChenyangZhang-cs/iMLBench .
- Published
- 2021
25. Critical Path Isolation and Bit-Width Scaling Are Highly Compatible for Voltage Over-Scalable Design
- Author
-
TaiYu Cheng, Jun Nagayama, Yutaka Masuda, Yoichi Momiyama, Masanori Hashimoto, and Tohru Ishihara
- Subjects
Reduction (complexity) ,Computer science ,Electronic engineering ,Key (cryptography) ,Isolation (database systems) ,General-purpose computing on graphics processing units ,Design methods ,Critical path method ,Power (physics) ,Voltage - Abstract
This work proposes a design methodology that saves the power under voltage over-scaling (VOS) operation. The key idea of the proposed design methodology is to combine critical path isolation (CPI) and bit-width scaling (BWS) under the constraint of computational quality, e.g., Peak Signal-to-Noise Ratio (PSNR). Conventional CPI inherently cannot reduce the delay of intrinsic critical paths (CPs), which may significantly restrict the power saving effect. On the other hand, the proposed methodology tries to reduce both intrinsic and non-intrinsic CPs. Therefore, our design dramatically reduces the supply voltage and power dissipation while satisfying the quality constraint. Moreover, for reducing co-design exploration space, the proposed methodology utilizes the exclusiveness of the paths targeted by CPI and BWS, where CPI aims at reducing the minimum supply voltage of non-intrinsic CP, and BWS focuses on intrinsic CPs in arithmetic units. From this key exclusiveness, the proposed design splits the simultaneous optimization problem into three sub-problems; (1) the determination of bit-width reduction, (2) the timing optimization for non-intrinsic CPs, and (3) investigating the minimum supply voltage of the BWS and CPI-applied circuit under quality constraint, for reducing power dissipation. Thanks to the problem splitting, the proposed methodology can efficiently find quality-constrained minimum-power design. Evaluation results show that CPI and BWS are highly compatible, and they significantly enhance the efficacy of VOS. In a case study of GPGPU processor, the proposed design saves the power dissipation by 42.7% for an image processing and by 51.2% for a neural network inference workload., DATE 21. 1-5 Feb. 2021. Grenoble, France (Virtual)
- Published
- 2021
26. PERFORMANCE ANALYSIS OF OPENFOAM-BASED CFD SOLVERS USING GPGPU
- Author
-
Seoeum Han, Bok Jik Lee, and Hwanghui Jeong
- Subjects
business.industry ,Computer science ,Parallel computing ,Computational fluid dynamics ,General-purpose computing on graphics processing units ,business - Published
- 2021
27. Study and evaluation of improved automatic GPU offloading method
- Author
-
Yoji Yamato
- Subjects
Computer Networks and Communications ,Computer science ,0102 computer and information sciences ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Evolutionary computation ,010201 computation theory & mathematics ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Graphics ,General-purpose computing on graphics processing units ,Field-programmable gate array ,Software - Abstract
With the slowing down of Moore's law, the use of hardware other than CPUs, such as graphics processing units (GPUs) or field-Programmable gate arrays (FPGAs), is increasing. However, when using het...
- Published
- 2021
28. Flynn’s Reconciliation
- Author
-
Nicolas Weber, Roberto Bifulco, and Daniel Thuerck
- Subjects
010302 applied physics ,020203 distributed computing ,Computer science ,02 engineering and technology ,Parallel computing ,computer.software_genre ,Supercomputer ,01 natural sciences ,Set (abstract data type) ,Programming idiom ,CUDA ,Hardware and Architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Compiler ,SIMD ,General-purpose computing on graphics processing units ,computer ,Software ,Information Systems - Abstract
A large portion of the recent performance increase in the High Performance Computing (HPC) and Machine Learning (ML) domains is fueled by accelerator cards. Many popular ML frameworks support accelerators by organizing computations as a computational graph over a set of highly optimized, batched general-purpose kernels. While this approach simplifies the kernels’ implementation for each individual accelerator, the increasing heterogeneity among accelerator architectures for HPC complicates the creation of portable and extensible libraries of such kernels. Therefore, using a generalization of the CUDA community’s warp register cache programming idiom, we propose a new programming idiom (CoRe) and a virtual architecture model (PIRCH), abstracting over SIMD and SIMT paradigms. We define and automate the mapping process from a single source to PIRCH’s intermediate representation and develop backends that issue code for three different architectures: Intel AVX512, NVIDIA GPUs, and NEC SX-Aurora. Code generated by our source-to-source compiler for batched kernels, borG, competes favorably with vendor-tuned libraries and is up to 2× faster than hand-tuned kernels across architectures.
- Published
- 2021
29. FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance
- Author
-
Radu Prodan, Usman Ahmed, Yasir Noman Khalid, Muhammad Azhar Iqbal, Muhammad Arshad Islam, and Muhammad Aleem
- Subjects
Numerical Analysis ,Speedup ,Computer science ,business.industry ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,Computer Science Applications ,Theoretical Computer Science ,Scheduling (computing) ,Set (abstract data type) ,Reduction (complexity) ,Computational Mathematics ,Computational Theory and Mathematics ,Kernel (statistics) ,Classifier (linguistics) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Graphics ,General-purpose computing on graphics processing units ,business ,computer ,Software - Abstract
Employing general-purpose graphics processing units (GPGPU) with the help of OpenCL has resulted in greatly reducing the execution time of data-parallel applications by taking advantage of the massive available parallelism. However, when a small data size application is executed on GPU there is a wastage of GPU resources as the application cannot fully utilize GPU compute-cores. There is no mechanism to share a GPU between two kernels due to the lack of operating system support on GPU. In this paper, we propose the provision of a GPU sharing mechanism between two kernels that will lead to increasing GPU occupancy, and as a result, reduce execution time of a job pool. However, if a pair of the kernel is competing for the same set of resources (i.e., both applications are compute-intensive or memory-intensive), kernel fusion may also result in a significant increase in execution time of fused kernels. Therefore, it is pertinent to select an optimal pair of kernels for fusion that will result in significant speedup over their serial execution. This research presents FusionCL, a machine learning-based GPU sharing mechanism between a pair of OpenCL kernels. FusionCL identifies each pair of kernels (from the job pool), which are suitable candidates for fusion using a machine learning-based fusion suitability classifier. Thereafter, from all the candidates, it selects a pair of candidate kernels that will produce maximum speedup after fusion over their serial execution using a fusion speedup predictor. The experimental evaluation shows that the proposed kernel fusion mechanism reduces execution time by 2.83× when compared to a baseline scheduling scheme. When compared to state-of-the-art, the reduction in execution time is up to 8%.
- Published
- 2021
30. A unified schedule policy of distributed machine learning framework for CPU-GPU cluster
- Author
-
Xiaochun Tang, Quan Zhao, and Ziyu Zhu
- Subjects
Schedule ,Computer science ,business.industry ,General Engineering ,TL1-4050 ,GPU cluster ,Machine learning ,computer.software_genre ,cpu-gpu tasks ,Scheduling (computing) ,Task (computing) ,clustering algorithm ,Resource (project management) ,distribution ,Artificial intelligence ,Central processing unit ,General-purpose computing on graphics processing units ,unified scheduler ,Cluster analysis ,business ,computer ,Motor vehicles. Aeronautics. Astronautics - Abstract
With the widespread using of GPU hardware facilities, more and more distributed machine learning applications have begun to use CPU-GPU hybrid cluster resources to improve the efficiency of algorithms. However, the existing distributed machine learning scheduling framework either only considers task scheduling on CPU resources or only considers task scheduling on GPU resources. Even considering the difference between CPU and GPU resources, it is difficult to improve the resource usage of the entire system. In other words, the key challenge in using CPU-GPU clusters for distributed machine learning jobs is how to efficiently schedule tasks in the job. In the full paper, we propose a CPU-GPU hybrid cluster schedule framework in detail. First, according to the different characteristics of the computing power of the CPU and the computing power of the GPU, the data is divided into data fragments of different sizes to adapt to CPU and GPU computing resources. Second, the paper introduces the task scheduling method under the CPU-GPU hybrid. Finally, the proposed method is verified at the end of the paper. After our verification for K-Means, using the CPU-GPU hybrid computing framework can increase the performance of K-Means by about 1.5 times. As the number of GPUs increases, the performance of K-Means can be significantly improved.
- Published
- 2021
31. XB-SIM∗: A simulation framework for modeling and exploration of ReRAM-based CNN acceleration design
- Author
-
Youhui Zhang, Xiang Fei, and Weimin Zheng
- Subjects
Multidisciplinary ,Artificial neural network ,Computer science ,Concurrency ,Batch processing ,General-purpose computing on graphics processing units ,Chip ,Convolutional neural network ,Electronic circuit ,Resistive random-access memory ,Computational science - Abstract
Resistive Random Access Memory (ReRAM)-based neural network accelerators have potential to surpass their digital counterparts in computational efficiency and performance. However, design of these accelerators faces a number of challenges including imperfections of the ReRAM device and a large amount of calculations required to accurately simulate the former. We present XB-SIM*, a simulation framework for ReRAM-crossbar-based Convolutional Neural Network (CNN) accelerators. XB-SIM* can be flexibly configured to simulate the accelerator's structure and clock-driven behaviors at the architecture level. This framework also includes an ReRAM-aware Neural Network (NN) training algorithm and a CNN-oriented mapper to train an NN and map it onto the simulated design efficiently. Behavior of the simulator has been verified by the corresponding circuit simulation of a real chip. Furthermore, a batch processing mode of the massive calculations that are required to mimic the behavior of ReRAM-crossbar circuits is proposed to fully apply the computational concurrency of the mapping strategy. On CPU/GPGPU, this batch processing mode can improve the simulation speed by up to 5.02× or 34.29×. Within this framework, comprehensive architectural exploration and end-to-end evaluation have been achieved, which provide some insights for systemic optimization.
- Published
- 2021
32. GENRE (GPU Elastic-Net REgression): A CUDA-Accelerated Package for Massively Parallel Linear Regression with Elastic-Net Regularization
- Author
-
Brett Byram and Christopher Khan
- Subjects
Elastic net regularization ,CUDA ,Computer science ,Linear regression ,Cyclic coordinate descent ,General-purpose computing on graphics processing units ,MATLAB ,computer ,Massively parallel ,Regression ,computer.programming_language ,Computational science - Published
- 2022
33. Search by triplet: An efficient local track reconstruction algorithm for parallel architectures
- Author
-
Niko Neufeld, Daniel Hugo Cámpora Pérez, Agustín Riscos Núñez, RS: FSE DACS, and Dept. of Advanced Computing Sciences
- Subjects
FOS: Computer and information sciences ,Parallel computing ,General Computer Science ,Physics::Instrumentation and Detectors ,Computer science ,FOS: Physical sciences ,Tracking (particle physics) ,Track reconstruction ,SIMD ,High Energy Physics - Experiment ,Theoretical Computer Science ,Computational science ,High Energy Physics - Experiment (hep-ex) ,Data acquisition ,Computer Science - Data Structures and Algorithms ,Data Structures and Algorithms (cs.DS) ,Detectors and Experimental Techniques ,Amortized analysis ,High throughput computing ,Large Hadron Collider ,Detector ,Heterogeneous architectures ,GPGPU ,Process (computing) ,Reconstruction algorithm ,Computing and Computers ,Modeling and Simulation ,General-purpose computing on graphics processing units ,SIMT ,Particle Physics - Experiment - Abstract
Millions of particles are collided every second at the LHCb detector placed inside the Large Hadron Collider at CERN. The particles produced as a result of these collisions pass through various detecting devices which will produce a combined raw data rate of up to 40 Tbps by 2021. These data will be fed through a data acquisition system which reconstructs individual particles and filters the collision events in real time. This process will occur in a heterogeneous farm employing exclusively off-the-shelf CPU and GPU hardware, in a two stage process known as High Level Trigger. The reconstruction of charged particle trajectories in physics detectors, also referred to as track reconstruction or tracking, determines the position, charge and momentum of particles as they pass through detectors. The Vertex Locator subdetector (VELO) is the closest such detector to the beamline, placed outside of the region where the LHCb magnet produces a sizable magnetic field. It is used to reconstruct straight particle trajectories which serve as seeds for reconstruction of other subdetectors and to locate collision vertices. The VELO subdetector will detect up to $10^9$ particles every second, which need to be reconstructed in real time in the High Level Trigger. We present Search by triplet, an efficient track reconstruction algorithm. Our algorithm is designed to run efficiently across parallel architectures. We extend on previous work and explain the algorithm evolution since its inception. We show the scaling of our algorithm under various situations, and analyse its amortized time in terms of complexity for each of its constituent parts and profile its performance. Our algorithm is the current state-of-the-art in VELO track reconstruction on SIMT architectures, and we qualify its improvements over previous results. Millions of particles are collided every second at the LHCb detector placed inside the Large Hadron Collider at CERN. The particles produced as a result of these collisions pass through various detecting devices which will produce a combined raw data rate of up to 40 Tbps by 2021. These data will be fed through a data acquisition system which reconstructs individual particles and filters the collision events in real time. This process will occur in a heterogeneous farm employing exclusively off-the-shelf CPU and GPU hardware, in a two stage process known as High Level Trigger. The reconstruction of charged particle trajectories in physics detectors, also referred to as track reconstruction or tracking, determines the position, charge and momentum of particles as they pass through detectors. The Vertex Locator subdetector (VELO) is the closest such detector to the beamline, placed outside of the region where the LHCb magnet produces a sizable magnetic field. It is used to reconstruct straight particle trajectories which serve as seeds for reconstruction of other subdetectors and to locate collision vertices. The VELO subdetector will detect up to 1000 million particles every second, which need to be reconstructed in real time in the High Level Trigger. We present Search by triplet, an efficient track reconstruction algorithm. Our algorithm is designed to run efficiently across parallel architectures. We extend on previous work and explain the algorithm evolution since its inception. We show the scaling of our algorithm under various situations, and analyze its amortized time in terms of complexity for each of its constituent parts and profile its performance. Our algorithm is the current state-of-the-art in VELO track reconstruction on SIMT architectures, and we qualify its improvements over previous results.
- Published
- 2022
34. Highly efficient lattice Boltzmann multiphase simulations of immiscible fluids at high-density ratios on CPUs and GPUs through code generation
- Author
-
Markus Holzer, Ulrich Rüde, Martin Bauer, and Harald Köstler
- Subjects
Computer science ,business.industry ,Multiphase flow ,Lattice Boltzmann methods ,Reynolds number ,010103 numerical & computational mathematics ,Computational fluid dynamics ,01 natural sciences ,010305 fluids & plasmas ,Theoretical Computer Science ,Computational physics ,symbols.namesake ,CUDA ,Hardware and Architecture ,0103 physical sciences ,symbols ,Fluid dynamics ,Code generation ,ddc:004 ,0101 mathematics ,General-purpose computing on graphics processing units ,business ,Software - Abstract
A high-performance implementation of a multiphase lattice Boltzmann method based on the conservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers is presented. Meta-programming techniques are used to generate optimized code for CPUs and GPUs automatically. The coupled model is specified in a high-level symbolic description and optimized through automatic transformations. The memory footprint of the resulting algorithm is reduced through the fusion of compute kernels. A roofline analysis demonstrates the excellent efficiency of the generated code on a single GPU. The resulting single GPU code has been integrated into the multiphysics framework waLBerla to run massively parallel simulations on large domains. Communication hiding and GPUDirect-enabled MPI yield near-perfect scaling behavior. Scaling experiments are conducted on the Piz Daint supercomputer with up to 2048 GPUs, simulating several hundred fully resolved bubbles. Further, validation of the implementation is shown in a physically relevant scenario—a three-dimensional rising air bubble in water.
- Published
- 2021
35. Efficient Buffer Overflow Detection on GPU
- Author
-
Dong Li, Hao Chen, Jianhua Sun, and Bang Di
- Subjects
Hardware_MEMORYSTRUCTURES ,Computer science ,Byte ,Parallel computing ,Instruction set ,Memory management ,Computational Theory and Mathematics ,Hardware and Architecture ,Multithreading ,Signal Processing ,Overhead (computing) ,Central processing unit ,General-purpose computing on graphics processing units ,Garbage collection ,Buffer overflow - Abstract
Rich thread-level parallelism of GPU has motivated co-running GPU kernels on a single GPU. However, when GPU kernels co-run, it is possible that one kernel can leverage buffer overflow to attack another kernel running on the same GPU. There is very limited work aiming to detect buffer overflow for GPU. Existing work has either large performance overhead or limited capability in detecting buffer overflow. In this article, we introduce GMODx, a runtime software system that can detect GPU buffer overflow. GMODx performs always-on monitoring on allocated memory based on a canary-based design. First , for the fine-grained memory management, GMODx introduces a set of byte arrays to store buffer information for overflow detection. Techniques, such as lock-free accesses to the byte arrays, delayed memory free, efficient memory reallocation, and garbage collection for the byte arrays, are proposed to achieve high performance. Second , for the coarse-grained memory management, GMODx utilizes unified memory to delegate the always-on monitoring to the CPU. To reduce performance overhead, we propose several techniques, including customized list data structure and specific optimizations against the unified memory. For micro-benchmarking, our experiments show that GMODx is capable of detecting buffer overflow for the fine-grained memory management without performance loss, and that it incurs small runtime overhead (4.2 percent on average and up to 9.7 percent) for the coarse-grained memory management. For real workloads, we deploy GMODx on the TensorFlow framework, it only causes 0.8 percent overhead on average (up to 1.8 percent).
- Published
- 2021
36. pdlADMM: An ADMM-based framework for parallel deep learning training with efficiency
- Author
-
Dongsheng Li, Xicheng Lu, Zhi-hui Yang, and Lei Guan
- Subjects
0209 industrial biotechnology ,Computational complexity theory ,Computer science ,business.industry ,Cognitive Neuroscience ,Deep learning ,Training (meteorology) ,02 engineering and technology ,Parallel computing ,Computer Science Applications ,020901 industrial engineering & automation ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,020201 artificial intelligence & image processing ,State (computer science) ,Artificial intelligence ,General-purpose computing on graphics processing units ,business - Abstract
Alternating Direction Methods of Multipliers (ADMM) has been proven to be a useful alternative to the popular gradient-based optimizers and successfully applied to train the DNN model. Whereas existing ADMM-based approaches generally do not achieve a good trade-off between the rapid convergence and fast training and do not support parallel DNN training with multiple GPUs as well. These drawbacks seriously hinder them from effectively training DNN models with modern GPU computing platforms which are always equipped with multiple GPUs. In this paper, we propose pdlADMM that can effectively train DNN in a data-parallel manner. The key insight of pdlADMM lies in that it explores efficient solutions for each sub-problem by comprehensively considering three main factors including computational complexity, convergence, and suitability to parallel computing. With more number of GPUs, pdlADMM remains rapid convergence and the computational complexity on each GPU tends to decline. Extensive experiments demonstrate the effectiveness of our proposal. Compared to the other two state state-of-the-art ADMM-based approaches, pdlADMM converges significantly faster, obtains better accuracy, and achieves very competitive training speed at the same time.
- Published
- 2021
37. Instruction Prefetch for Improving GPGPU Performance
- Author
-
Zhikui Chen, Yuxin Wang, Pengcheng Wang, Jianli Cao, and He Guo
- Subjects
Instruction prefetch ,Computer architecture ,Computer science ,Applied Mathematics ,Signal Processing ,Electrical and Electronic Engineering ,General-purpose computing on graphics processing units ,Computer Graphics and Computer-Aided Design - Published
- 2021
38. A method for decompilation of AMD GCN kernels to OpenCL
- Author
-
Kristina Igorevna Mihajlenko, Mikhail Andreevich Lukin, and Andrey Stankevich
- Subjects
FOS: Computer and information sciences ,Control and Optimization ,Source code ,Decompiler ,Computer science ,media_common.quotation_subject ,computer.software_genre ,Data type ,Preprocessor ,Software analysis pattern ,D.3.m ,computer.programming_language ,media_common ,68N20 ,Computer Science - Programming Languages ,Assembly language ,Programming language ,Python (programming language) ,Computer Science Applications ,Human-Computer Interaction ,Computer Science - Distributed, Parallel, and Cluster Computing ,Control and Systems Engineering ,Distributed, Parallel, and Cluster Computing (cs.DC) ,General-purpose computing on graphics processing units ,computer ,Software ,Programming Languages (cs.PL) ,Information Systems - Abstract
Introduction: Decompilers are useful tools for software analysis and support in the absence of source code. They are available for many hardware architectures and programming languages. However, none of the existing decompilers support modern AMD GPU architectures such as AMD GCN and RDNA. Purpose: We aim at developing the first assembly decompiler tool for a modern AMD GPU architecture that generates code in the OpenCL language, which is widely used for programming GPGPUs. Results: We developed the algorithms for the following operations: preprocessing assembly code, searching data accesses, extracting system values, decompiling arithmetic operations and recovering data types. We also developed templates for decompilation of branching operations. Practical relevance: We implemented the presented algorithms in Python as a tool called OpenCLDecompiler, which supports a large subset of AMD GCN instructions. This tool automatically converts disassembled GPGPU code into the equivalent OpenCL code, which reduces the effort required to analyze assembly code., Comment: 10 pages, 5 figures
- Published
- 2021
39. Static detection of uncoalesced accesses in GPU programs
- Author
-
Joseph Devietti, Omar Navarro Leija, Nimit Singhania, and Rajeev Alur
- Subjects
Correctness ,Computer science ,Suite ,020208 electrical & electronic engineering ,020207 software engineering ,02 engineering and technology ,Parallel computing ,Static analysis ,Theoretical Computer Science ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Architecture ,General-purpose computing on graphics processing units ,Programmer ,Software ,Abstraction (linguistics) - Abstract
GPU programming has become popular due to the high computational capabilities of GPUs. Obtaining significant performance gains with GPU is however challenging and the programmer needs to be aware of various subtleties of the GPU architecture. One such subtlety lies in accessing GPU memory, where certain access patterns can lead to poor performance. Such access patterns are referred to as uncoalesced global memory accesses. This work presents a light-weight compile-time static analysis to identify such accesses in GPU programs. The analysis relies on a novel abstraction which tracks the access pattern across multiple threads. The abstraction enables quick prediction while providing correctness guarantees. We have implemented the analysis in LLVM and compare it against a dynamic analysis implementation. The static analysis identifies 95 pre-existing uncoalesced accesses in Rodinia, a popular benchmark suite of GPU programs, and finishes within seconds for most programs, in comparison to the dynamic analysis which finds 69 accesses and takes orders of magnitude longer to finish.
- Published
- 2021
40. Idempotence-Based Preemptive GPU Kernel Scheduling for Embedded Systems
- Author
-
Euiseong Seo, Hwansoo Han, Hyunjun Kim, Hyeonsu Lee, and Cheolgi Kim
- Subjects
Software_OPERATINGSYSTEMS ,Source code ,Job shop scheduling ,Computer science ,business.industry ,media_common.quotation_subject ,Priority scheduling ,Preemption ,Processor scheduling ,02 engineering and technology ,Execution time ,020202 computer hardware & architecture ,Theoretical Computer Science ,Scheduling (computing) ,Software ,Computational Theory and Mathematics ,Kernel (image processing) ,Hardware and Architecture ,Embedded system ,Idempotence ,0202 electrical engineering, electronic engineering, information engineering ,General-purpose computing on graphics processing units ,business ,media_common - Abstract
Mission-critical embedded systems simultaneously run multiple graphics-processing-unit (GPU) computing tasks with different criticality and timeliness requirements. Considerable research effort has been dedicated to supporting the preemptive priority scheduling of GPU kernels. However, hardware-supported preemption leads to lengthy scheduling delays and complicated designs, and most software approaches depend on the voluntary yielding of GPU resources from restructured kernels. We propose a preemptive GPU kernel scheduling scheme that harnesses the idempotence property of kernels. The proposed scheme distinguishes idempotent kernels through static source code analysis. If a kernel is not idempotent, then GPU kernels are transactionized at the operating system (OS) level. Both idempotent and transactionized kernels can be aborted at any point during their execution and rolled back to their initial state for reexecution. Therefore, low-priority kernel instances can be preempted for high-priority kernel instances and reexecuted after the GPU becomes available again. Our evaluation using the Rodinia benchmark suite showed that the proposed approach limits the preemption delay to 18 $\mu$ μ s in the 99.9th percentile, with an average delay in execution time of less than 10 percent for high-priority tasks under a heavy load in most cases.
- Published
- 2021
41. GPU-friendly data structures for real time simulation
- Author
-
Benoît Ozell and Vincent Magnoux
- Subjects
Computer science ,Computation ,Graphics hardware ,Cutting simulation ,0206 medical engineering ,02 engineering and technology ,Computational science ,lcsh:TA168 ,Reduction (complexity) ,Real-time simulation ,0202 electrical engineering, electronic engineering, information engineering ,Physically-based simulation ,Engineering (miscellaneous) ,Haptic technology ,Applied Mathematics ,020207 software engineering ,GPU computing ,Data structure ,020601 biomedical engineering ,Computer Science Applications ,lcsh:Systems engineering ,Modeling and Simulation ,Surgery simulation ,Graph (abstract data type) ,General-purpose computing on graphics processing units ,lcsh:Mechanics of engineering. Applied mechanics ,lcsh:TA349-359 ,Research Article - Abstract
Simulators for virtual surgery training need to perform complex calculations very quickly to provide realistic haptic and visual interactions with a user. The complexity is further increased by the addition of cuts to virtual organs, such as would be needed for performing tumor resection. A common method for achieving large performance improvements is to make use of the graphics hardware (GPU) available on most general-use computers. Programming GPUs requires data structures that are more rigid than on conventional processors (CPU), making that data more difficult to update. We propose a new method for structuring graph data, which is commonly used for physically based simulation of soft tissue during surgery, and deformable objects in general. Our method aligns all nodes of the graph in memory, independently from the number of edges they contain, allowing for local modifications that do not affect the rest of the structure. Our method also groups memory transfers so as to avoid updating the entire graph every time a small cut is introduced in a simulated organ. We implemented our data structure as part of a simulator based on a meshless method. Our tests show that the new GPU implementation, making use of the new graph structure, achieves a 10 times improvement in computation times compared to the previous CPU implementation. The grouping of data transfers into batches allows for a 80–90% reduction in the amount of data transferred for each graph update, but accounts only for a small improvement in performance. The data structure itself is simple to implement and allows simulating increasingly complex models that can be cut at interactive rates.
- Published
- 2021
42. A GPU-assisted NFV framework for intrusion detection system
- Author
-
Carlos Natalino, Diego Lisboa Cardoso, and Igor M. Araujo
- Subjects
Computer Networks and Communications ,Computer science ,Packet processing ,Process (computing) ,020206 networking & telecommunications ,Throughput ,02 engineering and technology ,Intrusion detection system ,CUDA ,Computer architecture ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Network performance ,General-purpose computing on graphics processing units ,Virtual network - Abstract
The network function virtualization (NFV) paradigm advocates the replacement of specific-purpose hardware supporting packet processing by general-purpose ones, reducing costs and bringing more flexibility and agility to the network operation. However, this shift can degrade the network performance due to the non-optimal packet processing capabilities of the general-purpose hardware. Meanwhile, graphics processing units (GPUs) have been deployed in many data centers (DCs) due to their broad use in, e.g., machine learning (ML). These GPUs can be leveraged to accelerate the packet processing capability of virtual network functions (vNFs), but the delay introduced can be an issue for some applications. Our work proposes a framework for packet processing acceleration using GPUs to support vNF execution. We validate the proposed framework with a case study, analyzing the benefits of using GPU to support the execution of an intrusion detection system (IDS) as a vNF and evaluating the traffic intensities where using our framework brings the most benefits. Results show that the throughput of the system increases from 50 Mbps to 1 Gbps by employing our framework while reducing the central process unit (CPU) resource usage by almost 40%. The results indicate that GPUs are a good candidate for accelerating packet processing in vNFs. 1
- Published
- 2021
43. Fast GPU 3D diffeomorphic image registration
- Author
-
Malte Brunn, Naveen Himthani, Andreas Mang, Miriam Mehl, and George Biros
- Subjects
FOS: Computer and information sciences ,Computer Networks and Communications ,Computer science ,Fast Fourier transform ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Image registration ,02 engineering and technology ,Article ,Theoretical Computer Science ,Computational science ,Image (mathematics) ,Diffeomorphic image registration ,Artificial Intelligence ,FOS: Electrical engineering, electronic engineering, information engineering ,FOS: Mathematics ,0202 electrical engineering, electronic engineering, information engineering ,Mathematics - Optimization and Control ,Image and Video Processing (eess.IV) ,020206 networking & telecommunications ,Electrical Engineering and Systems Science - Image and Video Processing ,Solver ,Computer Science::Graphics ,Computer Science - Distributed, Parallel, and Cluster Computing ,Optimization and Control (math.OC) ,Hardware and Architecture ,Computer Science::Mathematical Software ,020201 artificial intelligence & image processing ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Diffeomorphism ,General-purpose computing on graphics processing units ,68U10, 49J20, 35Q93, 65K10, 65F08, 76D55 ,Software ,Interpolation - Abstract
3D image registration is one of the most fundamental and computationally expensive operations in medical image analysis. Here, we present a mixed-precision, Gauss--Newton--Krylov solver for diffeomorphic registration of two images. Our work extends the publicly available CLAIRE library to GPU architectures. Despite the importance of image registration, only a few implementations of large deformation diffeomorphic registration packages support GPUs. Our contributions are new algorithms to significantly reduce the run time of the two main computational kernels in CLAIRE: calculation of derivatives and scattered-data interpolation. We deploy (i) highly-optimized, mixed-precision GPU-kernels for the evaluation of scattered-data interpolation, (ii) replace Fast-Fourier-Transform (FFT)-based first-order derivatives with optimized 8th-order finite differences, and (iii) compare with state-of-the-art CPU and GPU implementations. As a highlight, we demonstrate that we can register $256^3$ clinical images in less than 6 seconds on a single NVIDIA Tesla V100. This amounts to over 20$\times$ speed-up over the current version of CLAIRE and over 30$\times$ speed-up over existing GPU implementations., 20 pages, 6 figures, 8 tables
- Published
- 2021
44. Shifted Sorting-based k-Approximate Nearest Neighbor Searching Algorithm with Extra Loops Based on Permutation for Better Accuracy
- Author
-
Taejung Park
- Subjects
CUDA ,Permutation ,Search algorithm ,Computer science ,Sorting ,General-purpose computing on graphics processing units ,Cluster analysis ,Algorithm ,k-nearest neighbors algorithm - Published
- 2021
45. Optimizing Neural Networks for Efficient FPGA Implementation: A Survey
- Author
-
Abdessalem Ben Abdelali, Riadh Ayachi, and Yahia Said
- Subjects
Artificial neural network ,business.industry ,Computer science ,Applied Mathematics ,Computation ,Deep learning ,Bandwidth (signal processing) ,02 engineering and technology ,01 natural sciences ,Computer Science Applications ,010101 applied mathematics ,Computer architecture ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,020201 artificial intelligence & image processing ,Artificial intelligence ,Applications of artificial intelligence ,0101 mathematics ,General-purpose computing on graphics processing units ,business ,Field-programmable gate array - Abstract
The deep learning has become the key for artificial intelligence applications development. It was successfully used to solve computer vision tasks. But the deep learning algorithms are based on Deep Neural Networks (DNN) with many hidden layers which need a huge computation effort and a big storage space. Thus, the general-purpose graphical processing units (GPGPU) are the best candidate for DNN development and inference because of their high number of processing core and the big integrated memory. In the other side, the disadvantage of the GPGPU is high-power consumption. In a real-world application, the processing unit is an embedded system based on limited power and computation resources. In recent years, Field Programmable Gate Array (FPGA) becomes a serious solution that can outperform GPGPU because of their flexible architecture and low power consumption. The FPGA is equipped with a very small integrated memory and a low bandwidth. To make DNNs fit into FPGA we need a lot of optimization techniques at different levels such as the network level, the hardware level, and the implementation tools level. In this paper, we will cite the existing optimization techniques and evaluate them to provide a complete overview of FPGA based DNN accelerators.
- Published
- 2021
46. On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond
- Author
-
Grigori Fursin, Anton Lokhmotov, Bruno Carpentieri, Fabiana Zollo, Marco Cianfriglia, Damiano Perri, Osvaldo Gervasi, Paolo Sylos Labini, Cedric Nugteren, and Flavio Vella
- Subjects
supervised classification ,Computer science ,Generalization ,Decision tree ,tuning, GPU computing, performance optimization, supervised classification, neural networks, predictive models ,02 engineering and technology ,performance optimization ,01 natural sciences ,Convolution ,Set (abstract data type) ,tuning ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,010302 applied physics ,020203 distributed computing ,Settore INF/01 - Informatica ,Artificial neural network ,business.industry ,Deep learning ,GPU computing ,neural networks ,predictive models ,Computer engineering ,Hardware and Architecture ,Artificial intelligence ,General-purpose computing on graphics processing units ,Heuristics ,business ,Software ,Information Systems - Abstract
Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations, or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep learning, this approach is not effective across the wide range of inputs and architectures used in practice. In this work, we analyze different machine learning techniques and predictive models to accelerate the convolution operator and GEMM. Moreover, we address the problem of dataset generation, and we study the performance, accuracy, and generalization ability of the models. Our insights allow us to improve the performance of computationally expensive deep learning primitives on high-end GPUs as well as low-power embedded GPU architectures on three different libraries. Experimental results show significant improvement in the target applications from 50% up to 300% compared to auto-tuned and high-optimized vendor-based heuristics by using simple decision tree- and MLP-based models.
- Published
- 2021
47. Reducing Energy in GPGPUs through Approximate Trivial Bypassing
- Author
-
Ehsan Atoofian, Ali Jannesari, and Zayan Shaikh
- Subjects
Power management ,Power gating ,Hardware and Architecture ,Computer science ,Dynamic demand ,Register file ,Execution unit ,Parallel computing ,General-purpose computing on graphics processing units ,Operand ,Software ,Integer (computer science) - Abstract
General-purpose computing using graphics processing units (GPGPUs) is an attractive option for acceleration of applications with massively data-parallel tasks. While performance of modern GPGPUs is increasing rapidly, the power consumption of these devices is becoming a major concern. In particular, execution units and register file are among the top three most power-hungry components in GPGPUs. In this work, we exploit trivial instructions to reduce power consumption in GPGPUs. Trivial instructions are those instructions that do not need computations, i.e., multiplication by one. We found that, during the course of a program's execution, a GPGPU executes many trivial instructions. Execution of these instructions wastes power unnecessarily. In this work, we propose trivial bypassing which skips execution of trivial instructions and avoids unnecessary allocation of resources for trivial instructions. By power gating execution units and skipping trivial computing, trivial bypassing reduces both static and dynamic power. Also, trivial bypassing reduces dynamic energy of register file by avoiding access to register file for source and/or destination operands of trivial instructions. While trivial bypassing reduces energy of GPGPUs, it has detrimental impact on performance as a power-gated execution unit requires several cycles to resume its normal operation. Conventional warp schedulers are oblivious to the status of execution units. We propose a new warp scheduler that prioritizes warps based on availability of execution units. We also propose a set of new power management techniques to reduce performance penalty of power gating, further. To increase energy saving of trivial bypassing, we also propose approximating operands of instructions. We offer a set of new techniques to approximate both integer and floating-point instructions and increase the pool of trivial instructions. Our evaluations using a diverse set of benchmarks reveal that our proposed techniques are able to reduce energy of execution units by 11.2% and dynamic energy of register file by 12.2% with minimal performance and quality degradation.
- Published
- 2021
48. Modeling and analyzing evaluation cost of CUDA kernels
- Author
-
Stefan K. Muller and Jan Hoffmann
- Subjects
Soundness ,Correctness ,Semantics (computer science) ,Computer science ,Programming language ,Task parallelism ,020207 software engineering ,02 engineering and technology ,Software_PROGRAMMINGTECHNIQUES ,computer.software_genre ,Set (abstract data type) ,CUDA ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,General-purpose computing on graphics processing units ,Safety, Risk, Reliability and Quality ,Execution model ,computer ,Software - Abstract
General-purpose programming on GPUs (GPGPU) is becoming increasingly in vogue as applications such as machine learning and scientific computing demand high throughput in vector-parallel applications. NVIDIA's CUDA toolkit seeks to make GPGPU programming accessible by allowing programmers to write GPU functions, called kernels, in a small extension of C/C++. However, due to CUDA's complex execution model, the performance characteristics of CUDA kernels are difficult to predict, especially for novice programmers. This paper introduces a novel quantitative program logic for CUDA kernels, which allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDA-specific performance bottlenecks. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCuda, an end-to-end resource-analysis tool for kernels, which has been implemented using an existing resource-analysis tool for imperative programs. An experimental evaluation on a suite of CUDA benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels.
- Published
- 2021
49. Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-Bit Faults
- Author
-
Bin Nie, Lishan Yang, Adwait Jog, and Evgenia Smirni
- Subjects
Computer science ,02 engineering and technology ,Parallel computing ,Power budget ,020202 computer hardware & architecture ,Theoretical Computer Science ,Instruction set ,Computational Theory and Mathematics ,Kernel (image processing) ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Graphics ,General-purpose computing on graphics processing units ,Software - Abstract
Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU (GPGPU) applications is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space in the order of billions, even for some simple applications and even when considering the occurrence of just a single-bit fault. We present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience becomes practical. The key insight behind our proposed methodology stems from the fact that while GPGPU applications spawn a lot of threads, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by careful analysis. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be primarily classified based on the number of the dynamic instructions they execute. We therefore achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, a subset of loop iterations within the representative threads, and a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications. We show the effectiveness of the proposed progressive pruning technique for a single-bit model and illustrate its application to even more challenging cases with three distinct multi-bit fault models.
- Published
- 2021
50. A Parallel Cyclic Reduction Algorithm for Pentadiagonal Systems with Application to a Convection-Dominated Heston PDE
- Author
-
Chittaranjan Mishra and Abhijit Ghosh
- Subjects
Computational Mathematics ,Stability conditions ,Parallelizable manifold ,Applied Mathematics ,Parallel algorithm ,Parallel computing ,General-purpose computing on graphics processing units ,Mathematics ,Cyclic reduction ,Convection dominated - Abstract
Based on the parallel cyclic reduction technique, a promising new parallel algorithm is designed for pentadiagonal systems. Subject to fulfilling stability conditions, this highly parallelizable al...
- Published
- 2021
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.