309 results on '"*CUDA (Computer architecture)"'
Search Results
2. Numerical Simulation of Compressible Flows on Heterogeneous Computational Architecture.
- Author
-
Kashkovsky, Alexander V., Shershnev, Anton A., and Vashchenkov, Pavel V.
- Subjects
- *
SUPERCOMPUTERS , *INTEL computers , *SOURCE code , *MOTHERBOARDS , *CUDA (Computer architecture) - Abstract
The technology of adaptation of the HyCFS numerical code, which was originally developed for supercomputers with graphical processor units (GPUs), to various computational platforms, such as conventional CPU-based systems and new supercomputers based on the Intel Xeon Phi co-processors is developed. The main idea of adaptation is to use OpenMP threads instead of CUDA threads. This approach provides a possibility of using a unified source code for different platforms. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
3. Iteration Methods Comparison in Parallel Implementation of the Two-Dimensional Liquid Convection Problem.
- Author
-
Popov, V. N. and Tsivinskaya, Yu. S.
- Subjects
- *
ITERATIVE methods (Mathematics) , *CONVECTIVE flow , *JACOBI method , *LINEAR equations , *CUDA (Computer architecture) - Abstract
Two-dimensional liquid convection problem is implemented using parallel computations based on CUDA technology. The Jacobi method and the conjugate gradient method are used to solve the systems of linear equations (SLE) with sparse matrices, obtained by the initial equations approximation. The calculation time on the central processor of the PC is compared with the time when using GPUs, the calculations acceleration with the unknowns number increase is estimated. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
4. Numerical Modeling of Seismic Wave Propagation Generated by the Electromagnetic Pulse Source in Layered Medium.
- Author
-
Varygina, M. P. and Chentsov, E. P.
- Subjects
- *
ELECTROMAGNETIC pulses , *THEORY of wave motion , *RHEOLOGY , *FINITE differences , *SEISMIC waves , *CUDA (Computer architecture) - Abstract
The processes of seismic waves propagation generated by the non-explosive electromagnetic pulse source "Yenisei" are under investigation. Rheological properties of the layered medium including fractured interlayers are taken into account. For numerical implementation, a computational algorithm based on the space-variable two-cyclic splitting method in combination with monotone finite-difference schemes is developed. Parallel software for the analysis of the fields of velocities and stresses in layered medium is designed. Parallelisation of computations is performed by the CUDA technology for supercomputers with graphical accelerators. The results of numerical computations of seismic waves propagation are shown. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
5. Accelerating binary biclustering on platforms with CUDA-enabled GPUs.
- Author
-
González-Domínguez, Jorge and Expósito, Roberto R.
- Subjects
- *
GRAPHICS processing units , *DATA mining , *CUDA (Computer architecture) , *SIMULTANEOUS multithreading processors , *DOCUMENT clustering - Abstract
• Biclustering is an extremely time-consuming data mining task for large datasets. • CUBiBit, a tool to accelerate biclustering on CUDA-enabled GPUs, is presented. • CUBiBit also includes multi-GPU support. • CUBiBit is up to 116x faster than the fastest avaible tool when using three K20 GPUs. Data mining is nowadays essential in many scientific fields to extract valuable information from large input datasets and transform it into an understandable structure. For instance, biclustering techniques are very useful in identifying subsets of two-dimensional data where both rows and columns are correlated. However, some biclustering techniques have become extremely time-consuming when processing very large datasets, which nowadays prevents their use in many areas of research and industry (such as bioinformatics) that have experienced an explosive growth on the amount of available data. In this work we present CUBiBit , a tool that accelerates the search for relevant biclusters on binary data by exploiting the computational capabilities of CUDA-enabled GPUs as well as the several CPU cores available in most current systems. The experimental evaluation has shown that CUBiBit is up to 116 times faster than the fastest state-of-the-art tool, BiBit , in a system with two Intel Sandy Bridge processors (16 CPU cores) and three NVIDIA K20 GPUs. CUBiBit is publicly available to download from https://sourceforge.net/projects/cubibit. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
6. Comparison of Parallel Data Processing and its Performance.
- Author
-
Botor, Tomáš and Habiballa, Hashim
- Subjects
- *
PARALLEL processing , *INFORMATION storage & retrieval systems , *CUDA (Computer architecture) , *C++ , *SOURCE code , *GRAPHICS processing units - Abstract
Research is focused on parallelization of code in C++ language, using TBB library, OpenMP directives and CUDA technology. There are also comparison of results include. Thanks results we can recommend the best technology for optimization of source code. We present experimental results of computational efficiency of several types of parallelization techniques including GPU execution units. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
7. CUDA Application on Two-Dimensional CFD Problems.
- Author
-
Tsivinskaya, Yu. S.
- Subjects
- *
CUDA (Computer architecture) , *COMPUTATIONAL fluid dynamics , *MEASUREMENT of flow velocity , *LINEAR equations , *APPROXIMATION theory - Abstract
Parallel computations based on CUDA technology are applied for solving two-dimensional problems. Heat distribution and velocity field calculation in a liquid is considered. Various iterative methods are used to solve simultaneous linear equations with sparse matrices, obtained by the initial equations approximation. The efficiency of using GPUs to accelerate calculations is estimated, which grows with the unknowns number increase. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
8. Calculation of longitudinal collective instabilities with mbtrack-cuda.
- Author
-
Xu, Haisheng, Locans, Uldis, Adelmann, Andreas, and Stingelin, Lukas
- Subjects
- *
PARTICLE beam instabilities , *CENTRAL processing units , *PROTON accelerators , *CUDA (Computer architecture) , *LONGITUDINAL method - Abstract
Abstract Macroparticle tracking is a prominent method for studying the collective beam instabilities in accelerators. However, the heavy computationalload often limits the capabilities of tracking codes. One widely used macroparticle tracking code for simulating collective instabilities in storage rings is mbtrack. The Message Passing Interface (MPI) has already been implemented in mbtrack to accelerate the simulations. However, many Central Processing Unit (CPU) threads are requested in mbtrack for the analyses of the coupled-bunch instabilities.Therefore, computer clusters or desktops with many CPU cores are needed. Since these are not always available, we employ a Graphics Processing Unit (GPU) with a CUDA programming interface as an alternative to run such simulations in a stand-alone workstation. All the heavy computations have been moved to the GPU. The benchmarks confirm that mbtrack-cuda can be used to analyze coupled-bunch instabilities of 484 bunches. Compared to mbtrack on an 8-core CPU, a 36-core CPU and a cluster, mbtrack-cuda is faster for simulations of up to 3 bunches. For 363 bunches, mbtrack-cuda requires approximately six times the execution time of the cluster and twice that of the 36-core CPU. The multibunch instability analysis demonstrates that the length of the ion-cleaning gap does not have a substantial influence, at least at 3 4 filling. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
9. An effective parallelization algorithm for DEM generalization based on CUDA.
- Author
-
Wu, Qianjiao, Chen, Yumin, Wilson, John P., Liu, Xuejun, and Li, Huifang
- Subjects
- *
PARALLEL algorithms , *CUDA (Computer architecture) , *GENERALIZATION , *CRITICAL point (Thermodynamics) , *COMPUTER algorithms - Abstract
Abstract An effective parallelization algorithm based on the compute-unified-device-architecture (CUDA) is developed for DEM generalization that is critical to multi-scale terrain analysis. It aims to efficiently retrieve the critical points for generating coarser-resolution DEMs which maximally maintain the significant terrain features. CUDA is embedded into a multi-point algorithm to provide a parallel-multi-point algorithm for enhancing its computing efficiency. The outcomes are compared with the ANUDEM, compound and maximum z-tolerance methods and the results demonstrate the proposed algorithm reduces response time by up to 96% compared to other methods. As to RMSE, it performs better than ANUDEM and needs half the number of points to keep the same RMSE. The mean slope and surface roughness are reduced by less than 1% in the tested cases. The parallel algorithm provides better streamline matching. Given its high computing efficiency, the proposed algorithm can retrieve more critical points to meet the demands of higher precision. Highlights • We present a parallelization method for DEM generalization based on CUDA. • We propose a parallel-multi-point algorithm to extract the critical points from the DEM. • The method reduces response time by up to 96% compared with three existing methods. • The method can better sustain the drainage features during the generalization process than three existing methods. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
10. On the effect of using rCUDA to provide CUDA acceleration to Xen virtual machines.
- Author
-
Prades, Javier, Reaño, Carlos, and Silla, Federico
- Subjects
- *
CUDA (Computer architecture) , *VIRTUAL machine systems , *GRAPHICS processing units , *ENERGY consumption , *INFORMATION sharing - Abstract
Nowadays, many data centers use virtual machines (VMs) in order to achieve a more efficient use of hardware resources. The use of VMs provides a reduction in equipment and maintenance expenses as well as a lower electricity consumption. Nevertheless, current virtualization solutions, such as Xen, do not easily provide graphics processing units (GPUs) to applications running in the virtualized domain with the flexibility usually required in data centers (i.e., managing virtual GPU instances and concurrently sharing them among several VMs). Therefore, the execution of GPU-accelerated applications within VMs is hindered by this lack of flexibility. In this regard, remote GPU virtualization solutions may address this concern. In this paper we analyze the use of the remote GPU virtualization mechanism to accelerate scientific applications running inside Xen VMs. We conduct our study with six different applications, namely CUDA-MEME, CUDASW++, GPU-BLAST, LAMMPS, a triangle count application, referred to as TRICO, and a synthetic benchmark used to emulate different application behaviors. Our experiments show that the use of remote GPU virtualization is a feasible approach to address the current concerns of sharing GPUs among several VMs, featuring a very low overhead if an InfiniBand fabric is already present in the cluster. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
11. Real‐time parallel 3D multiple particle tracking with single molecule centrifugal force microscopy.
- Author
-
KOU, L., JIN, L., LEI, H., HU, C., LI, H., and HU, X.
- Subjects
- *
GRAPHICS processing units , *CUDA (Computer architecture) , *TRACKING algorithms , *SINGLE molecules , *IMAGE processing - Abstract
Summary: Real‐time tracking of multiple particles is key for quantitative analysis of dynamic biophysical processes and materials science via time‐lapse microscopy image data, especially for single molecule biophysical techniques, such as magnetic tweezers and centrifugal force microscopy. However, real‐time multiple particle tracking with high resolution is limited by the current imaging processes or tracking algorithms. Here, we demonstrate 1 nm resolution in three dimensions in real‐time with a graphics‐processing unit (GPU) based on a compute unified device architecture (CUDA) parallel computing framework instead of only a central processing unit (CPU). We also explore the trade‐offs between processing speed and size of the utilized regions of interest and a maximum speedup of 137 is achieved with the GPU compared with the CPU. Moreover, we utilize this method with our recently self‐built centrifugal force microscope (CFM) in experiments that track multiple DNA‐tethered particles. Our approach paves the way for high‐throughput single molecule techniques with high resolution and efficiency. Lay Description: Particles are widely used as probes in life sciences through their motions. In single molecule techniques such as optical tweezers and magnetic tweezers, microbeads are used to study intermolecular or intramolecular interactions via beads tracking. Also tracking multiple beads' motions could study cell–cell or cell–ECM interactions in traction force microscopy. Therefore, particle tracking is of key important during these researches. However, parallel 3D multiple particle tracking in real‐time with high resolution is a challenge either due to the algorithm or the program. Here, we combine the performance of CPU and CUDA‐based GPU to make a hybrid implementation for particle tracking. In this way, a speedup of 137 is obtained compared the program before only with CPU without loss of accuracy. Moreover, we improve and build a new centrifugal force microscope for multiple single molecule force spectroscopy research in parallel. Then we employed our program into centrifugal force microscope for DNA stretching study. Our results not only demonstrate the application of this program in single molecule techniques, also indicate the capability of multiple single molecule study with centrifugal force microscopy. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
12. Hardware Accelerated Semantic Declarative Memory Systems through CUDA and MapReduce.
- Author
-
Edmonds, Mark, Atahary, Tanvir, Douglass, Scott, and Taha, Tarek
- Subjects
- *
EXPLICIT memory , *CUDA (Computer architecture) , *INFORMATION retrieval , *SEMANTIC networks (Information theory) , *SERVICE-oriented architecture (Computer science) - Abstract
Declarative memory enables cognitive agents to effectively store and retrieve factual memory in real-time. Increasing the capacity of a real-time agent's declarative memory increases an agent's ability to interact intelligently with its environment but requires a scalable retrieval system. This work represents an extension of the Accelerated Declarative Memory (ADM) system, referred to as Hardware Accelerated Declarative Memory (HADM), to execute retrievals on a GPU. HADM also presents improvements over ADM's CPU execution and considers critical behavior for indefinitely running declarative memories. The negative effects of a constant maximum associative strength are considered, and mitigating solutions are proposed. HADM utilizes a GPU to process the entire semantic network in parallel during retrievals, yielding significantly faster declarative retrievals. The resulting GPU-accelerated retrievals show an average speedup of approximately 70 times over the previous Service Oriented Architecture Declarative Memory (soaDM) implementation and an average speedup of approximately 5 times over ADM. HADM is the first GPU-accelerated declarative memory system in existence. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
13. ARC: adaptive ray-tracing with CUDA, a new ray tracing code for parallel GPUs.
- Author
-
Hartley, Blake and Ricotti, Massimo
- Subjects
- *
RAY tracing , *GRAPHICS processing units , *CUDA (Computer architecture) , *COMPUTER simulation , *METAPHYSICAL cosmology - Abstract
We present the methodology of a photon-conserving, spatially-adaptive, ray-tracing radiative transfer algorithm, designed to run on multiple parallel graphic processing units (GPUs). Each GPU has thousands of computing cores, making them ideally suited to the task of tracing independent rays. This ray-tracing implementation is speed competitive with approximate momentum methods, even with thousands of ionization sources, without sacrificing accuracy and resolution. Here, we validate our implementation with the selection of tests presented in the 'cosmological radiative transfer codes comparison project,' to demonstrate the correct behaviour of the code. We also present a selection of benchmarks to demonstrate the performance and computational scaling of the code. As expected, our method scales linearly with the number of sources and with the square of the dimension of the 3D computational grid. Our current implementation is scalable to an arbitrary number of nodes possessing GPUs, but is limited to a uniform resolution 3D grid. Cosmological simulations of reionization with tens of thousands of radiation sources and intergalactic volumes sampled with 10243 grid points take about 30 days on 64 GPUs to reach complete reionization. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
14. Multi-device Controllers: A Library to Simplify Parallel Heterogeneous Programming.
- Author
-
Moreton-Fernandez, Ana, Gonzalez-Escribano, Arturo, and Llanos, Diego R.
- Subjects
- *
COMPUTER programming , *HIGH performance computing , *MOTHERBOARDS , *CUDA (Computer architecture) , *KERNEL functions - Abstract
Current HPC clusters are composed by several machines with different computation capabilities and different kinds and families of accelerators. Programming efficiently for these heterogeneous systems has become an important challenge. There are many proposals to simplify the programming and management of accelerator devices, and the hybrid programming, mixing accelerators and CPU cores. However, in many cases, portability compromises the efficiency on different devices, and there are details concerning the coordination of different types of devices that should still be tackled by the programmer. In this work, we introduce the Multi-Controller, an abstract entity implemented in a library that coordinates the management of heterogeneous devices, including accelerators with different capabilities and sets of CPU-cores. Our proposal improves state-of-the-art solutions, simplifying data partition, mapping and the transparent deployment of both, simple generic kernels portable across different device types, and specialized implementations defined and optimized using specific native or vendor programming models (such as CUDA for NVIDIA's GPUs, or OpenMP for CPU-cores). The run-time system automatically selects and deploys the most appropriate implementation of each kernel for each device, managing data movements and hiding the launch details. The results of an experimental study with five study cases indicates that our abstraction allows the development of flexible and highly efficient programs that adapt to the heterogeneous environment. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
15. Parallel SIMD CPU and GPU Implementations of Berlekamp-Massey Algorithm and Its Error Correction Application.
- Author
-
Mohebbi, Hamidreza
- Subjects
- *
CUDA (Computer architecture) , *GRAPHICS processing units , *BERLEKAMP-Massey algorithm , *ERROR correction (Information theory) , *CRYPTOGRAPHY - Abstract
The Berlekamp-Massey algorithm finds the shortest linear feedback shift register for a binary input sequence. A wide range of applications like cryptography and digital signal processing use this algorithm. This research proposes novel parallel mechanisms offered by heterogeneous CPU and GPU hardwares in order to achieve the best possible performance for BMA. The proposed bitwise implementation of the BMA algorithm is almost 35 times faster than state of the art implementations. This further improvement is achieved by using SIMD instructions which provides data level parallelism. This new approach can be 4.6 and 35 times faster than a bitwise CPU and state of the art implementations, respectively. In order to achieve the highest possible speedup over a multi-core structure, a multi-threading implementation is introduced in this research. By leveraging on OpenMP we were able to obtain a speedup of 10 times over 12 cores server. The GPU device with thousands of processing cores can bring great speedup over the best CPU implementation. Two other parallel mechanisms offered by GPU are concurrent kernel execution and streaming. They achieve 14.5 and 2.2 times of speedup compared to CPU serial and typical CUDA implementations, respectively. Also, the performance of the openMP code with SIMD instructions is compared with GPU stream implementation. The effectiveness of the proposed method is evaluated in a real world error correction application and it achieves 6.8 times of speedup. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
16. A fast GPU-based hybrid algorithm for addition chains.
- Author
-
Bahig, Hatem M. and AbdElbari, Khaled A.
- Subjects
- *
GRAPHICS processing units , *BRANCH & bound algorithms , *COMPUTER algorithms , *CUDA (Computer architecture) , *CLOUD computing - Abstract
A graphics processing unit (GPU) has been widely used to accelerate discrete optimization problems. In this paper, we introduce a novel hybrid parallel algorithm to generate a shortest addition chain for a positive integer e. The main idea of the proposed algorithm is to divide the search tree into a sequence of three subtrees: top, middle, and bottom. The top subtree works using a branch and bound depth first strategy. The middle subtree works using a branch and bound breadth first strategy, while the bottom subtree works using a parallel (GPU) branch and bound depth first strategy. Our experimental results show that, compared to the fastest sequential algorithm for generating a shortest addition chain, we improve the generation by about 70% using one GPU (NVIDIA GeForce GTX 770). For generating all shortest addition chains, the percentage of the improvement is about 50%. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
17. CUDA-based parallelization of a bio-inspired model for fast object classification.
- Author
-
Hernández, Daniel E., Olague, Gustavo, Hernández, Benjamín, and Clemente, Eddie
- Subjects
- *
CUDA (Computer architecture) , *COMPUTER vision , *REAL-time computing , *GRAPHICS processing units , *IMAGE processing - Abstract
The need for highly accurate classification systems capable of working on real-time applications has increased in recent years. Nowadays, several computer vision tasks apply a classification step as part of bigger systems, hence requiring classification models that work at a fast pace. This rendered interesting the concept of real-time object classification to several research communities. In this paper, we propose to accelerate a bio-inspired model for object classification, which has given very good results when compared with other state-of-the-art proposals using the compute unified device architecture (CUDA) and exploiting computational capabilities of graphic processing units. The classification model that is used is called the artificial visual cortex, a novel bio-inspired approach for image classification. In this work, we show that through an implementation of this model in the CUDA framework it is possible to achieve real-time functionality. As a result, the proposed system is able to process images in average of up to 90 times faster than the original system. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
18. CUDA accelerated implementation of parallel dynamic relaxation.
- Author
-
Iványi, P.
- Subjects
- *
CUDA (Computer architecture) , *RELAXATION (Nuclear physics) , *RELAXATION spectroscopy , *GRAPHICS processing units , *PARALLEL algorithms - Abstract
Highlights • New parallel implementation of dynamic relaxation. • Dynamic relaxation method based on nodal based calculation. • Parallel implementation in CUDA. Abstract The dynamic relaxation method has been widely used for the design and analysis of cable-membrane structures. The method iteratively determines a static solution and it has already been parallelized with the MPI environment to speed up the analysis process. This paper discusses a new parallelization approach, which is programmed with the NVIDIA CUDA API and executed on GPU systems. Since a GPU system has a large number of cores and a separate memory from the computer therefore the original dynamic relaxation method has to be reorganized. The paper also discusses the performance measurements of the dynamic relaxation method on GPU systems. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
19. TELEPORT: Hardware/software alternative to CUDA shared memory programming.
- Author
-
Lashgar, Ahmad, Atoofian, Ehsan, and Baniasadi, Amirali
- Subjects
- *
COMPUTER software , *COMPUTER input-output equipment , *CUDA (Computer architecture) , *COMPUTER storage devices , *COMPUTER programming - Abstract
Abstract Using software-managed cache in CUDA programming provides significant potential to improve memory efficiency. Employing this feature requires the programmer to identify data tiles associated with thread blocks and bring them to the cache explicitly. Despite the advantages, the development effort required to exploit this feature can be significant. The goal of this paper is to reduce this effort while maintaining the associated benefits. To this end, we first investigate static precalculability in memory accesses for GPGPU workloads, at the thread block granularity. We show that a significant share of addresses can be precalculated knowing thread block identifiers. We build on this observation and introduce TELEPORT. TELEPORT is a novel hardware/software scheme for delivering performance competitive to software-managed cache programming, but at no extra development effort. On the software side, TELEPORT's static analyzer parses the kernel and finds precalculable memory accesses. We introduce Runtime API calls to pass this information to hardware. On the hardware side, this information is used to fetch the data required for each thread block into shared memory before the thread block starts execution. With this hardware support, TELEPORT outperforms hand-written CUDA code as a result of the associated DRAM row locality improvement. Investigating a wide set of benchmarks, we show that TELEPORT improves performance of hand-written implementations, on average, by 32% while reducing development effort by 2.5X. Our estimations show that the hardware overhead associated with TELEPORT is below 1%. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
20. A GPU-based DEM-FEM computational framework for tire-sand interaction simulations.
- Author
-
Zheng, Zumei, Zang, Mengyan, Chen, Shunhua, and Zeng, Haiyang
- Subjects
- *
GRAPHICS processing units , *FINITE element method , *CUDA (Computer architecture) , *KINEMATICS , *DISPLACEMENT (Mechanics) - Abstract
Highlights • A GPU-based discrete-finite element method is developed by using the CUDA FORTRAN. • The effectiveness and scalability of the method are validated. • The developed method is applied to tire-sand interaction simulations. • A speedup of more than 15 is achieved in tire-sand interaction simulation. • Simulation results of tire-sand interaction cases agree well with experimental results. Abstract Recently, the discrete-finite element method (DEM-FEM) has proved to be an advanced technique for solid-particle interaction simulations. However, the low computational efficiency limits its applications to practical engineering problems. To achieve this end, this paper proposes a Graphics Processing Unit (GPU) based DEM-FEM, including contact detection, force calculation and information update, in the context of Compute Unified Device Architecture (CUDA) FORTRAN environment. Three numerical examples are performed to validate the efficiency, effectiveness and scalability of the developed method. On this basis, the GPU-based DEM-FEM is extended to a computing platform, and programmed into our in-house code PDFP-OVS for a typical solid-particle interaction problem, i.e. the running performance of a pneumatic tire on granular sand. Numerical result shows that a speedup of more than 15 can be achieved. The simulation results are found to be in good agreement with the experiment results in terms of the gross tractive effort, the drawbar pull and the running resistance, which validates the capacity of the platform in the travel performance analysis of the tire on granular sand. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
21. Optimizing data transmission and access of the incremental clustering algorithm using CUDA: A case study.
- Author
-
Chen, Chunlei, Wang, Chengduan, Hou, Jinkui, Zhang, Peng, Zhang, Yonghui, Wang, Lei, and Dai, Jiangyan
- Subjects
- *
DATA transmission systems , *MATHEMATICAL optimization , *CUDA (Computer architecture) , *CLUSTER analysis (Statistics) , *COMPUTER algorithms - Abstract
Incremental clustering algorithms can find wide applications in real-time streaming data processing and massive data analysis. Such algorithms need to continuously load data, and thus data transmission and access can induce non-negligible time overhead. Additionally, we have proposed two algorithms to exploit high data parallelism for incremental clustering on CUDA-enabled GPGPU: the Top-down (TD) algorithm and Moderate-granularity (MG) algorithm. In this paper, we adopt TD and MG algorithms as a case study to optimize data transmission and access based on CUDA. First, we reinterpret the two algorithms in the point view of overlapping read/write and computing operations on CUDA-warp level. Second, we adjust the flow of TD and MG algorithms to enhance data locality. As a result, shared memory can be sufficiently utilized. Third, we reorder input data points to raise data rate of global memory through coalesced memory access. Fourth, we hide part of data transmission latency by running multiple CUDA streams. Experiment results validated the efficiency of our optimizations. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
22. Effectiveness of inside/outside determination in relation to 3D non-convex shapes using CUDA.
- Author
-
Kodama, Satoshi
- Subjects
- *
CUDA (Computer architecture) , *THREE-dimensional imaging , *VIRTUAL reality , *COMPUTER algorithms , *TRIGONOMETRIC functions - Abstract
The ability to accurately perform inside/outside determination in relation to three-dimensional (3D) non-convex shapes is an extremely important technology when analysing the structure of objects. However, because 3D shapes, inevitably have a large number of data points, considerable amount of time is required to analyse the shape; therefore, high-speed analytical methods have been established. Because methods that are generally used in games and virtual reality/augmented reality do not analyse structures related to the objects, they are capable of high-speed depiction of the objects. However, in cases wherein the structure itself needs to be analysed, it is necessary to accurately evaluate shapes regarding whether a particular point is inside or outside of the object. In this study, we employed an algorithm that uses solid angles to precisely analyse the 3D structure of objects. Given that potential problems emerge regarding processing speed when trigonometric functions are used when proforming inside/outside determination of solid angles, we parallelized the algorithm using Compute Unified Device Architecture to resolve these potential issues. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
23. GPU parallel neural hierarchical multi objective solver for burst routing and wavelength assignment.
- Author
-
Larhlimi, Abderrahim and Mestari, Mohammed
- Subjects
- *
LINEAR programming , *ARTIFICIAL neural networks , *CUDA (Computer architecture) , *MATHEMATICAL optimization , *PROBLEM solving - Abstract
Abstract Optical Burst Switching (OBS) is a promising technology for next generation of Transparent Optical Networks (TON). However, many scientific challenges remain to be overcome such as the problem of Burst Routing and Wavelength Assignment (BRWA) with several conflicting objectives and constraints. In this paper, we first formulate the BRWA as a Multi Objective Integer Linear Programming (MO-ILP) optimization problem. In the formulated problem, the proposed BRWA policy will satisfy several constraints in order to guarantee a high-speed management of processes, required by the transparent optical traffic. Then, since the obtained ILP problem contains a large number of optical constraints and conflicting objectives, we propose to use an exact parallel Neural Hierarchical (epNH) MO-ILP solution with Graphics Processing Unit (GPU) parallel implementation using Compute Unified Device Architecture (CUDA). This also allows doing a concurrent search for multiple solutions, reducing processing cost, making hybrid interfaces to other search techniques, and achieving better overall effectiveness. In addition, our architecture based on Artificial Neural Networks (ANN) allows flexibility and scalability. The processing time remains fixed regardless of the input size. Our BRWA GPU-based epNH MO-ILP solver is based on the joint use of advanced MO-ILP optimization methods, ANN large-scale inherent parallelism and CUDA-GPU High-Performance Computing (HPC) architecture. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
24. Evaluation of the computational efficacy in GPU-accelerated simulations of spiking neurons.
- Author
-
Fujita, Kazuhisa, Okuno, Shun, and Kashimori, Yoshiki
- Subjects
- *
COMPUTERS in economics , *NEURAL circuitry , *GRAPHICS processing units , *CUDA (Computer architecture) , *COMPUTER software - Abstract
To understand the mechanism of information processing by a biological neural network, computer simulation of a large-scale spiking neural network is an important method. However, because of a high computation cost of the simulation of a large-scale spiking neural network, the simulation requires high performance computing implemented by a supercomputer or a computer cluster. Recently, hardware for parallel computing such as a multi-core CPU and a graphics card with a graphics processing unit (GPU) is built in a gaming computer and a workstation. Thus, parallel computing using this hardware is becoming widespread, allowing us to obtain powerful computing power for simulation of a large-scale spiking neural network. However, it is not clear how much increased performance the parallel computing method using a new GPU yields in the simulation of a large-scale spiking neural network. In this study, we compared computation time between the computing methods with CPUs and GPUs in a simulation of neuronal models. We developed computer programs of neuronal simulations for the computing systems that consist of a gaming graphics card with new architecture (the NVIDIA GTX 1080) and an accelerator board using a GPU (the NVIDIA Tesla K20C). Our results show that the computing systems can perform a simulation of a large number of neurons faster than CPU-based systems. Furthermore, we investigated the accuracy of a simulation using single precision floating point. We show that the simulation results of single precision were accurate enough compared with those of double precision, but chaotic neuronal response calculated by a GPU using single precision is prominently different from that calculated by a CPU using double precision. Furthermore, the difference in chaotic dynamics appeared even if we used double precision of a GPU. In conclusion, the GPU-based computing system exhibits a higher computing performance than the CPU-based system, even if the GPU system includes data transfer from a graphics card to host memory. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
25. A program slicing-based method for effective detection of coincidentally correct test cases.
- Author
-
Feyzi, Farid and Parsa, Saeed
- Subjects
- *
EMOTIONAL or Behavior Disorder Scale , *COMPUTER software testing , *COMPUTER software , *GRAPHICS processing units , *CUDA (Computer architecture) - Abstract
Despite the proven applicability of the spectrum-based fault localization (SBFL) methods, their effectiveness may be degraded due to the presence of coincidental correctness, which occurs when faults fail to propagate, i.e., their execution does not result in failures. This article aims at improving SBFL effectiveness by mitigating the effect of coincidentally correct test cases. In this regard, given a test suite in which each test has been classified as failing or passing and each faulty program has a single-bug, we present a program slicing-based technique to identify a set of program entities that directly affect the program output when executed with failing test cases, called failure candidate causes (FCC). We then use FCC set to identify test cases that can be marked as being coincidentally correct. These tests are identified based on two heuristics: the average suspiciousness score of the statements that directly affect the program output and the coverage ratio of those statements. To evaluate our approach, we used several evaluation metrics and conducted extensive experiments on programs containing single and multiple bugs, including both real and seeded faults. The empirical results demonstrate that the proposed heuristics can alleviate the coincidental correctness problem and improve the accuracy of SBFL techniques. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
26. Parallelized Implementation of Dynamical Particle System.
- Author
-
Mašek, Jan, Frantík, Petr, and Vořechovský, Miroslav
- Subjects
- *
DYNAMICAL systems , *SIMULATION methods & models , *CUDA (Computer architecture) , *APPLIED mathematics , *COMPUTER architecture - Abstract
The paper presents approaches to implementation of solution of discrete dynamical system of mutually repelling particles. Two platforms: a single-thread JAVA process and parallelized CUDA C solution, are employed for the dynamical simulation. Qualities of both platforms are discussed and explained as their performance when solving two proposed interaction laws is compared. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
27. Study of Homogeneity and Inhomogeneity Phantom in CUDA EGS for Small Field Dosimetry.
- Author
-
Yani, Sitti, Rhani, Mohamad Fahdillah, Haryanto, Freddy, and Arif, Idam
- Subjects
- *
IMAGING phantoms , *RADIATION dosimetry , *CUDA (Computer architecture) , *X-ray imaging , *MONTE Carlo method - Abstract
CUDA EGS was CUDA implementation to simulate transport photon in a material based on Monte Carlo algorithm for X-ray imaging. The objective of this study was to investigate the effect of inhomogeneities in inhomogeneity phantom for small field dosimetry (1×1, 2×2, 3×3, 4×4 and 5×5 cm2). Two phantoms, homogeneity and inhomogeneity phantom were used. The interaction in homogeneity and inhomogeneity phantom was dominated by Compton interaction and multiple scattering. The CUDA EGS can represent the inhomogeneity effect in small field dosimetry by combining the grayscale curve between homogeneity and inhomogeneity phantom. The grayscale curve in inhomogeneity phantom is not asymmetric because of the existence of different material in phantom. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
28. GPU-BASED HEAT TRANSFER MODEL.
- Author
-
Szénási, Sándor and Felde, Imre
- Subjects
- *
HEAT transfer , *BOUNDARY value problems , *HEURISTIC algorithms , *CUDA (Computer architecture) , *MATERIALS science - Abstract
This paper presents a novel heat transfer model for the estimation of heat transfer coefficients without any information about the thermal boundary conditions. The massively parallel implementation (using the latest CUDA 8.0 framework and NVIDIA graphics cards) of the model gives the opportunity to use multiple graphics accelerators to execute the heat transfer simulations. This leads to very impressive performance which makes it possible to run thousands of simulations within a reasonable time. There are several problems in the field of material science without any known analytical solution where the only possible way is to use some heuristic searches (Genetic Algorithms, Particle Swarm Optimisation, Fireworks, etc.). For some of these cases, the fitness/cost function calculation needs a heat transfer simulation process to generate a cooling curve, which is usually the most resource consuming part of the algorithm. Using our novel heat transfer model makes it possible to efficiently use these methods. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
29. ASSESSMENT OF GPU COMPUTATIONAL ENHANCEMENT TO WAVE FIELD MODELING.
- Author
-
Gwizdz, Damian and Franczyk, Anna
- Subjects
- *
PERFORMANCE evaluation , *GRAPHICS processing units , *SOUND waves , *FINITE difference method , *CUDA (Computer architecture) - Abstract
In this paper, we present a performance assessment of the implementation of the algorithm of wave sources location using wave field modelling on a Graphics Processing Units (GPUs). The numerical algorithm of wave source location based on the acoustic wave equation was implemented using finite difference method. To achieve better wave source location, the algorithm was implemented using staggered grid meshes. The parallelization was carried out using GPU approach, established in NVIDIA's Compute Unified Development Architecture (CUDA). Several tests have been made to establish efficiency of the proposed parallelization method. The obtained performance measurement was carried out to determine both: the scalability of a single modelling step and the entire application performance. Tests were performed not only for the different size of the computational model but also for different GPUs cards. This type of testing allowed for reliable assessment of computational enhancement of the GPU implementation of wave field modelling algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
30. Study of Homogeneity and Inhomogeneity Phantom in CUDA EGS for Small Field Dosimetry.
- Author
-
Yani, Sitti, Rhani, Mohamad Fahdillah, Haryanto, Freddy, and Arif, Idam
- Subjects
- *
CUDA (Computer architecture) , *RADIATION dosimetry , *X-ray imaging , *MONTE Carlo method , *MULTIPLE scattering (Physics) - Abstract
CUDA EGS was CUDA implementation to simulate transport photon in a material based on Monte Carlo algorithm for X-ray imaging. The objective of this study was to investigate the effect of inhomogeneities in inhomogeneity phantom for small field dosimetry (1×1, 2×2, 3×3, 4×4 and 5×5 cm2). Two phantoms, homogeneity and inhomogeneity phantom were used. The interaction in homogeneity and inhomogeneity phantom was dominated by Compton interaction and multiple scattering. The CUDA EGS can represent the inhomogeneity effect in small field dosimetry by combining the grayscale curve between homogeneity and inhomogeneity phantom. The grayscale curve in inhomogeneity phantom is not asymmetric because of the existence of different material in phantom. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
31. On Some Questions in Computer Modeling of the Reachability Sets Constructing Problems.
- Author
-
Ushakov, V. N., Parshikov, G. V., and Matviychuk, A. R.
- Subjects
- *
REACHABLE sets (Set theory) , *PROBLEM solving , *APPROXIMATION theory , *GRAPHICS processing units , *CUDA (Computer architecture) - Abstract
The research considers the problem of constructing the reachability sets of non-linear dynamical system in n-dimensional Euclidean space on the fixed time interval. The approximate solution methods of the reachability sets constructing are considered in this research as well as the accuracy estimation for this methods is given. The research contains the computational experiments on computer modeling of described reachability sets constructing methods, which use the algorithms implemented for two computation technologies CPU as well as GPU (using CUDA technology). In this research the description and comparison of approaches to the computer modeling of the problem are given. Furthermore, the CPU-based computer modeling result comparison with the result obtained on GPU based on CUDA technology are presented. Besides, this research discusses some the side issues appeared during computer modeling, the issues raised during the computer algorithms implementation, as well as the ways to eliminate these issues or reduce their impact. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
32. MSGD: A Novel Matrix Factorization Approach for Large-Scale Collaborative Filtering Recommender Systems on GPUs.
- Author
-
Li, Hao, Li, Kenli, An, Jiyao, and Li, Keqin
- Subjects
- *
GRAPHICS processing units , *FACTORIZATION of operators , *CUDA (Computer architecture) , *COMPUTER architecture , *FACTORIZATION , *STOCHASTIC processes - Abstract
Real-time accurate recommendation of large-scale recommender systems is a challenging task. Matrix factorization (MF), as one of the most accurate and scalable techniques to predict missing ratings, has become popular in the collaborative filtering (CF) community. Currently, stochastic gradient descent (SGD) is one of the most famous approaches for MF. However, it is non-trivial to parallelize SGD for large-scale CF MF problems due to the dependence on the user and item pair, which can cause parallelization over-writing. To remove the dependence on the user and item pair, we propose a multi-stream SGD (MSGD) approach, for which the update process is theoretically convergent. On that basis, we propose a Compute Unified Device Architecture (CUDA) parallelization MSGD (CUMSGD) approach. CUMSGD can obtain high parallelism and scalability on [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
33. Research on the simulation of PF-LBM model based on MPI+CUDA mixed granularity parallel.
- Author
-
Zhu, Changsheng, Liu, Jieqiong, Feng, Li, and Deng, Xin
- Subjects
- *
SIMULATION methods & models , *LEAN body mass , *GRANULAR materials , *CUDA (Computer architecture) , *MAGNETIC particle imaging - Abstract
A microstructure numerical model is an intensive computational problem, for which the simulation time is too long and the simulation scale is too small. To solve these two problems, in this article, we use MPI+CUDA hybrid particle heterogeneous parallel computing to implement the dendrite growth simulation of a PF-LBM phase-field 3D model. Message Passing Interface (MPI) can be used to conduct coarse granularity division, to break through the limitation of the simulate scale in a single machine. In each node, fine-grained division is implemented by the Compute Unified Device Architecture (CUDA) parallel way to realize the completely parallelism intra-node, and to improve overall computational efficiency. At the same time, in this article, the "pseudo three-dimensional array" programming method is brought up in CUDA programming, and also to improve the CUDA random number generation method, in order to simplify the CUDA array programming and reduce the CUDA random number generation time purposes. Experiments show that at the same simulation scale, the speed-up ratio with 21 nodes MPI+CUDA was 57, which was increased 54% over the 21 nodes MPI. Under the condition of computing efficiency close, the largest simulation scale with 21 nodes MPI+CUDA was 4203, which is 13 times to single GPU. Therefore, the MPI + CUDA hybrid granularity parallel method proposed in this paper also has the advantages of high computational efficiency of the GPU and MPI to expand the simulation scale. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
34. Quad-RRT: A real-time GPU-based global path planner in large-scale real environments.
- Author
-
Hidalgo-Paniagua, Alejandro, Bandera, Juan Pedro, Ruiz-de-Quintanilla, Manuel, and Bandera, Antonio
- Subjects
- *
GRAPHICS processing units , *ROBOTIC path planning , *CUDA (Computer architecture) , *DISTRIBUTED computing , *COMPUTER algorithms - Abstract
During the last decade, sampling based methods for motion and path planning have gained more interest. Specifically, in the field of robotics, approaches based on the Rapidly-exploring Random Tree (RRT) algorithm have become the customary technique for solving the single-query motion planning problem. However, dynamic large maps still represent a challenging scenario for these methods to produce fast enough results. Taking advantage of an NVidia CUDA-enabled Graphic Processing Unit (GPU), we present quad-RRT, an extension of the bi-directional strategy to speed up the RRT when dealing with large-scale, bidimensional (2D) maps. Designed for modern GPUs, quad-RRT computes four trees instead of the two ones built by the bidirectional approaches. This modification aims balancing the direct searching ability of these methods with the parallel exploration of those parts of the map at both sides of the path joining the initial and goal poses. Experimental results demonstrate that the proposed algorithm provides a significant speedup dealing with large-scale maps densely populated by obstacles, when compared to other implementations of the RRT. Hence, the algorithm can have a high impact in the field of inspection path planning for distributed infrastructure. It is also a promising approach to allow new generation robots, designed to work in unconstrained environments, dynamically plan large-scale paths. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
35. Stochastic first passage time accelerated with CUDA.
- Author
-
Pierro, Vincenzo, Troiano, Luigi, Mejuto, Elena, and Filatrella, Giovanni
- Subjects
- *
CUDA (Computer architecture) , *GRAPHICS processing units , *JOSEPHSON junctions , *ATOMIC force microscopy , *STOCHASTIC analysis - Abstract
The numerical integration of stochastic trajectories to estimate the time to pass a threshold is an interesting physical quantity, for instance in Josephson junctions and atomic force microscopy, where the full trajectory is not accessible. We propose an algorithm suitable for efficient implementation on graphical processing unit in CUDA environment. The proposed approach for well balanced loads achieves almost perfect scaling with the number of available threads and processors, and allows an acceleration of about 400× with a GPU GTX980 respect to standard multicore CPU. This method allows with off the shell GPU to challenge problems that are otherwise prohibitive, as thermal activation in slowly tilted potentials. In particular, we demonstrate that it is possible to simulate the switching currents distributions of Josephson junctions in the timescale of actual experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
36. GPU-Accelerated Boussinesq Model Using Compute Unified Device Architecture FORTRAN.
- Author
-
Kim, Boram, Oh, Chanyoung, Yi, Youngmin, and Kim, Dae-Hong
- Subjects
- *
GRAPHICS processing units , *CUDA (Computer architecture) , *BOUSSINESQ equations , *COMPUTER architecture , *COMPUTING platforms - Abstract
ABSTRACT Kim, B.; Oh, C.; Yi, Y., and Kim, D.-H., 2018. GPU-Accelerated of Boussinesq model using compute unified device architecture FORTRAN. In: Shim, J.-S.; Chun, I., and Lim, H.S. (eds.), Proceedings from the International Coastal Symposium (ICS) 2018 (Busan, Republic of Korea). Journal of Coastal Research, Special Issue No. 85, pp. 1176–1180. Coconut Creek (Florida), ISSN 0749-0208. Graphic Processing Units (GPU) have a number of arithmetic units and their associated structures specialized for graphic processes make the computational performances much faster than CPU (Central Processing Units). In these days, many numerical models implemented by FORTRAN have been applied on real field scale problems, which requires huge computational resources and simulation time as well. In this study, a GPU version of Boussinesq equation model was implemented using the Compute Unified Device Architecture (CUDA) FORTRAN. The computed results of the GPU-CUDA FORTRAN Boussinesq model were verified by comparing with the computed result of a CPU based Boussinesq model that had been already verified for many benchmark tests. Exact agreements except round off magnitude have been observed from the comparison. The GPU-CUDA FORTRAN Boussinesq model showed about 20 times faster computational time compared with the CPU based code. In addition, as the computational domain becomes larger, the computational efficiency of GPU-CUDA FORTRAN version over the CPU version more increased. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
37. A modified ZS thinning algorithm by a hybrid approach.
- Author
-
Ben Boudaoud, Lynda, Solaiman, Basel, and Tari, Abdelkamel
- Subjects
- *
THINNING algorithms , *ALGORITHMS , *IMAGE processing , *GRAPHICS processing units , *CUDA (Computer architecture) - Abstract
Thinning is one of the most important techniques in the field of image processing. It is applied to erode the image of an object layer-by-layer until a skeleton is left. Several thinning algorithms allowing to get a skeleton of a binary image are already proposed in the literature. This paper investigates several well-known parallel thinning algorithms and proposes a modified version of the most widely used thinning algorithm, called the ZS algorithm. The proposed modified ZS (MZS) algorithm is implemented and compared against seven existing algorithms. Experimental results and performances evaluation, using different image databases, confirm the proposed MZS algorithm improvement over the seven examined algorithms both in terms of the obtained results quality and the computational speed. Moreover, for an efficient implementation (on Graphics Processing Units), a parallel model of the MZS algorithm is proposed (using the Compute Unified Device Architecture, CUDA, as a parallel programming model). Evaluation results have shown that the parallel version of the proposed algorithm is, on average, more than 21 times faster than the traditional CPU sequential version. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
38. High performance filtering for big datasets from Airborne Laser Scanning with CUDA technology.
- Author
-
Błaszczak-Bąk, Wioleta, Janowski, Artur, and Srokosz, Piotr
- Subjects
- *
CLOUD computing , *INFORMATION filtering , *BIG data , *OPTICAL scanners , *CUDA (Computer architecture) - Abstract
There are many studies on the problems of processing big datasets provided by Airborne Laser Scanning (ALS). The processing of point clouds is often executed in stages or on the fragments of the measurement set. Therefore, solutions that enable the processing of the entire cloud at the same time in a simple, fast, efficient way are the subject of many researches. In this paper, authors propose to use
General-Purpose computation on Graphics Processing Units (GPGPUs) to process the big datasets obtained from ALS. GPGPU handles computation for computer graphics using GPUs (Graphic Processing Units). This study was based on programming modelCompute Unified Device Architecture (CUDA), which facilitates the development of applications in GPUs. CUDA programming was used to carry out the filtration based on adaptive TIN model method in the initial stage of the processing of big ALS dataset. Results of the analysis showed that GPGPU can be used for the filtration of ALS point clouds and significantly speeds up calculations for big dataset. [ABSTRACT FROM AUTHOR]- Published
- 2018
- Full Text
- View/download PDF
39. Almost optimal column-wise prefix-sum computation on the GPU.
- Author
-
Tokura, Hiroki, Fujita, Toru, Nakano, Koji, Ito, Yasuaki, and Bordim, Jacir L.
- Subjects
- *
GRAPHICS processing units , *PARALLEL algorithms , *CUDA (Computer architecture) , *IMAGE processing , *EUCLIDEAN distance - Abstract
Row-wise and column-wise prefix-sum computation of a matrix has many applications in the area of image processing such as computation of the summed area table and the Euclidean distance map. It is known that the prefix-sums of a one-dimensional array can be computed efficiently on the GPU. Hence, row-wise prefix-sums of a matrix can also be computed efficiently on the GPU by executing this prefix-sum algorithm for every row in parallel. However, the same approach does not work well for computing column-wise prefix-sums due to inefficient stride memory access to the global memory is performed. The main contribution of this paper is to present an almost optimal column-wise prefix-sum algorithm on the GPU. Quite surprisingly, experimental results using NVIDIA TITAN X show that our column-wise prefix-sum algorithm runs only 2-6% slower than matrix duplication. Thus, our column-wise prefix-sum algorithm is almost optimal. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
40. High-throughput Ant Colony Optimization on graphics processing units.
- Author
-
Cecilia, José M., Llanes, Antonio, Abellán, José L., Gómez-Luna, Juan, Chang, Li-Wen, and Hwu, Wen-Mei W.
- Subjects
- *
ANT algorithms , *GRAPHICS processing units , *TRAVELING salesman problem , *CUDA (Computer architecture) , *PHEROMONES - Abstract
Nowadays, computer researchers can face ever-complex scientific problems by using a hardware and software co-design. One successful approach is exploring novel massively-parallel Natural-inspired algorithms , such as the Ant Colony Optimization (ACO) algorithm, through the exploitation of high-throughput accelerators such as GPUs, which are designed to provide high levels of parallelism and low Energy per instruction (EP) cost through heavy vectorization . In this paper, we demonstrate how to take advantage of contemporary hardware-based CUDA vectorization to optimize the ACO algorithm when applied to the Traveling Salesman Problem (TSP). Several parallel designs are proposed and analyzed on two different CUDA architectures. Our results reveal that our vectorization approaches can obtain good performance on these architectures. Moreover, atomic operations are under study showing good benefits on latest generations of CUDA architectures. This work lays the groundwork for future developments of ACO algorithm on high-performance platforms. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
41. On GPU-CUDA as preprocessing of fuzzy-rough data reduction by means of singular value decomposition.
- Author
-
Cuomo, Salvatore, Galletti, Ardelio, Marcellino, Livia, Navarra, Guglielmo, and Toraldo, Gerardo
- Subjects
- *
GRAPHICS processing units , *CUDA (Computer architecture) , *FUZZY logic , *SINGULAR value decomposition , *DATA reduction - Abstract
Data reduction algorithms often produce inaccurate results for loss of relevant information. Recently, the singular value decomposition (SVD) method has been used as preprocessing method in order to deal with high-dimensional data and achieve fuzzy-rough reduct convergence on higher dimensional datasets. Despite the well-known fact that SVD offers attractive properties, its high computational cost remains a critical issue. In this work, we present a parallel implementation of the SVD algorithm on graphics processing units using CUDA programming model. Our approach is based on an iterative parallel version of the QR factorization by means of Givens rotations using the Sameh and Kuck scheme. Our results show significant improvements in terms of performances with respect to the CPU version that encourage its usability for this expensive processing of data. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
42. Multi-level Fast Multipole Algorithm for 3-D Homogeneous Dielectric Objects Using MPI-CUDA on GPU Cluster.
- Author
-
Tuan Phan, Tran, Nghia, and Kilic, Ozlem
- Subjects
- *
GRAPHICS processing units , *CLUSTER analysis (Statistics) , *COMPUTER algorithms , *MESSAGE passing (Computer science) , *CUDA (Computer architecture) , *COMPUTER programming - Abstract
The implementation of Multi-level Fast Multipole Algorithm (MLFMA) on a 13-node Graphical Processing Unit (GPU) cluster using Message Passing Interface (MPI) and CUDA programming is presented. The performance achievements are investigated in terms of accuracy, speed up, and scalability. The experimental results demonstrate that our MLFMA implementation on GPUs is much faster than (up to 66x) that of the CPU implementation without trading off the accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2018
43. CUDA-MPI Implementation of Fast Multipole Method on GPU Clusters for Dielectric Objects.
- Author
-
Nghia Tran, Tuan Phan, and Kilic, Ozlem
- Subjects
- *
CUDA (Computer architecture) , *MESSAGE passing (Computer science) , *GRAPHICS processing units , *DIELECTRIC materials , *FAST multipole method - Abstract
This paper investigates the Fast Multipole Method (FMM) for large-scale electromagnetics scattering problems for dielectric objects. The algorithm is implemented on a Graphical Processing Unit (GPU) cluster using CUDA programming and Message Passing Interface (MPI). Its performance is investigated in terms of accuracy, speedup, and scalability. The details of the implementation and the performance achievements are shown and analyzed, demonstrating a scalable parallelization while maintaining a good degree of accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2018
44. Parallel Realization of Element by Element Analysis of Eddy Current Field Based on Graphic Processing Unit.
- Author
-
Dongyang Wu, Xiuke Yan, Renyuan Tang, Dexin Xie, and Ziyan Ren
- Subjects
- *
PARALLEL processing , *EDDY currents (Electric) , *GRAPHICS processing unit performance , *FINITE element method software , *JACOBIAN matrices , *CUDA (Computer architecture) , *COMPUTER software - Abstract
The element by element parallel finite element method (EbE-PFEM) applied to engineering eddy current problem is presented in this paper. Unlike classical finite element method (FEM), only element matrix is needed to store for EbE method. Thereby more storage memory saved. Element by element conjugated gradient (EbE-CG) method is used to solve the equations which are discretized from elements level. Considering the ill-conditioned character of system equations, highly parallel Jacobi preconditioned (JP) method is used to accelerate the convergence. Besides, the process of dealing with boundary condition based on EbE theory is introduced. To validate the method, a 2D eddy current problem in complex frequency domain is used. The numerical analysis is carried out on the graphic processing units (GPU) with a compute unified device architecture (CUDA) parallel programming model to accelerate the convergence. And the results demonstrate that the JP method and GPU platform are effective in solving eddy current field with improved convergence. [ABSTRACT FROM AUTHOR]
- Published
- 2018
45. Accelerating nearest neighbor partitioning neural network classifier based on CUDA.
- Author
-
Wang, Lin, Zhu, Xuehui, Yang, Bo, Guo, Jifeng, Liu, Shuangrong, Li, Meihui, Zhu, Jian, and Abraham, Ajith
- Subjects
- *
ARTIFICIAL neural networks , *NEAREST neighbor analysis (Statistics) , *CUDA (Computer architecture) , *PERFORMANCE evaluation , *GRAPHICS processing units - Abstract
The nearest neighbor partitioning (NNP) method is a high performance approach which is used for improving traditional neural network classifiers. However, the construction process of NNP model is very time-consuming, particularly for large data sets, thus limiting its range of application. In this study, a parallel NNP method is proposed to accelerate NNP based on Compute Unified Device Architecture(CUDA). In this method, blocks and threads are used to evaluate potential neural networks and to perform parallel subtasks, respectively. Experimental results manifest that the proposed parallel method improves performance of NNP neural network classifier. Furthermore, the application of parallel NNP in performance evaluation of cement microstructure indicates that the proposed approach has favorable performance. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
46. ESBMC-GPU A context-bounded model checking tool to verify CUDA programs.
- Author
-
Monteiro, Felipe R., da S. Alves, Erickson H., Silva, Isabela S., Ismail, Hussama I., Cordeiro, Lucas C., and de Lima Filho, Eddie B.
- Subjects
- *
CUDA (Computer architecture) , *GRAPHICS processing units , *COMPUTER software development , *COMPUTER programming , *PARALLEL computers - Abstract
The Compute Unified Device Architecture (CUDA) is a programming model used for exploring the advantages of graphics processing unit (GPU) devices, through parallelization and specialized functions and features. Nonetheless, as in other development platforms, errors may occur, due to traditional software creation processes, which may even compromise the execution of an entire system. In order to address such a problem, ESBMC-GPU was developed, as an extension to the Efficient SMT-Based Context-Bounded Model Checker (ESBMC). In summary, ESBMC processes input code through ESBMC-GPU and an abstract representation of the standard CUDA libraries, with the goal of checking a set of desired properties. Experimental results showed that ESBMC-GPU was able to correctly verify 85% of the chosen benchmarks and it also overcame other existing GPU verifiers regarding the verification of data-race conditions, array out-of-bounds violations, assertive statements, pointer safety, and the use of specific CUDA features. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
47. Fast GPU-based Monte Carlo code for SPECT/CT reconstructions generates improved Lu images.
- Author
-
Rydén, T., Heydorn Lagerlöf, J., Hemmingsson, J., Marin, I., Svensson, J., Båth, M., Gjertsson, P., and Bernhardt, P.
- Subjects
- *
MONTE Carlo method , *SINGLE-photon emission computed tomography , *GRAPHICS processing units , *TOMOGRAPHY image quality , *CUDA (Computer architecture) - Abstract
Background: Full Monte Carlo (MC)-based SPECT reconstructions have a strong potential for correcting for image degrading factors, but the reconstruction times are long. The objective of this study was to develop a highly parallel Monte Carlo code for fast, ordered subset expectation maximum (OSEM) reconstructions of SPECT/CT images. The MC code was written in the Compute Unified Device Architecture language for a computer with four graphics processing units (GPUs) (GeForce GTX Titan X, Nvidia, USA). This enabled simulations of parallel photon emissions from the voxels matrix (128 or 256). Each computed tomography (CT) number was converted to attenuation coefficients for photo absorption, coherent scattering, and incoherent scattering. For photon scattering, the deflection angle was determined by the differential scattering cross sections. An angular response function was developed and used to model the accepted angles for photon interaction with the crystal, and a detector scattering kernel was used for modeling the photon scattering in the detector. Predefined energy and spatial resolution kernels for the crystal were used. The MC code was implemented in the OSEM reconstruction of clinical and phantom Lu SPECT/CT images. The Jaszczak image quality phantom was used to evaluate the performance of the MC reconstruction in comparison with attenuated corrected (AC) OSEM reconstructions and attenuated corrected OSEM reconstructions with resolution recovery corrections (RRC). Result: The performance of the MC code was 3200 million photons/s. The required number of photons emitted per voxel to obtain a sufficiently low noise level in the simulated image was 200 for a 128 voxel matrix. With this number of emitted photons/voxel, the MC-based OSEM reconstruction with ten subsets was performed within 20 s/iteration. The images converged after around six iterations. Therefore, the reconstruction time was around 3 min. The activity recovery for the spheres in the Jaszczak phantom was clearly improved with MC-based OSEM reconstruction, e.g., the activity recovery was 88% for the largest sphere, while it was 66% for AC-OSEM and 79% for RRC-OSEM. Conclusion: The GPU-based MC code generated an MC-based SPECT/CT reconstruction within a few minutes, and reconstructed patient images of Lu-DOTATATE treatments revealed clearly improved resolution and contrast. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
48. Vision‐Based Natural Frequency Identification Using Laser Speckle Imaging and Parallel Computing.
- Author
-
Park, KyeongTaek, Torbol, Marco, and Kim, Sehwan
- Subjects
- *
GRAPHICS processing units , *CENTRAL processing units , *DETECTORS , *ALGORITHMS , *CUDA (Computer architecture) , *INTEGRATED circuits - Abstract
Abstract: This study focuses on the identification of the natural frequencies of structures through the analysis of the speckle pattern that a laser creates and a camera records. The laser pointer spreads its light over a target area on the structure and creates the speckle pattern. The ambient vibrations affect the pattern and a camera records the changes. The stream of images is fed into a graphics processing unit (GPU). The developed parallel code includes different algorithms: the speckle contrast image (SCI), the speckle flow imaging (SFI), and an innovative application of k‐means clustering that uses the geometrical centroid of each cluster as virtual sensors. The tracking of the centroid in time domain through the images creates a vibration signal. The signals from different clusters are processed together to extract the natural frequencies of the structure. This study applies the proposed method to different sample structures both in laboratory and in the field to demonstrate how the obtained signals are reliable and easy to handle. The GPU technology enhances the performance of the entire method and allows the achievement of real‐time processing. All these features create an inexpensive, portable, and efficient tool to inspect any structure or its components. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
49. Graphics processing unit accelerated phase field dislocation dynamics: Application to bi-metallic interfaces.
- Author
-
Eghtesad, Adnan, Knezevic, Marko, Germaschewski, Kai, Beyerlein, Irene J., and Hunter, Abigail
- Subjects
- *
GRAPHICS processing units , *CUDA (Computer architecture) , *FOURIER transforms , *HIGH performance computing , *CENTRAL processing units - Abstract
We present the first high-performance computing implementation of the meso-scale phase field dislocation dynamics (PFDD) model on a graphics processing unit (GPU)-based platform. The implementation takes advantage of the portable OpenACC standard directive pragmas along with Nvidia's compute unified device architecture (CUDA) fast Fourier transform (FFT) library called CUFFT to execute the FFT computations within the PFDD formulation on the same GPU platform. The overall implementation is termed ACCPFDD-CUFFT. The package is entirely performance portable due to the use of OPENACC-CUDA inter-operability, in which calls to CUDA functions are replaced with the OPENACC data regions for a host central processing unit (CPU) and device (GPU). A comprehensive benchmark study has been conducted, which compares a number of FFT routines, the Numerical Recipes FFT (FOURN), Fastest Fourier Transform in the West (FFTW), and the CUFFT. The last one exploits the advantages of the GPU hardware for FFT calculations. The novel ACCPFDD-CUFFT implementation is verified using the analytical solutions for the stress field around an infinite edge dislocation and subsequently applied to simulate the interaction and motion of dislocations through a bi-phase copper-nickel (Cu–Ni) interface. It is demonstrated that the ACCPFDD-CUFFT implementation on a single TESLA K80 GPU offers a 27.6X speedup relative to the serial version and a 5X speedup relative to the 22-multicore Intel Xeon CPU E5-2699 v4 @ 2.20 GHz version of the code. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
50. High Performance Exact Triangle Counting on GPUs.
- Author
-
Bisson, Mauro and Fatica, Massimiliano
- Subjects
- *
GRAPHICS processing units , *BIG data , *CUDA (Computer architecture) , *COMPUTER algorithms , *KERNEL operating systems - Abstract
This paper presents a GPU implementation of the graph triangle counting operation based on the set intersection algorithm. The algorithm is implemented in four kernels optimized for different types of graphs in a code delivering performance higher than the current state-of-the-art and without preprocessing the input graph. At runtime, a lightweight heuristic is used to select the kernel to run based on the specific graph taken as input. In contrast to previous works, the presented approach takes advantage of a set intersection operation implemented via bitmaps. Moreover, the simplicity of the approach allows the code to have limited size and engineering complexity. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.