148 results on '"Prefix sum"'
Search Results
2. When does 0–1 Principle Hold for Prefix Sums?
- Author
-
Morihata, Akimasa
- Subjects
- *
SUFFIXES & prefixes (Grammar) , *PROGRAMMING languages , *BINARY sequences - Abstract
Knuth's 0–1 principle argues that the correctness of any swap-based sorting network can be verified by testing arbitrary sequences over Boolean values (i.e., 0 and 1). Voigtländer (Proceedings of the 35th ACM SIGPLAN-SIGACT symposium on principles of programming languages, POPL 2008, San Francisco, California, USA, January 7–12, 2008. ACM, New York, NY, pp 29–35, 2008. https://doi.org/10.1145/1328438.1328445) proved a similar result for prefix-sum networks that consist of associative binary operators: the correctness can be verified by testing arbitrary sequences and associative binary operators over three values, namely 0, 1, and 2. He raised the question of whether testing over Boolean values is sufficient if the binary operator is idempotent in addition to associative. This paper answers his question. First, there is an incorrect prefix-sum network for associative idempotent operators, the flaw of which cannot be detected by testing over Boolean values. Second, testing over Boolean values is sufficient if the binary operators are restricted to commutative in addition to associative and idempotent. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
3. Formal verification of parallel prefix sum and stream compaction algorithms in CUDA.
- Author
-
Safari, Mohsen and Huisman, Marieke
- Subjects
- *
COMPACTING , *PARALLEL algorithms , *SUFFIXES & prefixes (Grammar) , *HIGH performance computing , *ROCK glaciers , *ALGORITHMS - Abstract
• We add CUDA verification support to the VerCors verifier. • We verify data race-freedom and functional correctness of CUDA implementations of two parallel prefix sum algorithms. • We verify data race-freedom and functional correctness of CUDA implementations of stream compaction algorithm. GPUs are an important part of any High Performance Computing (HPC) architecture. To make optimal use of the specifics of a GPU architecture, we need programming models that naturally support the parallel execution model of a GPU. CUDA and OpenCL are two widely used examples of such programming models. Furthermore, we also need to redesign algorithms such that they adhere to this parallel programming model, and we need to be able to prove the correctness of these redesigned algorithms. In this paper we study two examples of such parallelized algorithms, and we discuss how to prove their correctness (data race freedom and (partial) functional correctness) using the VerCors program verifier. First of all, we prove the correctness of two parallel algorithms solving the prefix sum problem. Second, we show how such a prefix sum algorithm is used as a basic block in a stream compaction algorithm, and we prove correctness of this stream compaction algorithm, taking advantage of the earlier correctness proof for the prefix sum algorithm. The proofs as described in this paper are developed over the CUDA implementations of these algorithms. In earlier work, we had already shown correctness of a more high-level version of the algorithm. This paper discusses how we add support to reason about CUDA programs in VerCors, and it then shows how we can redo the verification at the level of the CUDA code. We also discuss some practical challenges that we had to address to prove correctness of the actual CUDA-level verifications. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
4. Work-Stealing Prefix Scan: Addressing Load Imbalance in Large-Scale Image Registration.
- Author
-
Copik, Marcin, Grosser, Tobias, Hoefler, Torsten, Bientinesi, Paolo, and Berkels, Benjamin
- Subjects
- *
IMAGE registration , *SUFFIXES & prefixes (Grammar) , *ELECTRON microscopy , *ALGORITHMS , *NANOSTRUCTURED materials , *MECHANICAL properties of condensed matter - Abstract
Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this article, we study the recursive registration of a series of electron microscopy images – a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
5. Computer vision algorithms acceleration using graphic processors NVIDIA CUDA.
- Author
-
Afif, Mouna, Said, Yahia, and Atri, Mohamed
- Subjects
- *
COMPUTER vision , *COMPUTER algorithms , *CENTRAL processing units , *PROGRAMMING languages , *APPLICATION software , *GRAPHICS processing units - Abstract
Using graphic processing units (GPUs) in parallel with central processing unit in order to accelerate algorithms and applications demanding extensive computational resources has been a new trend used for the last few years. In this paper, we propose a GPU-accelerated method to parallelize different Computer vision tasks. We will report on parallelism and acceleration in computer vision applications, we provide an overview about the CUDA NVIDIA GPU programming language used. After that we will dive on GPU Architecture and acceleration used for time consuming optimization. We introduce a high-speed computer vision algorithm using graphic processing unit by using the NVIDIA's programming framework compute unified device architecture (CUDA). We realize high and significant accelerations for our computer vision algorithms and we demonstrate that using CUDA as a GPU programming language can improve Efficiency and speedups. Especially we demonstrate the efficiency of our implementations of our computer vision algorithms by speedups obtained for all our implementations especially for some tasks and for some image sizes that come up to 8061 and 5991 and 722 acceleration times. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
6. Optimal Parallel Prefix Sum Computation on Three-Dimensional Torus Network
- Author
-
Gupta, Ashish
- Published
- 2021
- Full Text
- View/download PDF
7. Constant Rounds Almost Linear Complexity Multi-party Computation for Prefix Sum
- Author
-
Ohara, Kazuma, Ohta, Kazuo, Suzuki, Koutarou, Yoneyama, Kazuki, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Kobsa, Alfred, editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Weikum, Gerhard, editor, Pointcheval, David, editor, and Vergnaud, Damien, editor
- Published
- 2014
- Full Text
- View/download PDF
8. Formal verification of parallel prefix sum and stream compaction algorithms in CUDA
- Author
-
Mohsen Safari, Marieke Huisman, Formal Methods and Tools, and Digital Society Institute
- Subjects
Stream compaction ,Separation logic ,General Computer Science ,Deductive verification ,Prefix sum ,UT-Hybrid-D ,CUDA ,GPU verification ,Theoretical Computer Science - Abstract
GPUs are an important part of any High Performance Computing (HPC) architecture. To make optimal use of the specifics of a GPU architecture, we need programming models that naturally support the parallel execution model of a GPU. CUDA and OpenCL are two widely used examples of such programming models. Furthermore, we also need to redesign algorithms such that they adhere to this parallel programming model, and we need to be able to prove the correctness of these redesigned algorithms. In this paper we study two examples of such parallelized algorithms, and we discuss how to prove their correctness (data race freedom and (partial) functional correctness) using the VerCors program verifier. First of all, we prove the correctness of two parallel algorithms solving the prefix sum problem. Second, we show how such a prefix sum algorithm is used as a basic block in a stream compaction algorithm, and we prove correctness of this stream compaction algorithm, taking advantage of the earlier correctness proof for the prefix sum algorithm. The proofs as described in this paper are developed over the CUDA implementations of these algorithms. In earlier work, we had already shown correctness of a more high-level version of the algorithm. This paper discusses how we add support to reason about CUDA programs in VerCors, and it then shows how we can redo the verification at the level of the CUDA code. We also discuss some practical challenges that we had to address to prove correctness of the actual CUDA-level verifications.
- Published
- 2022
9. Algorithms of Basic Communication Operation on the Biswapped Network
- Author
-
Wei, Wenhong, Xiao, Wenjun, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Bubak, Marian, editor, van Albada, Geert Dick, editor, Dongarra, Jack, editor, and Sloot, Peter M. A., editor
- Published
- 2008
- Full Text
- View/download PDF
10. Parallel Prefix (Scan) Algorithms for MPI
- Author
-
Sanders, Peter, Träff, Jesper Larsson, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Mohr, Bernd, editor, Träff, Jesper Larsson, editor, Worringen, Joachim, editor, and Dongarra, Jack, editor
- Published
- 2006
- Full Text
- View/download PDF
11. Optimized Parallel Prefix Sum Algorithm on Optoelectronic Biswapped-Torus Architecture
- Author
-
Ashish Gupta
- Subjects
020203 distributed computing ,biswapped-torus ,business.industry ,Computer science ,Computation ,biswapped ,Torus ,Information technology ,QA75.5-76.95 ,02 engineering and technology ,T58.5-58.64 ,Parallel prefix ,Electronic computers. Computer science ,0202 electrical engineering, electronic engineering, information engineering ,hyper hexa-cell ,Optoelectronics ,Prefix sum ,020201 artificial intelligence & image processing ,Architecture ,business ,optoelectronic ,otis - Abstract
The Biswapped-Torus is a recently reported optoelectronic node-symmetrical member of Biswapped-framework family. In this paper,optimized parallel approach is presented for prefix sum computation on [Formula: see text] Biswapped-Torus. The proposed parallel algorithm demands total 7[Formula: see text] electronic and three optical moves on odd network size or 7[Formula: see text] electronic and three optical moves on even network size. The algorithmic performance of the suggested parallel algorithm is also compared with the performances of recently reported optimal prefix sum algorithms on [Formula: see text] Biswapped-Mesh and [Formula: see text]-dimensional Biswapped Hyper Hexa-cell. Based on the comparative analysis, Biswapped-Torus claims to map prefix sum faster that require fewer communication moves compared to the Grid-based traditional architecture of biswapped family named Biswapped-Mesh. Moreover, the former also has architectural benefit of node-symmetry that leads to advantages such as easy embedding, mapping and designing of routing algorithms. Compared to symmetrical counter-part of biswapped family named Biswapped-Hyper Hexa-Cell, Biswapped-Torus is cost-efficient, but requires comparatively more communication moves for mapping prefix sum.
- Published
- 2020
12. A completely parallel surface reconstruction method for particle-based fluids
- Author
-
Wencong Yang and Chengying Gao
- Subjects
Surface (mathematics) ,Computer science ,Computation ,Pipeline (computing) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,020207 software engineering ,02 engineering and technology ,Grid ,Computer Graphics and Computer-Aided Design ,0202 electrical engineering, electronic engineering, information engineering ,Redundancy (engineering) ,Prefix sum ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Scalar field ,Algorithm ,Software ,Surface reconstruction ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
We present a novel surface reconstruction pipeline that significantly improves reconstructing efficiency while preserves high-quality details for particle-based liquid simulation. Our surface reconstruction algorithm is a sort of completely parallel narrow band method. At the beginning of reconstruction, we develop a spatial hashing grid-based strategy to identify surface particles, which is much more precise and simpler than the smoothed color field. Consequently, those precise surface particles ensure accurate extraction of scalar field in the narrow band around surface without any redundancy, which brings great performance improvement for subsequent reconstruction stages. Furthermore, in order to obtain a better computation performance, we carefully analyze the potential race conditions and conditional branches of each reconstruction step between parallel threads and come up with a completely parallel reconstruction method combined with the exclusive prefix sum algorithm. Our method is pretty straightforward to implement. Experimental results demonstrate that our method runs up to dozen times faster than the state-of-the-art of narrow band-based fluid surface reconstruction, especially for large-scale particle-based fluid.
- Published
- 2020
13. ПРИМЕНЕНИЕ ОФИСНЫХ ПРИЛОЖЕНИЙ ДЛЯ ИЗУЧЕНИЯ ДЕРЕВА ФЕНВИКА ПРИ ПОДГОТОВКЕ ШКОЛЬНИКОВ 8–9-Х КЛАССОВ К ОЛИМПИАДАМ ПО ИНФОРМАТИКЕ
- Subjects
префиксная сумма ,дерево отрезков ,IT Olympiads ,дерево Фенвика ,prefix sum ,электронный образовательный ресурс ,методика обучения информатике ,IT training methodology ,олимпиада по информатике ,electronic learning resources ,Fenwick tree ,segment tree - Abstract
В статье рассматривается использование популярного электронного образовательного ресурса MS Excel для изучения дерева Фенвика при подготовке школьников к олимпиадам по информатике., The article considers the use of a popular ELR of MSExcel to explain a Fenwick tree when training students for IT Olympiads., ВЕСТНИК МОСКОВСКОГО ГОРОДСКОГО ПЕДАГОГИЧЕСКОГО УНИВЕРСИТЕТА. СЕРИЯ: ИНФОРМАТИКА И ИНФОРМАТИЗАЦИЯ ОБРАЗОВАНИЯ, Выпуск 1 (56) 2022
- Published
- 2022
- Full Text
- View/download PDF
14. Ultrafast CPU/GPU Kernels for Density Accumulation in Placement
- Author
-
Yibo Lin, Jing Mai, and Zizheng Guo
- Subjects
Data dependency ,Computer science ,Kernel (statistics) ,Prefix sum ,Electronic design automation ,Workload ,Parallel computing ,Physical design ,Primitive operation ,Computer Science::Operating Systems ,Ultrashort pulse - Abstract
Density accumulation is a widely-used primitive operation in physical design, especially for placement. Iterative invocation in the optimization flow makes it one of the runtime bottlenecks. Accelerating density accumulation is challenging due to data dependency and workload imbalance. In this paper, we propose efficient CPU/GPU kernels for density accumulation by decomposing the problem into two phases: constant-time density collection for each instance and a linear-time prefix sum. We develop CPU and GPU dedicated implementations, and demonstrate promising efficiency benefits on tasks from large-scale placement problems.
- Published
- 2021
15. Parallel Makespan Calculation for Flow Shop Scheduling Problem with Minimal and Maximal Idle Time
- Author
-
Jarosław Rudy
- Subjects
Technology ,Computer science ,QH301-705.5 ,QC1-999 ,CUDA ,Parallel computing ,maximal idle time ,Discrete optimization ,Prefix sum ,General Materials Science ,Biology (General) ,Instrumentation ,Time complexity ,QD1-999 ,Fluid Flow and Transfer Processes ,minimal idle time ,Job shop scheduling ,parallel computing ,Process Chemistry and Technology ,Physics ,General Engineering ,Flow shop scheduling ,Computer experiment ,Engineering (General). Civil engineering (General) ,Computer Science Applications ,prefix sums ,Chemistry ,Simulated annealing ,discrete optimization ,TA1-2040 ,flowshop scheduling - Abstract
In this paper, a flow shop scheduling problem with minimal and maximal machine idle time with the goal of minimizing makespan is considered. The mathematical model of the problem is presented. A generalization of the prefix sum, called the job shift scan, for computing required shifts for overlapping jobs is proposed. A work-efficient algorithm for computing the job shift scan in parallel for the PRAM model with n processors is proposed and its time complexity of O(logn) is proven. Then, an algorithm for computing the makespan in time O(mlogn) in parallel using the prefix sum and job shift scan is proposed. Computer experiments on GPU were conducted using the CUDA platform. The results indicate multi-thread GPU vs. single-thread GPU speedups of up to 350 and 1000 for job shift scan and makespan calculation algorithms, respectively. Multi-thread GPU vs. single-thread CPU speedups up to 4.5 and 14.7, respectively, were observed as well. The experiments on the Taillard-based problem instances using a simulated annealing solving method and employing the parallel makespan calculation show that the method is able to perform many more iterations in the given time limit and obtain better results than the non-parallel version.
- Published
- 2021
- Full Text
- View/download PDF
16. Efficient deterministic parallel algorithms for integer sorting
- Author
-
Chen, Lin, Goos, G., editor, Hartmanis, J., editor, Barstow, D., editor, Brauer, W., editor, Brinch Hansen, P., editor, Gries, D., editor, Luckham, D., editor, Moler, C., editor, Pnueli, A., editor, Seegmüller, G., editor, Stoer, J., editor, Wirth, N., editor, Akl, S. G., editor, Fiala, F., editor, and Koczkodaj, W. W., editor
- Published
- 1990
- Full Text
- View/download PDF
17. Differentially Private Real-time Streaming Data Publication Based on Sliding Window under Exponential Decay
- Author
-
Yan Gao, Yingjie Wu, Xin Huang, Lan Sun, and Chen Ge
- Subjects
Theoretical computer science ,Range query (data structures) ,Computer science ,Fenwick tree ,Computer Science Applications ,Biomaterials ,Tree (data structure) ,Mechanics of Materials ,Modeling and Simulation ,Sliding window protocol ,Prefix sum ,Differential privacy ,Electrical and Electronic Engineering ,Exponential decay ,Time complexity - Abstract
Continuous response of range query on steaming data provides useful information for many practical applications as well as the risk of privacy disclosure. The existing research on differential privacy streaming data publication mostly pay close attention to boosting query accuracy, but pay less attention to query efficiency, and ignore the effect of timeliness on data weight. In this paper, we propose an effective algorithm of differential privacy streaming data publication under exponential decay mode. Firstly, by introducing the Fenwick tree to divide and reorganize data items in the stream, we achieve a constant time complexity for inserting a new item and getting the prefix sum. Meanwhile, we achieve time complicity linear to the number of data item for building a tree. After that, we use the advantage of matrix mechanism to deal with relevant queries and reduce the global sensitivity. In addition, we choose proper diagonal matrix further improve the range query accuracy. Finally, considering about exponential decay, every data item is weighted by the decay factor. By putting the Fenwick tree and matrix optimization together, we present complete algorithm for differentiate private real-time streaming data publication. The experiment is designed to compare the algorithm in this paper with similar algorithms for streaming data release in exponential decay. Experimental results show that the algorithm in this paper effectively improve the query efficiency while ensuring the quality of the query.
- Published
- 2019
18. Lambda calculus with algebraic simplification for reduction parallelisation: Extended study
- Author
-
Akimasa Morihata
- Subjects
Simply typed lambda calculus ,Computer science ,020207 software engineering ,02 engineering and technology ,Algebra ,Reduction (complexity) ,Tree (data structure) ,Operator (computer programming) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Lambda calculus ,Commutative property ,computer ,Software ,Associative property ,computer.programming_language - Abstract
Parallel reduction is a major component of parallel programming and widely used for summarisation and aggregation. It is not well understood, however, what sorts of non-trivial summarisations can be implemented as parallel reductions. This paper develops a calculus named λAS, a simply typed lambda calculus with algebraic simplification. This calculus provides a foundation for studying a parallelisation of complex reductions by equational reasoning. Its key feature is δ abstraction. A δ abstraction is observationally equivalent to the standard λ abstraction, but its body is simplified before the arrival of its arguments using algebraic properties such as associativity and commutativity. In addition, the type system of λAS guarantees that simplifications due to δ abstractions do not lead to serious overheads. The usefulness of λAS is demonstrated on examples of developing complex parallel reductions, including those containing more than one reduction operator, loops with conditional jumps, prefix sum patterns and even tree manipulations.
- Published
- 2021
19. Parallel Differential Evolutionary Particle Filtering Algorithm Based on the CUDA Unfolding Cycle
- Author
-
Kaijie Huang and Jie Cao
- Subjects
Technology ,Article Subject ,Computer Networks and Communications ,Computer science ,Computation ,Thread (computing) ,TK5101-6720 ,Cyclic prefix ,CUDA ,Nonlinear system ,Parallel processing (DSP implementation) ,Telecommunication ,Prefix sum ,Electrical and Electronic Engineering ,Particle filter ,Algorithm ,Information Systems - Abstract
Aiming at the problem of low statute efficiency of prefix sum execution during the execution of the parallel differential evolutionary particle filtering algorithm, a filtering algorithm based on the CUDA unfolding cyclic prefix sum is proposed to remove the thread differentiation and thread idleness existing in the parallel prefix sum by unfolding the cyclic method and unfolding the thread bundle method, optimize the cycle, and improve the prefix sum execution efficiency. By introducing the parallel strategy, the differential evolutionary particle filtering algorithm is implemented in parallel and executed on the GPU side using the improved prefix sum computation during the algorithm update. Through big data analysis, the results show that this parallel differential evolutionary particle filtering algorithm with the improved prefix sum statute can effectively improve differential evolutionary particle filtering for nonlinear system states and real-time performance in heterogeneous parallel processing systems.
- Published
- 2021
20. High performance CUDA-based implementation for the 2D version of the Maximum Subarray Problem (MSP).
- Author
-
Saleh, Salah, Abdellah, Marwan, Raouf, Ahmed A. Abdel, and Kadah, Yasser M.
- Abstract
The Maximum Subarray Problem (MSP) finds a segment of an array that has the maximum summation over all the other possible combinations. Different applications for this problem exist in various fields like genomic sequence analysis, data mining and computer vision. Several optimum linear-time solutions exist for the ID version, however, the known upper bounds for the 2D version are cubic or near-cubic time; which makes it a problem of high complexity. In this work, a stage by stage high performance Graphics Processing Unit (GPU)-based implementation for solving the 2D version of the problem in a linear time relying on the Compute Unified Device Architecture (CUDA) technology is presented. It achieves more than 7X of speedup in performance compared to a single-threaded sequential implementation on the Central Processing Unit (CPU) for an array of size 5122. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
21. Formal Verification of Parallel Prefix Sum
- Author
-
Safari, Mohsen, Oortwijn, Wytse, Joosten, Sebastiaan, Huisman, Marieke, Lee, Ritchie, Jha, Susmit, Mavridou, Anastasia, Lee, Ritchie, Jha, Susmit, Mavridou, Anastasia, Giannakopoulou, Dimitra, Formal Methods and Tools, and Digital Society Institute
- Subjects
Multi-core processor ,Correctness ,Computer science ,Parallel algorithm ,22/2 OA procedure ,020207 software engineering ,02 engineering and technology ,Thread (computing) ,Parallel computing ,Separation logic ,Parallel prefix ,0202 electrical engineering, electronic engineering, information engineering ,ComputingMilieux_COMPUTERSANDEDUCATION ,Prefix sum ,020201 artificial intelligence & image processing ,Formal verification - Abstract
Lecture Notes in Computer Science, 12229, ISSN:0302-9743, ISSN:1611-3349, NASA Formal Methods 12th International Symposium, NFM 2020, Moffett Field, CA, USA, May 11–15, 2020, Proceedings, ISBN:978-3-030-55753-9, ISBN:978-3-030-55754-6
- Published
- 2020
22. Practical trade-offs for the prefix-sum problem
- Author
-
Giulio Ermanno Pibiri and Rossano Venturini
- Subjects
FOS: Computer and information sciences ,Theoretical computer science ,Settore INF/01 - Informatica ,Computer science ,Trade offs ,020207 software engineering ,02 engineering and technology ,Data structure ,SIMD ,performance evaluation ,Range (mathematics) ,Integer ,efficiency ,prefix-sum ,Computer Science - Data Structures and Algorithms ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Data Structures and Algorithms (cs.DS) ,Software ,Computer Science::Databases ,Coding (social sciences) - Abstract
Given an integer array A, the prefix-sum problem is to answer sum(i) queries that return the sum of the elements in A[0..i], knowing that the integers in A can be changed. It is a classic problem in data structure design with a wide range of applications in computing from coding to databases. In this work, we propose and compare several and practical solutions to this problem, showing that new trade-offs between the performance of queries and updates can be achieved on modern hardware., Comment: Accepted by "Software: Practice and Experience", 2020
- Published
- 2020
23. A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment.
- Author
-
Ha, Sang-Won and Han, Tack-Don
- Subjects
- *
COMPUTER networks , *SCALABILITY , *PARALLEL computers , *COMPUTER algorithms , *NETWORK performance , *COMPUTER architecture , *COMPUTATIONAL complexity - Abstract
The parallel scan is a basic tool that is used to parallelize algorithms which appear to have serial dependencies. The performance of these algorithms relies heavily on the efficiency of the parallel scan that is being used. To maintain work efficiency, current parallelization methods either sacrifice the overall depth or limit the scalability. In this study, we present a parallel scan method that is derived from the Han-Carlson parallel prefix graph and is both a work-efficient and a depth-optimal process. In this method, the depth is increased by a small constant value above the lower bound; therefore, the amount of computation and/or memory access is effectively reduced. We also employ a novel cascaded thread-block execution method to exploit the single-program-multiple-data (SPMD) nature of the compute unified device architecture (CUDA) environment developed by NVIDIA. The proposed method facilitates the low-latency interthread accessible shared memory and the single-instruction-multiple-thread (SIMT) characteristics of the graphics hardware to reduce high-latency global memory access and costly barrier synchronization. Our experimental results demonstrate an average speed up of approximately 40 and 10 percent over the CUDA data parallel primitives (CUDPP) library derivation of the Kogge-Stone prefix tree and an implementation of Merrill and Grimshaw's method with coarser combination of the Kogge-Stone graph and the Brent-Kung prefix graph, respectively. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
24. HIGH PERFORMANCE AND SCALABLE RADIX SORTING:: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING.
- Author
-
Merrill, Duane and Grimshaw, Andrew
- Subjects
- *
CASE studies , *GRAPHICS processing units , *CENTRAL processing units , *COMPUTER algorithms , *SORTING (Electronic computers) , *MICROPROCESSORS - Abstract
The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
25. Performing Efficient NURBS Modeling Operations on the GPU.
- Author
-
Krishnamurthy, Adarsh, Khardekar, Rahul, McMains, Sara, Haller, Kirk, and Elber, Gershon
- Subjects
ALGORITHMS ,GRAPHICS processing units ,COMPUTER simulation ,COMPUTER graphics ,CAD/CAM systems - Abstract
We present algorithms for evaluating and performing modeling operations on NURBS surfaces using the programmable fragment processor on the Graphics Processing Unit (GPU). We extend our GPU-based NURBS evaluator that evaluates NURBS surfaces to compute exact normals for either standard or rational B-spline surfaces for use in rendering and geometric modeling. We build on these calculations in our new GPU algorithms to perform standard modeling operations such as inverse evaluations, ray: intersections, and surface-surface intersections on the GPU. Our modeling algorithms run in real time, enabling the user to sketch on the actual surface to create new features. In addition, the designer can edit the surface by interactively trimming it without the need for retessellation. Our GPU-accelerated algorithm to perform surface-surface intersection operations with NURBS surfaces can output intersection curves in the model space as well as in the parametric spaces of both the intersecting surfaces at interactive rates. We also extend our surface-surface intersection algorithm to evaluate self-intersections in NURBS surfaces. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
26. Generic functional parallel algorithms: scan and FFT
- Author
-
Conal Elliott
- Subjects
020203 distributed computing ,Functional programming ,Theoretical computer science ,Generic programming ,Fast Fourier transform ,Search engine indexing ,Parallel algorithm ,02 engineering and technology ,Data type ,Product (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Safety, Risk, Reliability and Quality ,Software ,Mathematics - Abstract
Parallel programming, whether imperative or functional, has long focused on arrays as the central data type. Meanwhile, typed functional programming has explored a variety of data types, including lists and various forms of trees. Generic functional programming decomposes these data types into a small set of fundamental building blocks: sum, product, composition, and their associated identities. Definitions over these few fundamental type constructions then automatically assemble into algorithms for an infinite variety of data types—some familiar and some new. This paper presents generic functional formulations for two important and well-known classes of parallel algorithms: parallel scan (generalized prefix sum) and fast Fourier transform (FFT). Notably, arrays play no role in these formulations. Consequent benefits include a simpler and more compositional style, much use of common algebraic patterns and freedom from possibility of run-time indexing errors. The functional generic style also clearly reveals deep commonality among what otherwise appear to be quite different algorithms. Instantiating the generic formulations, two well-known algorithms for each of parallel scan and FFT naturally emerge, as well as two possibly new algorithms.
- Published
- 2017
27. Parallel Dynamics Computation Using Prefix Sum Operations
- Author
-
Jia Pan, Yang Yajue, and Yuanqing Wu
- Subjects
FOS: Computer and information sciences ,020203 distributed computing ,Control and Optimization ,Recursion ,Computer science ,Mechanical Engineering ,Computation ,Biomedical Engineering ,02 engineering and technology ,Computer Science Applications ,Human-Computer Interaction ,Computer Science - Robotics ,CUDA ,020303 mechanical engineering & transports ,0203 mechanical engineering ,Parallel processing (DSP implementation) ,Artificial Intelligence ,Control and Systems Engineering ,Multithreading ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Computer Vision and Pattern Recognition ,General-purpose computing on graphics processing units ,Robotics (cs.RO) ,Algorithm - Abstract
A new parallel framework for fast computation of inverse and forward dynamics of articulated robots based on prefix sums (scans) is proposed. We first re-investigate the well-known recursive Newton–Euler formulation for robot dynamics and show that the forward–backward propagation process for robot inverse dynamics is equivalent to two scan operations on certain semigroups. Then, we showed that state-of-the-art forward dynamic algorithms can also be cast into a sequence of scan operations almost completely, with unscannable parts clearly identified. This suggests a serial–parallel hybrid approach for systems with a moderate number of links. We implement our scan-based algorithms on Nvidia CUDA platform with performance compared with multithreading CPU-based recursive algorithms; a computational acceleration is demonstrated.
- Published
- 2017
28. Multiple Addition and Prefix Sum on a Linear Array with a Reconfigurable Pipelined Bus System.
- Author
-
Datta, Amitava
- Subjects
- *
ALGORITHMS , *ALGEBRA , *LINEAR algebra , *COMPUTER architecture , *COMPUTER programming , *MATHEMATICS - Abstract
We present several fast algorithms for multiple addition and prefix sum on the Linear Array with a Reconfigurable Pipelined Bus System (LARPBS), a recently proposed architecture based on optical buses. Our algorithm for adding N integers runs on an N log M-processor LARPBS in O(log* N) time, where log* N is the number of times logarithm has to be taken to reduce N below 1 and M is the largest integer in the input. Our addition algorithm improves the time complexity of several matrix multiplication algorithms proposed by Li, Pan and Zheng (IEEE Trans. Parallel and Distributed Systems, 9(8):705–720, 1998). We also present several fast algorithms for computing prefix sums of N integers on the LARPBS. For integers with bounded magnitude, our first algorithm for prefix sum computation runs in O(log log N) time using N processors and in O(1) time using N1+ℇ processors, for ≤ ℇ < 1. For integers with unbounded magnitude, the first algorithm for multiple addition runs in O(log log N log* N) time using N log M processors, when M is the largest integer in the input. Our second algorithm for multiple addition runs in O(log* N) time using N1+ℇ log M processors, for ≤ ℇ < 1. We also show suitable extensions of our algorithm for real numbers. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
29. Summation and Routing on a Partitioned Optical Passive Stars Network with Large Group Size.
- Author
-
Datta, Amitave and Soundaralakshmi, Subbiah
- Subjects
- *
MICROPROCESSORS , *NETWORK routers , *SIMULATION methods & models , *ALGORITHMS , *NETWORK processors - Abstract
In a Partitioned Optical Passive Stars (POPS) network, n = dg processors are divided into g groups of d processors each, and such a POPS network is denoted by POPS(d, g). There is an optical passive star (OPS) coupler between every pair of groups. Hence, a POPS(d, g) requires g[sup2] couplers. It is likely that, in a practical system, the number of couplers will be less than the number of processors, i.e.,.d> &raddic;n> g and the number of groups will be smaller than the number of processors in a group. Hence, it is important to design fast algorithms for basic operations on such POPS networks with large group size. We present fast algorithms for data sum, prefix sum, and permutation routing on a POPS(d, g) such that d> &raddic;n> g. Our data sum and prefix sum algorithms improve upon the best known algorithms for these problems designed by Sahni [14]. Permutation routing can be solved on a POPS network by simulating a hypercube sorting algorithm. Our algorithm for permutation routing is more efficient compared to this simulated hypercube sorting algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2003
- Full Text
- View/download PDF
30. Compact Fenwick trees for dynamic ranking and selection
- Author
-
Stefano Marchini and Sebastiano Vigna
- Subjects
FOS: Computer and information sciences ,Theoretical computer science ,Computer science ,020207 software engineering ,02 engineering and technology ,Implicit data structure ,Bit array ,Data structure ,Fenwick tree ,Ranking (information retrieval) ,Tree (data structure) ,Computer Science - Data Structures and Algorithms ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Prefix sum ,Data Structures and Algorithms (cs.DS) ,Software - Abstract
The Fenwick tree is a classical implicit data structure that stores an array in such a way that modifying an element, accessing an element, computing a prefix sum and performing a predecessor search on prefix sums all take logarithmic time. We introduce a number of variants which improve the classical implementation of the tree: in particular, we can reduce its size when an upper bound on the array element is known, and we can perform much faster predecessor searches. Our aim is to use our variants to implement an efficient dynamic bit vector: our structure is able to perform updates, ranking and selection in logarithmic time, with a space overhead in the order of a few percents, outperforming existing data structures with the same purpose. Along the way, we highlight the pernicious interplay between the arithmetic behind the Fenwick tree and the structure of current CPU caches, suggesting simple solutions that improve performance significantly.
- Published
- 2019
31. Enabling prefix sum parallelism pattern for recurrences with principled function reconstruction
- Author
-
Gagan Agrawal, Yang Xia, and Peng Jiang
- Subjects
050101 languages & linguistics ,Many core ,Computer science ,Computation ,05 social sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,020201 artificial intelligence & image processing ,0501 psychology and cognitive sciences ,02 engineering and technology ,Parallel computing ,Merge (version control) - Abstract
Much research work has been done to parallelize loops with recurrences over the last several decades. Recently, sampling-and-reconstruction method was proposed to parallelize a broad class of loops with recurrences in an automated fashion, with a practical runtime approach. Although the parallelized codes achieve linear scalability across multi-cores architectures, the sequential merge inherent to this method makes it not scalable on many core architectures, such as GPUs. At the same time, existing parallel merge approaches used for simple reduction loops cannot be directly and correctly applied to this method. Based on this observation, we propose new methods to merge partial results in parallel on GPUs and achieve linear scalability. Our approach involves refined runtime-checking rules to avoid unnecessary runtime check failures and reduce the overhead of reprocessing. We also propose sample converge technique to reduce the number of sample points so that communication and computation overhead is reduced. Finally, based on GPU architectural features, we develop optimization techniques to further improve performance. Our evaluation results of a set of representative algorithms show that our parallel merge implementation is substantially more efficient than sequential merge, and achieves linear scalability on different GPUs.
- Published
- 2019
32. Decomposing and re-composing lightweight compression schemes - and why it matters
- Author
-
Rozenberg, E. (Eyal) and Rozenberg, E. (Eyal)
- Abstract
We argue for a richer view of the space of lightweight compression schemes for columnar DBMSes: We demonstrate how even simple simple schemes used in DBMSes decompose into constituent schemes through a columnar perspective on their decompression. With our concrete examples, we touch briefly on what follows from these and other decompositions: Composition of alternative compression schemes as well as other practical and analytical implications.
- Published
- 2018
- Full Text
- View/download PDF
33. HLS-based optimization and design space exploration for applications with variable loop bounds
- Author
-
Jason Cong and Young-kyu Choi
- Subjects
010302 applied physics ,Speedup ,Design space exploration ,Computer science ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,Reduction (complexity) ,Variable (computer science) ,Gate array ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Point (geometry) ,Field-programmable gate array - Abstract
In order to further increase the productivity of field-programmable gate array (FPGA) programmers, several design space exploration (DSE) frameworks for high-level synthesis (HLS) tools have been recently proposed to automatically determine the FPGA design parameters. However, one of the common limitations found in these tools is that they cannot find a design point with large speedup for applications with variable loop bounds. The reason is that loops with variable loop bounds cannot be efficiently parallelized or pipelined with simple insertion of HLS directives. Also, making highly accurate prediction of cycles and resource consumption on the entire design space becomes a challenging task because of the inaccuracy of the HLS tool cycle prediction and the wide design space. In this paper we present an HLS-based FPGA optimization and DSE framework that produces a high-performance design even in the presence of variable loop bounds. We propose code transformations that increase the utilization of the compute resources for variable loops, including several computation patterns with loop-carried dependency such as floating-point reduction and prefix sum. In order to rapidly perform DSE with high accuracy, we describe a resource and cycle estimation model constructed from the information obtained from the actual HLS synthesis. Experiments on applications with variable loop bounds in Polybench benchmarks with Vivado HLS show that our framework improves the baseline implementation by 75X on average and outperforms current state-of-the-art DSE frameworks.
- Published
- 2018
34. A Prefix-Sum-Based Rabin-Karp Implementation for Multiple Pattern Matching on GPGPU
- Author
-
Jacir Luiz Bordim, Yasuaki Ito, Lucas S. N. Nunes, and Koji Nakano
- Subjects
Speedup ,Matching (graph theory) ,Computer science ,Graphics processing unit ,020206 networking & telecommunications ,02 engineering and technology ,Parallel computing ,Rabin–Karp algorithm ,Hash table ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,020201 artificial intelligence & image processing ,Pattern matching ,General-purpose computing on graphics processing units - Abstract
In recent years, Graphics Processing Unit (GPU) has been increasingly used for general-purpose processing, referred as General Purpose GPU (GPGPU). They have been used by the scientific community in several areas, including cryptography, ordering, graphs and sequence alignment. The main contribution of this work is to propose a parallel Rabin-Karp implementation for handing multiple pattern matching. Given a text of length n and p patterns of length m, the proposed prefix-sum based Rabin-Karp algorithm finds all occurrences of p in n in O(m+q+n\τ+nm\q) time, where q is a sufficiently large prime number and τ is the available number of threads. The proposed GPGPU implementation uses the prefix-sums algorithm as the basis to maximize parallelization of the algorithm along with a hash table to speedup matching verification. The proposed solution allows the comparison of multiple patterns at once without a significant difference in running time. For n=2^27 and m ranging from 10 to 30, experimental results show that the proposed implementation provides speedup surpassing 2 times for a single pattern and speedup surpassing 10 times for p=256 patterns as compared to an OpenMP implementation on a multicore CPU.
- Published
- 2018
35. Massively parallel computation of linear recurrence equations with graphics processing units
- Author
-
Kyuyeon Hwang, Donghwan Lee, and Wonyong Sung
- Subjects
010302 applied physics ,Computer science ,Parallel algorithm ,02 engineering and technology ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Software_PROGRAMMINGTECHNIQUES ,01 natural sciences ,020202 computer hardware & architecture ,Titan (supercomputer) ,Constant (computer programming) ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,SIMD ,Graphics ,Massively parallel - Abstract
Graphics processing units (GPUs) show very high performance when executing many parallel programs; however their use in solving linear recurrence equations is considered difficult because of the sequential nature of the problem. Previously developed parallel algorithms, such as recursive doubling and multi-block processing, do not show high efficiency in GPUs because of poor scalability with the number of threads. In this work, we have developed a highly efficient GPU-based algorithm for recurrences using a thread-level parallel (TLP) approach, instead of conventional thread-block level parallel (TBLP) methods. The proposed TLP method executes all of the threads as independently as possible to improve the computational efficiency and employs a hierarchical structure for inter-thread communication. Not only constant but also time-varying coefficient recurrence equations are implemented on NVIDIA GTX285, GTX580 and GTX TITAN X GPUs, and the performances are compared with the results on single-core and multi-core SIMD CPU-based PCs.
- Published
- 2018
36. The Alternating Stock Size Problem and the Gasoline Puzzle
- Author
-
Alantha Newman, Heiko Röglin, Johanna Seif, Optimisation Combinatoire (G-SCOP_OC ), Laboratoire des sciences pour la conception, l'optimisation et la production (G-SCOP), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Universität Bonn = University of Bonn, Modèles de calcul, Complexité, Combinatoire (MC2), Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Department of Computer Science, Boston University [Boston] (BU), University of Bonn, École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Perruet, Marie Josèphe, Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École normale supérieure - Lyon (ENS Lyon)-Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-École normale supérieure - Lyon (ENS Lyon)
- Subjects
FOS: Computer and information sciences ,Doubly stochastic matrix ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,0211 other engineering and technologies ,[INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,02 engineering and technology ,Combinatorics ,Mathematics (miscellaneous) ,0502 economics and business ,Computer Science - Data Structures and Algorithms ,Prefix sum ,Data Structures and Algorithms (cs.DS) ,[INFO]Computer Science [cs] ,Stock (geology) ,ComputingMilieux_MISCELLANEOUS ,Mathematics ,050210 logistics & transportation ,021103 operations research ,000 Computer science, knowledge, general works ,Rounding ,05 social sciences ,Approximation algorithm ,Prefix ,Linear programming relaxation ,Computer Science ,Negative number - Abstract
Given a set S of integers whose sum is zero, consider the problem of finding a permutation of these integers such that (i) all prefix sums of the ordering are nonnegative and (ii) the maximum value of a prefix sum is minimized. Kellerer et al. call this problem the stock size problem and showed that it can be approximated to within 3/2. They also showed that an approximation ratio of 2 can be achieved via several simple algorithms. We consider a related problem, which we call the alternating stock size problem , where the numbers of positive and negative integers in the input set S are equal. The problem is the same as that shown earlier, but we are additionally required to alternate the positive and negative numbers in the output ordering. This problem also has several simple 2-approximations. We show that it can be approximated to within 1.79. Then we show that this problem is closely related to an optimization version of the gasoline puzzle due to Lovász, in which we want to minimize the size of the gas tank necessary to go around the track. We present a 2-approximation for this problem, using a natural linear programming relaxation whose feasible solutions are doubly stochastic matrices. Our novel rounding algorithm is based on a transformation that yields another doubly stochastic matrix with special properties, from which we can extract a suitable permutation.
- Published
- 2018
37. Efficient Sequential and Parallel Algorithms for Estimating Higher Order Spectra
- Author
-
Sanguthevar Rajasekaran, Nalini Ravishanker, Zigeng Wang, Abdullah-Al Mamun, and Xingyu Cai
- Subjects
0301 basic medicine ,FOS: Computer and information sciences ,Computer science ,Parallel algorithm ,020206 networking & telecommunications ,02 engineering and technology ,03 medical and health sciences ,030104 developmental biology ,Computer Science - Distributed, Parallel, and Cluster Computing ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Feature (machine learning) ,Prefix sum ,Multiplication ,Trispectrum ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Bispectrum ,Algorithm ,Smoothing - Abstract
Polyspectral estimation is a problem of great importance in the analysis of nonlinear time series that has applications in biomedical signal processing, communications, geophysics, image, radar, sonar and speech processing, etc. Higher order spectra (HOS) have been used in unsupervised and supervised clustering in big data scenarios, in testing for Gaussianity, to suppress Gaussian noise, to characterize nonlinearities in time series data, and so on . Any algorithm for computing the $k$th order spectra of a time series of length $n$ needs $\Omega(n^{k-1})$ time since the output size will be $\Omega(n^{k-1})$ as well. Given that we live in an era of big data, $n$ could be very large. In this case, sequential algorithms might take unacceptable amounts of time. Thus it is essential to develop parallel algorithms. There is also room for improving existing sequential algorithms. In addition, parallel algorithms in the literature are nongeneric. In this paper we offer generic sequential algorithms for computing higher order spectra that are asymptotically faster than any published algorithm for HOS. Further, we offer memory efficient algorithms. We also present optimal parallel implementations of these algorithms on parallel computing models such as the PRAM and the mesh. We provide experimental results on our sequential and parallel algorithms. Our parallel implementation achieves very good speedups., Comment: 12 pages, 4 figures, conference
- Published
- 2018
38. Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs
- Author
-
Hao Wang, Jeffrey S. Vetter, Seyong Lee, Kaixi Hou, and Wu-chun Feng
- Subjects
010302 applied physics ,Wavefront ,020203 distributed computing ,Correctness ,Memory hierarchy ,Computer science ,Computation ,Locality ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Compensation (engineering) ,0103 physical sciences ,Synchronization (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Overhead (computing) - Abstract
Wavefront loops are widely used in many scientific applications, e.g., partial differential equation (PDE) solvers and sequence alignment tools. However, due to the data dependencies in wavefront loops, it is challenging to fully utilize the abundant compute units of GPUs and to reuse data through their memory hierarchy. Existing solutions can only optimize for these factors to a limited extent. For example, tiling-based methods optimize memory access but may result in load imbalance; while compensation-based methods, which change the original order of computation to expose more parallelism and then compensate for it, suffer from both global synchronization overhead and limited generality. In this paper, we first prove under which circumstances that breaking data dependencies and properly changing the sequence of computation operators in our compensation-based method does not affect the correctness of results. Based on this analysis, we design a highly efficient compensation-based parallelism on GPUs. Our method provides weighted scan-based GPU kernels to optimize the computation and combines with the tiling method to optimize memory access and synchronization. The performance results on the NVIDIA K80 and P100 GPU platforms demonstrate that our method can achieve significant improvements for four types of real-world application kernels over the state-of-the-art research.
- Published
- 2018
39. Decomposing and re-composing lightweight compression schemes - and why it matters
- Author
-
Eyal Rozenberg
- Subjects
Decompression ,Metric spaces ,Scatter ,Patched frame of reference ,Piecewise polynomials ,Computer science ,Columnar processing ,Rle ,02 engineering and technology ,Data_CODINGANDINFORMATIONTHEORY ,Column space ,Maximum norm ,Analytic DBMS ,Simple (abstract algebra) ,020204 information systems ,Compression (functional analysis) ,Compression scheme ,Polynomial models ,0202 electrical engineering, electronic engineering, information engineering ,Compression scheme decomposition ,Run length encoding ,DBMS ,L-infinity norm ,Patching ,For ,Lightweight compression ,Perspective (graphical) ,Columar ,Compression ,Modeling ,Gather ,Low degree polynomials ,Run position encoding ,Function decomposition ,Columnar compression ,Column store ,Metric space ,Computer engineering ,Delta ,L-0 norm ,Prefix sum ,Pfor ,Run-length encoding ,RPE ,Frame of reference ,Step functions - Abstract
We argue for a richer view of the space of lightweight compression schemes for columnar DBMSes: We demonstrate how even simple simple schemes used in DBMSes decompose into constituent schemes through a columnar perspective on their decompression. With our concrete examples, we touch briefly on what follows from these and other decompositions: Composition of alternative compression schemes as well as other practical and analytical implications.
- Published
- 2018
40. Almost Optimal Column-wise Prefix-sum Computation on the GPU
- Author
-
Yasuaki Ito, Koji Nakano, Toru Fujita, and Hiroki Tokura
- Subjects
020203 distributed computing ,Computer science ,Computation ,Parallel algorithm ,020207 software engineering ,Image processing ,02 engineering and technology ,Parallel computing ,Summed area table ,CUDA ,Matrix (mathematics) ,Titan (supercomputer) ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Computer Science::Formal Languages and Automata Theory - Abstract
The row-wise and column-wise prefix-sum computation of a matrix has many applications in the area of image processing such as computation of the summed area table and the Euclidean distance map. It is known that the prefix-sums of a 1-dimensional array can be computed efficiently on the GPU. Hence, the row-wise prefix-sums of a matrix can also be computed efficiently on the GPU by executing this prefix-sum algorithm for every row in parallel. However, the same approach does not work well for computing the column-wise prefix-sums, because inefficient stride memory access to the global memory is performed. The main contribution of this paper is to present an almost optimal column-wise prefix-sum algorithm on the GPU. Since all elements in an input matrix must be read and the resulting prefix-sums must be written, computation of the column-wise prefix-sums cannot be faster than simple matrix duplication in the global memory of the GPU. Quite surprisingly, experimental results using NVIDIA TITAN X show that our column-wise prefix-sum algorithm runs only 2–6% slower than matrix duplication. Thus, our column-wise prefix-sum algorithm is almost optimal.
- Published
- 2018
41. A fast non-volatile memory aware algorithm for generating random scale-free networks
- Author
-
Mi-Yen Yeh, Tei-Wei Kuo, and Cheng-Chin Tu
- Subjects
010302 applied physics ,Random access memory ,Computer science ,Distributed computing ,Scale-free network ,02 engineering and technology ,Degree distribution ,Data structure ,01 natural sciences ,020202 computer hardware & architecture ,Data modeling ,Phase-change memory ,Non-volatile memory ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Heap (data structure) - Abstract
In this paper, we propose a method to realize the Barabasi-Albert (BA) model for generating large scale-free networks with preferential attachment. To our knowledge, the existing implementations of the BA model are still not very efficient because they failed to manage the temporary data of the network generating process properly by ignoring the inherent power law degree distribution property. To address this problem, we propose to leverage data structures including a prefix sum max heap and index arrays, which can competently manage nodes with different amount of connections. The proposed method is also friendly to the computing system with non-volatile memory (NVM) as main memory. Reducing long-latency write operations is the key to improve the efficiency of NVM, while the proposed method can ultimately save not only read operations but also significant amount of writes. We compare our proposed method with the baseline methods by generating networks of size from 102 nodes to 108 nodes. Experiment results show that the proposed method can save up to 50% of write counts. Furthermore, when using the phase change memory, a new byte-addressable non-volatile memory, as main memory, the proposed method can be almost two times faster.
- Published
- 2017
42. High-Performance and Scalable GPU Graph Traversal
- Author
-
Michael Garland, Andrew S. Grimshaw, and Duane Merrill
- Subjects
Power graph analysis ,SPQR tree ,Dense graph ,Computer science ,Breadth-first search ,Parallel algorithm ,Graph theory ,Parallel computing ,Graph ,Computer Science Applications ,Tree traversal ,Graph bandwidth ,Computational Theory and Mathematics ,Hardware and Architecture ,Modeling and Simulation ,Graph traversal ,Prefix sum ,Software ,Implicit graph ,MathematicsofComputing_DISCRETEMATHEMATICS - Abstract
Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum computations that achieves an asymptotically optimal O(|V| + |E|) gd work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single- and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations on both CPU and GPU platforms.
- Published
- 2015
43. A Parallel RLE Entropy Coding Technique for DICOM Images on GPGPU
- Author
-
E. Sudarshan, C. Shoba Bindu, and Ch. Satyanarayana
- Subjects
DICOM ,CUDA ,symbols.namesake ,Redundancy (information theory) ,Computer science ,Encoding (memory) ,symbols ,Prefix sum ,Parallel computing ,Entropy encoding ,General-purpose computing on graphics processing units ,Huffman coding - Abstract
the Electronic Healthcare Record (EHR) is the patient’s health care data history, which retains the information about the diseases and their DICOM images reports. This application widely used in the medical field and thereby an employing a Iossless entropy coding technique adaptively to remove the redundancy of an image at a precise level. Run-Length Encoding (RLE) is the basic entropy coding method to adopt in the DICOM images. To accelerate the entropy coding process deployed the parallel architecture CUDA with the GPU technology. The parallelized RLE entropy coding technique implemented on GPGPU platform by using the parallelized CUDA programming. Here we proposed a parallel RLE entropy coding technique to accelerate up to 5X times faster than the serial RLE and 2. 25X times faster than parallel Huffman coding. The parallel RLE runs at O(log N) only.
- Published
- 2017
44. Efficient Simulation of Nested Hollow Sphere Intersections
- Author
-
Till Köster, Kalyan S. Perumalla, and Adelinde M. Uhrmacher
- Subjects
Theoretical computer science ,Computer science ,Computation ,0206 medical engineering ,Parallel algorithm ,020207 software engineering ,02 engineering and technology ,Bottleneck ,Computational science ,CUDA ,Intersection ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Nesting (computing) ,Collision detection ,020602 bioinformatics ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
In the particle-based simulation of cell-biological systems in continuous space, a key performance bottleneck is the computation of all possible intersections between particles. These typically rely for collision detection on solid sphere approaches. The behavior of cell biological systems is influenced by dynamic hierarchical nesting, such as the forming of, the transport within, and the merging of vesicles. Existing collision detection algorithms are found not to be designed for these types of spatial cell-biological models, because nearly all existing high performance parallel algorithms are focusing on solid sphere interactions. The known algorithms for solid sphere intersections return more intersections than actually occur with nested hollow spheres. Here we define a new problem of computing the intersections among arbitrarily nested hollow spheres of possibly different sizes, thicknesses, positions, and nesting levels. We describe a new algorithm designed to solve this nested hollow sphere intersection problem and implement it for parallel execution on graphical processing units (GPUs). We present first results about the runtime performance and scaling to hundreds of thousands of spheres, and compare the performance with that from a leading solid object intersection package also running on GPUs.
- Published
- 2017
45. Parallel patterns: prefix sum
- Author
-
Li-Wen Chang and Juan Gómez-Luna
- Subjects
010302 applied physics ,Analysis of parallel algorithms ,Computer science ,Parallel algorithm ,Process (computing) ,020207 software engineering ,Work efficiency ,02 engineering and technology ,01 natural sciences ,Reduction (complexity) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Algorithm - Abstract
This chapter introduces parallel scan (prefix-sum), an important parallel computation pattern and the concept of work-efficiency for parallel algorithms. It introduces three styles of kernels: Kogge-Stone, Brent-Kung, and two-phase hybrid. Each of these kernels presents a different tradeoff in terms of work-efficiency, speed, and complexity. The chapter then introduces two hierarchical parallel scan algorithms that are designed to process arbitrarily long input lists while maintaining work efficiency.
- Published
- 2017
46. Sparse Prefix Sums
- Author
-
Michael Shekelyan, Johann Gamper, and Anton Dignös
- Subjects
Prefix ,Constant (computer programming) ,Computer science ,Prefix sum ,Overhead (computing) ,Algorithm - Abstract
The prefix sum approach is a powerful technique to answer range-sum queries over multi-dimensional arrays in constant time by requiring only a few look-ups in an array of precomputed prefix sums. In this paper, we propose the sparse prefix sum approach that is based on relative prefix sums and exploits sparsity in the data to vastly reduce the storage costs for the prefix sums. The proposed approach has desirable theoretical properties and works well in practice. It is the first approach achieving constant query time with sub-linear update costs and storage costs for range-sum queries over sparse low-dimensional arrays. Experiments on real-world data sets show that the approach reduces storage costs by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
- Published
- 2017
47. Calculating Parallel Programs in Coq using List Homomorphisms
- Author
-
Julien Tesson, Wadoud Bousdira, Frédéric Loulergue, Université d'Orléans (UO), Laboratoire d'Informatique Fondamentale d'Orléans (LIFO), Université d'Orléans (UO)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA), Design, study and implementation of languages for proofs and programs (PI.R2), Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique Fondamentale (IRIF (UMR_8243)), Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire d'Algorithmique Complexité et Logique (LACL), Université Paris-Est Créteil Val-de-Marne - Paris 12 (UPEC UP12), ANR-10-INTB-0205,PAPDAS,Developpement de programmes parallèles avec des squelettes algorithmiques(2010), Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), and Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université d'Orléans (UO)
- Subjects
Functional programming ,Source code ,[INFO.INFO-PL]Computer Science [cs]/Programming Languages [cs.PL] ,Computer science ,Programming language ,media_common.quotation_subject ,Proof assistant ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Theoretical Computer Science ,Bulk synchronous parallel ,Set (abstract data type) ,020204 information systems ,TheoryofComputation_LOGICSANDMEANINGSOFPROGRAMS ,Core (graph theory) ,0202 electrical engineering, electronic engineering, information engineering ,Prefix sum ,Algorithmic skeleton ,computer ,Software ,Information Systems ,media_common - Abstract
International audience; SyDPaCC is a set of libraries for the Coq proof assistant. It allows to write naive functional programs (i.e. with high complexity) that are considered as specifications, and to transform them into more efficient versions. These more efficient versions can then be automatically parallelised before being extracted from Coq into source code for the functional language OCaml together with calls to the Bulk Synchronous Parallel ML (BSML) library. In this paper we present a new core version of SyDPaCC for the development of parallel programs correct-by-contruction using the theory of list homomorphisms and algorithmic skeletons implemented and verified in Coq. The framework is illustrated on the the maximum prefix sum problem.
- Published
- 2017
48. Linear-time accurate lattice algorithms for tail conditional expectation
- Author
-
William W. Y. Hsu, Jan-Ming Ho, Bryant Chen, and Ming-Yang Kao
- Subjects
High Energy Physics::Lattice ,Extrapolation ,Trinomial ,Conditional expectation ,Computer Science Applications ,Computational Mathematics ,Interpolation error ,Lattice (order) ,Prefix sum ,Computer Vision and Pattern Recognition ,Algorithm ,Time complexity ,Finance ,Value at risk ,Mathematics - Abstract
This paper proposes novel lattice algorithms to compute tail conditional expectation of European calls and puts in linear time. We incorporate the technique of prefix-sum into tilting, trinomial, and extrapolation algorithms as well as some syntheses of these algorithms. Furthermore, we introduce fractional-step lattices to help reduce interpolation error in the extrapolation algorithms. We demonstrate the efficiency and accuracy of these algorithms with numerical results. A key finding is that combining the techniques of tilting lattice, extrapolation, and fractional steps substantially increases speed and accuracy.
- Published
- 2014
49. ВИКОРИСТАННЯ ГРАНИЧНИХ СУМ ДЛЯ РЕАЛІЗАЦІЇ СКЛАДНИХ ЗАПИТІВ У БАГАТОВИМІРНОМУ ПРОСТОРІ ДАНИХ
- Subjects
query ,lcsh:T58.5-58.64 ,lcsh:Information technology ,multi-dimensional space ,prefix sum ,Computer Science::Databases ,the data - Abstract
The paper considers the problem of development and application of optimization techniques to aggregate query multidimensional databases by finding the total values in the selected range d-dimensional cube. A method of implementing complex queries that can provide the high speed performance of all requests to the system via the formation of an additional array of prefix sums is described. The example calculation prefix sums for analysis of environmental data on emissions is given.
- Published
- 2014
50. AutoProof meets some verification challenges
- Author
-
Martin Nordio, Julian Tschannen, and Carlo A. Furia
- Subjects
Functional verification ,Computer science ,business.industry ,Programming language ,Automated Program verification ,Verification challenges ,Experience report ,Eiffel ,Functional correctness ,computer.software_genre ,Binary search tree ,Theory of computation ,Prefix sum ,Verification ,Software engineering ,business ,computer ,Software ,Information Systems ,computer.programming_language - Abstract
AutoProof is an automatic verifier for functional properties of programs written in Eiffel. This paper illustrates some of AutoProof’s capabilities when tackling the three challenges of the VerifyThis verification competition held at FM 2012, as well as on three other problems proposed in related events. AutoProof ’s design focuses on making it practically applicable with reduced user effort. Tackling the challenges demonstrates to what extent this design goal is met in the current implementation: while some of AutoProof’s current limitations prevent us from verifying the complete specification of the prefix sum and binary search tree algorithms, we can still prove some partial properties on interesting special cases, but with the advantage of requiring little or no specification., International Journal on Software Tools for Technology Transfer, 17 (6), ISSN:1433-2779, ISSN:1433-2787
- Published
- 2014
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.