DENSE MATRIX-VECTOR MULTIPLICATION ON THE CUDA ARCHITECTURE.

Authors :: FUJIMOTO, NORIYUKI
Source :: Parallel Processing Letters. Dec2008, Vol. 18 Issue 4, p511-530. 20p. 8 Diagrams, 1 Chart, 5 Graphs.
Publication Year :: 2008
Abstract: Recently GPUs have acquired the ability to perform fast general purpose computation by running thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on the NVIDIA CUDA architecture. The experiments are conducted on a PC with GeForce 8800GTX and 2.0 GHz Intel Xeon E5335 CPU. The results show that the proposed algorithm runs a maximum of 11.19 times faster than NVIDIA's BLAS library CUBLAS 1.1 on the GPU and 35.15 times faster than the Intel Math Kernel Library 9.1 on a single core x86 with SSE3 SIMD instructions. The performance of Jacobi's iterative method for solving linear equations, which includes the data transfer time between CPU and GPU, shows that the proposed algorithm is practical for real applications. [ABSTRACT FROM AUTHOR]