Author: "Lu Lu" / Journal: journal of supercomputing / Publisher: springer nature - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Lu Lu"' showing total 6 results

Start Over Author "Lu Lu" Journal journal of supercomputing Publisher springer nature

6 results on '"Lu Lu"'

1. Novel accelerated methods for convolution neural network with matrix core.

Author: Guo, Yijie, Lu, Lu, and Zhu, Songxiang
Subjects: *CONVOLUTIONAL neural networks, *DEEP learning, *PARALLEL programming, *MATRICES (Mathematics)
Abstract: The powerful parallel computing capability of GPU and the development of matrix processing unit in recent years provide more possibilities to improve the performance of convolutional neural network (CNN) on GPU. For the Winograd convolution algorithm, which is the most widely used in CNN and has the best performance, there are already some tuning results, but they all ignore the utilization of the matrix operation unit and fail to make full use of the computing resources of GPU. This paper introduces a single precision accelerated solution on GPU for CNN. According to the indicators of architecture, the optimal data layout, grid division and block division methods are derived. In order to adapt to a variety of padding in practical application, an efficient dynamic scheme for filling is designed, and by the use of matrix cores, a pipeline algorithm with operator fusion is implemented. The deep learning accelerated library MIOpen in AMD is used as the baseline. Taking several convolutional layers of ResNet50 as the experimental input, the evaluation shows that our approach outperforms MIOpen with the speedup of 1.41x on MI210, and reaches 74% of the peak value of single precision calculations. Applying this method to the training and inference of ResNet50, the speedup of 1.68x is obtained. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

2. Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU.

Author: Hu, Yichang, Lu, Lu, and Li, Cuixu
Subjects: *DISCRETE Fourier transforms, *MULTIDIMENSIONAL databases, *FAST Fourier transforms, *GRAPHICS processing units, *PATTERNMAKING
Abstract: Fast Fourier transform (FFT) is a well-known algorithm that calculates the discrete Fourier transform (DFT) of discrete data and is an essential tool in scientific and engineering computation. Due to the large amounts of data, parallelly executing FFT in graphics processing unit (GPU) can effectively optimize the performance. Following this approach, FFTW and some other FFT packages were designed, but the fixed computation pattern makes it hard to utilize the computing power of GPU. Additionally, the memory access pattern is not optimized to alleviate the bottleneck of data exchange. Motivated by these challenges, we propose an efficient GPU-accelerated multidimensional FFT library to achieve better performance in this paper. We present a detailed and clear implementation strategy and optimize FFT by having as few memory transfers as possible. The data will be reshuffled on the CPU, and the access mode is also optimized to coordinate with the GPU memory access pattern. Several optimizations are also demonstrated to enhance the performance of our approach for varying FFT sizes, and the evaluation shows that our approach consistently outperforms rocFFT with a speedup of about 25% to 250% on average in AMD Instinct MI100 GPU. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

3. An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems.

Author: Zhou, Binbin and Lu, Lu
Subjects: *FAST Fourier transforms, *GRAPHICS processing units
Abstract: This paper introduces an efficient and flexible 3D FFT framework for state-of-the-art multi-GPU distributed-memory systems. In contrast to the traditional pure MPI implementation, the multi-GPU distributed-memory systems can be exploited by employing a hybrid multi-GPU programming model that combines MPI with OpenMP to achieve effective communication. An asynchronous strategy that creates multiple streams and threads to reduce blocking time is adopted to accelerate intra-node communication. Furthermore, we combine our scheme with the GPU-Aware MPI implementation to perform GPU-GPU data transfers without CPU involvement. We also optimize the local FFT and transpose by creating fast parallel kernels to accelerate the total transform. Results show that our framework outperforms the state-of-the-art distributed 3D FFT library, being up to achieve 2× faster in a single node and 1.65× faster using two nodes. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

4. A batched GEMM optimization framework for deep learning.

Author: Yang, Zhiwei, Lu, Lu, and Wang, Ruimin
Subjects: *DEEP learning, *MATRIX multiplications, *CONVOLUTIONAL neural networks, *SIGNAL processing, *PARALLEL processing, *KERNEL functions
Abstract: Generalized matrix multiplication (GEMM) is one of the most widely utilized algorithms in many fields such as deep learning, astrophysics, signal processing, and advanced physical analysis. It plays an extremely important role in deep learning, especially for convolutional neural networks, because many of the calculations involved are converted into matrix multiplications in order to speed up the computation process leveraging the parallel processing power of GPUs. However, the sizes of the converted matrices are generally too small to fully occupy the GPU. In this paper, we focus on the impact of GEMM on deep learning and propose a framework for calculating a batch of GEMMs in one kernel function so as to increase GPU occupancy. A suite of tiling strategies is designed for a batch of matrices with small dimensions and variable sizes. The tiling strategy is determined by considering Kernel Occupancy for each GEMM to fit different matrix sizes and GPU architectures. Then the GoogLeNet is implemented using MIOpen as a representative case and the batched GEMM framework is integrated into it. The experimental results show that compared with MAGMA, the elapsed time of the GoogLeNet optimized with our framework obtains 2.60 × and 2.79 × speedup on AMD Radeon Instinct MI50 and MI100 GPU, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

5. A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution.

Author: Wang, Ruimin, Yang, Zhiwei, Xu, Hao, and Lu, Lu
Subjects: MATRIX multiplications, GRAPHICS processing units, LINEAR algebra, FLUID dynamics, IMAGE processing, DATA distribution
Abstract: In the past few decades, general matrix multiplication (GEMM), as the basic component of the Basic Linear Algebra Subprograms (BLAS) library, has played a vital role in various fields such as machine learning, image processing, and fluid dynamics. Because these fields tend to deconstruct the problem into multiple smaller sub-problems, today's BLAS libraries have implemented batched GEMM routines to achieve high performance in this scenario. MAGMA proposes a vbatch routine to calculate batched GEMM with variable size on GPU, but unbalanced input will cause some workgroups and threads to be idle, thereby affecting performance. In addition, unbalanced input will also affect the load balancing of the Computing Unit in GPU, and extreme input will lead to insufficient utilization of hardware resources. In this paper we proposes a high-performance batched GEMM computing framework on GPU. For a large batch of small matrices with variable sizes and unbalanced distribution, the proposed framework considered the hardware architecture and the possible data distribution, and adopted three methods (flexible tile, sort-up and split-down) to improve hardware utilization and achieve better load balancing. Experimental results show that our framework has a 3.02× performance improvement compared to the latest MAGMA implementation on AMD Radeon Instinct MI50 GPU, and 3.14× speedup on MI100. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

6. Design of a simulation model for high performance LINPACK in hybrid CPU-GPU systems.

Author: Hu, Yichang and Lu, Lu
Subjects: *GRAPHICS processing units, *HYBRID systems, *COMPUTER workstation clusters, *COMPUTER performance, *SIMULATION methods & models, *HETEROGENEOUS computing
Abstract: High performance LINPACK (HPL) benchmark is used to evaluate the maximum floating-point performance of a computer cluster. Since the performance of the graphics processing unit (GPU) has been improved rapidly, many researchers start to optimize HPL benchmark through GPU to maximize system utilization. Nevertheless, it is difficult to determine the optimal combination of parameters in this process due to the complexity of the input. Therefore, running HPL on a heterogeneous system is time-consuming and is not flexible under different hardware components. So we propose a simulation model of HPL in this paper. The model is no longer limited by hardware components and able to simulate the execute process of HPL across different computing node in heterogeneous GPU-enhanced clusters at any scale. It can also assist researchers in evaluating the floating-point performance quickly and provide a reference for the hardware investment. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

6 results on '"Lu Lu"'

1. Novel accelerated methods for convolution neural network with matrix core.

2. Memory-accelerated parallel method for multidimensional fast fourier implementation on GPU.

3. An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems.

4. A batched GEMM optimization framework for deep learning.

5. A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution.

6. Design of a simulation model for high performance LINPACK in hybrid CPU-GPU systems.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

6 results on '"Lu Lu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources