Author: "Cheng, Xiaohe" / Publication Type: Electronic Resources - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Cheng, Xiaohe"' showing total 9 results

Start Over Author "Cheng, Xiaohe" Publication Type Electronic Resources

9 results on '"Cheng, Xiaohe"'

1. A hypervisor for shared-memory FPGA platforms

Author: Ma, Jiacheng, Zuo, Gefei, Loughlin, Kevin, Cheng, Xiaohe, Liu, Yanqiang, Eneyew, Abel Mulugeta, Qi, Zhengwei, Kasikci, Baris, Ma, Jiacheng, Zuo, Gefei, Loughlin, Kevin, Cheng, Xiaohe, Liu, Yanqiang, Eneyew, Abel Mulugeta, Qi, Zhengwei, and Kasikci, Baris
Abstract: Cloud providers widely deploy FPGAs as application-specific accelerators for customer use. These providers seek to multiplex their FPGAs among customers via virtualization, thereby reducing running costs. Unfortunately, most virtualization support is confined to FPGAs that expose a restrictive, host-centric programming model in which accelerators cannot issue direct memory accesses (DMAs). The host-centric model incurs high runtime overhead for workloads that exhibit pointer chasing. Thus, FPGAs are beginning to support a shared-memory programming model in which accelerators can issue DMAs. However, virtualization support for shared-memory FPGAs is limited. This paper presents Optimus, the first hypervisor that supports scalable shared-memory FPGA virtualization. Optimus offers both spatial multiplexing and temporal multiplexing to provide efficient and flexible sharing of each accelerator on an FPGA. To share the FPGA-CPU interconnect at a high clock frequency, Optimus implements a multiplexer tree. To isolate each guest's address space, Optimus introduces the technique of page table slicing as a hardware-software co-design. To support preemptive temporal multiplexing, Optimus provides an accelerator preemption interface. We show that Optimus supports eight physical accelerators on a single FPGA and improves the aggregate throughput of twelve real-world benchmarks by 1.98x-7x. © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Published: 2020

2. A hypervisor for shared-memory FPGA platforms

Author: Ma, Jiacheng, Zuo, Gefei, Loughlin, Kevin, Cheng, Xiaohe, Liu, Yanqiang, Eneyew, Abel Mulugeta, Qi, Zhengwei, Kasikci, Baris, Ma, Jiacheng, Zuo, Gefei, Loughlin, Kevin, Cheng, Xiaohe, Liu, Yanqiang, Eneyew, Abel Mulugeta, Qi, Zhengwei, and Kasikci, Baris
Abstract: Cloud providers widely deploy FPGAs as application-specific accelerators for customer use. These providers seek to multiplex their FPGAs among customers via virtualization, thereby reducing running costs. Unfortunately, most virtualization support is confined to FPGAs that expose a restrictive, host-centric programming model in which accelerators cannot issue direct memory accesses (DMAs). The host-centric model incurs high runtime overhead for workloads that exhibit pointer chasing. Thus, FPGAs are beginning to support a shared-memory programming model in which accelerators can issue DMAs. However, virtualization support for shared-memory FPGAs is limited. This paper presents Optimus, the first hypervisor that supports scalable shared-memory FPGA virtualization. Optimus offers both spatial multiplexing and temporal multiplexing to provide efficient and flexible sharing of each accelerator on an FPGA. To share the FPGA-CPU interconnect at a high clock frequency, Optimus implements a multiplexer tree. To isolate each guest's address space, Optimus introduces the technique of page table slicing as a hardware-software co-design. To support preemptive temporal multiplexing, Optimus provides an accelerator preemption interface. We show that Optimus supports eight physical accelerators on a single FPGA and improves the aggregate throughput of twelve real-world benchmarks by 1.98x-7x. © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Published: 2020

3. A hypervisor for shared-memory FPGA platforms

Author: Ma, Jiacheng, Zuo, Gefei, Loughlin, Kevin, Cheng, Xiaohe, Liu, Yanqiang, Eneyew, Abel Mulugeta, Qi, Zhengwei, Kasikci, Baris, Ma, Jiacheng, Zuo, Gefei, Loughlin, Kevin, Cheng, Xiaohe, Liu, Yanqiang, Eneyew, Abel Mulugeta, Qi, Zhengwei, and Kasikci, Baris
Abstract: Cloud providers widely deploy FPGAs as application-specific accelerators for customer use. These providers seek to multiplex their FPGAs among customers via virtualization, thereby reducing running costs. Unfortunately, most virtualization support is confined to FPGAs that expose a restrictive, host-centric programming model in which accelerators cannot issue direct memory accesses (DMAs). The host-centric model incurs high runtime overhead for workloads that exhibit pointer chasing. Thus, FPGAs are beginning to support a shared-memory programming model in which accelerators can issue DMAs. However, virtualization support for shared-memory FPGAs is limited. This paper presents Optimus, the first hypervisor that supports scalable shared-memory FPGA virtualization. Optimus offers both spatial multiplexing and temporal multiplexing to provide efficient and flexible sharing of each accelerator on an FPGA. To share the FPGA-CPU interconnect at a high clock frequency, Optimus implements a multiplexer tree. To isolate each guest's address space, Optimus introduces the technique of page table slicing as a hardware-software co-design. To support preemptive temporal multiplexing, Optimus provides an accelerator preemption interface. We show that Optimus supports eight physical accelerators on a single FPGA and improves the aggregate throughput of twelve real-world benchmarks by 1.98x-7x. © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Published: 2020

4. Scalable and practical locking with shuffling

Author: Kashyap, Sanidhya, Calciu, Irina, Cheng, Xiaohe, Min, Changwoo, Kim, Taesoo, Kashyap, Sanidhya, Calciu, Irina, Cheng, Xiaohe, Min, Changwoo, and Kim, Taesoo
Abstract: Locks are an essential building block for high-performance multicore system software. To meet performance goals, lock algorithms have evolved towards specialized solutions for architectural characteristics (e.g., NUMA). However, in practice, applications run on different server platforms and exhibit widely diverse behaviors that evolve with time (e.g., number of threads, number of locks). This creates performance and scalability problems for locks optimized for a single scenario and platform. For example, popular spinlocks suffer from excessive cache-line bouncing in NUMA systems, while scalable, NUMA-aware locks exhibit sub-par single-thread performance. In this paper, we identify four dominating factors that impact the performance of lock algorithms. We then propose a new technique, shuffling, that can dynamically accommodate all these factors, without slowing down the critical path of the lock. The key idea of shuffling is to re-order the queue of threads waiting to acquire the lock in accordance with some pre-established policy. For best performance, this work is done off the critical path, by the waiter threads. Using shuffling, we demonstrate how to achieve NUMA-awareness and implement an efficient parking/wake-up strategy, without any auxiliary data structure, mostly off the critical path. The evaluation shows that our family of locks based on shuffling improves the throughput of real-world applications up to 12.5×, with impressive memory footprint reduction compared with the recent lock algorithms. © 2019 Copyright held by the owner/author(s).
Published: 2019

5. Scalable and practical locking with shuffling

Author: Kashyap, Sanidhya, Calciu, Irina, Cheng, Xiaohe, Min, Changwoo, Kim, Taesoo, Kashyap, Sanidhya, Calciu, Irina, Cheng, Xiaohe, Min, Changwoo, and Kim, Taesoo
Abstract: Locks are an essential building block for high-performance multicore system software. To meet performance goals, lock algorithms have evolved towards specialized solutions for architectural characteristics (e.g., NUMA). However, in practice, applications run on different server platforms and exhibit widely diverse behaviors that evolve with time (e.g., number of threads, number of locks). This creates performance and scalability problems for locks optimized for a single scenario and platform. For example, popular spinlocks suffer from excessive cache-line bouncing in NUMA systems, while scalable, NUMA-aware locks exhibit sub-par single-thread performance. In this paper, we identify four dominating factors that impact the performance of lock algorithms. We then propose a new technique, shuffling, that can dynamically accommodate all these factors, without slowing down the critical path of the lock. The key idea of shuffling is to re-order the queue of threads waiting to acquire the lock in accordance with some pre-established policy. For best performance, this work is done off the critical path, by the waiter threads. Using shuffling, we demonstrate how to achieve NUMA-awareness and implement an efficient parking/wake-up strategy, without any auxiliary data structure, mostly off the critical path. The evaluation shows that our family of locks based on shuffling improves the throughput of real-world applications up to 12.5×, with impressive memory footprint reduction compared with the recent lock algorithms. © 2019 Copyright held by the owner/author(s).
Published: 2019

6. Scalable and practical locking with shuffling

Author: Kashyap, Sanidhya, Calciu, Irina, Cheng, Xiaohe, Min, Changwoo, Kim, Taesoo, Kashyap, Sanidhya, Calciu, Irina, Cheng, Xiaohe, Min, Changwoo, and Kim, Taesoo
Abstract: Locks are an essential building block for high-performance multicore system software. To meet performance goals, lock algorithms have evolved towards specialized solutions for architectural characteristics (e.g., NUMA). However, in practice, applications run on different server platforms and exhibit widely diverse behaviors that evolve with time (e.g., number of threads, number of locks). This creates performance and scalability problems for locks optimized for a single scenario and platform. For example, popular spinlocks suffer from excessive cache-line bouncing in NUMA systems, while scalable, NUMA-aware locks exhibit sub-par single-thread performance. In this paper, we identify four dominating factors that impact the performance of lock algorithms. We then propose a new technique, shuffling, that can dynamically accommodate all these factors, without slowing down the critical path of the lock. The key idea of shuffling is to re-order the queue of threads waiting to acquire the lock in accordance with some pre-established policy. For best performance, this work is done off the critical path, by the waiter threads. Using shuffling, we demonstrate how to achieve NUMA-awareness and implement an efficient parking/wake-up strategy, without any auxiliary data structure, mostly off the critical path. The evaluation shows that our family of locks based on shuffling improves the throughput of real-world applications up to 12.5×, with impressive memory footprint reduction compared with the recent lock algorithms. © 2019 Copyright held by the owner/author(s).
Published: 2019

7. Optimizing the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

Author: Sorna, Anumeena, Cheng, Xiaohe, D'azevedo, Eduardo, Wong, Kwai, Tomov, Stanimire, Sorna, Anumeena, Cheng, Xiaohe, D'azevedo, Eduardo, Wong, Kwai, and Tomov, Stanimire
Abstract: The Fast Fourier Transform is a fundamental tool in scientific and technical computation. The highly parallelizable nature of the algorithm makes it a suitable candidate for GPU acceleration. This paper focuses on exploiting the speedup due to using the half precision multiplication capability of the latest GPUs' tensor core hardware without significantly degrading the precision of the Fourier Transform result. We develop an algorithm that dynamically splits the input single precision dataset into two half precision sets at the lowest level, uses half precision multiplication, and recombines the result at a later step. This work paves the way for using tensor cores for high precision inputs.
Published: 2018

8. Optimizing the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

Author: Sorna, Anumeena, Cheng, Xiaohe, D'azevedo, Eduardo, Wong, Kwai, Tomov, Stanimire, Sorna, Anumeena, Cheng, Xiaohe, D'azevedo, Eduardo, Wong, Kwai, and Tomov, Stanimire
Abstract: The Fast Fourier Transform is a fundamental tool in scientific and technical computation. The highly parallelizable nature of the algorithm makes it a suitable candidate for GPU acceleration. This paper focuses on exploiting the speedup due to using the half precision multiplication capability of the latest GPUs' tensor core hardware without significantly degrading the precision of the Fourier Transform result. We develop an algorithm that dynamically splits the input single precision dataset into two half precision sets at the lowest level, uses half precision multiplication, and recombines the result at a later step. This work paves the way for using tensor cores for high precision inputs.
Published: 2018

9. Optimizing the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

Author: Sorna, Anumeena, Cheng, Xiaohe, D'azevedo, Eduardo, Wong, Kwai, Tomov, Stanimire, Sorna, Anumeena, Cheng, Xiaohe, D'azevedo, Eduardo, Wong, Kwai, and Tomov, Stanimire
Abstract: The Fast Fourier Transform is a fundamental tool in scientific and technical computation. The highly parallelizable nature of the algorithm makes it a suitable candidate for GPU acceleration. This paper focuses on exploiting the speedup due to using the half precision multiplication capability of the latest GPUs' tensor core hardware without significantly degrading the precision of the Fourier Transform result. We develop an algorithm that dynamically splits the input single precision dataset into two half precision sets at the lowest level, uses half precision multiplication, and recombines the result at a later step. This work paves the way for using tensor cores for high precision inputs.
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

9 results on '"Cheng, Xiaohe"'

1. A hypervisor for shared-memory FPGA platforms

2. A hypervisor for shared-memory FPGA platforms

3. A hypervisor for shared-memory FPGA platforms

4. Scalable and practical locking with shuffling

5. Scalable and practical locking with shuffling

6. Scalable and practical locking with shuffling

7. Optimizing the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

8. Optimizing the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

9. Optimizing the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

9 results on '"Cheng, Xiaohe"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources