Author: "Po-Han Wang" / Topic: computer science - Searchworks@Jio Institute Digital Library Search Results

1. Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling

Author: Po-Han Wang, Hsiang-Yun Cheng, Li-jhan Chen, and Chia-Lin Yang
Subjects: 010302 applied physics, Computer science, Locality, 02 engineering and technology, Parallel computing, Thread (computing), 01 natural sciences, Win32 Thread Information Block, 020202 computer hardware & architecture, Instruction set, Shared memory, Hardware and Architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Locality of reference, Cache, Cache algorithms
Abstract: Modern GPGPUs support the concurrent execution of thousands of threads to provide an energy-efficient platform. However, the massive multi-threading of GPGPUs incurs serious cache contention, as the cache lines brought by one thread can easily be evicted by other threads in the small shared cache. In this paper, we propose a software-hardware cooperative approach that exploits the spatial locality among different thread blocks to better utilize the precious cache capacity. Through dynamic locality estimation and thread block scheduling, we can capture more performance improvement opportunities than prior work that only explores the spatial locality between consecutive thread blocks. Evaluations across diverse GPGPU applications show that, on average, our locality-aware scheduler provides 25 and 9 percent performance improvement over the commonly-employed round-robin scheduler and the state-of-the-art scheduler, respectively.
Published: 2017
Full Text: View/download PDF

2. Active Forwarding: Eliminate IOMMU Address Translation for Accelerator-rich Architectures

Author: Chia-Lin Yang, Hsueh-Chun Fu, and Po-Han Wang
Subjects: 010302 applied physics, business.industry, Computer science, Symmetric multiprocessor system, 02 engineering and technology, Translation (geometry), 01 natural sciences, 020202 computer hardware & architecture, Virtual address space, Embedded system, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Cache, Central processing unit, business, Energy (signal processing), Scratchpad memory
Abstract: Accelerator-rich architectures employ IOMMUs to support unified virtual address, but researches show that they fail to meet the performance and energy requirements of accelerators. Instead of optimizing the speed/energy of IOMMU address translation, this work tackles the issue from a new perspective, eliminating the need for translation with an active forwarding (AF) mechanism that forwards input data of accelerators directly from the CPU cache to the scratchpad memory of the accelerator. Results show that on average, AF can provide 8% performance improvement compared to the state-of-the-art mechanism, hostPageWalk, and reduce 22.1% accelerator power.
Published: 2018
Full Text: View/download PDF

3. Enabling fast preemption via Dual-Kernel support on GPUs

Author: Chia-Lin Yang, Po-Han Wang, Li-Wei Shieh, Hsueh-Chun Fu, and Kun-Chih Chen
Subjects: 010302 applied physics, Hardware_MEMORYSTRUCTURES, Software_OPERATINGSYSTEMS, Computer science, Quality of service, Preemption, 02 engineering and technology, Parallel computing, 01 natural sciences, 020202 computer hardware & architecture, Kernel preemption, Kernel (image processing), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Latency (engineering), Execution model, Resource utilization
Abstract: To consider QoS for resource-limited mobile systems, we introduce a fast preemption mechanism on GPUs. First, we involve a dual-kernel execution model to support fine-grained preemption, and a resource allocation policy to avoid resource fragmentation problem. Second, we propose a preemption victim selection scheme to reduce the throughput overhead while satisfying a required preemption latency. Evaluations show that we can reach very close to the ideal preemption scheme within 2% difference in terms of deadline violations. Furthermore, on average we improve GPU resource utilization by 2.93× over prior technique during preemption.
Published: 2017
Full Text: View/download PDF

4. Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture

Author: Cheng-Hsuan Li, Chia-Lin Yang, and Po-Han Wang
Subjects: 010302 applied physics, Heterogeneous System Architecture, Multi-core processor, Hardware_MEMORYSTRUCTURES, Computer science, 02 engineering and technology, Parallel computing, Cache pollution, 01 natural sciences, 020202 computer hardware & architecture, Smart Cache, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Cache, Latency (engineering), Multicore architecture, Cache algorithms
Abstract: Shared last-level cache (LLC) management is a critical design issue for heterogeneous multi-cores. In this paper, we observe two major challenges: the contribution of LLC latency to overall performance varies among applications/cores and also across time; overlooking the off-chip latency factor often leads to adverse effects on overall performance. Hence, we propose a Latency Sensitivity-based Cache Partitioning (LSP) framework, including a lightweight runtime mechanism to quantify the latency-sensitivity and a new cost function to guide the LLC partitioning. Results show that LSP improves the overall throughput by 8% on average (27% at most), compared with the state-of-the-art partitioning mechanism, TAP.
Published: 2016
Full Text: View/download PDF

5. Power gating strategies on GPUs

Author: Yu-Jung Cheng, Po-Han Wang, Yen-Ming Chen, and Chia-Lin Yang
Subjects: Power gating, Computer science, Clock gating, Parallel computing, Hardware and Architecture, Dynamic demand, Frequency scaling, Shader, Software, ComputingMethodologies_COMPUTERGRAPHICS, Information Systems, Voltage, Leakage (electronics), Efficient energy use
Abstract: As technology continues to shrink, reducing leakage is critical to achieving energy efficiency. Previous studies on low-power GPUs (Graphics Processing Units) focused on techniques for dynamic power reduction, such as DVFS (Dynamic Voltage and Frequency Scaling) and clock gating. In this paper, we explore the potential of adopting architecture-level power gating techniques for leakage reduction on GPUs. We propose three strategies for applying power gating on different modules in GPUs. The Predictive Shader Shutdown technique exploits workload variation across frames to eliminate leakage in shader clusters. Deferred Geometry Pipeline seeks to minimize leakage in fixed-function geometry units by utilizing an imbalance between geometry and fragment computation across batches. Finally, the simple time-out power gating method is applied to nonshader execution units to exploit a finer granularity of the idle time. Our results indicate that Predictive Shader Shutdown eliminates up to 60% of the leakage in shader clusters, Deferred Geometry Pipeline removes up to 57% of the leakage in the fixed-function geometry units, and the simple time-out power gating mechanism eliminates 83.3% of the leakage in nonshader execution units on average. All three schemes incur negligible performance degradation, less than 1%.
Published: 2011
Full Text: View/download PDF

6. A Predictive Shutdown Technique for GPU Shader Processors

Author: Po-Han Wang, Yen-Ming Chen, Chia-Lin Yang, and Yu-Jung Cheng
Subjects: Power gating, Hardware and Architecture, Computer science, Dynamic demand, Graphics processing unit, Clock gating, Central processing unit, Parallel computing, Frequency scaling, Shader, ComputingMethodologies_COMPUTERGRAPHICS, Dynamic voltage scaling
Abstract: As technology continues to shrink, reducing leakage is critical to achieve energy efficiency. Previous works on low-power GPU (graphics processing unit) focus on techniques for dynamic power reduction, such as DVFS (Dynamic Voltage/Frequency Scaling) and clock gating. In this paper, we explore the potential of adopting architecture-level power gating techniques for leakage reduction on GPU. In particular, we focus on the most power-hungry components, shader processors. We observe that, due to different scene complexity, the required shader resources to satisfy the target frame rate actually vary across frames. Therefore, we propose the predictive shader shutdown technique to exploit workload variation across frames for leakage reduction on shader processors. The experimental results show that predictive shader shutdown achieves up to 46% leakage reduction on shader processors with negligible performance degradation.
Published: 2009
Full Text: View/download PDF

7. Full system simulation framework for integrated CPU/GPU architecture

Author: Gen-Hong Liu, James Greensky, Hsu-Yao Huang, Jen-Chieh Yeh, Chia-Lin Yang, Shih-Lien Liu, Po-Han Wang, and Tse-Min Chen
Subjects: Flat memory model, Computer architecture, Shared memory, Computer science, Uniform memory access, Memory bandwidth, Cache, Central processing unit, CPU shielding, Microarchitecture
Abstract: The integrated CPU/GPU architecture brings performance advantage since the communication cost between the CPU and GPU is reduced, and also imposes new challenges in processor architecture design, especially in the management of shared memory resources, e.g., the last-level cache and memory bandwidth. Therefore, a micro-architecture level simulator is essential to facilitate researches in this direction. In this paper, we develop the first cycle-level full-system simulation framework for CPU-GPU integration with detailed memory models. With the simulation framework, we analyze the communication cost between the CPU and GPU for GPU workloads, and perform memory system characterization running both applications concurrently.
Published: 2014
Full Text: View/download PDF

8. A cycle-level SIMT-GPU simulation framework

Author: Chia-Lin Yang, Chien-Wei Lo, Yu-Jung Cheng, and Po-Han Wang
Subjects: Computer architecture, Computer science, Research community, Graphics processing unit, Parallelism (grammar), Parallel computing, Software_PROGRAMMINGTECHNIQUES, General-purpose computing on graphics processing units, Graphics, Architecture, Massively parallel, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: The massive parallelism provided by the modern graphics processing units (GPUs) makes them the attractive processors to accelerate the applications with high data-level parallelism. Therefore, the GPU architecture has recently gained a lot of attention in research community. However, the advance in the GPU architecture is impeded by the limited documents released from the major GPU vendors. Furthermore, current studies on GPUs often focus only on general-purpose (GPGPU) applications. The behaviors of the graphics applications, which are considered as the major GPU workloads, are often overlooked in these studies. A GPU design good for the GPGPU applications is not necessarily good for the graphics applications. Therefore, a simulation framework that is able to provide performance characterization for both applications is mandatory for the innovation of the GPU architecture.
Published: 2012
Full Text: View/download PDF

9. Analyzing OpenCL 2.0 Workloads Using a Heterogeneous CPU-GPU Simulator

Author: Shao-Chung Wang, Jenq Kuen Lee, Min-Yih Hsu, Ming Ouhyoung, Shiao-Li Tsao, Chao-Lin Lee, Po-Han Wang, Chia-Lin Yang, Sheng-Jie Shu, Chun-Chieh Yang, Yi-Chung Lee, Li Wang, Rih-Ding Peng, Hsiang-Yun Cheng, Yuan-Shin Hwang, Tzu-Chieh Yu, Kun-Chih Chen, Ren-Wei Tsai, and Li-Chen Kan
Subjects: Future studies, Computer architecture simulator, Computer science, 05 social sciences, 050301 education, Symmetric multiprocessor system, 02 engineering and technology, Software_PROGRAMMINGTECHNIQUES, 020202 computer hardware & architecture, Instruction set, Support vector machine, Kernel (linear algebra), Parallel processing (DSP implementation), Computer architecture, 0202 electrical engineering, electronic engineering, information engineering, General-purpose computing on graphics processing units, 0503 education, Simulation
Abstract: Heterogeneous CPU-GPU systems have recently emerged as an energy-efficient computing platform. A robust integrated CPU-GPU simulator is essential to facilitate researches in this direction. While few integrated CPU-GPU simulators are available, similar tools that support OpenCL 2.0, a widely used new standard with promising heterogeneous computing features, are currently missing. In this paper, we extend the existing integrated CPU-GPU simulator, gem5-gpu, to support OpenCL 2.0. In addition, we conduct experiments on the extended simulator to see the impact of new features introduced by OpenCL 2.0. Our OpenCL 2.0 compatible simulator is successfully validated against a state-of-the-art commercial product, and is expected to help boost future studies in heterogeneous CPU-GPU systems.

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

9 results on '"Po-Han Wang"'

1. Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling

2. Active Forwarding: Eliminate IOMMU Address Translation for Accelerator-rich Architectures

3. Enabling fast preemption via Dual-Kernel support on GPUs

4. Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture

5. Power gating strategies on GPUs

6. A Predictive Shutdown Technique for GPU Shader Processors

7. Full system simulation framework for integrated CPU/GPU architecture

8. A cycle-level SIMT-GPU simulation framework

9. Analyzing OpenCL 2.0 Workloads Using a Heterogeneous CPU-GPU Simulator

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

9 results on '"Po-Han Wang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources