Author: "Peng, Hongwu" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Peng, Hongwu"' showing total 102 results

Start Over Author "Peng, Hongwu"

102 results on '"Peng, Hongwu"'

1. RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks

Author: Xie, Xi, Luo, Yuebo, Peng, Hongwu, and Ding, Caiwen
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Top-k algorithms are essential in various applications, from high-performance computing and information retrieval to big data and neural network model training. This paper introduces RTop-K, a highly efficient parallel row-wise top-k selection algorithm designed for GPUs. RTop-K employs a Binary Search-based approach to optimize resource allocation and provides a scalable solution that significantly accelerates top-k operations. We perform a theoretical analysis of the effects of early stopping in our algorithm, demonstrating that it maintains the accuracy of neural network models while enhancing performance. Comprehensive tests show that our GPU implementation of RTop-K outperforms other row-wise top-k GPU implementations, with minimal impact on testing accuracy when early stopping is applied. Notably, RTop-K achieves speed increases ranging from 4.245$\times$ to 9.506$\times$ with early stopping, and 3.936$\times$ without early stopping, compared to state-of-the-art implementations. The proposed methods offer significant improvements in the training and inference of Graph Neural Networks (GNNs), addressing critical challenges in latency and throughput on GPU platforms., Comment: Need to improve the experiment part
Published: 2024

2. APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking

Author: Jin, Can, Peng, Hongwu, Zhao, Shiyu, Wang, Zhenting, Xu, Wujiang, Han, Ligong, Zhao, Jiahui, Zhong, Kai, Rajasekaran, Sanguthevar, and Metaxas, Dimitris N.
Subjects: Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly reranking, underexplored. Directly applying current prompt engineering algorithms to relevance ranking is challenging due to the integration of query and long passage pairs in the input, where the ranking complexity surpasses classification tasks. To reduce human effort and unlock the potential of prompt optimization in reranking, we introduce a novel automatic prompt engineering algorithm named APEER. APEER iteratively generates refined prompts through feedback and preference optimization. Extensive experiments with four LLMs and ten datasets demonstrate the substantial performance improvement of APEER over existing state-of-the-art (SoTA) manual prompts. Furthermore, we find that the prompts generated by APEER exhibit better transferability across diverse tasks and LLMs. Code is available at https://github.com/jincan333/APEER.
Published: 2024

3. SSNet: A Lightweight Multi-Party Computation Scheme for Practical Privacy-Preserving Machine Learning Service in the Cloud

Author: Duan, Shijin, Wang, Chenghong, Peng, Hongwu, Luo, Yukui, Wen, Wujie, Ding, Caiwen, and Xu, Xiaolin
Subjects: Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: As privacy-preserving becomes a pivotal aspect of deep learning (DL) development, multi-party computation (MPC) has gained prominence for its efficiency and strong security. However, the practice of current MPC frameworks is limited, especially when dealing with large neural networks, exemplified by the prolonged execution time of 25.8 seconds for secure inference on ResNet-152. The primary challenge lies in the reliance of current MPC approaches on additive secret sharing, which incurs significant communication overhead with non-linear operations such as comparisons. Furthermore, additive sharing suffers from poor scalability on party size. In contrast, the evolving landscape of MPC necessitates accommodating a larger number of compute parties and ensuring robust performance against malicious activities or computational failures. In light of these challenges, we propose SSNet, which for the first time, employs Shamir's secret sharing (SSS) as the backbone of MPC-based ML framework. We meticulously develop all framework primitives and operations for secure DL models tailored to seamlessly integrate with the SSS scheme. SSNet demonstrates the ability to scale up party numbers straightforwardly and embeds strategies to authenticate the computation correctness without incurring significant performance overhead. Additionally, SSNet introduces masking strategies designed to reduce communication overhead associated with non-linear operations. We conduct comprehensive experimental evaluations on commercial cloud computing infrastructure from Amazon AWS, as well as across diverse prevalent DNN models and datasets. SSNet demonstrates a substantial performance boost, achieving speed-ups ranging from 3x to 14x compared to SOTA MPC frameworks. Moreover, SSNet also represents the first framework that is evaluated on a five-party computation setup, in the context of secure DL inference., Comment: 16 pages, 9 figures
Published: 2024

4. Learning from Teaching Regularization: Generalizable Correlations Should be Easy to Imitate

Author: Jin, Can, Che, Tong, Peng, Hongwu, Li, Yiyuan, Metaxas, Dimitris N., and Pavone, Marco
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Generalization remains a central challenge in machine learning. In this work, we propose Learning from Teaching (LoT), a novel regularization technique for deep neural networks to enhance generalization. Inspired by the human ability to capture concise and abstract patterns, we hypothesize that generalizable correlations are expected to be easier to imitate. LoT operationalizes this concept to improve generalization of the main model with auxiliary student learners. The student learners are trained by the main model and, in turn, provide feedback to help the main model capture more generalizable and imitable correlations. Our experimental results across several domains, including Computer Vision, Natural Language Processing, and methodologies like Reinforcement Learning, demonstrate that the introduction of LoT brings significant benefits compared to training models on the original dataset. The results suggest the effectiveness and efficiency of LoT in identifying generalizable information at the right scales while discarding spurious data correlations, thus making LoT a valuable addition to current machine learning. Code is available at https://github.com/jincan333/LoT.
Published: 2024

5. Zero-Space Cost Fault Tolerance for Transformer-based Language Models on ReRAM

Author: Li, Bingbing, Yuan, Geng, Wang, Zigeng, Huang, Shaoyi, Peng, Hongwu, Behnam, Payman, Wen, Wujie, Liu, Hang, and Ding, Caiwen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Hardware Architecture
Abstract: Resistive Random Access Memory (ReRAM) has emerged as a promising platform for deep neural networks (DNNs) due to its support for parallel in-situ matrix-vector multiplication. However, hardware failures, such as stuck-at-fault defects, can result in significant prediction errors during model inference. While additional crossbars can be used to address these failures, they come with storage overhead and are not efficient in terms of space, energy, and cost. In this paper, we propose a fault protection mechanism that incurs zero space cost. Our approach includes: 1) differentiable structure pruning of rows and columns to reduce model redundancy, 2) weight duplication and voting for robust output, and 3) embedding duplicated most significant bits (MSBs) into the model weight. We evaluate our method on nine tasks of the GLUE benchmark with the BERT model, and experimental results prove its effectiveness.
Published: 2024

6. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Author: Cai, Tianle, Li, Yuhong, Geng, Zhengyang, Peng, Hongwu, Lee, Jason D., Chen, Deming, and Dao, Tri
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa substantially reduces the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x., Comment: The code for this implementation is available at https://github.com/FasterDecoding/Medusa
Published: 2024

7. MaxK-GNN: Extremely Fast GPU Kernel Design for Accelerating Graph Neural Networks Training

Author: Peng, Hongwu, Xie, Xi, Shivdikar, Kaustubh, Hasan, MD Amit, Zhao, Jiahui, Huang, Shaoyi, Khan, Omer, Kaeli, David, and Ding, Caiwen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, I.2, C.5
Abstract: In the acceleration of deep neural network training, the GPU has become the mainstream platform. GPUs face substantial challenges on GNNs, such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges but memory traffic is still significant. We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought" (i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based SpGEMM Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and SSpMM Kernel. We conduct extensive evaluations of MaxK-GNN and report the end-to-end system run-time. Experiments show that MaxK-GNN system could approach the theoretical speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22/4.24 times speedup (vs. theoretical limits, 5.52/7.27 times) on Reddit compared to DGL and GNNAdvisor implementations., Comment: ASPLOS 2024 accepted publication
Published: 2023

8. Advanced Large Language Model (LLM)-Driven Verilog Development: Enhancing Power, Performance, and Area Optimization in Code Synthesis

Author: Thorat, Kiran, Zhao, Jiahui, Liu, Yaotian, Peng, Hongwu, Xie, Xi, Lei, Bin, Zhang, Jeff, and Ding, Caiwen
Subjects: Computer Science - Machine Learning
Abstract: The increasing use of Advanced Language Models (ALMs) in diverse sectors, particularly due to their impressive capability to generate top-tier content following linguistic instructions, forms the core of this investigation. This study probes into ALMs' deployment in electronic hardware design, with a specific emphasis on the synthesis and enhancement of Verilog programming. We introduce an innovative framework, crafted to assess and amplify ALMs' productivity in this niche. The methodology commences with the initial crafting of Verilog programming via ALMs, succeeded by a distinct dual-stage refinement protocol. The premier stage prioritizes augmenting the code's operational and linguistic precision, while the latter stage is dedicated to aligning the code with Power-Performance-Area (PPA) benchmarks, a pivotal component in proficient hardware design. This bifurcated strategy, merging error remediation with PPA enhancement, has yielded substantial upgrades in the caliber of ALM-created Verilog programming. Our framework achieves an 81.37% rate in linguistic accuracy and 62.0% in operational efficacy in programming synthesis, surpassing current leading-edge techniques, such as 73% in linguistic accuracy and 46% in operational efficacy. These findings illuminate ALMs' aptitude in tackling complex technical domains and signal a positive shift in the mechanization of hardware design operations.
Published: 2023

9. Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Author: Peng, Hongwu, Ding, Caiwen, Geng, Tong, Choudhury, Sutanay, Barker, Kevin, and Li, Ang
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Performance, C.4
Abstract: The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Traditional computing architectures, based on the von Neumann model, are being outstripped by the requirements of contemporary AI/ML algorithms, leading to a surge in the creation of accelerators like the Graphcore Intelligence Processing Unit (IPU), Sambanova Reconfigurable Dataflow Unit (RDU), and enhanced GPU platforms. These hardware accelerators are characterized by their innovative data-flow architectures and other design optimizations that promise to deliver superior performance and energy efficiency for AI/ML tasks. This research provides a preliminary evaluation and comparison of these commercial AI/ML accelerators, delving into their hardware and software design features to discern their strengths and unique capabilities. By conducting a series of benchmark evaluations on common DNN operators and other AI/ML workloads, we aim to illuminate the advantages of data-flow architectures over conventional processor designs and offer insights into the performance trade-offs of each platform. The findings from our study will serve as a valuable reference for the design and performance expectations of research prototypes, thereby facilitating the development of next-generation hardware accelerators tailored for the ever-evolving landscape of AI/ML applications. Through this analysis, we aspire to contribute to the broader understanding of current accelerator technologies and to provide guidance for future innovations in the field., Comment: ICPE 2024 accepted publication
Published: 2023

10. LinGCN: Structural Linearized Graph Convolutional Network for Homomorphically Encrypted Inference

Author: Peng, Hongwu, Ran, Ran, Luo, Yukui, Zhao, Jiahui, Huang, Shaoyi, Thorat, Kiran, Geng, Tong, Wang, Chenghong, Xu, Xiaolin, Wen, Wujie, and Ding, Caiwen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security, E.3, I.2, B.0
Abstract: The growth of Graph Convolution Network (GCN) model sizes has revolutionized numerous applications, surpassing human performance in areas such as personal healthcare and financial systems. The deployment of GCNs in the cloud raises privacy concerns due to potential adversarial attacks on client data. To address security concerns, Privacy-Preserving Machine Learning (PPML) using Homomorphic Encryption (HE) secures sensitive client data. However, it introduces substantial computational overhead in practical applications. To tackle those challenges, we present LinGCN, a framework designed to reduce multiplication depth and optimize the performance of HE based GCN inference. LinGCN is structured around three key elements: (1) A differentiable structural linearization algorithm, complemented by a parameterized discrete indicator function, co-trained with model weights to meet the optimization goal. This strategy promotes fine-grained node-level non-linear location selection, resulting in a model with minimized multiplication depth. (2) A compact node-wise polynomial replacement policy with a second-order trainable activation function, steered towards superior convergence by a two-level distillation approach from an all-ReLU based teacher model. (3) an enhanced HE solution that enables finer-grained operator fusion for node-wise activation functions, further reducing multiplication level consumption in HE-based inference. Our experiments on the NTU-XVIEW skeleton joint dataset reveal that LinGCN excels in latency, accuracy, and scalability for homomorphically encrypted inference, outperforming solutions such as CryptoGCN. Remarkably, LinGCN achieves a 14.2x latency speedup relative to CryptoGCN, while preserving an inference accuracy of 75% and notably reducing multiplication depth., Comment: NeurIPS 2023 accepted publication
Published: 2023

11. Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks

Author: Xie, Xi, Peng, Hongwu, Hasan, Amit, Huang, Shaoyi, Zhao, Jiahui, Fang, Haowen, Zhang, Wei, Geng, Tong, Khan, Omer, and Ding, Caiwen
Subjects: Computer Science - Hardware Architecture, Computer Science - Machine Learning, I.2, B.6, C.3
Abstract: Graph Convolutional Networks (GCNs) are pivotal in extracting latent information from graph data across various domains, yet their acceleration on mainstream GPUs is challenged by workload imbalance and memory access irregularity. To address these challenges, we present Accel-GCN, a GPU accelerator architecture for GCNs. The design of Accel-GCN encompasses: (i) a lightweight degree sorting stage to group nodes with similar degree; (ii) a block-level partition strategy that dynamically adjusts warp workload sizes, enhancing shared memory locality and workload balance, and reducing metadata overhead compared to designs like GNNAdvisor; (iii) a combined warp strategy that improves memory coalescing and computational parallelism in the column dimension of dense matrices. Utilizing these principles, we formulated a kernel for sparse matrix multiplication (SpMM) in GCNs that employs block-level partitioning and combined warp strategy. This approach augments performance and multi-level memory efficiency and optimizes memory bandwidth by exploiting memory coalescing and alignment. Evaluation of Accel-GCN across 18 benchmark graphs reveals that it outperforms cuSPARSE, GNNAdvisor, and graph-BLAST by factors of 1.17 times, 1.86 times, and 2.94 times respectively. The results underscore Accel-GCN as an effective solution for enhancing GCN computational efficiency., Comment: ICCAD 2023 accepted publication
Published: 2023

12. AutoReP: Automatic ReLU Replacement for Fast Private Network Inference

Author: Peng, Hongwu, Huang, Shaoyi, Zhou, Tong, Luo, Yukui, Wang, Chenghong, Wang, Zigeng, Zhao, Jiahui, Xie, Xi, Li, Ang, Geng, Tony, Mahmood, Kaleel, Wen, Wujie, Xu, Xiaolin, and Ding, Caiwen
Subjects: Computer Science - Cryptography and Security, Computer Science - Machine Learning, E.3, I.2, B.0
Abstract: The growth of the Machine-Learning-As-A-Service (MLaaS) market has highlighted clients' data privacy and security issues. Private inference (PI) techniques using cryptographic primitives offer a solution but often have high computation and communication costs, particularly with non-linear operators like ReLU. Many attempts to reduce ReLU operations exist, but they may need heuristic threshold selection or cause substantial accuracy loss. This work introduces AutoReP, a gradient-based approach to lessen non-linear operators and alleviate these issues. It automates the selection of ReLU and polynomial functions to speed up PI applications and introduces distribution-aware polynomial approximation (DaPa) to maintain model expressivity while accurately approximating ReLUs. Our experimental results demonstrate significant accuracy improvements of 6.12% (94.31%, 12.9K ReLU budget, CIFAR-10), 8.39% (74.92%, 12.9K ReLU budget, CIFAR-100), and 9.45% (63.69%, 55K ReLU budget, Tiny-ImageNet) over current state-of-the-art methods, e.g., SNL. Morever, AutoReP is applied to EfficientNet-B2 on ImageNet dataset, and achieved 75.55% accuracy with 176.1 times ReLU budget reduction., Comment: ICCV 2023 accepeted publication
Published: 2023

13. PASNet: Polynomial Architecture Search Framework for Two-party Computation-based Secure Neural Network Deployment

Author: Peng, Hongwu, Zhou, Shanglin, Luo, Yukui, Xu, Nuo, Duan, Shijin, Ran, Ran, Zhao, Jiahui, Wang, Chenghong, Geng, Tong, Wen, Wujie, Xu, Xiaolin, and Ding, Caiwen
Subjects: Computer Science - Cryptography and Security, E.3, I.2, B.0
Abstract: Two-party computation (2PC) is promising to enable privacy-preserving deep learning (DL). However, the 2PC-based privacy-preserving DL implementation comes with high comparison protocol overhead from the non-linear operators. This work presents PASNet, a novel systematic framework that enables low latency, high energy efficiency & accuracy, and security-guaranteed 2PC-DL by integrating the hardware latency of the cryptographic building block into the neural architecture search loss function. We develop a cryptographic hardware scheduler and the corresponding performance model for Field Programmable Gate Arrays (FPGA) as a case study. The experimental results demonstrate that our light-weighted model PASNet-A and heavily-weighted model PASNet-B achieve 63 ms and 228 ms latency on private inference on ImageNet, which are 147 and 40 times faster than the SOTA CryptGPU system, and achieve 70.54% & 78.79% accuracy and more than 1000 times higher energy efficiency., Comment: DAC 2023 accepeted publication, short version was published on AAAI 2023 workshop on DL-Hardware Co-Design for AI Acceleration: RRNet: Towards ReLU-Reduced Neural Network for Two-party Computation Based Private Inference
Published: 2023

14. RRNet: Towards ReLU-Reduced Neural Network for Two-party Computation Based Private Inference

Author: Peng, Hongwu, Zhou, Shanglin, Luo, Yukui, Xu, Nuo, Duan, Shijin, Ran, Ran, Zhao, Jiahui, Huang, Shaoyi, Xie, Xi, Wang, Chenghong, Geng, Tong, Wen, Wujie, Xu, Xiaolin, and Ding, Caiwen
Subjects: Computer Science - Cryptography and Security, Computer Science - Machine Learning, I.2
Abstract: The proliferation of deep learning (DL) has led to the emergence of privacy and security concerns. To address these issues, secure Two-party computation (2PC) has been proposed as a means of enabling privacy-preserving DL computation. However, in practice, 2PC methods often incur high computation and communication overhead, which can impede their use in large-scale systems. To address this challenge, we introduce RRNet, a systematic framework that aims to jointly reduce the overhead of MPC comparison protocols and accelerate computation through hardware acceleration. Our approach integrates the hardware latency of cryptographic building blocks into the DNN loss function, resulting in improved energy efficiency, accuracy, and security guarantees. Furthermore, we propose a cryptographic hardware scheduler and corresponding performance model for Field Programmable Gate Arrays (FPGAs) to further enhance the efficiency of our framework. Experiments show RRNet achieved a much higher ReLU reduction performance than all SOTA works on CIFAR-10 dataset., Comment: This is work is a updated version of arXiv:2209.09424, the original version has been withdrawn
Published: 2023

15. Dynamic Sparse Training via Balancing the Exploration-Exploitation Trade-off

Author: Huang, Shaoyi, Lei, Bowen, Xu, Dongkuan, Peng, Hongwu, Sun, Yue, Xie, Mimi, and Ding, Caiwen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Over-parameterization of deep neural networks (DNNs) has shown high prediction accuracy for many applications. Although effective, the large number of parameters hinders its popularity on resource-limited devices and has an outsize environmental impact. Sparse training (using a fixed number of nonzero weights in each iteration) could significantly mitigate the training costs by reducing the model size. However, existing sparse training methods mainly use either random-based or greedy-based drop-and-grow strategies, resulting in local minimal and low accuracy. In this work, we consider the dynamic sparse training as a sparse connectivity search problem and design an exploitation and exploration acquisition function to escape from local optima and saddle points. We further design an acquisition function and provide the theoretical guarantees for the proposed method and clarify its convergence property. Experimental results show that sparse models (up to 98\% sparsity) obtained by our proposed method outperform the SOTA sparse training methods on a wide variety of deep learning tasks. On VGG-19 / CIFAR-100, ResNet-50 / CIFAR-10, ResNet-50 / CIFAR-100, our method has even higher accuracy than dense models. On ResNet-50 / ImageNet, the proposed method has up to 8.2\% accuracy improvement compared to SOTA sparse training methods.
Published: 2022

16. Aerial Manipulation Using a Novel Unmanned Aerial Vehicle Cyber-Physical System

Author: Ding, Caiwu, Peng, Hongwu, Lu, Lu, and Ding, Caiwen
Subjects: Computer Science - Robotics, C.3, I.4
Abstract: Unmanned Aerial Vehicles(UAVs) are attaining more and more maneuverability and sensory ability as a promising teleoperation platform for intelligent interaction with the environments. This work presents a novel 5-degree-of-freedom (DoF) unmanned aerial vehicle (UAV) cyber-physical system for aerial manipulation. This UAV's body is capable of exerting powerful propulsion force in the longitudinal direction, decoupling the translational dynamics and the rotational dynamics on the longitudinal plane. A high-level impedance control law is proposed to drive the vehicle for trajectory tracking and interaction with the environments. In addition, a vision-based real-time target identification and tracking method integrating a YOLO v3 real-time object detector with feature tracking, and morphological operations is proposed to be implemented onboard the vehicle with support of model compression techniques to eliminate latency caused by video wireless transmission and heavy computation burden on traditional teleoperation platforms., Comment: Newsletter of IEEE Technical Committee on Cyber-Physical Systems
Published: 2022

17. PolyMPCNet: Towards ReLU-free Neural Architecture Search in Two-party Computation Based Private Inference

Author: Peng, Hongwu, Zhou, Shanglin, Luo, Yukui, Duan, Shijin, Xu, Nuo, Ran, Ran, Huang, Shaoyi, Wang, Chenghong, Geng, Tong, Li, Ang, Wen, Wujie, Xu, Xiaolin, and Ding, Caiwen
Subjects: Computer Science - Cryptography and Security, Computer Science - Machine Learning, I.2, E.3, C.3
Abstract: The rapid growth and deployment of deep learning (DL) has witnessed emerging privacy and security concerns. To mitigate these issues, secure multi-party computation (MPC) has been discussed, to enable the privacy-preserving DL computation. In practice, they often come at very high computation and communication overhead, and potentially prohibit their popularity in large scale systems. Two orthogonal research trends have attracted enormous interests in addressing the energy efficiency in secure deep learning, i.e., overhead reduction of MPC comparison protocol, and hardware acceleration. However, they either achieve a low reduction ratio and suffer from high latency due to limited computation and communication saving, or are power-hungry as existing works mainly focus on general computing platforms such as CPUs and GPUs. In this work, as the first attempt, we develop a systematic framework, PolyMPCNet, of joint overhead reduction of MPC comparison protocol and hardware acceleration, by integrating hardware latency of the cryptographic building block into the DNN loss function to achieve high energy efficiency, accuracy, and security guarantee. Instead of heuristically checking the model sensitivity after a DNN is well-trained (through deleting or dropping some non-polynomial operators), our key design principle is to em enforce exactly what is assumed in the DNN design -- training a DNN that is both hardware efficient and secure, while escaping the local minima and saddle points and maintaining high accuracy. More specifically, we propose a straight through polynomial activation initialization method for cryptographic hardware friendly trainable polynomial activation function to replace the expensive 2P-ReLU operator. We develop a cryptographic hardware scheduler and the corresponding performance model for Field Programmable Gate Arrays (FPGA) platform., Comment: Uploaded a new version of the paper in another new submission: RRNet: Towards ReLU-Reduced Neural Network for Two-party Computation Based Private Inference [arXiv:2302.02292]
Published: 2022

18. Towards Sparsification of Graph Neural Networks

Author: Peng, Hongwu, Gurevin, Deniz, Huang, Shaoyi, Geng, Tong, Jiang, Weiwen, Khan, Omer, and Ding, Caiwen
Subjects: Computer Science - Machine Learning, I.2, C.4
Abstract: As real-world graphs expand in size, larger GNN models with billions of parameters are deployed. High parameter count in such models makes training and inference on graphs expensive and challenging. To reduce the computational and memory costs of GNNs, optimization methods such as pruning the redundant nodes and edges in input graphs have been commonly adopted. However, model compression, which directly targets the sparsification of model layers, has been mostly limited to traditional Deep Neural Networks (DNNs) used for tasks such as image classification and object detection. In this paper, we utilize two state-of-the-art model compression methods (1) train and prune and (2) sparse training for the sparsification of weight layers in GNNs. We evaluate and compare the efficiency of both methods in terms of accuracy, training sparsity, and training FLOPs on real-world graphs. Our experimental results show that on the ia-email, wiki-talk, and stackoverflow datasets for link prediction, sparse training with much lower training FLOPs achieves a comparable accuracy with the train and prune method. On the brain dataset for node classification, sparse training uses a lower number FLOPs (less than 1/7 FLOPs of train and prune method) and preserves a much better accuracy performance under extreme model sparsity., Comment: ICCD 2022 Paper
Published: 2022

19. A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Author: Peng, Hongwu, Huang, Shaoyi, Chen, Shiyang, Li, Bingbing, Geng, Tong, Li, Ang, Jiang, Weiwen, Wen, Wujie, Bi, Jinbo, Liu, Hang, and Ding, Caiwen
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture, I.2, B.6, C.3
Abstract: Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 $\times$ and 2.6 $\times$ speedup compared to CPU and GPU implementation, and 4 $\times$ higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM., Comment: 2022 59th ACM/IEEE Design Automation Conference (DAC)
Published: 2022
Full Text: View/download PDF

20. An Automatic and Efficient BERT Pruning for Edge AI Systems

Author: Huang, Shaoyi, Liu, Ning, Liang, Yueying, Peng, Hongwu, Li, Hongjia, Xu, Dongkuan, Xie, Mimi, and Ding, Caiwen
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: With the yearning for deep learning democratization, there are increasing demands to implement Transformer-based natural language processing (NLP) models on resource-constrained devices for low-latency and high accuracy. Existing BERT pruning methods require domain experts to heuristically handcraft hyperparameters to strike a balance among model size, latency, and accuracy. In this work, we propose AE-BERT, an automatic and efficient BERT pruning framework with efficient evaluation to select a "good" sub-network candidate (with high accuracy) given the overall pruning ratio constraints. Our proposed method requires no human experts experience and achieves a better accuracy performance on many NLP tasks. Our experimental results on General Language Understanding Evaluation (GLUE) benchmark show that AE-BERT outperforms the state-of-the-art (SOTA) hand-crafted pruning methods on BERT$_{\mathrm{BASE}}$. On QNLI and RTE, we obtain 75\% and 42.8\% more overall pruning ratio while achieving higher accuracy. On MRPC, we obtain a 4.6 higher score than the SOTA at the same overall pruning ratio of 0.5. On STS-B, we can achieve a 40\% higher pruning ratio with a very small loss in Spearman correlation compared to SOTA hand-crafted pruning methods. Experimental results also show that after model compression, the inference time of a single BERT$_{\mathrm{BASE}}$ encoder on Xilinx Alveo U200 FPGA board has a 1.83$\times$ speedup compared to Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU, which shows the reasonableness of deploying the proposed method generated subnets of BERT$_{\mathrm{BASE}}$ model on computation restricted devices.
Published: 2022

21. Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

Author: Qi, Panjie, Sha, Edwin Hsing-Mean, Zhuge, Qingfeng, Peng, Hongwu, Huang, Shaoyi, Kong, Zhenglun, Song, Yuhong, and Li, Bingbing
Subjects: Computer Science - Machine Learning, C.3, I.2
Abstract: State-of-the-art Transformer-based models, with gigantic parameters, are difficult to be accommodated on resource constrained embedded devices. Moreover, with the development of technology, more and more embedded devices are available to run a Transformer model. For a Transformer model with different constraints (tight or loose), it can be deployed onto devices with different computing power. However, in previous work, designers did not choose the best device among multiple devices. Instead, they just used an existing device to deploy model, which was not necessarily the best fit and may lead to underutilization of resources. To address the deployment challenge of Transformer and the problem to select the best device, we propose an algorithm & hardware closed-loop acceleration framework. Given a dataset, a model, latency constraint LC and accuracy constraint AC, our framework can provide a best device satisfying both constraints. In order to generate a compressed model with high sparsity ratio, we propose a novel pruning technique, hierarchical pruning (HP). We optimize the sparse matrix storage format for HP matrix to further reduce memory usage for FPGA implementation. We design a accelerator that takes advantage of HP to solve the problem of concurrent random access. Experiments on Transformer and TinyBert model show that our framework can find different devices for various LC and AC, covering from low-end devices to high-end devices. Our HP can achieve higher sparsity ratio and is more flexible than other sparsity pattern. Our framework can achieve 37x, 1.9x, 1.7x speedup compared to CPU, GPU and FPGA, respectively.
Published: 2021

22. Detecting Gender Bias in Transformer-based Models: A Case Study on BERT

Author: Li, Bingbing, Peng, Hongwu, Sainju, Rajat, Yang, Junhuan, Yang, Lei, Liang, Yueying, Jiang, Weiwen, Wang, Binghui, Liu, Hang, and Ding, Caiwen
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, I.2, I.7, H.0
Abstract: In this paper, we propose a novel gender bias detection method by utilizing attention map for transformer-based models. We 1) give an intuitive gender bias judgement method by comparing the different relation degree between the genders and the occupation according to the attention scores, 2) design a gender bias detector by modifying the attention module, 3) insert the gender bias detector into different positions of the model to present the internal gender bias flow, and 4) draw the consistent gender bias conclusion by scanning the entire Wikipedia, a BERT pretraining dataset. We observe that 1) the attention matrices, Wq and Wk introduce much more gender bias than other modules (including the embedding layer) and 2) the bias degree changes periodically inside of the model (attention matrix Q, K, V, and the remaining part of the attention layer (including the fully-connected layer, the residual connection, and the layer normalization module) enhance the gender bias while the averaged attentions reduces the bias).
Published: 2021

23. Optimizing FPGA-based Accelerator Design for Large-Scale Molecular Similarity Search

Author: Peng, Hongwu, Chen, Shiyang, Wang, Zhepeng, Yang, Junhuan, Weitze, Scott A., Geng, Tong, Li, Ang, Bi, Jinbo, Song, Minghu, Jiang, Weiwen, Liu, Hang, and Ding, Caiwen
Subjects: Computer Science - Hardware Architecture, B.0, I.0
Abstract: Molecular similarity search has been widely used in drug discovery to identify structurally similar compounds from large molecular databases rapidly. With the increasing size of chemical libraries, there is growing interest in the efficient acceleration of large-scale similarity search. Existing works mainly focus on CPU and GPU to accelerate the computation of the Tanimoto coefficient in measuring the pairwise similarity between different molecular fingerprints. In this paper, we propose and optimize an FPGA-based accelerator design on exhaustive and approximate search algorithms. On exhaustive search using BitBound & folding, we analyze the similarity cutoff and folding level relationship with search speedup and accuracy, and propose a scalable on-the-fly query engine on FPGAs to reduce the resource utilization and pipeline interval. We achieve a 450 million compounds-per-second processing throughput for a single query engine. On approximate search using hierarchical navigable small world (HNSW), a popular algorithm with high recall and query speed. We propose an FPGA-based graph traversal engine to utilize a high throughput register array based priority queue and fine-grained distance calculation engine to increase the processing capability. Experimental results show that the proposed FPGA-based HNSW implementation has a 103385 query per second (QPS) on the Chembl database with 0.92 recall and achieves a 35x speedup than the existing CPU implementation on average. To the best of our knowledge, our FPGA-based implementation is the first attempt to accelerate molecular similarity search algorithms on FPGA and has the highest performance among existing approaches., Comment: ICCAD 2021
Published: 2021

24. Binary Complex Neural Network Acceleration on FPGA

Author: Peng, Hongwu, Zhou, Shanglin, Weitze, Scott, Li, Jiaxin, Islam, Sahidul, Geng, Tong, Li, Ang, Zhang, Wei, Song, Minghu, Xie, Mimi, Liu, Hang, and Ding, Caiwen
Subjects: Computer Science - Machine Learning, B.0, C.3, I.2
Abstract: Being able to learn from complex data with phase information is imperative for many signal processing applications. Today' s real-valued deep neural networks (DNNs) have shown efficiency in latent information analysis but fall short when applied to the complex domain. Deep complex networks (DCN), in contrast, can learn from complex data, but have high computational costs; therefore, they cannot satisfy the instant decision-making requirements of many deployable systems dealing with short observations or short signal bursts. Recent, Binarized Complex Neural Network (BCNN), which integrates DCNs with binarized neural networks (BNN), shows great potential in classifying complex data in real-time. In this paper, we propose a structural pruning based accelerator of BCNN, which is able to provide more than 5000 frames/s inference throughput on edge devices. The high performance comes from both the algorithm and hardware sides. On the algorithm side, we conduct structural pruning to the original BCNN models and obtain 20 $\times$ pruning rates with negligible accuracy loss; on the hardware side, we propose a novel 2D convolution operation accelerator for the binary complex neural network. Experimental results show that the proposed design works with over 90% utilization and is able to achieve the inference throughput of 5882 frames/s and 4938 frames/s for complex NIN-Net and ResNet-18 using CIFAR-10 dataset and Alveo U280 Board., Comment: ASAP 2021, 8 pages
Published: 2021

25. Improving DNN Fault Tolerance using Weight Pruning and Differential Crossbar Mapping for ReRAM-based Edge AI

Author: Yuan, Geng, Liao, Zhiheng, Ma, Xiaolong, Cai, Yuxuan, Kong, Zhenglun, Shen, Xuan, Fu, Jingyan, Li, Zhengang, Zhang, Chengming, Peng, Hongwu, Liu, Ning, Ren, Ao, Wang, Jinhui, and Wang, Yanzhi
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Performance
Abstract: Recent research demonstrated the promise of using resistive random access memory (ReRAM) as an emerging technology to perform inherently parallel analog domain in-situ matrix-vector multiplication -- the intensive and key computation in deep neural networks (DNNs). However, hardware failure, such as stuck-at-fault defects, is one of the main concerns that impedes the ReRAM devices to be a feasible solution for real implementations. The existing solutions to address this issue usually require an optimization to be conducted for each individual device, which is impractical for mass-produced products (e.g., IoT devices). In this paper, we rethink the value of weight pruning in ReRAM-based DNN design from the perspective of model fault tolerance. And a differential mapping scheme is proposed to improve the fault tolerance under a high stuck-on fault rate. Our method can tolerate almost an order of magnitude higher failure rate than the traditional two-column method in representative DNN tasks. More importantly, our method does not require extra hardware cost compared to the traditional two-column mapping scheme. The improvement is universal and does not require the optimization process for each individual device., Comment: In Proceedings of the 22nd International Symposium on Quality Electronic Design (ISQED), 2021
Published: 2021

26. Si-IGBT and SiC-MOSFET hybrid switch-based 1.7 kV half-bridge power module

Author: Deshpande, Amol, Paul, Riya, Imran Emon, Asif, Yuan, Zhao, Peng, Hongwu, and Luo, Fang
Published: 2022
Full Text: View/download PDF

27. Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Author: Peng, Hongwu, primary, Ding, Caiwen, additional, Geng, Tong, additional, Choudhury, Sutanay, additional, Barker, Kevin, additional, and Li, Ang, additional
Published: 2024
Full Text: View/download PDF

28. AQ2PNN: Enabling Two-party Privacy-Preserving Deep Neural Network Inference with Adaptive Quantization

Author: Luo, Yukui, primary, Xu, Nuo, additional, Peng, Hongwu, additional, Wang, Chenghong, additional, Duan, Shijin, additional, Mahmood, Kaleel, additional, Wen, Wujie, additional, Ding, Caiwen, additional, and Xu, Xiaolin, additional
Published: 2023
Full Text: View/download PDF

29. PASNet: Polynomial Architecture Search Framework for Two-party Computation-based Secure Neural Network Deployment

Author: Peng, Hongwu, primary, Zhou, Shanglin, additional, Luo, Yukui, additional, Xu, Nuo, additional, Duan, Shijin, additional, Ran, Ran, additional, Zhao, Jiahui, additional, Wang, Chenghong, additional, Geng, Tong, additional, Wen, Wujie, additional, Xu, Xiaolin, additional, and Ding, Caiwen, additional
Published: 2023
Full Text: View/download PDF

30. Dynamic Sparse Training via Balancing the Exploration-Exploitation Trade-off

Author: Huang, Shaoyi, primary, Lei, Bowen, additional, Xu, Dongkuan, additional, Peng, Hongwu, additional, Sun, Yue, additional, Xie, Mimi, additional, and Ding, Caiwen, additional
Published: 2023
Full Text: View/download PDF

31. Design and Validation of a MVDC Isolated Active Voltage Injection Based HCB

Author: Mirza, Abdul Basit, primary, Azadeh, Yalda, additional, Peng, Hongwu, additional, Li, Yang, additional, Kaplun, John, additional, and Luo, Fang, additional
Published: 2023
Full Text: View/download PDF

32. Towards Sparsification of Graph Neural Networks

Author: Peng, Hongwu, primary, Gurevin, Deniz, additional, Huang, Shaoyi, additional, Geng, Tong, additional, Jiang, Weiwen, additional, Khan, Orner, additional, and Ding, Caiwen, additional
Published: 2022
Full Text: View/download PDF

33. CoDG-ReRAM: An Algorithm-Hardware Co-design to Accelerate Semi-Structured GNNs on ReRAM

Author: Luo, Yixuan, primary, Behnam, Payman, additional, Thorat, Kiran, additional, Liu, Zhuo, additional, Peng, Hongwu, additional, Huang, Shaoyi, additional, Zhou, Shu, additional, Khan, Omer, additional, Tumanov, Alexey, additional, Ding, Caiwen, additional, and Geng, Tong, additional
Published: 2022
Full Text: View/download PDF

34. A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining

Author: Peng, Hongwu, primary, Huang, Shaoyi, additional, Chen, Shiyang, additional, Li, Bingbing, additional, Geng, Tong, additional, Li, Ang, additional, Jiang, Weiwen, additional, Wen, Wujie, additional, Bi, Jinbo, additional, Liu, Hang, additional, and Ding, Caiwen, additional
Published: 2022
Full Text: View/download PDF

35. An Automatic and Efficient BERT Pruning for Edge AI Systems

Author: Huang, Shaoyi, primary, Liu, Ning, additional, Liang, Yueying, additional, Peng, Hongwu, additional, Li, Hongjia, additional, Xu, Dongkuan, additional, Xie, Mimi, additional, and Ding, Caiwen, additional
Published: 2022
Full Text: View/download PDF

36. Optimizing FPGA-based Accelerator Design for Large-Scale Molecular Similarity Search (Special Session Paper)

Author: Peng, Hongwu, primary, Chen, Shiyang, additional, Wang, Zhepeng, additional, Yang, Junhuan, additional, Weitze, Scott A., additional, Geng, Tong, additional, Li, Ang, additional, Bi, Jinbo, additional, Song, Minghu, additional, Jiang, Weiwen, additional, Liu, Hang, additional, and Ding, Caiwen, additional
Published: 2021
Full Text: View/download PDF

37. Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

Author: Qi, Panjie, primary, Sha, Edwin Hsing-Mean, additional, Zhuge, Qingfeng, additional, Peng, Hongwu, additional, Huang, Shaoyi, additional, Kong, Zhenglun, additional, Song, Yuhong, additional, and Li, Bingbing, additional
Published: 2021
Full Text: View/download PDF

38. An Isolated Voltage Injection Based Hybrid Circuit Breaker for MVDC Applications

Author: Mirza, Abdul Basit, primary, Azadeh, Yalda, additional, Peng, Hongwu, additional, and Luo, Fang, additional
Published: 2021
Full Text: View/download PDF

39. Space-Charge Accumulation and Its Impact on High-Voltage Power Module Partial Discharge Under DC and PWM Waves: Testing and Modeling

Author: Wang, Yalin, primary, Ding, Yi, additional, Yuan, Zhao, additional, Peng, Hongwu, additional, Wu, Jiandong, additional, Yin, Yi, additional, Han, Tao, additional, and Luo, Fang, additional
Published: 2021
Full Text: View/download PDF

40. Vertically Stacked, Flip-Chip Wide Bandgap MOSFET Co-Optimized for Reliability and Switching Performance

Author: Montazeri, Mahsa, primary, Huitink, David R., additional, Wallace, Andrea, additional, Peng, Hongwu, additional, Seal, Sayan, additional, Luo, Fang, additional, and Mantooth, H. Alan, additional
Published: 2021
Full Text: View/download PDF

41. Binary Complex Neural Network Acceleration on FPGA : (Invited Paper)

Author: Peng, Hongwu, primary, Zhou, Shanglin, additional, Weitze, Scott, additional, Li, Jiaxin, additional, Islam, Sahidul, additional, Geng, Tong, additional, Li, Ang, additional, Zhang, Wei, additional, Song, Minghu, additional, Xie, Mimi, additional, Liu, Hang, additional, and Ding, Caiwen, additional
Published: 2021
Full Text: View/download PDF

42. Investigation on Conducted EMI for Single and Parallel Connected Inverters

Author: ul-Hassan, Mustafeez, primary, Emon, Asif Imran, additional, Peng, Hongwu, additional, Kushan, Choksi, additional, and Luo, Fang, additional
Published: 2021
Full Text: View/download PDF

43. Accommodating Transformer onto FPGA

Author: Qi, Panjie, primary, Song, Yuhong, additional, Peng, Hongwu, additional, Huang, Shaoyi, additional, Zhuge, Qingfeng, additional, and Sha, Edwin Hsing-Mean, additional
Published: 2021
Full Text: View/download PDF

44. HMC-T RAN

Author: Huang, Shaoyi, primary, Chen, Shiyang, additional, Peng, Hongwu, additional, Manu, Daniel, additional, Kong, Zhenglun, additional, Yuan, Geng, additional, Yang, Lei, additional, Wang, Shusen, additional, Liu, Hang, additional, and Ding, Caiwen, additional
Published: 2021
Full Text: View/download PDF

45. A Three-phase 450 kVA SiC-MOSFET Based Inverter With High Efficiency and High Power Density By Using 3L-TNPC

Author: Yuan, Zhao, primary, Emon, Asif Imran, additional, Wang, Zhongjing, additional, Peng, Hongwu, additional, Narayanasamy, Balaji, additional, Hassan, Mustafeez, additional, Wang, Yalin, additional, Deshpande, Amol, additional, and Luo, Fang, additional
Published: 2021
Full Text: View/download PDF

46. Design of Partial-discharge-free Busbar for More-electric Aircraft Application with Low Pressure Condition

Author: Yuan, Zhao, primary, Wang, Yalin, additional, Wang, Zhongjing, additional, Emon, Asif Imran, additional, Peng, Hongwu, additional, Hassan, Mustafeez, additional, Narayanasamy, Balaji, additional, and Luo, Fang, additional
Published: 2021
Full Text: View/download PDF

47. Partial Discharge Testing Platform for High Voltage Power Module Packaging Under Square Wave Excitation

Author: Wang, Yalin, primary, Yuan, Zhao, additional, Peng, Hongwu, additional, Ding, Yi, additional, Yin, Yi, additional, and Fang, Luo, additional
Published: 2021
Full Text: View/download PDF

48. Improving DNN Fault Tolerance using Weight Pruning and Differential Crossbar Mapping for ReRAM-based Edge AI

Author: Yuan, Geng, primary, Liao, Zhiheng, additional, Ma, Xiaolong, additional, Cai, Yuxuan, additional, Kong, Zhenglun, additional, Shen, Xuan, additional, Fu, Jingyan, additional, Li, Zhengang, additional, Zhang, Chengming, additional, Peng, Hongwu, additional, Liu, Ning, additional, Ren, Ao, additional, Wang, Jinhui, additional, and Wang, Yanzhi, additional
Published: 2021
Full Text: View/download PDF

49. Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning

Author: Peng, Hongwu, primary, Huang, Shaoyi, additional, Geng, Tong, additional, Li, Ang, additional, Jiang, Weiwen, additional, Liu, Hang, additional, Wang, Shusen, additional, and Ding, Caiwen, additional
Published: 2021
Full Text: View/download PDF

50. Zero-Phase-Filtering based Digital Active EMI Filter

Author: Narayanasamy, Balaji, primary, Peng, Hongwu, additional, Yuan, Zhao, additional, Luo, Fang, additional, and Chu, Yongbin, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

102 results on '"Peng, Hongwu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources