Author: "Han, Yinhe" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Han, Yinhe"' showing total 326 results

Start Over Author "Han, Yinhe"

326 results on '"Han, Yinhe"'

1. COMET: Towards Partical W4A4KV4 LLMs Serving

Author: Liu, Lian, Ren, Haimeng, Cheng, Long, Xu, Zhaohui, Pan, Yudong, Wang, Mengdi, Li, Xiaowei, Han, Yinhe, and Wang, Ying
Subjects: Computer Science - Hardware Architecture, Computer Science - Machine Learning
Abstract: Quantization is a widely-used compression technology to reduce the overhead of serving large language models (LLMs) on terminal devices and in cloud data centers. However, prevalent quantization methods, such as 8-bit weight-activation or 4-bit weight-only quantization, achieve limited performance improvements due to poor support for low-precision (e.g., 4-bit) activation. This work, for the first time, realizes practical W4A4KV4 serving for LLMs, fully utilizing the INT4 tensor cores on modern GPUs and reducing the memory bottleneck caused by the KV cache. Specifically, we propose a novel fine-grained mixed-precision quantization algorithm (FMPQ) that compresses most activations into 4-bit with negligible accuracy loss. To support mixed-precision matrix multiplication for W4A4 and W4A8, we develop a highly optimized W4Ax kernel. Our approach introduces a novel mixed-precision data layout to facilitate access and fast dequantization for activation and weight tensors, utilizing the GPU's software pipeline to hide the overhead of data loading and conversion. Additionally, we propose fine-grained streaming multiprocessor (SM) scheduling to achieve load balance across different SMs. We integrate the optimized W4Ax kernel into our inference framework, COMET, and provide efficient management to support popular LLMs such as LLaMA-3-70B. Extensive evaluations demonstrate that, when running LLaMA family models on a single A100-80G-SMX4, COMET achieves a kernel-level speedup of \textbf{$2.88\times$} over cuBLAS and a \textbf{$2.02 \times$} throughput improvement compared to TensorRT-LLM from an end-to-end framework perspective., Comment: 14 pages, 12 figures
Published: 2024

2. BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

Author: Xu, Yongqi, Lee, Yujian, Yi, Gao, Liu, Bosheng, Chen, Yucong, Liu, Peng, Wu, Jigang, Chen, Xiaoming, and Han, Yinhe
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep neural networks (DNNs) are powerful for cognitive tasks such as image classification, object detection, and scene segmentation. One drawback however is the significant high computational complexity and memory consumption, which makes them unfeasible to run real-time on embedded platforms because of the limited hardware resources. Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden owing to their capability to effectively capture the broad data distribution of DNN models. Unfortunately, prior works on BFP-based quantization empirically choose the block size and the precision that preserve accuracy. In this paper, we develop a BFP-based bitwidth-aware analytical modeling framework (called ``BitQ'') for the best BFP implementation of DNN inference on embedded platforms. We formulate and resolve an optimization problem to identify the optimal BFP block size and bitwidth distribution by the trade-off of both accuracy and performance loss. Experimental results show that compared with an equal bitwidth setting, the BFP DNNs with optimized bitwidth allocation provide efficient computation, preserving accuracy on famous benchmarks. The source code and data are available at https://github.com/Cheliosoops/BitQ.
Published: 2024

3. KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems

Author: Wang, Zixuan, Yu, Bo, Zhao, Junzhe, Sun, Wenhao, Hou, Sai, Liang, Shuai, Hu, Xing, Han, Yinhe, and Gan, Yiming
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: Embodied AI agents responsible for executing interconnected, long-sequence household tasks often face difficulties with in-context memory, leading to inefficiencies and errors in task execution. To address this issue, we introduce KARMA, an innovative memory system that integrates long-term and short-term memory modules, enhancing large language models (LLMs) for planning in embodied agents through memory-augmented prompting. KARMA distinguishes between long-term and short-term memory, with long-term memory capturing comprehensive 3D scene graphs as representations of the environment, while short-term memory dynamically records changes in objects' positions and states. This dual-memory structure allows agents to retrieve relevant past scene experiences, thereby improving the accuracy and efficiency of task planning. Short-term memory employs strategies for effective and adaptive memory replacement, ensuring the retention of critical information while discarding less pertinent data. Compared to state-of-the-art embodied agents enhanced with memory, our memory-augmented embodied AI agent improves success rates by 1.3x and 2.3x in Composite Tasks and Complex Tasks within the AI2-THOR simulator, respectively, and enhances task execution efficiency by 3.4x and 62.7x. Furthermore, we demonstrate that KARMA's plug-and-play capability allows for seamless deployment on real-world robotic systems, such as mobile manipulation platforms.Through this plug-and-play memory system, KARMA significantly enhances the ability of embodied agents to generate coherent and contextually appropriate plans, making the execution of complex household tasks more efficient. The experimental videos from the work can be found at https://youtu.be/4BT7fnw9ehs.
Published: 2024

4. SuperEncoder: Towards Universal Neural Approximate Quantum State Preparation

Author: Zhao, Yilun, Wang, Bingmeng, Jiang, Wenle, Pan, Xiwei, Li, Bing, Han, Yinhe, and Wang, Ying
Subjects: Quantum Physics, Computer Science - Machine Learning
Abstract: Numerous quantum algorithms operate under the assumption that classical data has already been converted into quantum states, a process termed Quantum State Preparation (QSP). However, achieving precise QSP requires a circuit depth that scales exponentially with the number of qubits, making it a substantial obstacle in harnessing quantum advantage. Recent research suggests using a Parameterized Quantum Circuit (PQC) to approximate a target state, offering a more scalable solution with reduced circuit depth compared to precise QSP. Despite this, the need for iterative updates of circuit parameters results in a lengthy runtime, limiting its practical application. In this work, we demonstrate that it is possible to leverage a pre-trained neural network to directly generate the QSP circuit for arbitrary quantum state, thereby eliminating the significant overhead of online iterations. Our study makes a steady step towards a universal neural designer for approximate QSP.
Published: 2024

5. Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation

Author: Chang, Kaiyan, Chen, Zhirong, Zhou, Yunhao, Zhu, Wenlong, wang, kun, Xu, Haobo, Li, Cangyuan, Wang, Mengdi, Liang, Shengwen, Li, Huawei, Han, Yinhe, and Wang, Ying
Subjects: Computer Science - Hardware Architecture, Computer Science - Artificial Intelligence
Abstract: Natural language interfaces have exhibited considerable potential in the automation of Verilog generation derived from high-level specifications through the utilization of large language models, garnering significant attention. Nevertheless, this paper elucidates that visual representations contribute essential contextual information critical to design intent for hardware architectures possessing spatial complexity, potentially surpassing the efficacy of natural-language-only inputs. Expanding upon this premise, our paper introduces an open-source benchmark for multi-modal generative models tailored for Verilog synthesis from visual-linguistic inputs, addressing both singular and complex modules. Additionally, we introduce an open-source visual and natural language Verilog query language framework to facilitate efficient and user-friendly multi-modal queries. To evaluate the performance of the proposed multi-modal hardware generative AI in Verilog generation tasks, we compare it with a popular method that relies solely on natural language. Our results demonstrate a significant accuracy improvement in the multi-modal generated Verilog compared to queries based solely on natural language. We hope to reveal a new approach to hardware design in the large-hardware-design-model era, thereby fostering a more diversified and productive approach to hardware design., Comment: Accepted by ICCAD 2024
Published: 2024

6. Corki: Enabling Real-time Embodied AI Robots via Algorithm-Architecture Co-Design

Author: Huang, Yiyang, Hao, Yuhui, Yu, Bo, Yan, Feng, Yang, Yuxin, Min, Feng, Han, Yinhe, Ma, Lin, Liu, Shaoshan, Liu, Qiang, and Gan, Yiming
Subjects: Computer Science - Hardware Architecture, Computer Science - Robotics
Abstract: Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate. In particular, today's computing systems for embodied AI robots are designed purely based on the interest of algorithm developers, where robot actions are divided into a discrete frame-basis. Such an execution pipeline creates high latency and energy consumption. This paper proposes Corki, an algorithm-architecture co-design framework for real-time embodied AI robot control. Our idea is to decouple LLM inference, robotic control and data communication in the embodied AI robots compute pipeline. Instead of predicting action for one single frame, Corki predicts the trajectory for the near future to reduce the frequency of LLM inference. The algorithm is coupled with a hardware that accelerates transforming trajectory into actual torque signals used to control robots and an execution pipeline that parallels data communication with computation. Corki largely reduces LLM inference frequency by up to 8.0x, resulting in up to 3.6x speed up. The success rate improvement can be up to 17.3%. Code is provided for re-implementation. https://github.com/hyy0613/Corki
Published: 2024

7. Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework

Author: Chang, Kaiyan, Wang, Kun, Yang, Nan, Wang, Ying, Jin, Dantong, Zhu, Wenlong, Chen, Zhirong, Li, Cangyuan, Yan, Hao, Zhou, Yunhao, Zhao, Zhuoliang, Cheng, Yuan, Pan, Yudong, Liu, Yiqi, Wang, Mengdi, Liang, Shengwen, Han, Yinhe, Li, Huawei, and Li, Xiaowei
Subjects: Computer Science - Hardware Architecture, Computer Science - Artificial Intelligence, Computer Science - Programming Languages
Abstract: Recent advances in large language models have demonstrated their potential for automated generation of hardware description language (HDL) code from high-level prompts. Researchers have utilized fine-tuning to enhance the ability of these large language models (LLMs) in the field of Chip Design. However, the lack of Verilog data hinders further improvement in the quality of Verilog generation by LLMs. Additionally, the absence of a Verilog and Electronic Design Automation (EDA) script data augmentation framework significantly increases the time required to prepare the training dataset for LLM trainers. This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts. For Verilog generation, it translates Verilog files to an abstract syntax tree and then maps nodes to natural language with a predefined template. For Verilog repair, it uses predefined rules to generate the wrong verilog file and then pairs EDA Tool feedback with the right and wrong verilog file. For EDA Script generation, it uses existing LLM(GPT-3.5) to obtain the description of the Script. To evaluate the effectiveness of our data augmentation method, we finetune Llama2-13B and Llama2-7B models using the dataset generated by our augmentation framework. The results demonstrate a significant improvement in the Verilog generation tasks with LLMs. Moreover, the accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark. Our 13B model (ChipGPT-FT) has a pass rate improvement compared with GPT-3.5 in Verilog generation and outperforms in EDA script (i.e., SiliconCompiler) generation with only 200 EDA script data., Comment: DAC 2024
Published: 2024
Full Text: View/download PDF

8. PIMSYN: Synthesizing Processing-in-memory CNN Accelerators

Author: Li, Wanqian, Sun, Xiaotian, Wang, Xinyu, Wang, Lei, Han, Yinhe, and Chen, Xiaoming
Subjects: Computer Science - Hardware Architecture
Abstract: Processing-in-memory architectures have been regarded as a promising solution for CNN acceleration. Existing PIM accelerator designs rely heavily on the experience of experts and require significant manual design overhead. Manual design cannot effectively optimize and explore architecture implementations. In this work, we develop an automatic framework PIMSYN for synthesizing PIM-based CNN accelerators, which greatly facilitates architecture design and helps generate energyefficient accelerators. PIMSYN can automatically transform CNN applications into execution workflows and hardware construction of PIM accelerators. To systematically optimize the architecture, we embed an architectural exploration flow into the synthesis framework, providing a more comprehensive design space. Experiments demonstrate that PIMSYN improves the power efficiency by several times compared with existing works. PIMSYN can be obtained from https://github.com/lixixi-jook/PIMSYN-NN.
Published: 2024

9. PIMSIM-NN: An ISA-based Simulation Framework for Processing-in-Memory Accelerators

Author: Wang, Xinyu, Sun, Xiaotian, Han, Yinhe, and Chen, Xiaoming
Subjects: Computer Science - Hardware Architecture
Abstract: Processing-in-memory (PIM) has shown extraordinary potential in accelerating neural networks. To evaluate the performance of PIM accelerators, we present an ISA-based simulation framework including a dedicated ISA targeting neural networks running on PIM architectures, a compiler, and a cycleaccurate configurable simulator. Compared with prior works, this work decouples software algorithms and hardware architectures through the proposed ISA, providing a more convenient way to evaluate the effectiveness of software/hardware optimizations. The simulator adopts an event-driven simulation approach and has better support for hardware parallelism. The framework is open-sourced at https://github.com/wangxy-2000/pimsim-nn.
Published: 2024

10. Dadu-RBD: Robot Rigid Body Dynamics Accelerator with Multifunctional Pipelines

Author: Yang, Yuxin, Chen, Xiaoming, and Han, Yinhe
Subjects: Computer Science - Robotics, Computer Science - Hardware Architecture
Abstract: Rigid body dynamics is a key technology in the robotics field. In trajectory optimization and model predictive control algorithms, there are usually a large number of rigid body dynamics computing tasks. Using CPUs to process these tasks consumes a lot of time, which will affect the real-time performance of robots. To this end, we propose a multifunctional robot rigid body dynamics accelerator, named RBDCore, to address the performance bottleneck. By analyzing different functions commonly used in robot dynamics calculations, we summarize their reuse relationship and optimize them according to the hardware. Based on this, RBDCore can fully reuse common hardware modules when processing different computing tasks. By dynamically switching the dataflow path, RBDCore can accelerate various dynamics functions without reconfiguring the hardware. We design Structure-Adaptive Pipelines for RBDCore, which can greatly improve the throughput of the accelerator. Robots with different structures and parameters can be optimized specifically. Compared with the state-of-the-art CPU, GPU dynamics libraries and FPGA accelerator, RBDCore can significantly improve the performance.
Published: 2023

11. PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators

Author: Sun, Xiaotian, Wang, Xinyu, Li, Wanqian, Wang, Lei, Han, Yinhe, and Chen, Xiaoming
Subjects: Computer Science - Hardware Architecture, Computer Science - Emerging Technologies
Abstract: Crossbar-based PIM DNN accelerators can provide massively parallel in-situ operations. A specifically designed compiler is important to achieve high performance for a wide variety of DNN workloads. However, some key compilation issues such as parallelism considerations, weight replication selection, and array mapping methods have not been solved. In this work, we propose PIMCOMP - a universal compilation framework for NVM crossbar-based PIM DNN accelerators. PIMCOMP is built on an abstract PIM accelerator architecture, which is compatible with the widely used Crossbar/IMA/Tile/Chip hierarchy. On this basis, we propose four general compilation stages for crossbar-based PIM accelerators: node partitioning, weight replicating, core mapping, and dataflow scheduling. We design two compilation modes with different inter-layer pipeline granularities to support high-throughput and low-latency application scenarios, respectively. Our experimental results show that PIMCMOP yields improvements of 1.6$\times$ and 2.4$\times$ in throughput and latency, respectively, relative to PUMA.
Published: 2023

12. ChipGPT: How far are we from natural language hardware design

Author: Chang, Kaiyan, Wang, Ying, Ren, Haimeng, Wang, Mengdi, Liang, Shengwen, Han, Yinhe, Li, Huawei, and Li, Xiaowei
Subjects: Computer Science - Artificial Intelligence, Computer Science - Hardware Architecture, Computer Science - Programming Languages
Abstract: As large language models (LLMs) like ChatGPT exhibited unprecedented machine intelligence, it also shows great performance in assisting hardware engineers to realize higher-efficiency logic design via natural language interaction. To estimate the potential of the hardware design process assisted by LLMs, this work attempts to demonstrate an automated design environment that explores LLMs to generate hardware logic designs from natural language specifications. To realize a more accessible and efficient chip development flow, we present a scalable four-stage zero-code logic design framework based on LLMs without retraining or finetuning. At first, the demo, ChipGPT, begins by generating prompts for the LLM, which then produces initial Verilog programs. Second, an output manager corrects and optimizes these programs before collecting them into the final design space. Eventually, ChipGPT will search through this space to select the optimal design under the target metrics. The evaluation sheds some light on whether LLMs can generate correct and complete hardware logic designs described by natural language for some specifications. It is shown that ChipGPT improves programmability, and controllability, and shows broader design optimization space compared to prior work and native LLMs alone.
Published: 2023

13. Depth-NeuS: Neural Implicit Surfaces Learning for Multi-view Reconstruction Based on Depth Information Optimization

Author: Jiang, Hanqi, Zeng, Cheng, Chen, Runnan, Liang, Shuai, Han, Yinhe, Gao, Yichao, and Wang, Conglin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, methods for neural surface representation and rendering, for example NeuS, have shown that learning neural implicit surfaces through volume rendering is becoming increasingly popular and making good progress. However, these methods still face some challenges. Existing methods lack a direct representation of depth information, which makes object reconstruction unrestricted by geometric features, resulting in poor reconstruction of objects with texture and color features. This is because existing methods only use surface normals to represent implicit surfaces without using depth information. Therefore, these methods cannot model the detailed surface features of objects well. To address this problem, we propose a neural implicit surface learning method called Depth-NeuS based on depth information optimization for multi-view reconstruction. In this paper, we introduce depth loss to explicitly constrain SDF regression and introduce geometric consistency loss to optimize for low-texture areas. Specific experiments show that Depth-NeuS outperforms existing technologies in multiple scenarios and achieves high-quality surface reconstruction in multiple scenarios., Comment: 9 pages
Published: 2023

14. Depth-NeuS: Neural Implicit Surfaces Learning for Multi-view Reconstruction Based on Depth Information Optimization

Author: Wen, Siqi, Jiang, Hanqi, Zeng, Cheng, Chen, Runnan, Yuan, Jidong, Liang, Shuai, Han, Yinhe, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Huang, De-Shuang, editor, Zhang, Xiankun, editor, and Guo, Jiayang, editor
Published: 2024
Full Text: View/download PDF

15. Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

Author: Cao, Weidong, Zhao, Yilong, Boloor, Adith, Han, Yinhe, Zhang, Xuan, and Jiang, Li
Subjects: Computer Science - Hardware Architecture, Computer Science - Emerging Technologies, Computer Science - Machine Learning
Abstract: Processing-in-memory (PIM) architectures have demonstrated great potential in accelerating numerous deep learning tasks. Particularly, resistive random-access memory (RRAM) devices provide a promising hardware substrate to build PIM accelerators due to their abilities to realize efficient in-situ vector-matrix multiplications (VMMs). However, existing PIM accelerators suffer from frequent and energy-intensive analog-to-digital (A/D) conversions, severely limiting their performance. This paper presents a new PIM architecture to efficiently accelerate deep learning tasks by minimizing the required A/D conversions with analog accumulation and neural approximated peripheral circuits. We first characterize the different dataflows employed by existing PIM accelerators, based on which a new dataflow is proposed to remarkably reduce the required A/D conversions for VMMs by extending shift and add (S+A) operations into the analog domain before the final quantizations. We then leverage a neural approximation method to design both analog accumulation circuits (S+A) and quantization circuits (ADCs) with RRAM crossbar arrays in a highly-efficient manner. Finally, we apply them to build an RRAM-based PIM accelerator (i.e., \textbf{Neural-PIM}) upon the proposed analog dataflow and evaluate its system-level performance. Evaluations on different benchmarks demonstrate that Neural-PIM can improve energy efficiency by 5.36x (1.73x) and speed up throughput by 3.43x (1.59x) without losing accuracy, compared to the state-of-the-art RRAM-based PIM accelerators, i.e., ISAAC (CASCADE)., Comment: 14 pages, 13 figures, Published in IEEE Transactions on Computers
Published: 2022
Full Text: View/download PDF

16. The Big Chip: Challenge, model and architecture

Author: Han, Yinhe, Xu, Haobo, Lu, Meixuan, Wang, Haoran, Huang, Junpei, Wang, Ying, Wang, Yujie, Min, Feng, Liu, Qi, Liu, Ming, and Sun, Ninghui
Published: 2024
Full Text: View/download PDF

17. Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction

Author: Jin, Beibei, Hu, Yu, Tang, Qiankun, Niu, Jingyu, Shi, Zhiping, Han, Yinhe, and Li, Xiaowei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video prediction is a pixel-wise dense prediction task to infer future frames based on past frames. Missing appearance details and motion blur are still two major problems for current predictive models, which lead to image distortion and temporal inconsistency. In this paper, we point out the necessity of exploring multi-frequency analysis to deal with the two problems. Inspired by the frequency band decomposition characteristic of Human Vision System (HVS), we propose a video prediction network based on multi-level wavelet analysis to deal with spatial and temporal information in a unified manner. Specifically, the multi-level spatial discrete wavelet transform decomposes each video frame into anisotropic sub-bands with multiple frequencies, helping to enrich structural information and reserve fine details. On the other hand, multi-level temporal discrete wavelet transform which operates on time axis decomposes the frame sequence into sub-band groups of different frequencies to accurately capture multi-frequency motions under a fixed frame rate. Extensive experiments on diverse datasets demonstrate that our model shows significant improvements on fidelity and temporal consistency over state-of-the-art works., Comment: Accepted by CVPR2020
Published: 2020

18. Communication Lower Bound in Convolution Accelerators

Author: Chen, Xiaoming, Han, Yinhe, and Wang, Yu
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture
Abstract: In current convolutional neural network (CNN) accelerators, communication (i.e., memory access) dominates the energy consumption. This work provides comprehensive analysis and methodologies to minimize the communication for CNN accelerators. For the off-chip communication, we derive the theoretical lower bound for any convolutional layer and propose a dataflow to reach the lower bound. This fundamental problem has never been solved by prior studies. The on-chip communication is minimized based on an elaborate workload and storage mapping scheme. We in addition design a communication-optimal CNN accelerator architecture. Evaluations based on the 65nm technology demonstrate that the proposed architecture nearly reaches the theoretical minimum communication in a three-level memory hierarchy and it is computation dominant. The gap between the energy efficiency of our accelerator and the theoretical best value is only 37-87%.
Published: 2019

19. STC-NAS: Fast neural architecture search with source-target consistency

Author: Sun, Zihao, Hu, Yu, Yang, Longxing, Lu, Shun, Mei, Jilin, Han, Yinhe, and Li, Xiaowei
Published: 2022
Full Text: View/download PDF

20. Reconfiguration algorithms for synchronous communication on switch based degradable arrays

Author: Wu, Yalan, Wu, Jigang, Liu, Peng, Han, Yinhe, and Srikanthan, Thambipillai
Published: 2022
Full Text: View/download PDF

21. Bi-stage multi-modal 3D instance segmentation method for production workshop scene

Author: Tang, Zaizuo, Chen, Guangzhu, Han, Yinhe, Liao, Xiaojuan, Ru, Qingjun, and Wu, Yuanyuan
Published: 2022
Full Text: View/download PDF

22. Survey on chiplets: interface, interconnect and integration methodology

Author: Ma, Xiaohan, Wang, Ying, Wang, Yujie, Cai, Xuyi, and Han, Yinhe
Published: 2022
Full Text: View/download PDF

23. Multi-modal feature fusion for 3D object detection in the production workshop

Author: Hou, Rui, Chen, Guangzhu, Han, Yinhe, Tang, Zaizuo, and Ru, Qingjun
Published: 2022
Full Text: View/download PDF

24. PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training

Author: Wang, Haoran, primary, Wang, Lei, additional, Xu, Haobo, additional, Wang, Ying, additional, Li, Yuming, additional, and Han, Yinhe, additional
Published: 2024
Full Text: View/download PDF

25. ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache Optimizations

Author: Lu, Xiaoyang, primary, Long, Boyu, additional, Chen, Xiaoming, additional, Han, Yinhe, additional, and Sun, Xian-He, additional
Published: 2024
Full Text: View/download PDF

26. GNN-PIM: A Processing-in-Memory Architecture for Graph Neural Networks

Author: Wang, Zhao, Guan, Yijin, Sun, Guangyu, Niu, Dimin, Wang, Yuhao, Zheng, Hongzhong, Han, Yinhe, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Dong, Dezun, editor, Gong, Xiaoli, editor, Li, Cunlu, editor, Li, Dongsheng, editor, and Wu, Junjie, editor
Published: 2020
Full Text: View/download PDF

27. GAS: General-Purpose In-Memory-Computing Accelerator for Sparse Matrix Multiplication

Author: Zhang, Xiaoyu, primary, Li, Zerun, additional, Liu, Rui, additional, Chen, Xiaoming, additional, and Han, Yinhe, additional
Published: 2024
Full Text: View/download PDF

28. Mathematical Framework for Optimizing Crossbar Allocation for ReRAM-based CNN Accelerators

Author: Li, Wanqian, primary, Han, Yinhe, additional, and Chen, Xiaoming, additional
Published: 2023
Full Text: View/download PDF

29. Frequency-Domain Inference Acceleration for Convolutional Neural Networks Using ReRAMs

Author: Liu, Bosheng, primary, Jiang, Zhuoshen, additional, Wu, Yalan, additional, Wu, Jigang, additional, Chen, Xiaoming, additional, Liu, Peng, additional, Zhou, Qingguo, additional, and Han, Yinhe, additional
Published: 2023
Full Text: View/download PDF

30. Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Author: Zou, Xingqi, Xu, Sheng, Chen, Xiaoming, Yan, Liang, and Han, Yinhe
Published: 2021
Full Text: View/download PDF

31. PANG: A Pattern-Aware GCN Accelerator for Universal Graphs

Author: Du, Yibo, primary, Wang, Ying, additional, Liang, Shengwen, additional, Li, Huawei, additional, Li, Xiaowei, additional, and Han, Yinhe, additional
Published: 2023
Full Text: View/download PDF

32. Hardware-Software Co-Design for Content-Based Sparse Attention

Author: Tang, Rui, primary, Zhang, Xiaoyu, additional, Liu, Rui, additional, Luo, Zhejian, additional, Chen, Xiaoming, additional, and Han, Yinhe, additional
Published: 2023
Full Text: View/download PDF

33. Dadu-RBD: Robot Rigid Body Dynamics Accelerator with Multifunctional Pipelines

Author: Yang, Yuxin, primary, Chen, Xiaoming, additional, and Han, Yinhe, additional
Published: 2023
Full Text: View/download PDF

34. FSPA: An FeFET-based Sparse Matrix-Dense Vector Multiplication Accelerator

Author: Zhang, Xiaoyu, primary, Li, Zerun, additional, Liu, Rui, additional, Chen, Xiaoming, additional, and Han, Yinhe, additional
Published: 2023
Full Text: View/download PDF

35. APPEND: Rethinking ASIP Synthesis in the Era of AI

Author: Li, Cangyuan, primary, Wang, Ying, additional, Li, Huawei, additional, and Han, Yinhe, additional
Published: 2023
Full Text: View/download PDF

36. PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators

Author: Sun, Xiaotian, primary, Wang, Xinyu, additional, Li, Wanqian, additional, Wang, Lei, additional, Han, Yinhe, additional, and Chen, Xiaoming, additional
Published: 2023
Full Text: View/download PDF

37. RBDCore: Robot Rigid Body Dynamics Accelerator with Multifunctional Pipelines

Author: Yang, Yuxin, Chen, Xiaoming, and Han, Yinhe
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Robotics (cs.RO)
Abstract: Rigid body dynamics is a key technology in the robotics field. In trajectory optimization and model predictive control algorithms, there are usually a large number of rigid body dynamics computing tasks. Using CPUs to process these tasks consumes a lot of time, which will affect the real-time performance of robots. To this end, we propose a multifunctional robot rigid body dynamics accelerator, named RBDCore, to address the performance bottleneck. By analyzing different functions commonly used in robot dynamics calculations, we summarize their reuse relationship and optimize them according to the hardware. Based on this, RBDCore can fully reuse common hardware modules when processing different computing tasks. By dynamically switching the dataflow path, RBDCore can accelerate various dynamics functions without reconfiguring the hardware. We design Structure-Adaptive Pipelines for RBDCore, which can greatly improve the throughput of the accelerator. Robots with different structures and parameters can be optimized specifically. Compared with the state-of-the-art CPU, GPU dynamics libraries and FPGA accelerator, RBDCore can significantly improve the performance.
Published: 2023

38. Statistical energy optimization on voltage–frequency island based MPSoCs in the presence of process variations

Author: Jin, Song, Han, Yinhe, and Pei, Songwei
Published: 2016
Full Text: View/download PDF

39. The Abacus Turn Model

Author: Fu, Binzhang, Han, Yinhe, Li, Huawei, Li, Xiaowei, Palesi, Maurizio, editor, and Daneshtalab, Masoud, editor
Published: 2014
Full Text: View/download PDF

40. An Automatic Neural Network Architecture-and-Quantization Joint Optimization Framework for Efficient Model Inference

Author: Liu, Lian, Wang, Ying, Zhao, Xiandong, Chen, Weiwei, Li, Huawei, Li, Xiaowei, and Han, Yinhe
Abstract: Efficient deep learning models, especially optimized for edge devices, benefit from low inference latency to efficient energy consumption. Two classical techniques for efficient model inference are lightweight neural architecture search (NAS), which automatically designs compact network models, and quantization, which reduces the bit-precision of neural network models. As a consequence, joint design for both neural architecture and quantization precision settings is becoming increasingly popular. There are three main aspects that affect the performance of the joint optimization between neural architecture and quantization: 1) quantization precision selection (QPS); 2) quantization-aware training (QAT); and 3) NAS. However, existing works focus on at most twofold of these aspects, and result in secondary performance. To this end, we proposed a novel automatic optimization framework, DAQU, that allows jointly searching for Pareto-optimal neural architecture and quantization precision combination among more than $10^{47}$ quantized subnet models. To overcome the instability of the conventional automatic optimization framework, DAQU incorporates a warm-up strategy to reduce the accuracy gap among different neural architectures, and a precision-transfer training approach to maintain flexibility among different quantization precision settings. Our experiments show that the quantized lightweight neural networks generated by DAQU consistently outperform state-of-the-art NAS and quantization joint optimization methods.
Published: 2024
Full Text: View/download PDF

41. Accelerating DNN-based 3D point cloud processing for mobile computing

Author: Liu, Bosheng, Chen, Xiaoming, Han, Yinhe, Li, Jiajun, Xu, Haobo, and Li, Xiaowei
Published: 2019
Full Text: View/download PDF

42. Accelerating Convolutional Neural Networks in Frequency Domain via Kernel-Sharing Approach

Author: Liu, Bosheng, primary, Liang, Hongyi, additional, Wu, Jigang, additional, Chen, Xiaoming, additional, Liu, Peng, additional, and Han, Yinhe, additional
Published: 2023
Full Text: View/download PDF

43. Towards Effective Neural Architecture Selection: The Promise of Self-Attention

Author: Sun, Zihao, primary, Hu, Yu, additional, Yang, Longxing, additional, Lu, Shun, additional, Mei, Jilin, additional, and Han, Yinhe, additional
Published: 2023
Full Text: View/download PDF

44. IVP: An Intelligent Video Processing Architecture for Video Streaming

Author: Gao, Chengsi, primary, Wang, Ying, additional, Han, Yinhe, additional, Chen, Weiwei, additional, and Zhang, Lei, additional
Published: 2023
Full Text: View/download PDF

45. An Automatic Neural Network Architecture-and-Quantization Joint Optimization Framework for Efficient Model Inference

Author: Liu, Lian, primary, Wang, Ying, additional, Zhao, Xiandong, additional, Chen, Weiwei, additional, Li, Huawei, additional, Li, Xiaowei, additional, and Han, Yinhe, additional
Published: 2023
Full Text: View/download PDF

46. FeCrypto: Instruction Set Architecture for Cryptographic Algorithms Based on FeFET-based In-memory Computing

Author: Liu, Rui, primary, Zhang, Xiaoyu, additional, Xie, Zhiwen, additional, Wang, Xinyu, additional, Li, Zerun, additional, Chen, Xiaoming, additional, Han, Yinhe, additional, and Tang, Minghua, additional
Published: 2023
Full Text: View/download PDF

47. Dadu-SV: Accelerate Stereo Vision Processing on NPU

Author: Min, Feng, primary, Wang, Ying, additional, Xu, Haobo, additional, Huang, Junpei, additional, Wang, Yujie, additional, Zou, Xingqi, additional, Lu, Meixuan, additional, and Han, Yinhe, additional
Published: 2022
Full Text: View/download PDF

48. Re-FeMAT: A Reconfigurable Multifunctional FeFET-Based Memory Architecture

Author: Zhang, Xiaoyu, primary, Liu, Rui, additional, Song, Tao, additional, Yang, Yuxin, additional, Han, Yinhe, additional, and Chen, Xiaoming, additional
Published: 2022
Full Text: View/download PDF

49. Amphis: Managing Reconfigurable Processor Architectures With Generative Adversarial Learning

Author: Chen, Weiwei, primary, Wang, Ying, additional, Xu, Ying, additional, Gao, Chengsi, additional, Han, Yinhe, additional, and Zhang, Lei, additional
Published: 2022
Full Text: View/download PDF

50. GIA

Author: Li, Fuping, primary, Wang, Ying, additional, Cheng, Yuanqing, additional, Wang, Yujie, additional, Han, Yinhe, additional, Li, Huawei, additional, and Li, Xiaowei, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

326 results on '"Han, Yinhe"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources