Author: "Wu, Yongwei" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wu, Yongwei"' showing total 4 results

Start Over Author "Wu, Yongwei" Publication Type Reports

4 results on '"Wu, Yongwei"'

1. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Author: Qin, Ruoyu, Li, Zheming, He, Weiran, Zhang, Mingxing, Wu, Yongwei, Zheng, Weimin, and Xu, Xinran
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Hardware Architecture
Abstract: Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests., Comment: 23 pages, 13 figures
Published: 2024

2. Efficient and Economic Large Language Model Inference with Attention Offloading

Author: Chen, Shaoyuan, Lin, Yutong, Zhang, Mingxing, and Wu, Yongwei
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially as context length increases. To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop Lamina, an LLM inference system that incorporates attention offloading. Experimental results indicate that Lamina can provide 1.48x-12.1x higher estimated throughput per dollar than homogeneous solutions.
Published: 2024

3. A Case for Asymmetric Non-Volatile Memory Architecture

Author: Ma, Teng, Zhang, Mingxing, Chen, Kang, Qian, Xuehai, and Wu, Yongwei
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The byte-addressable Non-Volatile Memory (NVM) is a promising technology since it simultaneously provides DRAM-like performance, disk-like capacity, and persistency. The current NVM deployment is symmetric, where NVM devices are directly attached to servers. Due to the higher density, NVM provides larger capacity and can be shared among servers. Unfortunately, in the symmetric setting, the availability of NVM devices is affected by the specific machine it is attached to. High availability can be realized by replicating data to NVM on a remote machine. However, it requires full replication of data structure in local memory, limiting the size of the working set. This paper rethinks NVM deployment and makes a case for the asymmetric NVM architecture, which decouples servers from persistent data storage. In the proposed AsymNVM architecture, NVM devices (back-end nodes) can be shared by multiple servers (front-end nodes) and provide recoverable persistent data structures. The asymmetric architecture is made possible by RDMA, and follows the recent industry trend of resource disaggregation. We build AsymNVM framework based on AsymNVM architecture that implements: 1) high performance persistent data structure update; 2) NVM data management; 3) concurrency control; and 4) crash-consistency and replication. The central idea is to use operation logs to reduce the stall due to RDMA writes and enable efficient batching and caching in front-end nodes. To evaluation performance, we construct eight widely used data structures and two applications based on AsymNVM framework, and use traces of industry workloads. In a cluster with ten machines, the results show that AsymNVM achieves comparable performance to the best possible symmetric architecture while avoiding all the drawbacks with disaggregation. Compared to the baseline AsymNVM, speedup brought by the proposed optimizations is 6~22x., Comment: 18 Pages
Published: 2018

4. RFP: A Remote Fetching Paradigm for RDMA-Accelerated Systems

Author: Su, Maomeng, Zhang, Mingxing, Chen, Kang, Wu, Yongwei, and Li, Guoliang
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Remote Direct Memory Access (RDMA) is an efficient way to improve the performance of traditional client-server systems. Currently, there are two main design paradigms for RDMA-accelerated systems. The first allows the clients to directly operate the server's memory and totally bypasses the CPUs at server side. The second follows the traditional server-reply paradigm, which asks the server to write results back to the clients. However, the first method has to expose server's memory and needs tremendous re-design of upper-layer software, which is complex, unsafe, error-prone, and inefficient. The second cannot achieve high input/output operations per second (IOPS), because it employs out-bound RDMA-write at server side which is not efficient. We find that the performance of out-bound RDMA-write and in-bound RDMA-read is asymmetric and the latter is 5 times faster than the former. Based on this observation, we propose a novel design paradigm named Remote Fetching Paradigm (RFP). In RFP, the server is still responsible for processing requests from the clients. However, counter-intuitively, instead of sending results back to the clients through out-bound RDMA-write, the server only writes the results in local memory buffers, and the clients use in-bound RDMA-read to remotely fetch these results. Since in-bound RDMA-read achieves much higher IOPS than out-bound RDMA-write, our model is able to bring higher performance than the traditional models. In order to prove the effectiveness of RFP, we design and implement an RDMA-accelerated in-memory key-value store following the RFP model. To further improve the IOPS, we propose an optimization mechanism that combines status checking and result fetching. Experiment results show that RFP can improve the IOPS by 160%~310% against state-of-the-art models for in-memory key-value stores., Comment: 11 pages, 10 figures; Key Words: RDMA and InfiniBand, Remote Fetching Paradigm, IOPS, and Small Data
Published: 2015

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

4 results on '"Wu, Yongwei"'

1. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

2. Efficient and Economic Large Language Model Inference with Attention Offloading

3. A Case for Asymmetric Non-Volatile Memory Architecture

4. RFP: A Remote Fetching Paradigm for RDMA-Accelerated Systems

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

4 results on '"Wu, Yongwei"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources