1. KVDirect: Distributed Disaggregated LLM Inference
- Author
-
Chen, Shiyang, Jiang, Rain, Yu, Dezhi, Xu, Jinlai, Chao, Mengyuan, Meng, Fanlong, Jiang, Chenyu, Xu, Wei, and Liu, Hang
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning ,Computer Science - Performance - Abstract
Large Language Models (LLMs) have become the new foundation for many applications, reshaping human society like a storm. Disaggregated inference, which separates prefill and decode stages, is a promising approach to improving hardware utilization and service quality. However, due to inefficient inter-node communication, existing systems restrict disaggregated inference to a single node, limiting resource allocation flexibility and reducing service capacity. This paper introduces KVDirect, which optimizes KV cache transfer to enable a distributed disaggregated LLM inference. KVDirect achieves this through the following contributions. First, we propose a novel tensor-centric communication mechanism that reduces the synchronization overhead in traditional distributed GPU systems. Second, we design a custom communication library to support dynamic GPU resource scheduling and efficient KV cache transfer. Third, we introduce a pull-based KV cache transfer strategy that reduces GPU resource idling and improves latency. Finally, we implement KVDirect as an open-source LLM inference framework. Our evaluation demonstrates that KVDirect reduces per-request latency by 55% compared to the baseline across diverse workloads under the same resource constraints.
- Published
- 2024