Start Over

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Authors :: Yu, Minchen
Yang, Rui
Jia, Chaobo
Su, Zhaoyuan
Yao, Sheng
Lan, Tingfeng
Yang, Yuchen
Cheng, Yue
Wang, Wei
Wang, Ao
Chen, Ruichuan
Publication Year :: 2025
Abstract: Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce {\lambda}Scale, an efficient serverless inference system to achieve fast model scaling. The key idea behind {\lambda}Scale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as "execute-while-load". {\lambda}Scale proposes an efficient model scaling scheme, {\lambda}Pipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, {\lambda}Scale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that {\lambda}Scale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.

Subjects :: Computer Science - Distributed, Parallel, and Cluster Computing

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2502.09922
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources