1. Towards Pareto optimal throughput in small language model serving
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CROMAI - Computing Resources Orchestration and Management for AI, García Recasens, Pol, Zhu, Yue, Wang, Chen, Lee, Eun Kyung, Tardieu, Olivier, Youssef, Alaa, Torres Viñals, Jordi, Berral García, Josep Lluís, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CROMAI - Computing Resources Orchestration and Management for AI, García Recasens, Pol, Zhu, Yue, Wang, Chen, Lee, Eun Kyung, Tardieu, Olivier, Youssef, Alaa, Torres Viñals, Jordi, and Berral García, Josep Lluís
- Abstract
Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs., This work has been partially financed by grant agreement EUHORIZON GA.101095717 and by the EU-HORIZON MSCA programme under grant agreement EU-HORIZON MSCA GA.101086248. Also, it has been partially financed by Generalitat de Catalunya (AGAUR) under grant agreement 2021- SGR-00478, and by the Spanish Ministry of Science (MICINN), the Research State Agency (AEI) and European Regional Development Funds (ERDF/FEDER) under grant agreement PID2021-126248OB-I00, MCIN/AEI/ 10.13039/ 501100011033/ FEDER, UE., Peer Reviewed, Postprint (author's final draft)
- Published
- 2024