Back to Search Start Over

EuroLLM: Multilingual Language Models for Europe

Authors :
Martins, Pedro Henrique
Fernandes, Patrick
Alves, João
Guerreiro, Nuno M.
Rei, Ricardo
Alves, Duarte M.
Pombal, José
Farajian, Amin
Faysse, Manuel
Klimaszewski, Mateusz
Colombo, Pierre
Haddow, Barry
de Souza, José G. C.
Birch, Alexandra
Martins, André F. T.
Publication Year :
2024

Abstract

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2409.16235
Document Type :
Working Paper