MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Authors :: Bañón, Marta
Esplà-Gomis, Miquel
Forcada, Mikel L.
García-Romero, Cristian
Kuzman, Taja
Ljubešić, Nikola
van Noord, Rik
Sempere, Leopoldo Pla
Ramírez-Sánchez, Gema
Rupnik, Peter
Suchomel, Vít
Toral, Antonio
van der Werff, Tobias
Zaragoza, Jaume
Macken, Lieve
Rufener, Andrew
Van den Bogaert, Joachim
Daems, Joke
Tezcan, Arda
Vanroy, Bram
Fonteyne, Margot
Barrault, Loic
Costa-Jussa, Marta R.
Kemp, Ellie
Pilos, Spyridon
Declercq, Christophe
Koponen, Maarit
Scarton, Carolina
Moniz, Helena
Computational Linguistics (CL)
Source :: EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 303-304, STARTPAGE=303;ENDPAGE=304;TITLE=EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Publication Year :: 2022
Publisher :: European Association for Machine Translation, 2022.
Abstract: We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.

Language :: English
Database :: OpenAIRE
Journal :: EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 303-304, STARTPAGE=303;ENDPAGE=304;TITLE=EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Accession number :: edsair.narcis........5a0a68e6c4f890c520eb13d3be112a97

Tools