Back to Search
Start Over
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
- Source :
- EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 303-304, STARTPAGE=303;ENDPAGE=304;TITLE=EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
- Publication Year :
- 2022
- Publisher :
- European Association for Machine Translation, 2022.
-
Abstract
- We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.
Details
- Language :
- English
- Database :
- OpenAIRE
- Journal :
- EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 303-304, STARTPAGE=303;ENDPAGE=304;TITLE=EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
- Accession number :
- edsair.narcis........5a0a68e6c4f890c520eb13d3be112a97