Back to Search Start Over

Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information

Authors :
Crespo, Maria Clara Ramos Morales
Rocha, Maria Lina de Souza Jeannine
Sturzeneker, Mariana Lourenço
Serras, Felipe Ribas
de Mello, Guilherme Lamartine
Costa, Aline Silva
Palma, Mayara Feliciano
Mesquita, Renata Morais
Guets, Raquel de Paula
da Silva, Mariana Marques
Finger, Marcelo
de Sousa, Maria Clara Paixão
Namiuti, Cristiane
Monte, Vanessa Martins do
Publication Year :
2023

Abstract

This paper presents the first publicly available version of the Carolina Corpus and discusses its future directions. Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology enhanced with provenance, typology, versioning, and text integrality. The corpus aims at being used both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models, contributing towards removing Portuguese from the set of low-resource languages. Here we present the construction of the corpus methodology, comparing it with other existing methodologies, as well as the corpus current state: Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types. Each text is annotated with several different metadata categories in its header, which we developed using TEI annotation standards. We also present ongoing derivative works and invite NLP researchers to contribute with their own.<br />Comment: 14 pages, 3 figures, 1 appendix

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2303.16098
Document Type :
Working Paper