Back to Search Start Over

Pir��: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Authors :
Paschoal, Andr�� F. A.
Pirozelli, Paulo
Freire, Valdinei
Delgado, Karina V.
Peres, Sarajane M.
Jos��, Marcos M.
Nakasato, Fl��vio
Oliveira, Andr�� S.
Brand��o, Anarosa A. F.
Costa, Anna H. R.
Cozman, Fabio G.
Publication Year :
2022
Publisher :
arXiv, 2022.

Abstract

Current research in natural language processing is highly dependent on carefully produced corpora. Most existing resources focus on English; some resources focus on languages such as Chinese and French; few resources deal with more than one language. This paper presents the Pir�� dataset, a large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pir�� is, to the best of our knowledge, the first QA dataset with supporting texts in Portuguese, and, perhaps more importantly, the first bilingual QA dataset that includes this language. The Pir�� dataset consists of 2261 properly curated question/answer (QA) sets in both languages. The QA sets were manually created based on two corpora: abstracts related to the Brazilian coast and excerpts of United Nation reports about the ocean. The QA sets were validated in a peer-review process with the dataset contributors. We discuss some of the advantages as well as limitations of Pir��, as this new resource can support a set of tasks in NLP such as question-answering, information retrieval, and machine translation.<br />https://github.com/C4AI/Pira

Details

Database :
OpenAIRE
Accession number :
edsair.doi...........86d7d3eeda851a3f9b7ac044a0f96216
Full Text :
https://doi.org/10.48550/arxiv.2202.02398