Pir��: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Authors :: Paschoal, Andr�� F. A.
Pirozelli, Paulo
Freire, Valdinei
Delgado, Karina V.
Peres, Sarajane M.
Jos��, Marcos M.
Nakasato, Fl��vio
Oliveira, Andr�� S.
Brand��o, Anarosa A. F.
Costa, Anna H. R.
Cozman, Fabio G.
Publication Year :: 2022
Publisher :: arXiv, 2022.
Abstract: Current research in natural language processing is highly dependent on carefully produced corpora. Most existing resources focus on English; some resources focus on languages such as Chinese and French; few resources deal with more than one language. This paper presents the Pir�� dataset, a large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pir�� is, to the best of our knowledge, the first QA dataset with supporting texts in Portuguese, and, perhaps more importantly, the first bilingual QA dataset that includes this language. The Pir�� dataset consists of 2261 properly curated question/answer (QA) sets in both languages. The QA sets were manually created based on two corpora: abstracts related to the Brazilian coast and excerpts of United Nation reports about the ocean. The QA sets were validated in a peer-review process with the dataset contributors. We discuss some of the advantages as well as limitations of Pir��, as this new resource can support a set of tasks in NLP such as question-answering, information retrieval, and machine translation.<br />https://github.com/C4AI/Pira

Subjects :: FOS: Computer and information sciences
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
Computation and Language (cs.CL)

Full Text Access

Tools