1. Pir��: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean
- Author
-
Paschoal, Andr�� F. A., Pirozelli, Paulo, Freire, Valdinei, Delgado, Karina V., Peres, Sarajane M., Jos��, Marcos M., Nakasato, Fl��vio, Oliveira, Andr�� S., Brand��o, Anarosa A. F., Costa, Anna H. R., and Cozman, Fabio G.
- Subjects
FOS: Computer and information sciences ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Computation and Language (cs.CL) - Abstract
Current research in natural language processing is highly dependent on carefully produced corpora. Most existing resources focus on English; some resources focus on languages such as Chinese and French; few resources deal with more than one language. This paper presents the Pir�� dataset, a large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pir�� is, to the best of our knowledge, the first QA dataset with supporting texts in Portuguese, and, perhaps more importantly, the first bilingual QA dataset that includes this language. The Pir�� dataset consists of 2261 properly curated question/answer (QA) sets in both languages. The QA sets were manually created based on two corpora: abstracts related to the Brazilian coast and excerpts of United Nation reports about the ocean. The QA sets were validated in a peer-review process with the dataset contributors. We discuss some of the advantages as well as limitations of Pir��, as this new resource can support a set of tasks in NLP such as question-answering, information retrieval, and machine translation., https://github.com/C4AI/Pira
- Published
- 2022
- Full Text
- View/download PDF