Back to Search
Start Over
Pir��: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean
- Publication Year :
- 2022
- Publisher :
- arXiv, 2022.
-
Abstract
- Current research in natural language processing is highly dependent on carefully produced corpora. Most existing resources focus on English; some resources focus on languages such as Chinese and French; few resources deal with more than one language. This paper presents the Pir�� dataset, a large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pir�� is, to the best of our knowledge, the first QA dataset with supporting texts in Portuguese, and, perhaps more importantly, the first bilingual QA dataset that includes this language. The Pir�� dataset consists of 2261 properly curated question/answer (QA) sets in both languages. The QA sets were manually created based on two corpora: abstracts related to the Brazilian coast and excerpts of United Nation reports about the ocean. The QA sets were validated in a peer-review process with the dataset contributors. We discuss some of the advantages as well as limitations of Pir��, as this new resource can support a set of tasks in NLP such as question-answering, information retrieval, and machine translation.<br />https://github.com/C4AI/Pira
Details
- Database :
- OpenAIRE
- Accession number :
- edsair.doi...........86d7d3eeda851a3f9b7ac044a0f96216
- Full Text :
- https://doi.org/10.48550/arxiv.2202.02398