Benchmarks for Pirá 2.0, a Reading Comprehension Dataset on the ocean, the Brazilian coast, and climate change

Authors :: Paulo Pirozelli
José Marcos M.
Silveira Igor
Nakasato Flávio
Peres Sarajane M.
Brandão Anarosa A. F.
Costa Anna H. R.
Cozman Fabio G.
Publication Year :: 2022
Publisher :: Research Square Platform LLC, 2022.
Abstract: Pirá is a recently developed reading comprehension dataset focused on the ocean, the Brazilian coast, and climate change. No detailed set of baselines has been built with this dataset yet, something that certainly hinders its use by researchers. In this paper, we define five benchmarks over the Pirá dataset, covering machine reading comprehension, information retrieval, open question answering, answer triggering, and multiple choice question answering. As part of this effort, we have produced a curated version of the original dataset, where we fixed a number of grammar issues, repetitions and other shortcomings. Furthermore, the dataset, now called Pirá 2.0, has been extended in several new directions, so as to face the aforementioned benchmark tasks: translation of supporting texts into Portuguese, classification labels on answerability, multiple choice candidates, and automatic paraphrases of questions and answers. The results described in this paper provide a reference point for researchers working with Pirá 2.0. Our results show that Pirá 2.0 is indeed a very challenging dataset, particularly useful for testing the ability of current machine learning models in acquiring expert scientific knowledge.

Tools