Back to Search Start Over

Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry.

Authors :
Cordeiro, Fábio Corrêa
da Silva, Patrícia Ferreira
Tessarollo, Alexandre
Freitas, Cláudia
de Souza, Elvis
da Silva Magalhaes Gomes, Diogo
Souza, Renato Rocha
Coelho, Flávio Codeço
Source :
Computers & Geosciences. Nov2024, Vol. 193, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

Most companies struggle to find and extract relevant information from their technical documents. In particular, the Oil and Gas (O&G) industry faces the challenge of dealing with large amounts of data hidden within old and new geoscientific reports collected over decades of operation. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built from general domain corpora extracted from the Internet and primarily written in English. This paper presents Petro NLP, a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. We connected an interdisciplinary team of geoscientists, linguists, computer scientists, petroleum engineers, librarians, and ontologists to build a knowledge graph and several annotated corpora. The Petro NLP resources comprise: (i) Petro KGraph– a knowledge graph populated with entities and relations commonly found on technical reports; and (ii) Petrolês, PetroGold, PetroNER, and PetroRE– sets of corpora containing raw text and documents annotated with morphosyntactic labels, named entities, and relations. These resources are fundamental infrastructure for future research in natural language processing and information extraction in the oil industry. Our ongoing research uses these datasets to train and enhance pre-trained machine learning models that automatically extract information from geoscientific technical documents. [Display omitted] • Petro NLP is a set of natural language processing and information extraction resources. • It is specialized in the oil and gas industry in Portuguese. • We formally defined a geoscientific ontology and populated a knowledge graph. • We present corpora with morphosyntactic, named entities, and relations annotations. • These are key resources to train machine learning models for information extraction. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00983004
Volume :
193
Database :
Academic Search Index
Journal :
Computers & Geosciences
Publication Type :
Academic Journal
Accession number :
180772116
Full Text :
https://doi.org/10.1016/j.cageo.2024.105714