Back to Search
Start Over
Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry.
- Source :
-
Computers & Geosciences . Nov2024, Vol. 193, pN.PAG-N.PAG. 1p. - Publication Year :
- 2024
-
Abstract
- Most companies struggle to find and extract relevant information from their technical documents. In particular, the Oil and Gas (O&G) industry faces the challenge of dealing with large amounts of data hidden within old and new geoscientific reports collected over decades of operation. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built from general domain corpora extracted from the Internet and primarily written in English. This paper presents Petro NLP, a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. We connected an interdisciplinary team of geoscientists, linguists, computer scientists, petroleum engineers, librarians, and ontologists to build a knowledge graph and several annotated corpora. The Petro NLP resources comprise: (i) Petro KGraph– a knowledge graph populated with entities and relations commonly found on technical reports; and (ii) Petrolês, PetroGold, PetroNER, and PetroRE– sets of corpora containing raw text and documents annotated with morphosyntactic labels, named entities, and relations. These resources are fundamental infrastructure for future research in natural language processing and information extraction in the oil industry. Our ongoing research uses these datasets to train and enhance pre-trained machine learning models that automatically extract information from geoscientific technical documents. [Display omitted] • Petro NLP is a set of natural language processing and information extraction resources. • It is specialized in the oil and gas industry in Portuguese. • We formally defined a geoscientific ontology and populated a knowledge graph. • We present corpora with morphosyntactic, named entities, and relations annotations. • These are key resources to train machine learning models for information extraction. [ABSTRACT FROM AUTHOR]
- Subjects :
- *MACHINE learning
*KNOWLEDGE graphs
*DATA mining
*CORPORA
*NATURAL resources
Subjects
Details
- Language :
- English
- ISSN :
- 00983004
- Volume :
- 193
- Database :
- Academic Search Index
- Journal :
- Computers & Geosciences
- Publication Type :
- Academic Journal
- Accession number :
- 180772116
- Full Text :
- https://doi.org/10.1016/j.cageo.2024.105714