Back to Search Start Over

CnGeoPLM: Contextual knowledge selection and embedding with pretrained language representation model for the geoscience domain.

Authors :
Ma, Kai
Zheng, Shuai
Tian, Miao
Qiu, Qinjun
Tan, Yongjian
Hu, Xinxin
Li, HaiYan
Xie, Zhong
Source :
Earth Science Informatics. Dec2023, Vol. 16 Issue 4, p3629-3646. 18p.
Publication Year :
2023

Abstract

Several recent efforts have been devoted to enhancing pretrained language models (PLMs) by utilizing extra heterogeneous knowledge in knowledge graphs (KGs) and have achieved consistent improvements on various knowledge-driven natural language processing (NLP) tasks. However, most of these knowledge-enhanced PLMs, regardless of the domain-specific knowledge required by PLMs, lack a Chinese corpus to support large-scale PLMs, especially in domain-specific PLMs. Herein, we harvest and process public geological reports (mainly various reports from the National Geological Archives of China), abstracts of papers from 34 geological journals, and abstracts of geological dissertations from 161 universities or scientific institutions. This paper introduces a large-scale Chinese geological corpus (GeoCorpus) containing approximately 232 MB of training data with a total of 243 million characters. Based on GeoCorpus, we also introduce a Chinese pretrained language representation model in the geological domain (CnGeoPLM). The model utilizes the generic BERT to complete parameter initialization and combines the GeoCorpus for the second stage of training. We evaluate and test CnGeoPLM and Bidirectional Encoder Representations from Transformers (BERT) on three tasks of geological named entity recognition (GeoNER), geological entity relationship extraction (GeoRE), and geological entity clustering (GeoClu). In the results of both the GeoNER and GeoRE experiments, CnGeoPLM has different degrees of improvement in evaluation metrics compared to BERT. In the GeoClu experiment, CnGeoPLM is significantly more effective and discriminative in entity clustering effect compared to BERT. These results show that CnGeoPLM has a stronger representation ability in geological NLP tasks than BERT. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
18650473
Volume :
16
Issue :
4
Database :
Academic Search Index
Journal :
Earth Science Informatics
Publication Type :
Academic Journal
Accession number :
174096730
Full Text :
https://doi.org/10.1007/s12145-023-01112-6