Back to Search
Start Over
CnGeoPLM: Contextual knowledge selection and embedding with pretrained language representation model for the geoscience domain.
- Source :
-
Earth Science Informatics . Dec2023, Vol. 16 Issue 4, p3629-3646. 18p. - Publication Year :
- 2023
-
Abstract
- Several recent efforts have been devoted to enhancing pretrained language models (PLMs) by utilizing extra heterogeneous knowledge in knowledge graphs (KGs) and have achieved consistent improvements on various knowledge-driven natural language processing (NLP) tasks. However, most of these knowledge-enhanced PLMs, regardless of the domain-specific knowledge required by PLMs, lack a Chinese corpus to support large-scale PLMs, especially in domain-specific PLMs. Herein, we harvest and process public geological reports (mainly various reports from the National Geological Archives of China), abstracts of papers from 34 geological journals, and abstracts of geological dissertations from 161 universities or scientific institutions. This paper introduces a large-scale Chinese geological corpus (GeoCorpus) containing approximately 232 MB of training data with a total of 243 million characters. Based on GeoCorpus, we also introduce a Chinese pretrained language representation model in the geological domain (CnGeoPLM). The model utilizes the generic BERT to complete parameter initialization and combines the GeoCorpus for the second stage of training. We evaluate and test CnGeoPLM and Bidirectional Encoder Representations from Transformers (BERT) on three tasks of geological named entity recognition (GeoNER), geological entity relationship extraction (GeoRE), and geological entity clustering (GeoClu). In the results of both the GeoNER and GeoRE experiments, CnGeoPLM has different degrees of improvement in evaluation metrics compared to BERT. In the GeoClu experiment, CnGeoPLM is significantly more effective and discriminative in entity clustering effect compared to BERT. These results show that CnGeoPLM has a stronger representation ability in geological NLP tasks than BERT. [ABSTRACT FROM AUTHOR]
- Subjects :
- *LANGUAGE models
*NATURAL language processing
*KNOWLEDGE graphs
Subjects
Details
- Language :
- English
- ISSN :
- 18650473
- Volume :
- 16
- Issue :
- 4
- Database :
- Academic Search Index
- Journal :
- Earth Science Informatics
- Publication Type :
- Academic Journal
- Accession number :
- 174096730
- Full Text :
- https://doi.org/10.1007/s12145-023-01112-6