Back to Search Start Over

Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT model.

Authors :
Razaq, Abdur
Halim, Zahid
Ur Rahman, Atta
Sikandar, Kholla
Source :
Multimedia Tools & Applications; Sep2024, Vol. 83 Issue 30, p74205-74232, 28p
Publication Year :
2024

Abstract

With the emerging new technologies based on Artificial Intelligence (AI) for the generation of new and paraphrasing of existing text, the identification of genuinely written text has become an important research undertaking. Past approaches to address this issue, need a significant volume of human-labeled data. Most of the approaches used in literature are either for noisy text or for clean text. Conversations in chats, text in blogs, text messages on cell phones, text exchange on Messengers, etc., are examples of noisy text that may contain misspelled words or incomplete words. The second approach focuses on clean text, which is free from the mentioned characteristics in the noisy text. As research articles do not contain noisy data, we propose a model that focuses on clean text for the identification of paraphrases in research articles. To address the problem of paraphrase detection, this work presents a novel Bidirectional Encoder Representation from Transformers (BERT) based model with fine-tuning. For word representation, Global Vectors (Glove) embeddings and contextualized Embeddings From Language Models (ELMo) are employed in this work. Initially, the model is evaluated without performing preprocessing. Later, the preprocessing step is performed before evaluating the model. Extensive experimentations are performed to evaluate the proposed model utilizing two benchmark datasets, namely, Microsoft Research Paraphrase (MSRP) and Quora Question Pairs (Quora). A comparison of the proposed model is done with four closely related state-of-the-art works. The obtained results show that Fine-tuned BERT using ELMo embeddings with preprocessing produces promising outcomes. Paraphrase identification rates achieved on MSRP and Quora datasets are 86.51% and 94.32%, respectively, which are better than the other competing methods. The proposed solution enables the identification of paraphrased text with a higher accuracy having its application in multiple domains requiring genuinely written documents. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
13807501
Volume :
83
Issue :
30
Database :
Complementary Index
Journal :
Multimedia Tools & Applications
Publication Type :
Academic Journal
Accession number :
179395171
Full Text :
https://doi.org/10.1007/s11042-024-18359-w