Back to Search Start Over

웹 말뭉치에 대한 문장 필터링 데이터 셋 구축 방법.

Authors :
남충현
장경식
Source :
Journal of the Korea Institute of Information & Communication Engineering; Nov2021, Vol. 25 Issue 11, p1505-1511, 7p
Publication Year :
2021

Abstract

Pretrained models with high performance in various tasks within natural language processing have the advantage of learning the linguistic patterns of sentences using large corpus during the training, allowing each token in the input sentence to be represented with appropriate feature vectors. One of the methods of constructing a corpus required for a pre-trained model training is a collection method using web crawler. However, sentences that exist on web may contain unnecessary words in some or all of the sentences because they have various patterns. In this paper, we propose a dataset construction method for filtering sentences containing unnecessary words using neural network models for corpus collected from the web. As a result, we construct a dataset containing a total of 2,330 sentences. We also evaluated the performance of neural network models on the constructed dataset, and the BERT model showed the highest performance with an accuracy of 93.75%. [ABSTRACT FROM AUTHOR]

Details

Language :
Korean
ISSN :
22344772
Volume :
25
Issue :
11
Database :
Complementary Index
Journal :
Journal of the Korea Institute of Information & Communication Engineering
Publication Type :
Academic Journal
Accession number :
155093026
Full Text :
https://doi.org/10.6109/jkiice.2021.25.11.1505