웹 말뭉치에 대한 문장 필터링 데이터 셋 구축 방법.

Authors :: 남충현
장경식
Source :: Journal of the Korea Institute of Information & Communication Engineering; Nov2021, Vol. 25 Issue 11, p1505-1511, 7p
Publication Year :: 2021
Abstract: Pretrained models with high performance in various tasks within natural language processing have the advantage of learning the linguistic patterns of sentences using large corpus during the training, allowing each token in the input sentence to be represented with appropriate feature vectors. One of the methods of constructing a corpus required for a pre-trained model training is a collection method using web crawler. However, sentences that exist on web may contain unnecessary words in some or all of the sentences because they have various patterns. In this paper, we propose a dataset construction method for filtering sentences containing unnecessary words using neural network models for corpus collected from the web. As a result, we construct a dataset containing a total of 2,330 sentences. We also evaluated the performance of neural network models on the constructed dataset, and the BERT model showed the highest performance with an accuracy of 93.75%. [ABSTRACT FROM AUTHOR]

Subjects :: NATURAL language processing
ARTIFICIAL neural networks
TASK performance
DEEP learning
CORPORA

Language :: Korean
ISSN :: 22344772
Volume :: 25
Issue :: 11
Database :: Complementary Index
Journal :: Journal of the Korea Institute of Information & Communication Engineering
Publication Type :: Academic Journal
Accession number :: 155093026
Full Text :: https://doi.org/10.6109/jkiice.2021.25.11.1505