Back to Search
Start Over
웹 말뭉치에 대한 문장 필터링 데이터 셋 구축 방법.
- Source :
- Journal of the Korea Institute of Information & Communication Engineering; Nov2021, Vol. 25 Issue 11, p1505-1511, 7p
- Publication Year :
- 2021
-
Abstract
- Pretrained models with high performance in various tasks within natural language processing have the advantage of learning the linguistic patterns of sentences using large corpus during the training, allowing each token in the input sentence to be represented with appropriate feature vectors. One of the methods of constructing a corpus required for a pre-trained model training is a collection method using web crawler. However, sentences that exist on web may contain unnecessary words in some or all of the sentences because they have various patterns. In this paper, we propose a dataset construction method for filtering sentences containing unnecessary words using neural network models for corpus collected from the web. As a result, we construct a dataset containing a total of 2,330 sentences. We also evaluated the performance of neural network models on the constructed dataset, and the BERT model showed the highest performance with an accuracy of 93.75%. [ABSTRACT FROM AUTHOR]
Details
- Language :
- Korean
- ISSN :
- 22344772
- Volume :
- 25
- Issue :
- 11
- Database :
- Complementary Index
- Journal :
- Journal of the Korea Institute of Information & Communication Engineering
- Publication Type :
- Academic Journal
- Accession number :
- 155093026
- Full Text :
- https://doi.org/10.6109/jkiice.2021.25.11.1505