Back to Search
Start Over
Performance evaluation of text-mining models with Hindi stopwords lists
- Source :
- Journal of King Saud University - Computer and Information Sciences. 34:2771-2786
- Publication Year :
- 2022
- Publisher :
- Elsevier BV, 2022.
-
Abstract
- Nowadays, several news portals, government websites, and social media sites are generating a massive amount of digitalized Hindi textual information. Stopword removal is a significant factor in text mining tasks that helps the miner to enhance the performance of a system. This paper attempts to construct the corpus specific stopwords lists for Hindi text documents using statistical and knowledge-based methods. In order to prepare the stopwords list, the proposed method considers the ranking of the words given by different methods followed by normalization of the outcomes of these methods using the social choice theory based vote ranking method. Further, we propose an evaluation method to evaluate the prepared stopword lists and investigate their behavior using text mining models. We also compare our prepared stopword lists with the baselines and conclude that the technique which fetches the best features does not necessarily identify the candidate stop words. To the best of our knowledge, the proposed approach guarantees the removal of candidate stop words and has the least information dissipation.
- Subjects :
- Normalization (statistics)
Hindi
Stop words
General Computer Science
business.industry
Computer science
020206 networking & telecommunications
02 engineering and technology
computer.software_genre
language.human_language
Textual information
Text mining
Evaluation methods
0202 electrical engineering, electronic engineering, information engineering
language
020201 artificial intelligence & image processing
Social media
Artificial intelligence
business
computer
Social choice theory
Natural language processing
Subjects
Details
- ISSN :
- 13191578
- Volume :
- 34
- Database :
- OpenAIRE
- Journal :
- Journal of King Saud University - Computer and Information Sciences
- Accession number :
- edsair.doi...........2046b5d4fda51227385df5730a4b9760