1. The process of summarization in the pre-processing stage in order to improve measurement of texts when clustering.
- Author
-
Guelpeli, Marcus V. C., Garcia, Ana Cristina B., and Branco, Antonio Horta
- Abstract
This work introduces the Cassiopeia model, which allows for knowledge discovery in textual bases, used for the purposes of text mining in distinct and/or antagonistic domains. The most relevant contributions include the use of summarized texts as an entrance in pre-processing stage of clusterization, language independence with the use of stop words and the treatment of high dimensionality, a problem that is inherent to Text Mining. In the knowledge extraction, the texts are clustered and reclustered according to a similarity criterion. With the results obtained, the study hopes to show the impact of including summarization in the process of text clusterization. The experiments conducted in this study indicate that text clusterization using summaries is in fact much more effective than direct clusterization of texts in their entirety, as measured by internal and external measures traditionally employed in the field of text clusterization. Finally, the post-processing stage creates clusters of summarized texts with a high degree of informativity, a quality that is inherent to summarization. The clusters are highly esteemed with the indexed words. This fact is due to the process proposed by the Cassiopeia model, which allows for strong similarity among the clustered texts. In the future, this similarity will allow for the creation of categories based on the word indices of each cluster. [ABSTRACT FROM PUBLISHER]
- Published
- 2011