Semantic clustering documents using Mini Batch K-means method.

Authors :: Haji, Vaman
Ibrahim, Rowaida
Jacksi, Karwan
Source :: AIP Conference Proceedings. 2024, Vol. 2944 Issue 1, p1-9. 9p.
Publication Year :: 2024
Abstract: The continued growth of the Internet has resulted in a significant increase in the number of electronic text documents. Grouping these documents into meaningful collections has become a very important issue. Previously, documents were grouped based on statistical characteristics, and the classification relied more on syntactic notions than semantic concepts. This created the problem of not classifying documents that are semantically related. To have this problem solved, semantic similarity techniques are applied. In this article, a unique approach for classifying text-based documents semantically is introduced. The work is performed by extracting document synopses from Wikipedia and IMDB databases grouped together, and then NLTK is employed to generate them. Following that, a vector space is modeled using TFIDF and converted to a TFIDF matrix in numerical form. Finally, the Mini Batch K-means method is used for clustering. The results are compared with previous work using several datasets of different sizes, and various evaluation metrics have been applied to determine the best approach among them. [ABSTRACT FROM AUTHOR]