Canopy with k-means clustering algorithm for big data analytics.

Authors :: Sagheer, Noor S.
Yousif, Suhad A.
Cakalli, Huseyin
Kocinac, Ljubisa D. R.
Ashyralyev, Allaberen
Harte, Robin
Dik, Mehmet
Canak, Ibrahim
Kandemir, Hacer Sengul
Tez, Mujgan
Gurtug, Ozay
Savas, Ekrem
Akay, Kadri Ulas
Ucgun, Filiz Cagatay
Uyaver, Sahin
Ashyralyyev, Charyyar
Sezer, Sefa Anil
Turkoglu, Arap Duran
Onvural, Oruc Raif
Sahin, Hakan
Source :: AIP Conference Proceedings; 2020, Vol. 2334 Issue 1, p1-4, 4p
Publication Year :: 2020
Abstract: Recently, Big Data is gathered from various sources in different types, and it is not easy to analyze them by traditional methods. Apache Hadoop is a robust solution to the problems of saving and processing large datasets by providing HDFS (Hadoop Distributed File System) and MapReduce for storing and processing data. One of the essential methods for analyzing big data to discover new patterns is the clustering algorithms. In this paper, we have used the canopy clustering algorithm provided by Distributed Machine Learning with Apache Mahout as preprocessing step for the k-means clustering algorithm. The results showed that using Canopy as a preprocessing step has sped up the time of managing the massive scale of the healthcare insurance dataset, and it also reduces the execution time of the k-means by providing initial centroids for the given dataset. [ABSTRACT FROM AUTHOR]