1. Village-Net Clustering: A Rapid approach to Non-linear Unsupervised Clustering of High-Dimensional Data
- Author
-
Ballal, Aditya, Datta, Esha, DePaul, Gregory A., Carlsson, Erik, Chen-Izu, Ye, López, Javier E., and Izu, Leighton T.
- Subjects
Computer Science - Machine Learning ,Quantitative Biology - Quantitative Methods ,Statistics - Machine Learning - Abstract
Clustering large high-dimensional datasets with diverse variable is essential for extracting high-level latent information from these datasets. Here, we developed an unsupervised clustering algorithm, we call "Village-Net". Village-Net is specifically designed to effectively cluster high-dimension data without priori knowledge on the number of existing clusters. The algorithm operates in two phases: first, utilizing K-Means clustering, it divides the dataset into distinct subsets we refer to as "villages". Next, a weighted network is created, with each node representing a village, capturing their proximity relationships. To achieve optimal clustering, we process this network using a community detection algorithm called Walk-likelihood Community Finder (WLCF), a community detection algorithm developed by one of our team members. A salient feature of Village-Net Clustering is its ability to autonomously determine an optimal number of clusters for further analysis based on inherent characteristics of the data. We present extensive benchmarking on extant real-world datasets with known ground-truth labels to showcase its competitive performance, particularly in terms of the normalized mutual information (NMI) score, when compared to other state-of-the-art methods. The algorithm is computationally efficient, boasting a time complexity of O(N*k*d), where N signifies the number of instances, k represents the number of villages and d represents the dimension of the dataset, which makes it well suited for effectively handling large-scale datasets., Comment: Software available at https://villagenet.streamlit.app/ more...
- Published
- 2025