Back to Search Start Over

Clustering High-dimensional Noisy Categorical and Mixed Data

Authors :
Tian, Zhiyi
Publication Year :
2021
Publisher :
Purdue University Graduate School, 2021.

Abstract

Clustering is an unsupervised learning technique widely used to group data into homogeneous clusters. For many real-world data containing categorical values, existing algorithms are often computationally costly in high dimensions, do not work well on noisy data with missing values, and rarely provide theoretical guarantees on clustering accuracy. In this thesis, we propose a general categorical data encoding method and a computationally efficient spectral based algorithm to cluster high-dimensional noisy categorical (nominal or ordinal) data. Under a statistical model for data on m attributes from n subjects in r clusters with missing probability epsilon, we show that our algorithm exactly recovers the true clusters with high probability when mn(1-epsilon) >= CMr2 log3M, with M=max(n,m) and a fixed constant C. Moreover, we show that mn(1- epsilon)2 >= r *delta/2 with 0< delta

Subjects

Subjects :
Statistics
FOS: Mathematics

Details

Database :
OpenAIRE
Accession number :
edsair.doi.dedup.....2fc106cf35d5bfa879c1ea8196150f9b
Full Text :
https://doi.org/10.25394/pgs.15058008