Back to Search Start Over

Reproducible clustering with non-Euclidean distances: a simulation and case study

Authors :
Staples, Lauren
Ring, Janelle
Fontana, Scott
Stradwick, Christina
DeMaio, Joe
Ray, Herman
Zhang, Yifan
Zhang, Xinyan
Source :
International Journal of Data Science and Analytics; 20240101, Issue: Preprints p1-20, 20p
Publication Year :
2024

Abstract

Certain categorical sequence clustering applications require path connectivity, such as the clustering of DNA, click-paths through web-user sessions, or paths of care clustering with sequences of patient medical billing codes. K-means and k-medoids clustering with non-Euclidean distance metrics such as the Jaccard or edit distances maintains such path connectivity. Although k-means and k-medoids clustering with the Jaccard and edit distances have enjoyed success in these domains, the limits of accurate cluster recovery in these conditions have not yet been defined. As a first step in approaching this goal, we performed a simulated study using k-means and k-medoids clustering with non-Euclidean distances and show the performance deteriorates at a certain level of noise and when the number of clusters increases. However, we identify initialization strategies that improve upon cluster recovery in the presence of noise. We employ the use of the Tibshirani and Guenther (J Comput Graph Stat 14(3):511–528, 2005) Prediction Strength method, which creates a hypothesis testing scenario that determines if there is clustering structure to the data (if the clusters are reproducible), with the null hypothesis being there is none. We then applied the framework to perinatal episodes of care and the clusters reproducibly and organically split between Cesarean and vaginal deliveries, which itself is not a clinical finding but sensibly validates the approach. Further visualizations of the clusters did bring insights into subclusters that split along groups of physicians, cost and risk scores, warranting the outlined future work into ways of improving this framework for better resolution.

Details

Language :
English
ISSN :
2364415X and 23644168
Issue :
Preprints
Database :
Supplemental Index
Journal :
International Journal of Data Science and Analytics
Publication Type :
Periodical
Accession number :
ejs63650363
Full Text :
https://doi.org/10.1007/s41060-023-00429-1