Back to Search Start Over

HEADSS: HiErArchical Data Splitting and Stitching software for non-distributed clustering algorithms

Authors :
D.A. Crake
N.C. Hambly
R.G. Mann
Source :
Crake, D A, Hambly, N C & Mann, R G 2023, ' HEADSS: HiErArchical Data Splitting and Stitching software for non-distributed clustering algorithms ', Astronomy and Computing, vol. 43, 100709, pp. 1-9 . https://doi.org/10.1016/j.ascom.2023.100709
Publication Year :
2023

Abstract

The increase in data volume is challenging the suitability of non-distributed and non-scalable algorithms, despite advancements in hardware. An example of this challenge is clustering. Considering that optimal clustering algorithms scale poorly with increased data volume or are intrinsically non-distributed, accurate clustering of large datasets is increasingly resource-heavy, relying on substantial and expensive compute nodes. This scenario forces users to choose between accuracy and scalability. In this work, we introduce HiErArchical Data Splitting and Stitching (HEADSS), a Python package designed to facilitate clustering at scale. By automating the splitting and stitching, it allows repeatable handling, and removal, of edge effects. We implement HEADSS in conjunction with HDBSCAN, where we achieve orders of magnitude reduction in single node memory requirements for both non-distributed and distributed implementations, with the latter offering similar order of magnitude reductions in total run times while recovering analogous accuracy. Furthermore, our method establishes a hierarchy of features by using a subset of clustering features to split the data.

Details

Language :
English
Database :
OpenAIRE
Journal :
Crake, D A, Hambly, N C & Mann, R G 2023, ' HEADSS: HiErArchical Data Splitting and Stitching software for non-distributed clustering algorithms ', Astronomy and Computing, vol. 43, 100709, pp. 1-9 . https://doi.org/10.1016/j.ascom.2023.100709
Accession number :
edsair.doi.dedup.....4dbedb4093fb4dce7dbb6a0852e5bb9d
Full Text :
https://doi.org/10.1016/j.ascom.2023.100709