1. Optimizing Parallel Clustering Throughput in Shared Memory
- Author
-
Michael Gowanlock, Victor Pankratius, and David M. Blair
- Subjects
Clustering high-dimensional data ,DBSCAN ,020203 distributed computing ,010504 meteorology & atmospheric sciences ,Computer science ,Correlation clustering ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Biclustering ,Data stream clustering ,Computational Theory and Mathematics ,Hardware and Architecture ,CURE data clustering algorithm ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Canopy clustering algorithm ,Data mining ,Cluster analysis ,Heuristics ,computer ,0105 earth and related environmental sciences - Abstract
This article studies the optimization of parallel clustering throughput in the context of variant-based parallelism, which exploits commonalities and reuse among variant computations for multithreading scalability. This direction is motivated by challenging scientific applications where scientists have to execute multiple runs of clustering algorithms with different parameters to determine which ones best explain phenomena observed in empirical data. To make this process more efficient, we propose a novel set of optimizations to maximize the throughput of Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a frequently used algorithm for scientific data mining in astronomy, geoscience, and many other fields. Our approach executes multiple algorithm variants in parallel, computes clusters concurrently, and leverages heuristics to maximize the reuse of results from completed variants. As scientific datasets continue to grow, maximizing clustering throughput with our techniques may accelerate the search and identification of natural phenomena of interest with computational support, i.e., Computer-Aided Discovery. We present evaluations on a whole spectrum of datasets, such as geoscience data on space weather phenomena, astronomical data from the Sloan Digital Sky Survey on intermediate-redshift galaxies, as well as synthetic datasets to characterize performance properties. Selected results show a 1,115 percent performance improvement due to indexing tailored for variant-based clustering, and a 2,209 percent performance improvement when applying all of our proposed optimizations.
- Published
- 2017
- Full Text
- View/download PDF