Back to Search Start Over

基于网格密度和局部敏感哈希函数的 并行化聚类算法.

Authors :
毛伊敏
陶 涛
曹文梁
Source :
Application Research of Computers / Jisuanji Yingyong Yanjiu. May2021, Vol. 38 Issue 5, p1422-1427. 6p.
Publication Year :
2021

Abstract

Aiming at the problems of sensitivity of initial center, high communication overhead of nodes and low efficiency of cluster in big data clustering algorithm based on partitioning, this paper proposed a partitioning-based clustering algorithm using grid density and locality sensitive hash function based on MapReduce, named PBGDLSH-MR. Firstly, based on the initial data set, it proposed the GDS(grid density strategy) to get the initial clustering center, which avoided the sensitivity of initial center caused by random selection of initial cluster center. Secondly, it proposed the DP-LSH ( data partitioning based on locality sensitive hash functions) to map more closely related data objects into the same subdataset and get data partitions on the map. Meanwhile, it designed a formula SI( similarity improvement) to evaluate the data partitioning results, reduced the communication overhead between nodes. In addition, this paper designed an AGS (adaptive grouping strategy) to handle data skew in data partitions, which improved the cluster efficiency. Finally, based on MapReduce, it mined the cluster centers in parallel to gene rate the final clustering results. The experimental results show that the PBGDLSH-MR has better clustering results and performs better parallelization in big data. [ABSTRACT FROM AUTHOR]

Details

Language :
Chinese
ISSN :
10013695
Volume :
38
Issue :
5
Database :
Academic Search Index
Journal :
Application Research of Computers / Jisuanji Yingyong Yanjiu
Publication Type :
Academic Journal
Accession number :
150306843
Full Text :
https://doi.org/10.19734/j.issn.1001-3695.2020.04.0109