Back to Search Start Over

A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment.

Authors :
Zhao, Jia
Sun, Jia
Zhai, Yunan
Ding, Yan
Wu, Chunyi
Hu, Ming
Source :
International Journal of Pattern Recognition & Artificial Intelligence; Feb2018, Vol. 32 Issue 2, p-1, 20p
Publication Year :
2018

Abstract

The data are rapidly expanding nowadays, which makes it very difficult to analyze valuable information from big data. Most of the existing data mining algorithms deal with big data problems at large time and space costs. This paper focuses on the sampling problem of big data and puts forward an efficient heuristic Cluster Sampling Arithmetic, called CSA. Many of the former researchers adopted random method to extract early sample set from the original data and then made a variety of different processing of the sample in order to obtain the corresponding minimum sample set, which is regarded as a representation of the original big data set. However, the final processing results of big data will be severely affected by the random sampling process at the beginning, resulting in lower comprehensiveness and quality of the final data results and longer processing time. Based on this view, CSA introduces the idea of clustering to obtain minimum sample set of big data, which is in contrast to the random sampling method in the current literature. CSA makes cluster analysis of the original data set and selects the center of each class as centralized members of the minimum sample set. It aims at ensuring that the sample distribution accords with the characteristics of the original data, guarantees the data integrity and reduces the processing time. The max-min distance means that the pattern recognition has been integrated into the clustering process in order to get the clustering center and prevent algorithm from local optimum. The final experimental results show that, compared with the existing work, CSA algorithm can efficiently reflect the characteristics of the original data and reduce the time of data processing. The obtained minimum sample set has also achieved good effects in the classification algorithm. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
02180014
Volume :
32
Issue :
2
Database :
Complementary Index
Journal :
International Journal of Pattern Recognition & Artificial Intelligence
Publication Type :
Academic Journal
Accession number :
126169732
Full Text :
https://doi.org/10.1142/S0218001418500039