Back to Search
Start Over
A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment.
- Source :
- International Journal of Pattern Recognition & Artificial Intelligence; Feb2018, Vol. 32 Issue 2, p-1, 20p
- Publication Year :
- 2018
-
Abstract
- The data are rapidly expanding nowadays, which makes it very difficult to analyze valuable information from big data. Most of the existing data mining algorithms deal with big data problems at large time and space costs. This paper focuses on the sampling problem of big data and puts forward an efficient heuristic Cluster Sampling Arithmetic, called CSA. Many of the former researchers adopted random method to extract early sample set from the original data and then made a variety of different processing of the sample in order to obtain the corresponding minimum sample set, which is regarded as a representation of the original big data set. However, the final processing results of big data will be severely affected by the random sampling process at the beginning, resulting in lower comprehensiveness and quality of the final data results and longer processing time. Based on this view, CSA introduces the idea of clustering to obtain minimum sample set of big data, which is in contrast to the random sampling method in the current literature. CSA makes cluster analysis of the original data set and selects the center of each class as centralized members of the minimum sample set. It aims at ensuring that the sample distribution accords with the characteristics of the original data, guarantees the data integrity and reduces the processing time. The max-min distance means that the pattern recognition has been integrated into the clustering process in order to get the clustering center and prevent algorithm from local optimum. The final experimental results show that, compared with the existing work, CSA algorithm can efficiently reflect the characteristics of the original data and reduce the time of data processing. The obtained minimum sample set has also achieved good effects in the classification algorithm. [ABSTRACT FROM AUTHOR]
- Subjects :
- DATA mining
BIG data
SAMPLING (Process)
ALGORITHMS
ELECTRONIC data processing
Subjects
Details
- Language :
- English
- ISSN :
- 02180014
- Volume :
- 32
- Issue :
- 2
- Database :
- Complementary Index
- Journal :
- International Journal of Pattern Recognition & Artificial Intelligence
- Publication Type :
- Academic Journal
- Accession number :
- 126169732
- Full Text :
- https://doi.org/10.1142/S0218001418500039