1. A characterization of big data benchmarks
- Author
-
Ye Li, Bei Zhendong, Bai Xue, Xiong Wen, Zou Yubin, Fan Zhang, Cheng-Zhong Xu, Juanjuan Zhao, and Zhibin Yu
- Subjects
Computer science ,business.industry ,Suite ,Big data ,Workload ,Benchmarking ,computer.software_genre ,Infant Stage ,Data science ,Data set ,Benchmark (computing) ,Redundancy (engineering) ,Data mining ,business ,computer - Abstract
Recently, big data has been evolved into a buzzword from academia to industry all over the world. Benchmarks are important tools for evaluating an IT system. However, benchmarking big data systems is much more challenging than ever before. First, big data systems are still in their infant stage and consequently they are not well understood. Second, big data systems are more complicated compared to previous systems such as a single node computing platform. While some researchers started to design benchmarks for big data systems, they do not consider the redundancy between their benchmarks. Moreover, they use artificial input data sets rather than real world data for their benchmarks. It is therefore unclear whether these benchmarks can be used to precisely evaluate the performance of big data systems. In this paper, we first analyze the redundancy among benchmarks from ICTBench, HiBench and typical workloads from real world applications: spatio-temporal data analysis for Shenzhen transportation system. Subsequently, we present an initial idea of a big data benchmark suite for spatio-temporal data. There are three findings in this work: (1) redundancy exists in these pioneering benchmark suites and some of them can be removed safely. (2) The workload behavior of trajectory data analysis applications is dramatically affected by their input data sets. (3) The benchmarks created for academic research cannot represent the cases of real world applications.
- Published
- 2013
- Full Text
- View/download PDF