Back to Search Start Over

High-efficient analysis system for massive data of alpine grassland based on Hive

Authors :
LI Liangdan
YE Sha
XIE Xia
HU Yueming
XIE Jianwen
ZHOU Wu
YOU Xiaomin
Source :
Journal of Agricultural Resources and Environment, Vol 38, Iss 6, Pp 1152-1163 (2021)
Publication Year :
2021
Publisher :
Agro-Environmental Protection Institute, Ministry of Agriculture, 2021.

Abstract

Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alpine grassland, which can reliably and efficiently store and analyze the massive data of alpine grassland. First, the platform was designed based on the Hadoop, Hive, and Sqoop environments, and was completed through steps such as node and cluster configuration. Then, the data ETL(Extract-Transform-Load) and data storage were completed by using the EM (Expectation-Maximization) algorithm for data filling, importing data, and data partition storage. Finally, the system realized the fuzzy query function through mixed function coding, and the system had achieved the predetermined effect. The results showed as the file size increased, the overall data size increased, and the overall system storage and reading time were always increasing, however, the average running time(the average time for processing 1 MB of data) was decreasing, reflecting the system's high ability to process large amounts of data in parallel as the amount of data increased. Using the alpine grassland quadrat monitoring data and some virtual data from the counties in Qinghai Province in 2014, the total data volume was approximately 39.58 million(7.56 GB), and the efficiency of data query between the Hive cluster and the relational database SQL Server was compared. When the query data volume was 39.58 million, the Hive cluster data query time was 67.8% of the SQL Server. With the increase of data volume, the efficiency of system data query was higher than that of SQL Server. The ecological data of alpine grassland was analyzed and processed through HiveQL, and the corresponding control experiment was carried out. The comparison found that the Hive data analysis technology had the same processing result as the control experiment. In summary, distributed data warehouse technology is applied to the storage and analysis of massive data in alpine grasslands, which is a significant improvement over traditional data storage and analysis technologies. This system has high efficiency in processing massive data and strong developability, which can well meet the storage and analysis requirements of massive alpine grassland data.

Details

Language :
Chinese
ISSN :
20956819
Volume :
38
Issue :
6
Database :
OpenAIRE
Journal :
Journal of Agricultural Resources and Environment
Accession number :
edsair.doajarticles..94ab76ccd347a081403e4f27443c2849