1. Distributed log information processing with Map-Reduce: A case study from raw data to final models
- Author
-
Gang Liu and Mingyue Luo
- Subjects
Distributed database ,Computer science ,business.industry ,Machine learning ,computer.software_genre ,Data modeling ,Software ,Data extraction ,Scalability ,Data pre-processing ,Artificial intelligence ,Data mining ,Cluster analysis ,Raw data ,business ,computer - Abstract
With the high development of Internet, e-commerce websites now routinely have to work with log datasets which are up to a few terabytes in size. How to remove messy data timely with low cost and find out useful information is a problem we have to face. The mining process involves several steps from pre-processing the raw data to establishing the final models. In this paper we describe our method to solve the problem with Map-Reduce. Hadoop[7] is a Map-Reduce implementation develops open-source software for reliable, scalable, distributed computing. Several applications which we have proposed: data extracting, sum operation, join operation and clustering algorithm are applied on hadoop. We can apply them on data pre-processing and detect users with the same interests. In particular, we focus on clustering algorithms. A clustering algorithms which integrate SOM(Self-Organized Map) and fuzzy[13] logic is combined with Map-Reduce and we call it MRSF here. With the help of hadoop cluster, large calculation of jobs with MRSF can be accommodated easily by just adding more nodes or computers to the cluster. From the experiment, we show that MRSF can scale well and efficiently process and analyze extremely large datasets.
- Published
- 2010