1. Parallel and Distributed Powerset Generation Using Big Data Processing
- Author
-
Youssef M. Essa, Ahmed El-Mahalawy, Gamal Attiya, and Ayman El-Sayed
- Subjects
Electronic computers. Computer science ,QA75.5-76.95 ,Cybernetics ,Q300-390 - Abstract
Data mining algorithms are more important today as it allows stakeholders to get a 360-degree view of their customers. Recently, powerset has become the basic core for many algorithms and techniques in different data mining domains as it provides optimal solutions for many problems in data mining. Nevertheless, it is challenging to be used in several instances because the complexity of powerset grows exponentially with the number of sets. Constructing powerset from huge datasets on a single machine causes an out-of-memory exception. So, from a business perspective in mega data projects, the enterprise companies need to invest a lot of money to build high-performance system infrastructure of powerset. Also, enterprise companies have to invest more money to build a standby system to keep the system alive if the high-performance machines break down. Furthermore, the existing powerset techniques are designed for structured data and not useful in intensive processing using in-memory unstructured data store. Thus, this paper tackles most problems that hinder deploying powerset algorithm toward Big Data and presents a series of pruning techniques that can greatly improve construction efficiency of powerset generation. The approach allows enterprise companies to explore huge data volumes and gain business insights into near-real-time and save the cost of infrastructure.
- Published
- 2019
- Full Text
- View/download PDF