Descriptor: "MapReduce" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"MapReduce"' showing total 5,134 results

Start Over Descriptor "MapReduce"

5,134 results on '"MapReduce"'

1. Analytics in the Cloud

Author: Gupta, Pramod, Sehgal, Naresh Kumar, Acken, John M., Gupta, Pramod, Sehgal, Naresh Kumar, and Acken, John M.
Published: 2025
Full Text: View/download PDF

2. A multi-threaded particle swarm optimization-kmeans algorithm based on MapReduce.

Author: Wang, Xikang, Wang, Tongxi, and Xiang, Hua
Subjects: *PARTICLE swarm optimization, *K-means clustering, *BIG data, *COMPUTATIONAL complexity, *RESEARCH personnel, *PARALLEL algorithms
Abstract: The particle swarm optimization-K-Means algorithm is proposed by the related researchers to improve the clustering accuracy of the K-Means algorithm. However, the particle swarm optimization-K-Means algorithm brings more burden to the computation, and the computational efficiency is low when dealing with large data sets. To solve this problem, a parallel particle swarm K-Means algorithm based on MapReduce with multi-threading is proposed. The algorithm performs parallel computation by dividing the particle swarm into several equal-sized sub-populations based on the number of available nodes in the cluster and distributing them to each node. It uses a multi-threaded execution in the evaluation stage, which has the highest computational complexity in the evolutionary process. Experiments show that although splitting the population will affect the optimization effect to some extent, the proposed still can effectively optimize the clustering results of the K-Means algorithm, and the computational efficiency is significantly improved compared with serial particle swarm optimization k-means algorithm and MapReduce-based non-multithreaded particle swarm optimization k-means algorithm, in the experiment with the largest dataset and a configuration of 16 nodes, the proposed algorithm is 58 times faster than the serial algorithm. Furthermore, the computing efficiency can be improved in the clusters with more CPU cores. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. From Micro-benchmarks to Machine Learning: Unveiling the Efficiency and Scalability of Hadoop and Spark.

Author: Hebabaze, Salah Eddine, El Ghmary, Mohamed, El Bouabidi, Hamid, Maftah, Sara, and Amnai, Mohamed
Subjects: ELECTRONIC data processing, MACHINE learning, BIG data, WEB-based user interfaces, INTERNET searching
Abstract: With the exponential growth of data, the demand for efficient and scalable data processing solutions has become paramount. Hadoop and Spark, pivotal components of the open-source Big Data landscape, have been put to the test in this study. We conducted a comprehensive performance analysis of Hadoop and Spark in virtualized environments, evaluating their prowess across a suite of benchmarks. The benchmarks encompassed a spectrum of workloads, from micro-benchmarks such as Sort, WordCount, and TeraSort to web search tasks such as PageRank and machine learning endeavors including Naive Bayes and K-means. The central focus was to gauge their performance, efficiency, and resource utilization. The findings of this study underscore the benefits of Spark's in-memory processing, demonstrating its superiority over Hadoop in various scenarios. Spark excels in machine learning and web search applications, particularly when handling smaller inputs. Its efficient memory management and support for multiple iterations make it a strong choice. In resource-constrained environments or when dealing with large input files and limited memory, Hadoop may still hold an edge. The design and implementation of data processing solutions in virtualized environments should carefully consider the specific demands of each framework. This study not only presents a performance comparison of Hadoop and Spark across different benchmarks but also emphasizes the vital implications for designing and deploying data processing solutions in virtualized settings. It serves as a cornerstone for informed decision-making, paving the way for optimized algorithms and techniques in the dynamic landscape of big data processing. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Input/output optimization scheduler for cloud-based map reduce framework.

Author: Naaz, Farha and Banu, Sameena
Subjects: QUALITY of service, CLOUD computing, BIG data, PRODUCTION scheduling, WORKFLOW
Abstract: Hadoop MapReduce (HMR) provides the most common MapReduce (MR) framework, and it is available as open source. MR is a famous computational framework for evaluating unstructured, and semistructured big data and executing applications in the past ten years. Memory and input/output (I/O) overhead are just two of the many problems affecting the current HMR scheduler system. This study aims to improve systems resource use including the processing of data in realtime by creating a memory I/O optimized scheduler (MIOOS) for HMR. The disk I/O seek can be reduced by using MIOOS, which analyzes the entire memory management. Additionally, the MIOOS makespan approach is used to reduce the occurrence of problems in intermediary tasks. Both the MIOOS approach and the current approach are assessed by using complex scientific workflow applications with extreme task inter-dependencies. Further, the comparison study demonstrates that the MIOOS framework outdoes the current approach regarding makespan and overall memory usage. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. A Parallel Genetic K-Means Algorithm based on the Island Model.

Author: Xikang Wang, Tongxi Wang, Hua Xiang, and Lan Huang
Abstract: The K-Means clustering algorithm is widely employed in cluster analysis but is known for its sensitivity to initial center selection and its tendency to become trapped in local optima. These limitations have prompted researchers to explore optimization techniques. The genetic K-Means algorithm (GKA) leverages the optimization capabilities of genetic algorithms to enhance the clustering performance of K-Means. However, this improvement comes at the cost of increased computational complexity, rendering the algorithm less efficient for large-scale datasets. To address these issues, we propose a parallel genetic K-Means algorithm (PGKA) based on the island model. In PGKA, the overall population is partitioned into multiple sub-populations of equal size, each evolving independently on different nodes. The evolutionary process is divided into several generations, with sub-populations exchanging information between generations to preserve diversity. We employ multi-threaded computation to maximize the CPU utilization for the most computationally intensive part of the algorithm, the fitness computation. Additionally, we have modified certain evolutionary operators to better suit the optimization of the K-Means algorithm. Experimental results demonstrate that the proposed algorithm achieves superior clustering accuracy compared to other recent proposed GKA variants. It also significantly enhances computational efficiency relative to the serial GKA and the non-multi-threaded PGKA. Specifically, with a configuration of 16 nodes, PGKA is 12.4 times faster than the serial version when tested on the largest dataset in our experiments. Furthermore, the speedup can be further improved with clusters having more CPU cores. [ABSTRACT FROM AUTHOR]
Published: 2024

6. H-mrk-means: Enhanced Heuristic mrk-means for Linear Time Clustering of Big Data Using Hybrid Meta-heuristic Algorithm.

Author: Puri, Digvijay and Gupta, Deepak
Subjects: METAHEURISTIC algorithms, HEURISTIC algorithms, TIME series analysis, SAMPLING (Process), HEURISTIC
Abstract: Big data is generally derived with a large volume and combined categories of attributes like categorical and numerical. Among them, k -prototypes have been adopted into MapReduce structure, and thus, it provides a better solution for the huge range of data. However, k -prototypes need to compute all distances among every data point and cluster centres. Moreover, the computations of distances are redundant as data points are often present in similar clusters after fewer iterations. Nowadays, to cluster huge-scale datasets, one of the efficient solutions is k -means. However, k -means is not intrinsically appropriate to execute in MapReduce due to the iterative nature of this technique. Moreover, for every iteration, k -means should perform an independent MapReduce job but, it leads to higher Input/Output (I/O) overhead at every iteration. This research paper presents a novel enhanced linear time clustering for handling big data called Heuristic mrk-means (H-mrk-means) using optimized k -means on the MapReduce model. In order to manage big data that is time series in nature, the sampling and MapReduce framework are adopted, which utilize different machines for processing data. Before initiating the main clustering process, a sampling process is adopted to get the noteworthy information. The two main phases of the developed method are the map phase (divide and conquer) and the reduce phase (final clustering). In the map phase, the data are divided into diverse chunks that should be stored in assigned machines. In the reduce phase, data clustering is performed. Here, the cluster centroid of data is tuned with the help of hybrid Tunicate-Deer Hunting Optimization (T-DHO) algorithm by attaining a newly derived objective function. This type of optimal tuning of solution enhances the efficiency of clustering when compared over normal iterative k -means and mrk-means clustering. The experimental evaluation on varied counts of chunks using the proposed H-mrk-means has attained higher quality of clustering results and faster execution times evaluated with other clustering approaches. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. A Reliable Multimetric Straggling Task Detection.

Author: Ajibade, Lukuman Saheed, Bakar, Kamalrulnizam Abu, Yusuf, Muhammed Nura, and Isyaku, Babangida
Subjects: BIG data, ELECTRONIC data processing, ALGORITHMS
Abstract: One of the most difficult issues in using MapReduce for parallelising and distributing largescale data processing is detecting straggling tasks. It is defined as recognising processes that are operating on weak nodes. When two steps in the Map phase (copy, combine) and three stages in the Reduce phase (shuffle, sort, and reduce) are included, the overall execution time is the sum of the execution times of these five stages. The main objective of this study is to calculate the remaining time to complete a task, the time taken, and the straggler(s) detected in parallel execution. The suggested method is based on the use of Progress Score (PS), Progress Rate (PR), and Remaining Time (RT) metrics to detect straggling tasks. The results obtained have been compared with popular algorithms in this domain, such as Longest Approximate Time to End (LATE) and Combinatory Late-Machine (CLM), and it has been demonstrated to be capable of detecting straggling tasks, accurately estimating execution time, and supporting task acceleration. RMSTD outperforms LATE by 23.30% and CLM by 19.51%. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Traffic flow prediction based on improved LSTM and mobile big data in smart cities.

Author: Yao, T. and Yang, C.
Subjects: *CITY traffic, *TRAFFIC flow, *TRAFFIC congestion, *SMART cities, *URBAN policy
Abstract: Urban traffic problems are one of the major challenges in the development of smart cities. To effectively alleviate traffic congestion and improve urban traffic management, a distributed big data analytics platform based on the MapReduce computing framework is investigated to process urban traffic data containing millions of traffic flow, speed and timestamp records. The study proposes an improved Long Short-Term Memory (LSTM) neural network algorithm and describes the network structure, training process and parameter tuning in detail. The effectiveness of the improved algorithm was verified through comparative analysis with the traditional ARIMA, linear regression and support vector regression algorithms. The results showed that the improved LSTM algorithm achieved an accuracy of 97.62% in traffic flow prediction, which was significantly better than other algorithms. The results demonstrate that the method can provide reliable technical support and decision-making basis for traffic management in smart cities. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques

Author: Mohammed Bergui, Soufiane Hourri, Said Najah, and Nikola S. Nikolov
Subjects: Hadoop, MapReduce, Big data, Performance modelling, Runtime prediction, Machine learning, Computer engineering. Computer hardware, TK7885-7895, Information technology, T58.5-58.64, Electronic computers. Computer science, QA75.5-76.95
Abstract: Abstract Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.
Published: 2024
Full Text: View/download PDF

10. Ensemble Model for Stock Price Forecasting: MapReduce Framework for Big Data Handling: An Optimal Trained Hybrid Model for Classification.

Author: Senthamil Selvi, R., Sankari, V., Ramya, N., and Selvi, M.
Subjects: *STOCK price forecasting, *FEATURE extraction, *CLASSIFICATION, *BIG data, *EXTRACTION techniques
Abstract: A number of authors have focused on this study to examine how huge data are perceived. A novel big data classification paradigm is introduced by the work's preprocessing, feature extraction and classification techniques. Data normalization is carried out at the preprocessing stage. The MapReduce framework is then utilized to manage the massive data. Statistical features (mean, median, min/max and SD), higher-order statistical features (skewness, kurtosis and enhanced entropy), and correlation-based features are all extracted prior to classification. The Bi-LSTM and deep maxout hybrid classification model classifies the data during the reduction stage. To assure classification accuracy, training will also be deployed by the new Hybrid Butterfly Positioned Coot Optimization (HBPCO) algorithm. The proposed method's accuracy of 97.45% beats the methods of NN (85.13%), CNN (83.78%), RNN (78.37%), Bi-LSTM (82.43%) and SVM (87.83%). [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. A new service composition method in the cloud‐based Internet of things environment using a grey wolf optimization algorithm and MapReduce framework.

Author: Vakili, Asrin, Al‐Khafaji, Hamza Mohammed Ridha, Darbandi, Mehdi, Heidari, Arash, Jafari Navimipour, Nima, and Unal, Mehmet
Subjects: OPTIMIZATION algorithms, INTERNET of things, SOFTWARE as a service, QUALITY of service, ALGORITHMS, CLOUD computing, SOFTWARE maintenance
Abstract: Summary: Cloud computing is quickly becoming a common commercial model for software delivery and services, enabling companies to save maintenance, infrastructure, and labor expenses. Also, Internet of Things (IoT) apps are designed to ease developers' and users' access to networks of smart services, devices, and data. Although cloud services give nearly infinite resources, their reach is constrained. Designing coherent and organized apps is made possible by integrating the cloud and IoT. Expanding facilities by combining services is a critical component of this technology. Various services may be presented in this environment based on the user's demands. Considering their Quality of Service (QoS) attributes, discovering the appropriate available atomic services to construct the needed composite service with their collaboration in an orchestration model is an NP‐hard issue. This article suggests a service composition method using Grey Wolf Optimization (GWO) and MapReduce framework to compose services with optimized QoS. The simulation outcomes illustrate cost, availability, response time, and energy‐saving improvements through the suggested approach. Comparing the suggested technique to three baseline algorithms, the average gain is a 40% improvement in energy savings, a 14% decrease in response time, an 11% increase in availability, and a 24% drop in cost. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques.

Author: Bergui, Mohammed, Hourri, Soufiane, Najah, Said, and Nikolov, Nikola S.
Subjects: REGRESSION trees, JOB performance, PREDICTION models, RANDOM forest algorithms, DECISION trees, MACHINE learning, ELECTRONIC data processing
Abstract: Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. An Improved Parallel Clustering Method Based on K-Means for Electricity Consumption Patterns.

Author: Yang, Yuehua and Wu, Yun
Subjects: *CONSUMPTION (Economics), *PATTERN recognition systems, *ELECTRIC power distribution, *ELECTRIC power consumption, *K-means clustering, *CLUSTER analysis (Statistics), *PARALLEL programming, *PARALLEL processing
Abstract: Electricity consumption pattern recognition is the foundation of intelligent electricity distribution data analysis. However, as the scale of electricity consumption data increases, traditional clustering analysis methods encounter bottlenecks such as low computation speed and processing efficiency. To meet the efficient mining needs of massive electricity consumption data, in this paper a parallel processing method of the density-based k-means clustering is presented. First, an initial cluster center selection method based on data sample density is proposed to avoid inaccurate initial cluster center point selection, leading to clustering falling into local optima. The dispersion degree of the data samples within the cluster is also used as an important reference for determining the number of clusters. Subsequently, parallelization of density calculation and clustering for data samples were achieved based on the MapReduce model. Through experiments conducted on Hadoop clusters, it has been shown that the proposed parallel processing method is efficient and feasible, and can provide favorable support for intelligent power allocation decisions. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. On the use of big data frameworks in big service management.

Author: Ghedass, Fedia and Ben Charrada, Faouzi
Subjects: *BIG data, *AUTONOMIC computing, *KNOWLEDGE graphs, *PROCESS capability, *QUALITY of service, *CLOUD computing
Abstract: Over the last few years, big data have emerged as a paradigm for processing and analyzing a large volume of data. Coupled with other paradigms, such as cloud computing, service computing, and Internet of Things, big data processing takes advantage of the underlying cloud infrastructure, which allows hosting and managing massive amounts of data, while service computing allows to process and deliver various data sources as on‐demand services. This synergy between multiple paradigms has led to the emergence of big services, as a cross‐domain, large‐scale, and big data‐centric service model. Apart from the adaptation issues (e.g., need of high reaction to changes) inherited from other service models, the massiveness and heterogeneity of big services add a new factor of complexity to the way such a large‐scale service ecosystem is managed in case of execution deviations. Indeed, big services are often subject to frequent deviations at both the functional (e.g., service failure, QoS degradation, and IoT resource unavailability) and data (e.g., data source unavailability or access restrictions) levels. Handling these execution problems is beyond the capacity of traditional web/cloud service management tools, and the majority of big service approaches have targeted specific management operations, such as selection and composition. To maintain a moderate state and high quality of their cross‐domain execution, big services should be continuously monitored and managed in a scalable and autonomous way. To cope with the absence of self‐management frameworks for large‐scale services, the goal of this work is to design an autonomic management solution that takes the whole control of big services in an autonomous and distributed lifecycle process. We combine autonomic computing and big data processing paradigms to endow big services with self‐* and parallel processing capabilities. The proposed management framework takes advantage of the well‐known MapReduce programming model and Apache Spark and manages big service's related data using knowledge graph technology. We also define a scalable embedding model that allows processing and learning latent big service knowledge in a distributed manner. Finally, a cooperative decision mechanism is defined to trigger non‐conflicting management policies in response to the captured deviations of the running big service. Big services' management tasks (monitoring, embedding, and decision), as well as the core modules (autonomic managers' controller, embedding module, and coordinator), are implemented on top of Apache Spark as MapReduce jobs, while the processed data are represented as resilient distributed dataset (RDD) structures. To exploit the shared information exchanged between the workers and the master node (coordinator), and for further resolution of conflicts between management policies, we endowed the proposed framework with a lightweight communication mechanism that allows transferring useful knowledge between the running map‐reduce tasks and filtering inappropriate intermediate data (e.g., conflicting actions). The experimental results proved the increased quality of embeddings and the high performance of autonomic managers in a parallel and cooperative setting, thanks to the shared knowledge. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. Handling uncertainty using optimal clustering with rough sets‐based rule generation model for data classification.

Author: Bhukya, Hanumanthu and Sadanandam, Manchala
Subjects: *CLASSIFICATION, *DATA modeling, *INFORMATION design, *BIG data, *BARNACLES, *ROUGH sets, *AMBIGUITY
Abstract: In recent times, MapReduce has become a popular tool for handling big data. At the same time, uncertainty is related to arbitrariness, fuzziness, ambiguity, irregularity and incomplete knowledge. In RS theory, the uncertainty behaviour of the data in the dataset of interest is managed by using upper and lower approximate sets and classification accuracy. The RS model is integrated with data clustering technique for optimal outcomes. With this motivation, this study designs an Optimal Clustering with RS‐Based Rule Generation Model (OC‐RSRGM) for data classification on MapReduce environment. The OC‐RSRGM technique aims to generate an optimal set of rules using RST for the data classification process and it involves a two‐stage process namely Optimal Fuzzy c‐Means Clustering (OFCM) and RSRGM‐based rule generation with classification. The OFCM technique is derived to eradicate the local optimal problem of the FCM (Fuzzy c‐Means) model using Barnacles Mating Optimizer (BMO). It provides the decision‐makers with all the information needed to design appropriate mechanisms to support their decision‐making activities. The Hadoop MapReduce tool is used to handle big data. The proposed method combines an FCM, BMO, RS theory to accomplish effective decision‐making. The OC‐RSRGM technique can be employed to continuous value dataset where data point does not offer any class details and it might be uncertain. To validate the performance of OC‐RSRGM technique, a detailed experimental analysis is carried out to highlights the betterment of OC‐RSRGM technique. The proposed OC‐RSRGM technique has obtained an effective outcome with the CT of 5.43 s. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. Parallel and streaming wavelet neural networks for classification and regression under apache spark.

Author: Eduru, Harindra Venkatesh, Vivek, Yelleti, Ravi, Vadlamani, and Shankar, Orsu Shiva
Subjects: *BIG data, *CLASSIFICATION, *GAS detectors, *GAUSSIAN function
Abstract: Wavelet neural networks (WNN) have been applied in many fields to solve regression as well as classification problems. After the advent of big data, as data gets generated at a brisk pace, it is imperative to analyze it as soon as it is generated owing to the fact that the nature of the data may change dramatically in short time intervals. This is necessitated by the fact that big data is all pervasive and throws computational challenges for data scientists. Therefore, in this paper, we built an efficient Scalable, Parallelized Wavelet Neural Network (SPWNN) which employs the parallel stochastic gradient algorithm (SGD) algorithm. SPWNN is designed and developed under both static and streaming environments in the horizontal parallelization framework. SPWNN is implemented by using Morlet and Gaussian functions as activation functions. This study is conducted on big datasets like gas sensor data which has more than 4 million samples and medical research data which has more than 10,000 features, which are high dimensional in nature. The experimental analysis indicates that in the static environment, SPWNN with Morlet activation function outperformed SPWNN with Gaussian on the classification datasets. However, in the case of regression, there is no clear trend was observed. In contrast, in the streaming environment i.e., Gaussian outperformed Morlet on the classification and Morlet outperformed Gaussian on the regression datasets. Overall, the proposed SPWNN architecture achieved a speedup of 1.22 - 1.78. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

17. From Stars to Diamonds: Counting and Listing Almost Complete Subgraphs in Large Networks.

Author: Finocchi, Irene, Garcia, Renan Leon, and Sinaimeri, Blerina
Subjects: *GRAPH connectivity, *COMPUTER algorithms, *COMPUTER network management, *SUBGRAPHS, *DISTRIBUTED computing
Abstract: Listing dense subgraphs is a fundamental task with a variety of network analytics applications. A lot of research has been done focusing on |$k$| -cliques, i.e. complete subgraphs on |$k$| nodes. However, requiring complete connectivity between the nodes of a subgraph may be too restrictive in many real applications. Hence, in this paper, we consider a natural relaxation of cliques, called |$k$| -diamonds and defined as cliques of size |$k$| with one missing edge. We first provide a sequential algorithm that, in |$O(nm^{(k-1)/2})$| time, counts and lists all the |$k$| -diamonds in large graphs, for any constant |$k \geq 4$|⁠. A parallel extension of the sequential algorithm is then proposed and analyzed in a MapReduce-style model, achieving the same local and total space usage of the state-of-the-art algorithms for |$k$| -cliques. The running time is optimal on dense graphs and |$O(\sqrt{m})$| larger than |$k$| -clique counting if the graph is sparse. Our algorithms compute induced diamonds by analyzing the structure of directed stars formed by the graph nodes and their neighbors. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

18. Handling Massive Sparse Data in Recommendation Systems.

Author: Chetana, V. Lakshmi and Seetha, Hari
Subjects: RECOMMENDER systems, GAUSSIAN mixture models, ONLINE education, USER experience, SATISFACTION, CULTURAL industries
Abstract: Collaborative filtering-based recommendation systems have become significant in various domains due to their ability to provide personalised recommendations. In e-commerce, these systems analyse the browsing history and purchase patterns of users to recommend items. In the entertainment industry, collaborative filtering helps platforms like Netflix and Spotify recommend movies, shows and songs based on users' past preferences and ratings. This technology also finds significance in online education, where it assists in suggesting relevant courses and learning materials based on a user's interests and previous learning behaviour. Even though much research has been done in this domain, the problems of sparsity and scalability in collaborative filtering still exist. Data sparsity refers to too few preferences of users on items, and hence it would be difficult to understand users' preferences. Recommendation systems must keep users engaged with fast responses, and hence there is a challenge in handling large data as these days it is growing quickly. Sparsity affects the recommendation accuracy, while scalability influences the complexity of processing the recommendations. The motivation behind the paper is to design an efficient algorithm to address the sparsity and scalability problems, which in turn provide a better user experience and increased user satisfaction. This paper proposes two separate, novel approaches that deal with both problems. In the first approach, an improved autoencoder is used to address sparsity, and later, its outcome is processed in a parallel and distributed manner using a MapReduce-based k -means clustering algorithm with the Elbow method. Since the k -means clustering technique uses a predetermined number of clusters, it may not improve accuracy. So, the elbow method identifies the optimal number of clusters for the k -means algorithm. In the second approach, a MapReduce-based Gaussian Mixture Model (GMM) with Expectation-Maximization (EM) is proposed to handle large volumes of sparse data. Both the proposed algorithms are implemented using MovieLens 20M and Netflix movie recommendation datasets to generate movie recommendations and are compared with the other state-of-the-art approaches. For comparison, metrics like RMSE, MAE, precision, recall, and F-score are used. The outcomes demonstrate that the second proposed strategy outperformed state-of-the-art approaches. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

19. High-speed parallel segmentation algorithms of MeanShift for litchi canopies based on Spark and Hadoop.

Author: Xiong, Hongyi, Wang, Jianhua, Xiao, Yiming, Xiao, Fangjun, Huang, Renhuang, Hong, Licong, Wu, Bofei, Zhou, Jinfeng, Long, Yongbin, and Lan, Yubin
Subjects: PARALLEL algorithms, REMOTE sensing, LITCHI, COMPUTATIONAL complexity, ALGORITHMS, IMAGE segmentation
Abstract: The MeanShift algorithm is a nonparametric method based on gradient ascent and it can effectively handle complex variations in lychee orchard scenes as well as changes in lychee tree canopies due to its adaptability, multi-scale analysis capabilities, and robustness, making it widely used in the segmentation processing of drone-based remote sensing images of lychee orchards. However, due to the high computational complexity of the MeanShift algorithm, its performance in processing large-scale drone remote sensing images of lychee tree canopies is not highly efficient, leading to low segmentation efficiency, which hampers its broader application. To address these issues, this study proposes high-speed MeanShift parallel segmentation algorithms for drone remote sensing images of lychee tree canopies based on MapReduce and Spark distributed computing frameworks. In this study, a cluster consisting of four nodes with Hadoop and Spark was set up, and 4000 drone remote sensing images were used as test data to evaluate the algorithm. Experimental results show that, the MeanShift algorithm based on MapReduce reduced the task execution time by 86.1% compared to the traditional MeanShift algorithm, while the MeanShift algorithm based on Spark reduced the task execution time by 88.0%, without compromising segmentation accuracy. The MeanShift parallel segmentation algorithm based on Hadoop and Spark platform can overcome the bottleneck of task execution efficiency and significantly enhance computational speed on a single machine. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

20. A Map Reduce-based big data clustering using swarm-inspired meta-heuristic algorithms.

Author: Tekieh, R. and Beheshti, Z.
Subjects: METAHEURISTIC algorithms, PARTICLE swarm optimization, BIG data, BIOLOGICALLY inspired computing, PARALLEL algorithms, SEARCH algorithms, DISTRIBUTED computing
Abstract: Clustering is one of the important methods in data analysis. For big data, clustering is difficult due to the volume of data and the complexity of clustering algorithms. Therefore, methods that can handle a large amount of data clustering at the reasonable time are required. MapReduce is a powerful programming model that allows parallel algorithms to run in distributed computing environments. In this study, an improved Artificial Bee Colony (ABC) algorithm based on a MapReduce clustering model (MR-CWABC) is proposed. The weighted average without greedy selection of the results improves the local and global search of ABC. The improved algorithm is implemented in accordance with the MapReduce model on the Hadoop framework to allocate optimal samples to the clusters such that the compression and separation of the clusters are preserved. The proposed method is compared with some well-known bio-inspired algorithms such as Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC) and Gravitational Search Algorithm (GSA) implemented based on the MapReduce model on the Hadoop framework. The results showed that MR-CWABC is well-suited for big data, while maintaining clustering quality. The MR-CWABC demonstrates an improvement of 7.13%, 7.71%, and 6.77% based on the average F-measure compared to MR-CABC, MR-CPSO, and MR-CGSA, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

21. PDCNN-MRW: a parallel Winograd convolutional neural network algorithm base on MapReduce.

Author: Wen, Zhanqing, Mao, Yimin, and Dai, Jingguo
Abstract: Parallel deep convolutional neural network (DCNN) algorithms have been widely used in the field of big data, but there are still some problems: excessive computation of redundant features, insufficient performance of convolution operation, and poor merging ability of parameter parallelization. Based on the above problems, a parallel DCNN algorithm based on MapReduce and Winograd convolution (PDCNN-MRW) is proposed in this paper, which contains three parts. First, a feature selection method based on cosine similarity and normalized mutual information (FS-CSNMI) is proposed, which reduces redundant feature computation between channels and avoids excessive redundant feature computation. Next, a parallel Winograd convolution method base on MapReduce (PWC-MR) is presented to address the insufficient convolution performance by reducing the number of multiplications. Finally, a load balancing method based on multiway tree and task migration (LB-MTTM) is developed, which improves the capability of parameter merging by balancing the load between nodes and reducing the response time of the cluster. We compared the PDCNN-MRW algorithm with other algorithms on four datasets, including ISIC 2019, BloodCellImages, PatchCamelyon, and ImageNet 1K. Experiment shows that the proposed algorithm has lower training costs and higher efficiency than other parallel DCNN algorithms. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. Extracting Common DNA Segments from the Complete Genomes of 7538 Viruses and Five Selected Mammals

Author: Wang, Jing-Doo, Wang, Yi-Chun, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Nguyen, Ngoc-Than, editor, Franczyk, Bogdan, editor, Ludwig, André, editor, Nunez, Manuel, editor, Treur, Jan, editor, Vossen, Gottfried, editor, and Kozierkiewicz, Adrianna, editor
Published: 2024
Full Text: View/download PDF

23. Analyzing Twitter Data Using Apache Hive—A Big Data Technology Exploration

Author: Sharma, Kanhaiya, Kapshe, Mansi, Bhargava, Parth, Trivedi, Prakhar, Changde, Sanika, Mishra, Om, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Chillarige, Raghavendra Rao, editor, Distefano, Salvatore, editor, and Rawat, Sandeep Singh, editor
Published: 2024
Full Text: View/download PDF

24. mRMR Feature Selection to Handle High Dimensional Datasets: Vertical Partitioning Based Iterative MapReduce Framework

Author: Yelleti, Vivek, Prasad, P. S. V. S. Sai, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Abraham, Ajith, editor, Bajaj, Anu, editor, Hanne, Thomas, editor, Siarry, Patrick, editor, and Ma, Kun, editor
Published: 2024
Full Text: View/download PDF

25. Stateful MapReduce Framework for mRMR Feature Selection Using Horizontal Partitioning

Author: Yelleti, Vivek, Sai Prasad, P. S. V. S., Hartmanis, Juris, Founding Editor, Goos, Gerhard, Series Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Ghosh, Ashish, editor, King, Irwin, editor, Bhattacharyya, Malay, editor, Sankar Ray, Shubhra, editor, and K. Pal, Sankar, editor
Published: 2024
Full Text: View/download PDF

26. Resource Management in Hadoop Clusters at the Storing Level of Hadoop Distributed File System

Author: Goyal, Mani, Garg, Navneet, Oberoi, Neelam, Mehta, Vaishali, Bansal, Jagdish Chand, Series Editor, Deep, Kusum, Series Editor, Nagar, Atulya K., Series Editor, Vimal, Vrince, editor, Perikos, Isidoros, editor, Mukherjee, Amrit, editor, and Piuri, Vincenzo, editor
Published: 2024
Full Text: View/download PDF

27. A Review of Research on Big Data Technology

Author: Wang, Ruochen, Zeng, Ziyang, Yang, Zhirui, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Wang, Yue, editor, Zou, Jiaqi, editor, Xu, Lexi, editor, Ling, Zhilei, editor, and Cheng, Xinzhou, editor
Published: 2024
Full Text: View/download PDF

28. An implementation of GPU accelerated mapreduce: using hadoop with openCL for breast cancer detection and compute-intensive jobs

Author: Ouhakki, Hamza and Elmoufidi, Abdelali
Published: 2024
Full Text: View/download PDF

29. Integrated normal discriminant analysis in mapreduce for diabetic chronic disease prediction using bivariant deep neural networks

Author: Ramani, R., Dhinakaran, D., Edwin Raja, S., Thiyagarajan, M., and Selvaraj, D.
Published: 2024
Full Text: View/download PDF

30. Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce.

Author: Lawrance, Josephine Usha, Jesudhasan, Jesu Vedha Nayahi, and Thampiraj Rittammal, Jerald Beno
Subjects: BIG data, TECHNOLOGICAL innovations, DATA privacy, DATA mining, DATA management, INFORMATION scientists
Abstract: The amount of data on the internet is steadily growing due to recent technological advancements in cyber-physical-social systems, sensor networks, and communication technologies. Many information scientists, policy and decision-makers are attempting to explore this vast amount of data for critical decisions and planned business moves. The increasing amount of big data also increases privacy issues and data breaches. Proper data management is essential for all organizations that handle sensitive information and large volumes of data. Data anonymization is a promising method for protecting individual privacy, resulting in significant information loss. Recently, data anonymization based on data mining techniques has shown significant improvement in data utility. Again, when utilized with big data, the clustering-based anonymization technique has serious scalability issues, and cluster formation on large data sets is time-consuming. This paper proposes the Parallel Fuzzy C-Means Clustering based Anonymization Algorithm (FCMCAA) using the Hadoop MapReduce framework for ensuring the privacy of large volume of structured data. The results demonstrate that the algorithm works better in terms of F-measure and classification accuracy yielding 91% accuracy. It is also scalable and able to handle huge volumes of structured data while maintaining a high level of privacy with minimum information loss. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

31. A Deep Learning Modified Neural Network(DLMNN) based proficient sentiment analysis technique on Twitter data.

Author: S, Neelakandan, Paulraj, D., Ezhumalai, P., and Prakash, M.
Subjects: *SENTIMENT analysis, *DEEP learning, *BIG data, *DATABASES, *PARTICLE swarm optimization, *SOCIAL media, *USER-generated content
Abstract: The rapid enhancement in social media over the internet generates massive information in real-time scenarios, which has a striking impact on big data analysis. It resulted in the elevated usage of emotions and sentiments in social media. This paper proffers a proficient sentiment analysis technique in Twitter data. The Twitter database is preprocessed includes, stemming, tokenisation, number removal and stop word removal, etc. The preprocessed words are then passed into the HDFS (Hadoop Distributed File System) to reduce the repeated words and are eliminated using the MapReduce technique. The emoticons and the non-emoticons are extorted as features. The resulted features are ranked with their intended meaning. Then, the classification is performed utilising the DLMNN (Deep Learning Modified Neural Network). The experimental results were examined by using the output parameter such as Accuracy, Recall, Precision, F-Score and Execution Time with other conventional techniques such as ANN, SVM, K-Means and DCNN to show the greatest outcome of the proposed model. Evaluation result shows that DLMNN achieved the greatest performance in terms of precision (95.78%), Recall (95.84%), F-Score (95.87%) and Accuracy (91.65%) when compared with the existing models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. Enhancing Data Analytics in Environmental Sensing Through Cloud IoT Integration.

Author: Verma, Rohan, Taneja, Harsh, Singh, Kiran Deep, and Singh, Prabh Deep
Subjects: *INTERNET of things, *ENVIRONMENTAL monitoring, *DATA security, *CLOUD computing, *VIRTUAL machine systems
Abstract: Transformational advances in environmental sensing have been made possible by the convergence of Cloud Computing and the Internet of Things (IoT). The potential for these technologies is to work together and improve environmental monitoring data analytics. Integrating IoT in the cloud creates a robust environment for advanced analytics by overcoming previous hurdles in data granularity, real-time monitoring, and geographical coverage. These developments aren't without their own set of difficulties, however, including data security, interoperability, and ethical concerns. This paper explores integrating Cloud IoT with environmental sensing. The paper addresses these issues and emphasises the need to establish ethical data-gathering methods, implement standardised communication protocols, and address privacy concerns. The study concludes by examining the challenges, concerns, and potential of integrating Cloud IoT. The integration of Cloud Computing and IoT not only transforms environmental sensing but also provides a solid foundation for collaborative research, data-driven decision-making, and environmentally conscious management. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

33. Big data clustering based on spark chaotic improved particle swarm optimization.

Author: Boushaki, Saida Ishak, Mahammed, Brahim Hadj, Bendjeghaba, Omar, and Mosbah, Messaoud
Subjects: PARTICLE swarm optimization, BIG data
Abstract: In recent years, the surge in continuously accelerating data generation has given rise to the prominence of big data technology. The MapReduce architecture, situated at the core of this technology, provides a robust parallel environment. Spark, a leading framework in the big data landscape, extends the capabilities of the traditional MapReduce model. Coping with big data, especially in the realm of clustering, requires more efficient techniques. Meta-heuristic-based clustering, known for offering global solutions within reasonable time frames, emerges as a promising approach. This paper introduces a parallel-distributed clustering algorithm for big data within the Spark Framework, named Spark, chaotic improved PSO (S-CIPSO). Centered on particle swarm optimization (PSO), the proposed algorithm is enhanced with a chaotic map and an efficient procedure. Test results, conducted on both real and artificial datasets, establish the superior performance and quality of clustering results achieved by the proposed approach. Additionally, the scalability and robustness of S-CIPSO are validated, demonstrating its effectiveness in handling large-scale datasets. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

34. Data-Oriented Operating System for Big Data and Cloud.

Author: Kessler, Selwyn Darryl, Ng, Kok-Why, and Haw, Su-Cheng
Subjects: BIG data, COMPUTERS, DATA management, ENGINEERING design, SYSTEMS design, HARD disks
Abstract: Operating System (OS) is a critical piece of software that manages a computer's hardware and resources, acting as the intermediary between the computer and the user. The existing OS is not designed for Big Data and Cloud Computing, resulting in data processing and management inefficiency. This paper proposes a simplified and improved kernel on an x86 system designed for Big Data and Cloud Computing purposes. The proposed algorithm utilizes the performance benefits from the improved Input/Output (I/O) performance. The performance engineering runs the data-oriented design on traditional data management to improve data processing speed by reducing memory access overheads in conventional data management. The OS incorporates a data-oriented design to "modernize" various Data Science and management aspects. The resulting OS contains a basic input/output system (BIOS) bootloader that boots into Intel 32-bit protected mode, a text display terminal, 4 GB paging memory, 4096 heap block size, a Hard Disk Drive (HDD) I/O Advanced Technology Attachment (ATA) driver and more. There are also I/O scheduling algorithm prototypes that demonstrate how a simple Sweeping algorithm is superior to more conventionally known I/O scheduling algorithms. A MapReduce prototype is implemented using Message Passing Interface (MPI) for big data purposes. An attempt was made to optimize binary search using modern performance engineering and data-oriented design. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

35. Advancing big data clustering with fuzzy logic-based IMV-FCA and ensemble approach.

Author: Dandugala, L. S. and Vani, K. S.
Subjects: *BIG data, *TIME complexity, *ELECTRONIC data processing, *FUZZY algorithms, *DISTRIBUTED databases, *PARALLEL processing
Abstract: The act of gathering, looking over, and analyzing a lot of data to find patterns, insights, and market trends that can help businesses make more effective choices is known as big data analysis (BDA). Quick and effective access to this data allows businesses to be flexible in developing strategies to hold onto their competitive edge. To analyze massive amounts of data quickly through parallel processing, the structure of the Hadoop software employs the MapReduce methodology. Computational solid resources are necessary for BDA, although they are not always available. Developing new clustering techniques that could handle this kind of data processing became crucial. Therefore, in this research, we presented a novel, effective fuzzy-based Improved Multiview Fuzzy C-Means Algorithm (IMV-FCA) to boost the clustering strategy. To summarize, fuzzy-based IMV-FCA clustering presents the ensemble of the MobileNet V2 model, and three-layered stacked Bidirectional LSTM (MVSBiLSTM) to increase computing speed and effectiveness. It also presents a function that calculates the separation among the cluster center and the particular instance, to assist with better clustering. By simulating shared memory space and parallelizing on the framework known as MapReduce on the Hadoop cloud computing platform, the distributed database is utilized to improve the method’s effectiveness while reducing its time complexities. The experimental investigation was conducted on existing approaches, and the proposed approach was analyzed using three standard datasets. While differentiating from existing approaches, the presented approach yields greater performances in terms of various metrics. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

36. Exploring the integration of big data analytics in landscape visualization and interaction design.

Author: Yang, Xiaoqing, Sitharan, Roopesh, Sharji, Elyna Amir, and Feng, He
Subjects: *BIG data, *DATA visualization, *DATA integration, *SUSTAINABLE urban development, *DATA privacy, *URBAN planning
Abstract: The exponential growth of urban data presents significant challenges in efficiently analyzing and gaining actionable insights for urban planning and design. This paper proposes a big data analytics framework using MapReduce-based parallel FP-growth (MP-PFP) algorithm leveraging tools like Hadoop, MapReduce, and distributed crawlers to uncover patterns and trends from large-scale, heterogeneous urban datasets. A key contribution is the integration of diverse data types, from socio-economic datasets to environmental parameters, into a consistent analysis framework. The methodology employs frequent pattern mining algorithms on a scalable analytics platform to process behavior data and derive planning directives. Additionally, data visualization and parametric analysis techniques transform raw statistics into interactive 3D landscape representations that expose environmental site attributes. Specifically, the MapReduce capabilities enable distributed parallel processing of vast urban data volumes, ensuring speed and efficiency. The data visualization module creates immersive VR representations of urban landscapes, allowing interactive modifications. Advanced simulation techniques are incorporated to model the impact of planning directives on multiple landscape attributes. The framework is designed as a scalable, customizable solution that can integrate diverse urban data sources with customizable analytics, modeling and visualization modules through APIs. Comparative evaluations demonstrate a classification accuracy improvement from 68 to 93% over prevailing approaches. The framework has proven superior in data integration, real-time responsiveness, and accurately modeling the dynamic complexities of urban landscapes. The quantifiable simulations empower designers to make more informed planning decisions aligned with community needs. Despite ongoing data accuracy and privacy concerns, the methodology shows promising capabilities in harnessing urban big data to drive intelligent, sustainable urban development through its integration of data-driven insights, computational analysis, and interactive visualization. It brings impactful innovations to the future of urban informatics and planning. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

37. Diabetes classification using MapReduce-based capsule network.

Author: Arun, G. and Marimuthu, C. N.
Subjects: DEEP learning, CAPSULE neural networks, BIG data, DIABETES, ACQUISITION of data, CLASSIFICATION
Abstract: Big data analytics is a complex exploratory process to uncover hidden data information from vast collections of data. It often provides enormous information from diverse sources and the use of analytics provides confined knowledge from the collected noisy data. In the case of diabetes data, there exist a massive collection of patient data that relates to significant information on patient health and its critical nature. In order of validating and analysing the data to get desired information about a patient and their health risk from the vast collection of data, the study uses bigdata based deep learning analytics. This study uses a Deep Learning Model namely capsule network (CapsNet) is executed on a MapReduce framework. The CapsNet present in the MapReduce framework enables the classification of instances via proper regulations. This model after suitable training with the training dataset enables optimal classification of instances to detect the nature of the risk of a patient. The validation conducted on the test dataset shows that the proposed CapsNets-based MapReduce model obtains increased accuracy, recall, and F-score than the conventional MapReduce and deep learning models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

38. RA-MRS: A high efficient attribute reduction algorithm in big data

Author: Linzi Yin, Ke Cao, Zhaohui Jiang, and Zhanqi Li
Subjects: MapReduce, Big data, Vibration Optimization, Marked Reduction Set, Attribute reduction, Electronic computers. Computer science, QA75.5-76.95
Abstract: Efficient attribute reduction algorithm capable of handling high dimensional big data is one of the hot topics of rough set theory, and some related researchers have achieved with |C| jobs. In this paper, we present the definition of a marked reduction set and propose a more efficient attribute reduction algorithm (RA-MRS). The RA-MRS includes a batch processing phase and a vibration optimization phase, which reduce the number of jobs from |C| to log2|C|. Additionally, we provide an effective judgment strategy based on MapReduce, which supports the exception processing mechanism of Java to interrupt and advance the current job. Finally, the proposed algorithm is implemented in parallel based on Spark computing framework. The experimental results show that the proposed RA-MRS algorithm is over 99% faster than the classical PAAR_PR algorithm and 70% faster than the algorithm in the literature (Yin et al., 2021).
Published: 2024
Full Text: View/download PDF

39. MP-SPILDL: A Massively Parallel Inductive Logic Learner in Description Logic

Author: Eyad Algahtani
Subjects: Machine learning, inductive logic programming, description logic, GPU, big data, MapReduce, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: This article presents MP-SPILDL, a massively parallel inductive logic learner in Description Logic (DL). MP-SPILDL is a scalable inductive Logic Programming (ILP) algorithm that exploits existing Big Data infrastructure to perform large-scale inductive logic learning in DL (the $\mathcal {ALCQI}^{\mathcal {(D)}}$ DL language in particular). MP-SPILDL targets accelerating both hypothesis search and hypothesis evaluation by aggregating the computing power of multi-core CPUs with their vector/SIMD instructions and multi-GPUs in a Hadoop cluster. In terms of hypothesis search, MP-SPILDL employs a novel MapReduce-based algorithm that performs distributed parallel hypothesis search. MP-SPILDL also employs a novel MapReduce-based procedure that eliminates all redundant hypotheses generated after each learning iteration. Moreover, MP-SPILDL utilizes deterministic ordering of hypotheses’ operands to avoid exploring redundant areas of the search space, similar to the DL-Learner, the state of the art in DL-based ILP literature. In terms of hypothesis evaluation, MP-SPILDL performs parallel hypothesis evaluation, which uses all CPU cores combined with their vector instructions and all multi-GPUs of all machines in the Hadoop cluster. According to the experimental results using an Apache Spark implementation on a Hadoop cluster of three worker machines (36 total CPU cores, 7 total GPUs), MP-SPILDL achieved speedups of up to 13.3 folds using parallel beam search with $beamWidth = 32$ and CPU-based vectorized hypothesis evaluation – the best-case scenario. On small datasets such as Michalski’s trains, MP-SPILDL achieved a slower performance than the baseline, representing the worst-case scenario.
Published: 2024
Full Text: View/download PDF

40. MP-HTHEDL: A Massively Parallel Hypothesis Evaluation Engine in Description Logic

Author: Eyad Algahtani
Subjects: Machine learning, inductive logic programming, description logic, GPU, big data, MapReduce, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: We present MP-HTHEDL, a massively parallel hypothesis evaluation engine for inductive learning in description logic (DL). MP-HTHEDL is an extension on our previous work HT-HEDL, which also targets improving hypothesis evaluation performance for inductive logic programming (ILP) algorithms, that uses DL as their representation language. Unlike our previous work (HT-HEDL), MP-HTHEDL is a massively parallel approach that improves hypothesis evaluation performance through horizontal scaling, by exploiting the computing capabilities of all CPUs and GPUs from networked machines in Hadoop clusters. Many modern CPUs, have extended instruction sets for accelerating specific types of computations – especially for data parallel or vector computations. For CPU-based hypothesis evaluation, MP-HTHEDL employs vectorized multiprocessing as opposed to HT-HEDL’s vectorized multithreading; though, both MP-HTHEDL and HT-HEDL combine the classical scalar processing of multi-core CPUs with the extended vector instructions of each CPU core. This combination of CPUs’ scalar and vector processing, resulted in more extracted performance from CPUs. According to experimental results through Apache Spark implementation, on a Hadoop cluster of 3 worker nodes that have a total of 36 CPU cores and 7 GPUs; the performance improvement achieved using the pure scalar processing power of multi-core CPUs, has yielded a speedup of up to ~25.4 folds. When combining the scalar-processing and the extended vector instructions of those multi-core CPUs, the performance gains increased from ~25.4 folds to ~67 folds, on the same cluster of 3 worker nodes – these large speedups are achieved using only CPU-based processing. In terms of GPU-based evaluation, MP-HTHEDL achieved a speedup of up to ~161 folds, using the GPUs from the same 3 worker nodes.
Published: 2024
Full Text: View/download PDF

41. Elevating Big Data Privacy: Innovative Strategies and Challenges in Data Abundance

Author: Mohamed Elkawkagy, E. Elwan, Albandari Alsumayt, Heba Elbeh, and Sumayh S. Aljameel
Subjects: Big data, big data privacy, fine-grained encryption, perturbation technique, Hadoop, MapReduce, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: The exponential growth of big data has ushered in transformative possibilities across various sectors, but it has also raised formidable privacy concerns. This article delves into the pressing need for enhancing big data privacy and explores innovative approaches to address this critical issue. In recent years, big data has been characterized by its immense volume, high velocity, and diverse data sources. These attributes have enabled organizations to gain unprecedented insights but have also exposed sensitive information to potential breaches. As such, ensuring the privacy of individuals and sensitive data within big data sets has emerged as a paramount concern. This article first elucidates the multifaceted nature of big data privacy, emphasizing its encompassment of privacy, confidentiality, integrity, and availability. It also acknowledges the challenges posed by existing privacy-preserving techniques, which often fall short of providing comprehensive protection for large and diverse data sets. The core focus of this article lies in presenting novel strategies and technologies designed to improve big data privacy. This article presents an innovative framework that combines advanced encryption methods, including fine-grained encryption techniques and differential privacy mechanisms specifically designed for the distinct traits of big data, like noisy techniques. To achieve this, the dataset undergoes categorization into key attributes, sensitive attributes, quasi attributes, and insensitive attributes. Subsequently, the fine-grained technique encrypts key and sensitive attributes, while the differential privacy mechanism encrypts the quasi attributes. To further substantiate the effectiveness of the proposed technique, this article references to empirical findings that demonstrate tangible improvements in big data privacy protection.
Published: 2024
Full Text: View/download PDF

42. Diabetes classification using MapReduce-based capsule network

Author: G. Arun and C. N. Marimuthu
Subjects: Capsnets, MapReduce, classification, big data, network, framework, Control engineering systems. Automatic machinery (General), TJ212-225, Automation, T59.5
Abstract: Big data analytics is a complex exploratory process to uncover hidden data information from vast collections of data. It often provides enormous information from diverse sources and the use of analytics provides confined knowledge from the collected noisy data. In the case of diabetes data, there exist a massive collection of patient data that relates to significant information on patient health and its critical nature. In order of validating and analysing the data to get desired information about a patient and their health risk from the vast collection of data, the study uses bigdata based deep learning analytics. This study uses a Deep Learning Model namely capsule network (CapsNet) is executed on a MapReduce framework. The CapsNet present in the MapReduce framework enables the classification of instances via proper regulations. This model after suitable training with the training dataset enables optimal classification of instances to detect the nature of the risk of a patient. The validation conducted on the test dataset shows that the proposed CapsNets-based MapReduce model obtains increased accuracy, recall, and F-score than the conventional MapReduce and deep learning models.
Published: 2024
Full Text: View/download PDF

43. Replication-Based Query Management for Resource Allocation Using Hadoop and MapReduce over Big Data

Author: Ankit Kumar, Neeraj Varshney, Surbhi Bhatiya, and Kamred Udham Singh
Subjects: big data, hadoop, mapreduce, resource allocation, query management, Electronic computers. Computer science, QA75.5-76.95
Abstract: We live in an age where everything around us is being created. Data generation rates are so scary, creating pressure to implement costly and straightforward data storage and recovery processes. MapReduce model functionality is used for creating a cluster parallel, distributed algorithm, and large datasets. The MapReduce strategy from Hadoop helps develop a community of non-commercial use to offer a new algorithm for resolving such problems for commercial applications as expected from this working algorithm with insights as a result of disproportionate or discriminatory Hadoop cluster results. Expected results are obtained in the work and the exam conducted under this job; many of them are scheduled to set schedules, match matrices’ data positions, clustering before determining to click, and accurate mapping and internal reliability to be closed together to avoid running and execution times. Mapper output and proponents have been implemented, and the map has been used to reduce the function. The execution input key/value pair and output key/value pair have been set. This paper focuses on evaluating this technique for the efficient retrieval of large volumes of data. The technique allows for capabilities to inform a massive database of information, from storage and indexing techniques to the distribution of queries, scalability, and performance in heterogeneous environments. The results show that the proposed work reduces the data processing time by 30%.
Published: 2023
Full Text: View/download PDF

44. Big data clustering using fuzzy based energy efficient clustering and MobileNet V2.

Author: Dandugala, Lakshmi Srinivasulu and Vani, Koneru Suvarna
Subjects: *BIG data, *OPTIMIZATION algorithms, *TIME complexity, *SOFTWARE frameworks, *DISTRIBUTED databases, *PARALLEL processing
Abstract: Big data analytics (BDA) is a systematic way to analyze and detect various patterns, relationships, and trends in vast amounts of data. Big data analysis and processing require significant effort, techniques, and equipment. The Hadoop framework software uses the MapReduce approach to do large-scale data analysis using parallel processing in order to generate results as soon as possible. Due to the traditional algorithm's longer execution time and difficulty in processing big amounts of data, this is one of the main issues. Clusters are highly correlated inside each other but are not highly correlated with one another. The technique of effectively allocating limited resources is known as an optimization algorithm for clustering. For processing large amounts of data with several dimensions, the conventional optimization approach is insufficient. By using a fuzzy method, this can be prevented. In this paper, we proposed Fuzzy based energy efficient clustering approach to enhance the clustering mechanism. In summary, Fuzzy based energy efficient clustering introduces a function that measures the distance between the cluster center and the instance, which aids in improved clustering, and we then present the MobileNet V2 model to improve efficiency and speed up computation. To enhance the method's performance and reduce its time complexity, the distributed database simulates the shared memory space and parallelizes on the MapReduce framework on the Hadoop cloud computing platform. The proposed approach is evaluated using performance metrics such as Accuracy, Precision, Adjusted Rand Index (ARI), Recall, F1-Score, and Normalized Mutual Information (NMI). The experimental findings indicate that the proposed approach outperforms the existing techniques in terms of clustering accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

45. Monitoring and Mitigating Climate-Induced Natural Disasters with Cloud IoT.

Author: Taneja, Harsh, Verma, Rohan, Singh, Prabh Deep, and Singh, Kiran Deep
Subjects: *NATURAL disasters, *INTERNET of things, *CLIMATE change mitigation, *DECISION support systems, *CALIFORNIA wildfires
Abstract: Innovative solutions are necessary for efficient monitoring and mitigation techniques as climate change increases the frequency and severity of natural catastrophes. In this study, we investigate how Cloud Internet of Things (IoT) might help mitigate climate-related calamities. Through the use of cloud computing and sensor networks, Cloud IoT facilitates the capture, processing, and distribution of data in near-real time. This strategy promotes catastrophe planning, response, and recovery by offering early warnings, predictive insights, and simplified communication. The article emphasises the cumulative influence of Cloud IoT components such as sensor networks, data analytics, decision support systems, and remote control in the context of disaster management. Cloud IoT is useful in real-world scenarios, such as the tracking of floods in Bangladesh and the identification of wildfires in California. These cases show how the technology may prevent injuries and preserve property through early warnings and well-coordinated responses. Despite its potential, it faces obstacles including ensuring the safety of data and dealing with issues related to the necessary infrastructure. In conclusion, including Cloud IoT in disaster management provides a cost-effective, scalable, and efficient solution, greatly contributing to constructing resilient communities and creating a sustainable future in the face of climate-induced natural catastrophes. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

46. Lightweight MapReduce Application Service Integrity Auditing on the Cloud.

Author: Li, Panpan, Xie, Zhengxia, Wang, Zengkai, and Zhou, Zhigang
Subjects: *SERVICE level agreements, *AUDITING, *COMPLIANCE auditing, *DATA integrity, *CLOUD computing
Abstract: Due to the advantages of MapReduce in both cost-saving and effectiveness for large-scale computation-intensive applications, a growing number of tenants are considering migrating their applications to the MapReduce cloud. For financial interests, an untrusted cloud service provider (CSP) may violate the service level agreement (SLA) when providing service to tenants. Therefore, essential security mechanisms are needed to protect the integrity of MapReduce services. To address this issue, a novel service accountability scheme is proposed to audit compliance for MapReduce on the cloud in this paper. The proposed scheme inserts auditing points into tenant's applications and constructs a tamper-proof list at run time. By verifying the auditing list, the MapReduce cloud can be responsible and accountable, which can detect whether CSP should be held responsible for its service. Finally, the proposed scheme is deployed in experimental business processes, and its effectiveness is evaluated. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

47. MapReduce-based distributed tensor clustering algorithm.

Author: Zhang, Hongjun, Li, Peng, Meng, Fanshuo, Fan, Weibei, and Xue, Zhuangzhuang
Subjects: *K-means clustering, *CLUSTER analysis (Statistics), *COMPUTER science, *ALGORITHMS, *PARTITION functions, *PARALLEL processing, *PARALLEL programming
Abstract: Cluster analysis is one of the most fundamental methods in data mining, and it has been widely used in economics, social sciences and computer science. However, with the rapid development of Internet technology, the volume of data required for various web applications has grown rapidly, making the traditional clustering analysis methods face technical challenges. How to obtain useful information in a large amount of data quickly and efficiently is an urgent problem in many industrial fields. With the continuous development of cloud computing technology, large amounts of data can be performed quickly and efficiently. Hadoop is an open source distributed cloud computing platform with HDFS (Digital File System) and MapReduce as its core. HDFS provides massive data storage, while MapReduce uses the MapReduce programming model to achieve parallel processing. Compared with the traditional parallel programming model, it contains basic functions such as data partitioning, task scheduling, and parallel processing, making it possible for users to develop distributed applications on their own without understanding the basics of distributed basics, thus facilitating the design of parallel programs. K-means algorithm is a typical clustering analysis method, which is widely used in industry, but the number of iterations will increase significantly due to the growth of data volume, thus reducing the efficiency of computation. In order to better apply to the cluster analysis of large-scale data, this paper firstly implements a parallelization algorithm based on MapReduce on Hadoop platform using the basic idea of MapReduce and improves the K-means algorithm for the problems of blindness and easy to fall into local optimum when selecting randomly in clusters. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

48. 3D Interior Design System Model Based on Computer Virtual Reality Technology.

Author: Yu Dai
Subjects: *COMPUTER simulation, *INTERIOR decoration, *INFORMATION storage & retrieval systems, *WEB services, *DATABASES, *VIRTUAL reality software, *VIRTUAL reality
Abstract: Globally, data volume is increases exponentially with increase in the proliferation with Cloud Computing. MapReduce is emerged as the prominent solution for the unprecedented growth in the efficient manner as it process both structured and unstructured data. The dynamic landscape of Virtual Reality has seen a significant shift towards technology-driven approaches, with data analytics and personalized learning becoming increasingly important. This paper introduces an innovative framework that leverages the power of Hadoop and MapReduce to elevate 3D virtual reality experiences within diverse VR Cloud settings. This paper presents the development of an efficient Cache-Based MapReduce framework (CMF) where Cache algorithms are effectively used to process queries on large-scale cloud-based data. The Hadoop System processes data in single-node Hadoop Clusters (Pseudo-distributed) as well as heterogeneous Hadoop Clusters (fully distributed nodes) within Amazon Web Services (AWS). The Hadoop System process the data in the single node Hadoop Cluster (Pseudo-distributed) heterogeneous Hadoop Cluster (fully distributed node) in the Amazon Web Services (AWS). The experimental analysis is evaluated for the SmallGutenberg and LargeGutenberg database. The developed model achieves the average reduction in job of 48.01% with reduction in execution time of 51.99%. The CMF of 7-node, 9-node, 15-node and 20-node reduction in execution time is measured as 49.91%, 51.38%, 54.71% and 45.29% respectively. [ABSTRACT FROM AUTHOR]
Published: 2023

49. Performance Improvement through Novel Adaptive Node and Container Aware Scheduler with Resource Availability Control in Hadoop YARN.

Author: Manjaly, J. S. and Subbulakshmi, T.
Subjects: BIG data, ARTIFICIAL neural networks, CONVOLUTIONAL neural networks, MACHINE learning, DEEP learning, ARTIFICIAL intelligence
Abstract: The default scheduler of Apache Hadoop demonstrates operational inefficiencies when connecting external sources and processing transformation jobs. This paper has proposed a novel scheduler for enhancement of the performance of the Hadoop Yet Another Resource Negotiator (YARN) scheduler, called the Adaptive Node and Container Aware Scheduler (ANACRAC), that aligns cluster resources to the demands of the applications in the real world. The approach performs to leverage the user-provided configurations as a unique design to apportion nodes, or containers within the nodes, to application thresholds. Additionally, it provides the flexibility to the applications for selecting and choosing which node's resources they want to manage and adds limits to prevent threshold breaches by adding additional jobs as needed. Node or container awareness can be utilized individually or in combination to increase efficiency. On top of this, the resource availability within the node and containers can also be investigated.This paper also focuses on the elasticity of the containers and self-adaptiveness depending on the job type. The results proved that 15%-20% performance improvement was achieved compared with the node and container awareness feature of the ANACRAC. It has been validated that this ANACRAC scheduler demonstrates a 70%-90% performance improvement compared with the default Fair scheduler. Experimental results also demonstrated the success of the enhancement and a performance improvement in the range of 60% to 200% when applications were connected with external interfaces and high workloads. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

50. Visualization of hyperspectral images on parallel and distributed platform: Apache Spark.

Author: Zbakh, Abdelali, Bennani, Mohamed Taj, Souri, Adnan, and El Hichami, Outman
Subjects: DATA visualization, HYPERSPECTRAL imaging systems, PRINCIPAL components analysis, PARALLEL algorithms
Abstract: The field of hyperspectral image storage and processing has undergone a remarkable evolution in recent years. The visualization of these images represents a challenge as the number of bands exceeds three bands, since direct visualization using the trivial system red, green and blue (RGB) or hue, saturation and lightness (HSL) is not feasible. One potential solution to resolve this problem is the reduction of the dimensionality of the image to three dimensions and thereafter assigning each dimension to a color. Conventional tools and algorithms have become incapable of producing results within a reasonable time. In this paper, we present a new distributed method of visualization of hyperspectral image based on the principal component analysis (PCA) and implemented in a distributed parallel environment (Apache Spark). The visualization of the big hyperspectral images with the proposed method is made in a smaller time and with the same performance as the classical method of visualization. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

5,134 results on '"MapReduce"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources