Back to Search Start Over

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark.

Authors :
Ramirez-Gallego, Sergio
Krawczyk, Bartosz
Garcia, Salvador
Wozniak, Michal
Benitez, Jose Manuel
Herrera, Francisco
Source :
IEEE Transactions on Systems, Man & Cybernetics. Systems. Oct2017, Vol. 47 Issue 10, p2727-2739. 13p.
Publication Year :
2017

Abstract

Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
21682216
Volume :
47
Issue :
10
Database :
Academic Search Index
Journal :
IEEE Transactions on Systems, Man & Cybernetics. Systems
Publication Type :
Academic Journal
Accession number :
125206992
Full Text :
https://doi.org/10.1109/TSMC.2017.2700889