Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark.

Authors :: Ramirez-Gallego, Sergio
Krawczyk, Bartosz
Garcia, Salvador
Wozniak, Michal
Benitez, Jose Manuel
Herrera, Francisco
Source :: IEEE Transactions on Systems, Man & Cybernetics. Systems. Oct2017, Vol. 47 Issue 10, p2727-2739. 13p.
Publication Year :: 2017
Abstract: Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data. [ABSTRACT FROM AUTHOR]