Distributed nearest neighbor classification - Python Projects

Distributed nearest neighbor classification for large-scale multi-label data on spark - 2018

Research Area: Big Data

Abstract:

Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor ( Ml-knn). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed Ml-knn implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables Ml-knn to scale efficiently with respect to the size of the problem.

Keywords:

Author(s) Name: Jorge Gonzalez-Lopez,Sebastián Ventura and Alberto Cano

Journal name: Future Generation Computer Systems

Conferrence name:

Publisher name: ELSEVIER

DOI: 10.1016/j.future.2018.04.094

Volume Information: Volume 87, October 2018, Pages 66-82

Paper Link: https://www.sciencedirect.com/science/article/abs/pii/S0167739X17327759

Office Address

Social List

Distributed nearest neighbor classification for large-scale multi-label data on spark - 2018

Abstract:

S-Logix (OPC) Private Limited

Office Address

Distributed nearest neighbor classification for large-scale multi-label data on spark - 2018

Abstract:

Related Papers