Research Area:  Big Data
Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor ( Ml-knn). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed Ml-knn implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables Ml-knn to scale efficiently with respect to the size of the problem.
Keywords:  
Author(s) Name:  Jorge Gonzalez-Lopez,Sebastián Ventura and Alberto Cano
Journal name:  Future Generation Computer Systems
Conferrence name:  
Publisher name:  ELSEVIER
DOI:  10.1016/j.future.2018.04.094
Volume Information:  Volume 87, October 2018, Pages 66-82
Paper Link:   https://www.sciencedirect.com/science/article/abs/pii/S0167739X17327759