The K-means clustering algorithm groups similar objects into number of clusters. It refines the cluster center point iteratively until the maximum intra cluster deviation is reached. Map Reduce Framework is implemented to cluster large data points.. The below source code process the input file that consists of data points and initial center points. In mapper class the Map function reads this file to get the centers from centers.txt file. Then it reads the input rows (the data) in input.txt file and calculates the distance to each center. By processing each row, it produces an output pair with key as cluster id and value as the coordinates of row. After the completion of shuffle and sort phase, the output is transformed into reducer. In Reducer class, the reduce function calculates the mean value of coordinates belongs to each cluster and the mean values are updated as new centers in centers.txt file.
//Map function find nearest centers to the coordinates
Map Phase input:<k1, v1>
k1- line number
//Find minimum center from point
For each center
Find minimum distance to point
//Compute new cluster centers
Reduce Phase input:<k2, List<v2>>
Calculate mean value for v2 points
new center point-mean value
k3-new center point
Continue the process till the clusters are converged.
First column – Document id
Second column – cluster id