This project aims to improve the scalability of kmeans clustering algorithm and also compare the scalability of proposal on various big dataset. To adapt the kmeans for scalability, it should improve the speed or running time of cluster formation. To make it faster, this work suggest that Initial centroids are selected as close as possible to ideal cluster centroids such that algorithm takes less time to converge.
To reduce the runtime of kmeans algorithm.
To achieve scalability.
Kmeans algorithm can be further improved by use the sampling process to get some subsets of the big data. By processing these subsets, it obtain the cluster center sets which can be used to cluster the original datasets. These cluster centroid is selected based on thedistribution based merge clustering algorithm.
