How to Determine the Optimal Number of Neighbors in KNN Algorithm in Python?
Share
Condition for Finding Optimal Number of Clusters in K-Means Algorithm Using Silhouette Method
Description:
The K-means clustering algorithm is a popular unsupervised machine learning technique used to
partition data into distinct groups or clusters. One challenge with K-means is determining the
optimal number of clusters (k). While the elbow method is widely used for this purpose,
the silhouette method provides a more reliable evaluation metric, as it not only considers
intra-cluster cohesion but also inter-cluster separation.
Silhouette Method:
The Silhouette Method measures how close each point in one cluster is to the points in the neighboring clusters.
A high silhouette score indicates that points are well-clustered and well-separated from other clusters.
This method can be used to choose the best value for k by calculating the average silhouette score for a range
of possible k values and selecting the one that maximizes the silhouette score.
Step-by-Step Process
Import Required Libraries:
Import libraries for data manipulation (e.g., NumPy, Pandas), clustering (e.g., KMeans), and visualization (e.g., Matplotlib, Seaborn).
Load and Prepare Dataset:
Load a dataset suitable for clustering, such as the Iris dataset.
Preprocess the Data:
Standardize or normalize the features to ensure all features contribute equally to the clustering process.
Run K-Means for Different Values of k:
Perform K-Means clustering for a range of k values (e.g., from 2 to 10).
Compute Silhouette Scores for Each k:
Calculate the silhouette score for each k value and evaluate the quality of clustering.
Plot Silhouette Scores vs. k:
Plot the silhouette scores for different values of k to visually determine the optimal number of clusters.
Visualize the Clusters and Silhouette Scores:
Use PCA for dimensionality reduction and plot the clustering result. Optionally, plot a silhouette plot.
Evaluate with Output Classification Metrics:
If applicable, evaluate the clustering results using classification metrics.
Why Should We Choose the Silhouette Method?
Quantitative Metric:
Unlike the elbow method, which relies on visual interpretation, the silhouette score gives a
numerical measure of the quality of clustering.
Well-defined clusters:
It provides insight not only into how tight the clusters are, but also how well-separated they are
from other clusters.
Versatile:
Can be used for any type of clustering algorithm, not just K-Means.
Sample Source Code
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import seaborn as sns
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
# Standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Range of k values to try
k_values = range(2, 11)
# Store silhouette scores for each k
silhouette_scores = []
# Perform K-Means clustering for different k values
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
score = silhouette_score(X_scaled, kmeans.labels_)
silhouette_scores.append(score)
# Plot Silhouette Scores for different k
plt.figure(figsize=(8,6))
plt.plot(k_values, silhouette_scores, marker='o', color='b', linestyle='--')
plt.title('Silhouette Scores for Different Values of K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.show()
# Find the optimal k (k with highest silhouette score)
optimal_k = k_values[np.argmax(silhouette_scores)]
print(f"Optimal number of clusters (k): {optimal_k}")
# Perform K-Means with the optimal k
kmeans_optimal = KMeans(n_clusters=optimal_k, random_state=42)
y_kmeans = kmeans_optimal.fit_predict(X_scaled)
# Visualize the clusters using PCA (to reduce to 2D)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Plot the clustering result
plt.figure(figsize=(8,6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y_kmeans, palette='Set2', s=100, edgecolor='black')
plt.title(f"K-Means Clustering with Optimal K={optimal_k}")
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.show()
# Silhouette plot for the final clustering
from sklearn.metrics import silhouette_samples