π K-means Clustering
K-means is a popular unsupervised learning algorithm that partitions data into k clusters by grouping similar data points together and identifying underlying patterns in the data.
Resources: Scikit-learn K-means | Pattern Recognition and Machine Learning - Chapter 9
βοΈ Summary
K-means clustering is an unsupervised machine learning algorithm that aims to partition n observations into k clusters where each observation belongs to the cluster with the nearest centroid (cluster center). It's one of the simplest and most popular clustering algorithms.
Key characteristics: - Unsupervised: No labeled data required - Centroid-based: Uses cluster centers to define clusters - Hard clustering: Each point belongs to exactly one cluster - Iterative: Uses an expectation-maximization approach - Distance-based: Uses Euclidean distance (typically)
Applications: - Customer segmentation in marketing - Image segmentation and compression
 - Market research and analysis - Data preprocessing for other ML algorithms - Gene sequencing analysis - Recommendation systems
When to use K-means: - When you know the approximate number of clusters - When clusters are roughly spherical and similar sized - When you need interpretable results - When computational efficiency is important
π§ Intuition
How K-means Works
K-means follows a simple iterative process:
- Initialize k cluster centroids randomly
- Assign each data point to the nearest centroid
- Update centroids by calculating the mean of assigned points
- Repeat steps 2-3 until centroids stop moving significantly
Mathematical Foundation
1. Objective Function
K-means minimizes the Within-Cluster Sum of Squares (WCSS):
Where: - \(J\) is the objective function to minimize - \(k\) is the number of clusters - \(C_i\) is the set of points in cluster \(i\) - \(\mu_i\) is the centroid of cluster \(i\) - \(||x - \mu_i||^2\) is the squared Euclidean distance
2. Centroid Update Rule
The centroid of cluster \(i\) is updated as:
Where \(|C_i|\) is the number of points in cluster \(i\).
3. Distance Calculation
Euclidean distance between point \(x\) and centroid \(\mu_i\):
Where \(d\) is the number of dimensions.
4. Convergence Criteria
The algorithm stops when: - Centroids don't move significantly: \(||\mu_i^{(t+1)} - \mu_i^{(t)}|| < \epsilon\) - Maximum number of iterations reached - No points change cluster assignments
Algorithm Complexity
- Time Complexity: \(O(n \cdot k \cdot d \cdot t)\)
- \(n\): number of data points
- \(k\): number of clusters
- \(d\): number of dimensions
-  \(t\): number of iterations 
-  Space Complexity: \(O(n \cdot d + k \cdot d)\) 
π’ Implementation using Libraries
Using Scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs, load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, adjusted_rand_score
from sklearn.decomposition import PCA
# Generate sample data
np.random.seed(42)
X, y_true = make_blobs(
    n_samples=300, 
    centers=4, 
    n_features=2, 
    cluster_std=0.8,
    random_state=42
)
# Basic K-means clustering
kmeans = KMeans(
    n_clusters=4,
    init='k-means++',      # Smart initialization
    n_init=10,             # Number of initializations
    max_iter=300,          # Maximum iterations
    tol=1e-4,              # Convergence tolerance
    random_state=42
)
# Fit the model
kmeans.fit(X)
# Get predictions and centroids
y_pred = kmeans.labels_
centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_  # WCSS value
print(f"Inertia (WCSS): {inertia:.2f}")
print(f"Silhouette Score: {silhouette_score(X, y_pred):.3f}")
# Visualize results
plt.figure(figsize=(15, 5))
# Original data with true labels
plt.subplot(1, 3, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7)
plt.title('True Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
# K-means results
plt.subplot(1, 3, 2)
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.7)
plt.scatter(centroids[:, 0], centroids[:, 1], 
           marker='x', s=200, linewidths=3, color='red')
plt.title('K-means Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
# Comparison
plt.subplot(1, 3, 3)
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.5, label='True')
plt.scatter(centroids[:, 0], centroids[:, 1], 
           marker='x', s=200, linewidths=3, color='red', label='Centroids')
plt.title('Comparison')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.tight_layout()
plt.show()
# Calculate accuracy (for labeled data comparison)
accuracy = adjusted_rand_score(y_true, y_pred)
print(f"Adjusted Rand Index: {accuracy:.3f}")
Determining Optimal Number of Clusters
Elbow Method
def plot_elbow_method(X, max_k=10):
    """Plot elbow method to find optimal k"""
    wcss = []
    k_range = range(1, max_k + 1)
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X)
        wcss.append(kmeans.inertia_)
    plt.figure(figsize=(10, 6))
    plt.plot(k_range, wcss, 'bo-')
    plt.title('Elbow Method for Optimal k')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
    plt.grid(True, alpha=0.3)
    # Find elbow point using second derivative
    second_derivative = np.diff(wcss, 2)
    elbow_point = np.argmax(second_derivative) + 2
    plt.axvline(x=elbow_point, color='red', linestyle='--', 
                label=f'Elbow at k={elbow_point}')
    plt.legend()
    plt.show()
    return wcss
wcss_values = plot_elbow_method(X, max_k=10)
Silhouette Method
def plot_silhouette_method(X, max_k=10):
    """Plot silhouette scores for different k values"""
    silhouette_scores = []
    k_range = range(2, max_k + 1)
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = kmeans.fit_predict(X)
        score = silhouette_score(X, labels)
        silhouette_scores.append(score)
    plt.figure(figsize=(10, 6))
    plt.plot(k_range, silhouette_scores, 'go-')
    plt.title('Silhouette Method for Optimal k')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Silhouette Score')
    plt.grid(True, alpha=0.3)
    # Find optimal k
    optimal_k = k_range[np.argmax(silhouette_scores)]
    plt.axvline(x=optimal_k, color='red', linestyle='--', 
                label=f'Optimal k={optimal_k}')
    plt.legend()
    plt.show()
    return silhouette_scores
silhouette_values = plot_silhouette_method(X, max_k=10)
Real-world Example: Iris Dataset
# Load Iris dataset
iris = load_iris()
X_iris = iris.data
y_true_iris = iris.target
# Standardize features
scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris)
# Apply K-means
kmeans_iris = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred_iris = kmeans_iris.fit_predict(X_iris_scaled)
# Evaluate
silhouette_iris = silhouette_score(X_iris_scaled, y_pred_iris)
ari_iris = adjusted_rand_score(y_true_iris, y_pred_iris)
print(f"Iris Dataset Results:")
print(f"Silhouette Score: {silhouette_iris:.3f}")
print(f"Adjusted Rand Index: {ari_iris:.3f}")
# Visualize using PCA for dimensionality reduction
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_iris_scaled)
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_true_iris, cmap='viridis', alpha=0.7)
plt.title('True Species (PCA)')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.subplot(1, 3, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred_iris, cmap='viridis', alpha=0.7)
plt.title('K-means Clusters (PCA)')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
# Feature importance
feature_names = iris.feature_names
centroids_original = scaler.inverse_transform(kmeans_iris.cluster_centers_)
plt.subplot(1, 3, 3)
for i, centroid in enumerate(centroids_original):
    plt.plot(feature_names, centroid, 'o-', label=f'Cluster {i}')
plt.title('Cluster Centroids (Original Features)')
plt.xlabel('Features')
plt.ylabel('Values')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()
# Print cluster characteristics
print("\nCluster Centroids (Original Scale):")
for i, centroid in enumerate(centroids_original):
    print(f"Cluster {i}:")
    for feature, value in zip(feature_names, centroid):
        print(f"  {feature}: {value:.2f}")
    print()
Mini-batch K-means for Large Datasets
from sklearn.cluster import MiniBatchKMeans
import time
# Generate larger dataset
X_large, _ = make_blobs(n_samples=10000, centers=5, n_features=10, random_state=42)
# Compare standard K-means vs Mini-batch K-means
print("Comparing K-means vs Mini-batch K-means:")
# Standard K-means
start_time = time.time()
kmeans_standard = KMeans(n_clusters=5, random_state=42, n_init=10)
labels_standard = kmeans_standard.fit_predict(X_large)
time_standard = time.time() - start_time
# Mini-batch K-means
start_time = time.time()
kmeans_mini = MiniBatchKMeans(n_clusters=5, batch_size=100, random_state=42)
labels_mini = kmeans_mini.fit_predict(X_large)
time_mini = time.time() - start_time
print(f"Standard K-means - Time: {time_standard:.3f}s, Inertia: {kmeans_standard.inertia_:.2f}")
print(f"Mini-batch K-means - Time: {time_mini:.3f}s, Inertia: {kmeans_mini.inertia_:.2f}")
print(f"Speedup: {time_standard/time_mini:.1f}x")
# Compare clustering quality
ari_comparison = adjusted_rand_score(labels_standard, labels_mini)
print(f"Agreement between methods (ARI): {ari_comparison:.3f}")
βοΈ From Scratch Implementation
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
class KMeansFromScratch:
    """K-means clustering implementation from scratch"""
    def __init__(self, k=3, max_iters=100, tol=1e-4, init='k-means++', random_state=None):
        """
        Initialize K-means clusterer
        Parameters:
        -----------
        k : int, default=3
            Number of clusters
        max_iters : int, default=100
            Maximum number of iterations
        tol : float, default=1e-4
            Tolerance for convergence
        init : str, default='k-means++'
            Initialization method ('random' or 'k-means++')
        random_state : int, default=None
            Random seed for reproducibility
        """
        self.k = k
        self.max_iters = max_iters
        self.tol = tol
        self.init = init
        self.random_state = random_state
        # Initialize attributes
        self.centroids = None
        self.labels = None
        self.inertia_ = None
        self.n_iter_ = 0
    def _initialize_centroids(self, X):
        """Initialize centroids using specified method"""
        if self.random_state:
            np.random.seed(self.random_state)
        n_samples, n_features = X.shape
        if self.init == 'random':
            # Random initialization
            min_vals = np.min(X, axis=0)
            max_vals = np.max(X, axis=0)
            centroids = np.random.uniform(min_vals, max_vals, (self.k, n_features))
        elif self.init == 'k-means++':
            # K-means++ initialization for better initial centroids
            centroids = np.zeros((self.k, n_features))
            # Choose first centroid randomly
            centroids[0] = X[np.random.randint(n_samples)]
            for i in range(1, self.k):
                # Calculate distances from each point to nearest centroid
                distances = np.array([min([np.linalg.norm(x - c)**2 for c in centroids[:i]]) 
                                    for x in X])
                # Choose next centroid with probability proportional to squared distance
                probabilities = distances / distances.sum()
                cumulative_probabilities = probabilities.cumsum()
                r = np.random.rand()
                for j, p in enumerate(cumulative_probabilities):
                    if r < p:
                        centroids[i] = X[j]
                        break
        else:
            raise ValueError("init must be 'random' or 'k-means++'")
        return centroids
    def _assign_clusters(self, X, centroids):
        """Assign each point to the nearest centroid"""
        # Calculate distances from each point to each centroid
        distances = cdist(X, centroids, metric='euclidean')
        # Assign each point to the nearest centroid
        labels = np.argmin(distances, axis=1)
        return labels
    def _update_centroids(self, X, labels):
        """Update centroids based on current cluster assignments"""
        centroids = np.zeros((self.k, X.shape[1]))
        for i in range(self.k):
            # Find points belonging to cluster i
            cluster_points = X[labels == i]
            if len(cluster_points) > 0:
                # Update centroid as mean of cluster points
                centroids[i] = np.mean(cluster_points, axis=0)
            else:
                # Keep old centroid if no points assigned to cluster
                centroids[i] = self.centroids[i]
        return centroids
    def _calculate_inertia(self, X, labels, centroids):
        """Calculate within-cluster sum of squares (inertia)"""
        inertia = 0
        for i in range(self.k):
            cluster_points = X[labels == i]
            if len(cluster_points) > 0:
                inertia += np.sum((cluster_points - centroids[i])**2)
        return inertia
    def fit(self, X):
        """Fit K-means clustering to data"""
        # Initialize centroids
        self.centroids = self._initialize_centroids(X)
        # Store convergence history
        self.centroid_history = [self.centroids.copy()]
        self.inertia_history = []
        for iteration in range(self.max_iters):
            # Assign points to clusters
            labels = self._assign_clusters(X, self.centroids)
            # Update centroids
            new_centroids = self._update_centroids(X, labels)
            # Calculate inertia
            inertia = self._calculate_inertia(X, labels, new_centroids)
            self.inertia_history.append(inertia)
            # Check for convergence
            centroid_shift = np.linalg.norm(new_centroids - self.centroids)
            if centroid_shift < self.tol:
                self.n_iter_ = iteration + 1
                break
            # Update centroids
            self.centroids = new_centroids
            self.centroid_history.append(self.centroids.copy())
        # Final assignments
        self.labels = self._assign_clusters(X, self.centroids)
        self.inertia_ = self._calculate_inertia(X, self.labels, self.centroids)
        return self
    def predict(self, X):
        """Predict cluster labels for new data"""
        if self.centroids is None:
            raise ValueError("Model must be fitted before predicting")
        return self._assign_clusters(X, self.centroids)
    def fit_predict(self, X):
        """Fit model and return cluster labels"""
        self.fit(X)
        return self.labels
    def plot_convergence(self):
        """Plot convergence of inertia over iterations"""
        if not hasattr(self, 'inertia_history'):
            raise ValueError("Model must be fitted first")
        plt.figure(figsize=(10, 6))
        plt.plot(range(1, len(self.inertia_history) + 1), self.inertia_history, 'b-o')
        plt.title('K-means Convergence')
        plt.xlabel('Iteration')
        plt.ylabel('Inertia (WCSS)')
        plt.grid(True, alpha=0.3)
        plt.show()
    def plot_clusters(self, X):
        """Visualize clusters (works for 2D data)"""
        if X.shape[1] != 2:
            raise ValueError("Visualization only works for 2D data")
        if self.labels is None:
            raise ValueError("Model must be fitted first")
        plt.figure(figsize=(10, 8))
        # Plot points colored by cluster
        colors = plt.cm.viridis(np.linspace(0, 1, self.k))
        for i in range(self.k):
            cluster_points = X[self.labels == i]
            plt.scatter(cluster_points[:, 0], cluster_points[:, 1], 
                       c=[colors[i]], alpha=0.7, label=f'Cluster {i}')
        # Plot centroids
        plt.scatter(self.centroids[:, 0], self.centroids[:, 1], 
                   c='red', marker='x', s=200, linewidths=3, label='Centroids')
        plt.title(f'K-means Clustering (k={self.k})')
        plt.xlabel('Feature 1')
        plt.ylabel('Feature 2')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()
# Example usage of custom implementation
print("Testing custom K-means implementation:")
# Generate sample data
np.random.seed(42)
X_test, y_true_test = make_blobs(n_samples=300, centers=3, n_features=2, 
                                cluster_std=1.0, random_state=42)
# Fit custom K-means
kmeans_custom = KMeansFromScratch(k=3, max_iters=100, init='k-means++', random_state=42)
labels_custom = kmeans_custom.fit_predict(X_test)
print(f"Custom K-means converged in {kmeans_custom.n_iter_} iterations")
print(f"Final inertia: {kmeans_custom.inertia_:.2f}")
# Compare with sklearn
from sklearn.cluster import KMeans
kmeans_sklearn = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_sklearn = kmeans_sklearn.fit_predict(X_test)
print(f"Sklearn K-means inertia: {kmeans_sklearn.inertia_:.2f}")
print(f"Agreement between implementations: {adjusted_rand_score(labels_custom, labels_sklearn):.3f}")
# Visualize results
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_true_test, cmap='viridis', alpha=0.7)
plt.title('True Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.subplot(1, 3, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c=labels_custom, cmap='viridis', alpha=0.7)
plt.scatter(kmeans_custom.centroids[:, 0], kmeans_custom.centroids[:, 1], 
           marker='x', s=200, linewidths=3, color='red')
plt.title('Custom K-means')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.subplot(1, 3, 3)
plt.scatter(X_test[:, 0], X_test[:, 1], c=labels_sklearn, cmap='viridis', alpha=0.7)
plt.scatter(kmeans_sklearn.cluster_centers_[:, 0], kmeans_sklearn.cluster_centers_[:, 1], 
           marker='x', s=200, linewidths=3, color='red')
plt.title('Sklearn K-means')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.tight_layout()
plt.show()
# Plot convergence
kmeans_custom.plot_convergence()
β οΈ Assumptions and Limitations
Assumptions
- Spherical Clusters: K-means assumes clusters are roughly spherical and have similar sizes
- Similar Variance: Clusters should have similar variance in all directions
- Isotropic Clusters: Equal variance in all dimensions
- Fixed Number of Clusters: You must specify k in advance
- Euclidean Distance: Uses Euclidean distance metric (sensitive to scale)
Limitations
- Sensitive to Initialization: Can converge to local optima
- Requires Preprocessing: Sensitive to feature scaling and outliers
- Difficulty with Non-spherical Clusters: Performs poorly on elongated or irregular shapes
- Fixed k: Need to know or estimate the number of clusters
- Sensitive to Outliers: Outliers can significantly affect centroids
- Equal Cluster Size Assumption: Tends to create clusters of similar sizes
Comparison with Other Clustering Algorithms
| Algorithm | Advantages | Disadvantages | Best Use Cases | 
|---|---|---|---|
| K-means | Fast, simple, works well with spherical clusters | Requires k, sensitive to initialization, assumes spherical clusters | Customer segmentation, image compression | 
| Hierarchical | No need to specify k, creates dendrogram | Slow O(nΒ³), sensitive to noise | Small datasets, understanding cluster hierarchy | 
| DBSCAN | Finds arbitrary shaped clusters, robust to outliers | Sensitive to parameters, struggles with varying densities | Anomaly detection, irregular shaped clusters | 
| Gaussian Mixture | Soft clustering, handles elliptical clusters | More complex, requires knowing k | When cluster overlap is expected | 
When NOT to Use K-means
- Non-spherical clusters: Use DBSCAN or spectral clustering
- Varying cluster densities: Use DBSCAN
- Unknown number of clusters: Use hierarchical clustering or DBSCAN
- Categorical data: Use K-modes or mixed-type clustering
- High-dimensional data: Consider dimensionality reduction first
π‘ Interview Questions
1. Explain the K-means algorithm step by step.
Answer:
K-means follows these steps:
- Initialization: Choose k cluster centers (centroids) randomly or using k-means++
- Assignment: Assign each data point to the nearest centroid based on Euclidean distance
- Update: Recalculate centroids as the mean of all points assigned to each cluster
- Convergence Check: Repeat steps 2-3 until centroids stop moving significantly or max iterations reached
Mathematical formulation: - Objective: Minimize \(J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2\) - Centroid update: \(\mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x\)
Time complexity: O(nΒ·kΒ·dΒ·t) where n=samples, k=clusters, d=dimensions, t=iterations
2. What are the main assumptions and limitations of K-means?
Answer:
Assumptions: - Clusters are spherical and have similar sizes - Features have similar variances (isotropic) - Number of clusters k is known - Data is continuous and suitable for Euclidean distance
Limitations: - Sensitive to initialization (can converge to local optima) - Requires specifying k in advance - Assumes spherical clusters of similar size - Sensitive to outliers and feature scaling - Poor performance on non-convex clusters - Hard clustering (each point belongs to exactly one cluster)
3. How do you determine the optimal number of clusters (k)?
Answer:
Methods to determine optimal k:
- Elbow Method: Plot WCSS vs k, look for the "elbow" point
- Silhouette Method: Choose k with highest average silhouette score
- Gap Statistic: Compare within-cluster dispersion with expected dispersion
- Information Criteria: Use AIC/BIC for model selection
- Domain Knowledge: Use business/domain expertise
Elbow Method Example:
wcss = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k)
    wcss.append(kmeans.fit(X).inertia_)
# Plot and look for elbow
4. What is the difference between K-means and K-means++?
Answer:
K-means++ is an initialization method, not a different algorithm:
Standard K-means initialization: - Randomly selects k points as initial centroids - Can lead to poor convergence and local optima - Results may vary significantly between runs
K-means++ initialization: - First centroid chosen randomly - Subsequent centroids chosen with probability proportional to squared distance from nearest existing centroid - Provides better initial centroids, leading to better final clustering - More consistent results across multiple runs - Typically converges faster with better final objective value
5. How does K-means handle outliers and what can you do about it?
Answer:
How K-means handles outliers: - Outliers significantly affect centroid positions since centroids are calculated as means - Can cause centroids to shift away from main cluster mass - May create clusters around outliers - Reduces overall clustering quality
Solutions:
- Preprocessing:
- Remove outliers using IQR, Z-score, or isolation forest
-  Use robust scaling instead of standard scaling 
-  Alternative algorithms: 
- Use K-medoids (uses medians instead of means)
-  Use DBSCAN (treats outliers as noise) 
-  Outlier-aware variants: 
- Trimmed K-means (removes certain percentage of farthest points)
- Robust K-means with M-estimators
6. Compare K-means with Hierarchical clustering.
Answer:
| Aspect | K-means | Hierarchical | 
|---|---|---|
| k specification | Must specify k | No need to specify k | 
| Time complexity | O(nkdt) | O(nΒ³) for agglomerative | 
| Shape assumption | Spherical clusters | Any shape | 
| Scalability | Good for large datasets | Poor for large datasets | 
| Deterministic | No (depends on initialization) | Yes | 
| Output | Flat partitioning | Dendrogram hierarchy | 
| Interpretability | Cluster centers | Hierarchy of merges | 
| Memory usage | Low | High O(nΒ²) | 
7. What is the objective function of K-means and how is it optimized?
Answer:
Objective Function (Within-Cluster Sum of Squares): \(\(J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2\)\)
Optimization: - K-means uses Lloyd's algorithm (Expectation-Maximization) - E-step: Assign points to nearest centroids (minimize J w.r.t. cluster assignments) - M-step: Update centroids as cluster means (minimize J w.r.t. centroids)
Key properties: - Guaranteed to converge (objective function decreases monotonically) - May converge to local minimum - Convergence to global optimum not guaranteed - Each step reduces or maintains the objective value
8. How do you evaluate the quality of K-means clustering?
Answer:
Internal Metrics (no ground truth needed):
- Silhouette Score: Measures how similar points are to their own cluster vs other clusters
-  Range: [-1, 1], higher is better 
-  Within-Cluster Sum of Squares (WCSS): Lower is better 
-  Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance 
External Metrics (with ground truth):
- Adjusted Rand Index (ARI): Measures similarity to true clustering
- Normalized Mutual Information (NMI): Information-theoretic measure
- Fowlkes-Mallows Index: Geometric mean of precision and recall
Example:
from sklearn.metrics import silhouette_score, calinski_harabasz_score
silhouette = silhouette_score(X, labels)
ch_score = calinski_harabasz_score(X, labels)
9. What is Mini-batch K-means and when would you use it?
Answer:
Mini-batch K-means: - Variant that uses small random batches of data for updates - Updates centroids using only a subset of data points in each iteration - Significantly faster than standard K-means - Slightly lower quality but much more scalable
When to use: - Large datasets where standard K-means is too slow - Online/streaming data scenarios - When approximate results are acceptable - Limited computational resources
Trade-offs: - Pros: Much faster, memory efficient, good for large datasets - Cons: Slightly less accurate, may need more iterations for convergence
Example:
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(n_clusters=k, batch_size=100)
10. How does feature scaling affect K-means clustering?
Answer:
Impact of feature scaling: - K-means uses Euclidean distance, which is sensitive to feature scales - Features with larger scales dominate the distance calculation - Can lead to poor clustering where high-scale features determine clusters
Example:
# Age: 20-80, Income: 20000-100000
# Income will dominate distance calculation
Solutions:
-  StandardScaler: Mean=0, Std=1 from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
-  MinMaxScaler: Scale to [0,1] from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X)
-  RobustScaler: Uses median and IQR (robust to outliers) 
Best practice: Always scale features before applying K-means
π§ Examples
Example 1: Customer Segmentation
# Generate customer data
np.random.seed(42)
n_customers = 1000
# Features: Age, Income, Spending Score
ages = np.random.normal(40, 12, n_customers)
incomes = np.random.normal(60000, 20000, n_customers)
spending_scores = np.random.normal(50, 25, n_customers)
# Create DataFrame
customer_data = pd.DataFrame({
    'Age': ages,
    'Annual_Income': incomes,
    'Spending_Score': spending_scores
})
# Add some correlation (higher income -> higher spending for some customers)
mask = np.random.choice(n_customers, size=int(0.6 * n_customers), replace=False)
customer_data.loc[mask, 'Spending_Score'] += (customer_data.loc[mask, 'Annual_Income'] - 60000) / 1000
print("Customer Data Summary:")
print(customer_data.describe())
# Prepare data for clustering
X_customers = customer_data.values
scaler = StandardScaler()
X_customers_scaled = scaler.fit_transform(X_customers)
# Determine optimal number of clusters
def analyze_optimal_k(X, max_k=10):
    """Analyze optimal k using multiple methods"""
    wcss = []
    silhouette_scores = []
    k_range = range(2, max_k + 1)
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = kmeans.fit_predict(X)
        wcss.append(kmeans.inertia_)
        silhouette_scores.append(silhouette_score(X, labels))
    # Plot results
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    # Elbow method
    ax1.plot(k_range, wcss, 'bo-')
    ax1.set_title('Elbow Method')
    ax1.set_xlabel('Number of Clusters (k)')
    ax1.set_ylabel('WCSS')
    ax1.grid(True, alpha=0.3)
    # Silhouette method
    ax2.plot(k_range, silhouette_scores, 'ro-')
    ax2.set_title('Silhouette Analysis')
    ax2.set_xlabel('Number of Clusters (k)')
    ax2.set_ylabel('Silhouette Score')
    ax2.grid(True, alpha=0.3)
    # Find optimal k
    optimal_k_silhouette = k_range[np.argmax(silhouette_scores)]
    ax2.axvline(x=optimal_k_silhouette, color='red', linestyle='--', 
                label=f'Optimal k={optimal_k_silhouette}')
    ax2.legend()
    plt.tight_layout()
    plt.show()
    return optimal_k_silhouette, silhouette_scores
optimal_k, _ = analyze_optimal_k(X_customers_scaled, max_k=8)
print(f"Optimal number of clusters: {optimal_k}")
# Apply K-means with optimal k
kmeans_customers = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
customer_segments = kmeans_customers.fit_predict(X_customers_scaled)
# Add cluster labels to original data
customer_data['Segment'] = customer_segments
# Analyze segments
print(f"\nCustomer Segmentation Results (k={optimal_k}):")
print(f"Silhouette Score: {silhouette_score(X_customers_scaled, customer_segments):.3f}")
segment_analysis = customer_data.groupby('Segment').agg({
    'Age': ['mean', 'std'],
    'Annual_Income': ['mean', 'std'],
    'Spending_Score': ['mean', 'std'],
    'Segment': 'count'
}).round(2)
segment_analysis.columns = ['Age_Mean', 'Age_Std', 'Income_Mean', 'Income_Std', 
                           'Spending_Mean', 'Spending_Std', 'Count']
print("\nSegment Characteristics:")
print(segment_analysis)
# Visualize segments
fig = plt.figure(figsize=(18, 12))
# 2D scatter plots
feature_pairs = [
    ('Age', 'Annual_Income'),
    ('Age', 'Spending_Score'),
    ('Annual_Income', 'Spending_Score')
]
for i, (feat1, feat2) in enumerate(feature_pairs):
    ax = fig.add_subplot(2, 3, i+1)
    for segment in range(optimal_k):
        segment_data = customer_data[customer_data['Segment'] == segment]
        ax.scatter(segment_data[feat1], segment_data[feat2], 
                  alpha=0.6, label=f'Segment {segment}')
    ax.set_xlabel(feat1)
    ax.set_ylabel(feat2)
    ax.set_title(f'{feat1} vs {feat2}')
    ax.legend()
    ax.grid(True, alpha=0.3)
# 3D scatter plot
ax = fig.add_subplot(2, 3, 4, projection='3d')
colors = plt.cm.viridis(np.linspace(0, 1, optimal_k))
for segment in range(optimal_k):
    segment_data = customer_data[customer_data['Segment'] == segment]
    ax.scatter(segment_data['Age'], segment_data['Annual_Income'], 
              segment_data['Spending_Score'], c=[colors[segment]], 
              alpha=0.6, label=f'Segment {segment}')
ax.set_xlabel('Age')
ax.set_ylabel('Annual Income')
ax.set_zlabel('Spending Score')
ax.set_title('3D Customer Segments')
ax.legend()
# Segment size distribution
ax = fig.add_subplot(2, 3, 5)
segment_counts = customer_data['Segment'].value_counts().sort_index()
ax.bar(range(optimal_k), segment_counts.values, color=colors[:optimal_k])
ax.set_xlabel('Segment')
ax.set_ylabel('Number of Customers')
ax.set_title('Segment Size Distribution')
ax.set_xticks(range(optimal_k))
# Radar chart for segment characteristics
ax = fig.add_subplot(2, 3, 6, projection='polar')
# Normalize features for radar chart
centroids_original = scaler.inverse_transform(kmeans_customers.cluster_centers_)
features = ['Age', 'Annual_Income', 'Spending_Score']
# Normalize each feature to 0-1 scale for visualization
normalized_centroids = np.zeros_like(centroids_original)
for i, feature in enumerate(features):
    min_val = customer_data[feature].min()
    max_val = customer_data[feature].max()
    normalized_centroids[:, i] = (centroids_original[:, i] - min_val) / (max_val - min_val)
angles = np.linspace(0, 2*np.pi, len(features), endpoint=False).tolist()
angles += angles[:1]  # Complete the circle
for segment in range(optimal_k):
    values = normalized_centroids[segment].tolist()
    values += values[:1]  # Complete the circle
    ax.plot(angles, values, 'o-', linewidth=2, label=f'Segment {segment}')
    ax.fill(angles, values, alpha=0.25)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(features)
ax.set_title('Segment Characteristics (Normalized)')
ax.legend()
plt.tight_layout()
plt.show()
# Business insights
print("\nBusiness Insights:")
for segment in range(optimal_k):
    segment_data = segment_analysis.loc[segment]
    print(f"\nSegment {segment} ({segment_data['Count']} customers):")
    print(f"  Average Age: {segment_data['Age_Mean']:.1f} years")
    print(f"  Average Income: ${segment_data['Income_Mean']:,.0f}")
    print(f"  Average Spending Score: {segment_data['Spending_Mean']:.1f}")
    # Generate insights based on characteristics
    if segment_data['Age_Mean'] < 35 and segment_data['Spending_Mean'] > 60:
        print(f"  β Young high spenders - target for premium products")
    elif segment_data['Income_Mean'] > 70000 and segment_data['Spending_Mean'] < 40:
        print(f"  β High income, low spending - potential for marketing campaigns")
    elif segment_data['Age_Mean'] > 50 and segment_data['Spending_Mean'] > 60:
        print(f"  β Mature high spenders - focus on quality and service")
    else:
        print(f"  β Standard customers - balanced approach")
Example 2: Image Color Quantization
from sklearn.datasets import load_sample_image
import matplotlib.image as mpimg
def quantize_image_colors(image_path=None, n_colors=8):
    """Reduce number of colors in an image using K-means"""
    # Load image (use sample image if path not provided)
    if image_path is None:
        # Use sklearn sample image
        china = load_sample_image("china.jpg")
        image = china / 255.0  # Normalize to [0, 1]
    else:
        image = mpimg.imread(image_path)
        if image.max() > 1:
            image = image / 255.0
    print(f"Original image shape: {image.shape}")
    # Reshape image to be a list of pixels
    original_shape = image.shape
    image_2d = image.reshape(-1, 3)  # Flatten to (n_pixels, 3)
    print(f"Number of pixels: {image_2d.shape[0]:,}")
    print(f"Original colors: {len(np.unique(image_2d, axis=0)):,} unique colors")
    # Apply K-means clustering
    print(f"Reducing to {n_colors} colors using K-means...")
    kmeans = KMeans(n_clusters=n_colors, random_state=42, n_init=10)
    labels = kmeans.fit_predict(image_2d)
    # Replace each pixel with its cluster center
    quantized_colors = kmeans.cluster_centers_[labels]
    quantized_image = quantized_colors.reshape(original_shape)
    # Calculate compression ratio
    original_unique_colors = len(np.unique(image_2d, axis=0))
    compression_ratio = original_unique_colors / n_colors
    print(f"Compression ratio: {compression_ratio:.1f}x")
    print(f"Final colors: {n_colors}")
    # Visualize results
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    # Original image
    axes[0, 0].imshow(image)
    axes[0, 0].set_title(f'Original Image\n({original_unique_colors:,} colors)')
    axes[0, 0].axis('off')
    # Quantized image
    axes[0, 1].imshow(quantized_image)
    axes[0, 1].set_title(f'Quantized Image\n({n_colors} colors)')
    axes[0, 1].axis('off')
    # Difference
    difference = np.abs(image - quantized_image)
    axes[0, 2].imshow(difference)
    axes[0, 2].set_title('Absolute Difference')
    axes[0, 2].axis('off')
    # Color palette
    palette = kmeans.cluster_centers_.reshape(1, n_colors, 3)
    axes[1, 0].imshow(palette)
    axes[1, 0].set_title('Color Palette')
    axes[1, 0].axis('off')
    # Color distribution
    unique_labels, counts = np.unique(labels, return_counts=True)
    colors_rgb = kmeans.cluster_centers_
    axes[1, 1].bar(range(n_colors), counts, color=colors_rgb, alpha=0.8)
    axes[1, 1].set_title('Color Frequency')
    axes[1, 1].set_xlabel('Color Index')
    axes[1, 1].set_ylabel('Pixel Count')
    # MSE plot for different number of colors
    color_range = range(2, 17)
    mse_values = []
    for n_c in color_range:
        temp_kmeans = KMeans(n_clusters=n_c, random_state=42, n_init=5)
        temp_labels = temp_kmeans.fit_predict(image_2d)
        temp_quantized = temp_kmeans.cluster_centers_[temp_labels]
        mse = np.mean((image_2d - temp_quantized) ** 2)
        mse_values.append(mse)
    axes[1, 2].plot(color_range, mse_values, 'bo-')
    axes[1, 2].axvline(x=n_colors, color='red', linestyle='--', 
                       label=f'Selected k={n_colors}')
    axes[1, 2].set_title('MSE vs Number of Colors')
    axes[1, 2].set_xlabel('Number of Colors')
    axes[1, 2].set_ylabel('Mean Squared Error')
    axes[1, 2].legend()
    axes[1, 2].grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    return quantized_image, kmeans
# Apply color quantization
quantized_img, color_kmeans = quantize_image_colors(n_colors=16)
print("\nColor palette (RGB values):")
for i, color in enumerate(color_kmeans.cluster_centers_):
    print(f"Color {i}: RGB({color[0]:.3f}, {color[1]:.3f}, {color[2]:.3f})")
π References
Academic Papers
- Lloyd, S. (1982). "Least squares quantization in PCM". IEEE Transactions on Information Theory
- Arthur, D. & Vassilvitskii, S. (2007). "K-means++: The advantages of careful seeding". SODA '07
- MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations"
Books
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Chapter 14.3
- Bishop, C. (2006). Pattern Recognition and Machine Learning. Chapter 9
- Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 25
Online Resources
- Scikit-learn K-means Documentation
- K-means Clustering: Algorithm, Applications, Evaluation Methods
- An Introduction to Statistical Learning with R - Chapter 10
- Stanford CS229 Machine Learning Course Notes
Tutorials and Guides
- K-means Clustering in Python: A Practical Guide
- Clustering Algorithms Comparison
- How to Determine the Optimal Number of Clusters