Skip to content

πŸ“˜ Collaborative Filtering

Collaborative filtering is a recommendation technique that predicts user preferences by analyzing the behavior and preferences of similar users or items, leveraging the collective intelligence of the user community.

Resources: Surprise Documentation | Netflix Prize Paper | Recommender Systems Handbook

✍️ Summary

Collaborative Filtering (CF) is a method used in recommendation systems that makes automatic predictions about user preferences by collecting preferences from many users. The underlying assumption is that users who agreed in the past will agree in the future, and they will like similar kinds of items.

Key concepts: - User-based CF: Find similar users and recommend items they liked - Item-based CF: Find similar items to those the user has liked - Matrix Factorization: Decompose user-item interaction matrix into latent factors

Applications: - Movie recommendations (Netflix, IMDb) - Product recommendations (Amazon, eBay) - Music recommendations (Spotify, Pandora) - Social media content (Facebook, Twitter) - News recommendations (Google News) - Book recommendations (Goodreads)

Collaborative filtering works well when you have sufficient user interaction data but doesn't require knowledge about item content.

🧠 Intuition

Mathematical Foundation

Collaborative filtering can be formulated as a matrix completion problem. Given a user-item rating matrix \(R\) where \(R_{ui}\) represents the rating user \(u\) gave to item \(i\):

\[R = \begin{bmatrix} r_{11} & r_{12} & \cdots & r_{1n} \\ r_{21} & r_{22} & \cdots & r_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ r_{m1} & r_{m2} & \cdots & r_{mn} \end{bmatrix}\]

Where many entries are missing (unobserved ratings).

User-Based Collaborative Filtering

The similarity between users \(u\) and \(v\) can be measured using:

Cosine Similarity: \(\(\text{sim}(u,v) = \frac{\sum_{i \in I} r_{ui} \cdot r_{vi}}{\sqrt{\sum_{i \in I} r_{ui}^2} \cdot \sqrt{\sum_{i \in I} r_{vi}^2}}\)\)

Pearson Correlation: \(\(\text{sim}(u,v) = \frac{\sum_{i \in I} (r_{ui} - \bar{r}_u)(r_{vi} - \bar{r}_v)}{\sqrt{\sum_{i \in I} (r_{ui} - \bar{r}_u)^2} \cdot \sqrt{\sum_{i \in I} (r_{vi} - \bar{r}_v)^2}}\)\)

Prediction formula: \(\(\hat{r}_{ui} = \bar{r}_u + \frac{\sum_{v \in N(u)} \text{sim}(u,v) \cdot (r_{vi} - \bar{r}_v)}{\sum_{v \in N(u)} |\text{sim}(u,v)|}\)\)

Item-Based Collaborative Filtering

Similar to user-based, but focuses on item similarities:

\[\hat{r}_{ui} = \frac{\sum_{j \in N(i)} \text{sim}(i,j) \cdot r_{uj}}{\sum_{j \in N(i)} |\text{sim}(i,j)|}\]

Matrix Factorization

Decompose the rating matrix \(R\) into two lower-dimensional matrices: \(\(R \approx P \times Q^T\)\)

Where: - \(P \in \mathbb{R}^{m \times k}\) represents user latent factors - \(Q \in \mathbb{R}^{n \times k}\) represents item latent factors - \(k\) is the number of latent factors

Objective function: \(\(\min_{P,Q} \sum_{(u,i) \in \text{observed}} (r_{ui} - p_u^T q_i)^2 + \lambda(||P||^2 + ||Q||^2)\)\)

πŸ”’ Implementation using Libraries

Using Surprise Library

import pandas as pd
import numpy as np
from surprise import Dataset, Reader, SVD, KNNBasic, accuracy
from surprise.model_selection import train_test_split, cross_validate
from collections import defaultdict

# Sample dataset creation
def create_sample_data():
    """Create sample movie ratings dataset"""
    np.random.seed(42)

    users = [f'User_{i}' for i in range(1, 101)]
    movies = [f'Movie_{i}' for i in range(1, 51)]

    # Generate ratings with some pattern
    ratings = []
    for user in users:
        # Each user rates 10-30 movies
        n_ratings = np.random.randint(10, 31)
        user_movies = np.random.choice(movies, n_ratings, replace=False)

        for movie in user_movies:
            # Add some user bias and item bias
            user_bias = np.random.normal(0, 0.5)
            movie_bias = np.random.normal(0, 0.3)
            rating = np.clip(3 + user_bias + movie_bias + np.random.normal(0, 0.8), 1, 5)
            ratings.append([user, movie, round(rating, 1)])

    return pd.DataFrame(ratings, columns=['user', 'item', 'rating'])

# Create and prepare data
df = create_sample_data()
print("Sample data:")
print(df.head())

# Surprise dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# 1. Matrix Factorization (SVD)
print("\n1. Matrix Factorization (SVD)")
svd = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02, random_state=42)
svd.fit(trainset)

# Make predictions
predictions = svd.test(testset)
print(f"RMSE: {accuracy.rmse(predictions):.4f}")

# 2. User-based Collaborative Filtering
print("\n2. User-based Collaborative Filtering")
user_based = KNNBasic(sim_options={'name': 'cosine', 'user_based': True}, k=20)
user_based.fit(trainset)

predictions_user = user_based.test(testset)
print(f"RMSE: {accuracy.rmse(predictions_user):.4f}")

# 3. Item-based Collaborative Filtering  
print("\n3. Item-based Collaborative Filtering")
item_based = KNNBasic(sim_options={'name': 'cosine', 'user_based': False}, k=20)
item_based.fit(trainset)

predictions_item = item_based.test(testset)
print(f"RMSE: {accuracy.rmse(predictions_item):.4f}")

# Get recommendations for a user
def get_recommendations(model, user_id, trainset, n_recommendations=5):
    """Get top N recommendations for a user"""
    # Get list of all items
    all_items = set([item for (_, item, _) in trainset.all_ratings()])

    # Get items the user has already rated
    user_items = set([item for (user, item, _) in trainset.all_ratings() if user == user_id])

    # Get items the user hasn't rated
    unrated_items = all_items - user_items

    # Predict ratings for unrated items
    predictions = []
    for item in unrated_items:
        pred = model.predict(user_id, item)
        predictions.append((item, pred.est))

    # Sort by predicted rating
    predictions.sort(key=lambda x: x[1], reverse=True)

    return predictions[:n_recommendations]

# Example recommendations
user_id = trainset.to_raw_uid(0)  # First user in trainset
recommendations = get_recommendations(svd, user_id, trainset)
print(f"\nTop 5 recommendations for {user_id}:")
for item, rating in recommendations:
    print(f"  {item}: {rating:.2f}")

Using scikit-learn

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import NMF
import pandas as pd

class CollaborativeFilteringScratch:
    def __init__(self, method='user_based'):
        self.method = method
        self.user_similarity = None
        self.item_similarity = None
        self.user_mean = None

    def fit(self, ratings_matrix):
        """
        Fit collaborative filtering model
        ratings_matrix: pandas DataFrame with users as rows, items as columns
        """
        self.ratings_matrix = ratings_matrix.fillna(0)
        self.user_mean = ratings_matrix.mean(axis=1)

        if self.method == 'user_based':
            # Calculate user similarity matrix
            self.user_similarity = cosine_similarity(self.ratings_matrix)
            np.fill_diagonal(self.user_similarity, 0)  # Remove self-similarity

        elif self.method == 'item_based':
            # Calculate item similarity matrix
            self.item_similarity = cosine_similarity(self.ratings_matrix.T)
            np.fill_diagonal(self.item_similarity, 0)

    def predict_user_based(self, user_idx, item_idx, k=20):
        """Predict rating using user-based collaborative filtering"""
        if self.ratings_matrix.iloc[user_idx, item_idx] > 0:
            return self.ratings_matrix.iloc[user_idx, item_idx]

        # Find k most similar users
        similarities = self.user_similarity[user_idx]
        similar_users = np.argsort(similarities)[::-1][:k]

        # Remove users who haven't rated this item
        similar_users = [u for u in similar_users 
                        if self.ratings_matrix.iloc[u, item_idx] > 0]

        if not similar_users:
            return self.user_mean.iloc[user_idx]

        # Calculate weighted average
        numerator = sum(similarities[u] * 
                       (self.ratings_matrix.iloc[u, item_idx] - self.user_mean.iloc[u])
                       for u in similar_users)
        denominator = sum(abs(similarities[u]) for u in similar_users)

        if denominator == 0:
            return self.user_mean.iloc[user_idx]

        return self.user_mean.iloc[user_idx] + numerator / denominator

    def predict_item_based(self, user_idx, item_idx, k=20):
        """Predict rating using item-based collaborative filtering"""
        if self.ratings_matrix.iloc[user_idx, item_idx] > 0:
            return self.ratings_matrix.iloc[user_idx, item_idx]

        # Find k most similar items that the user has rated
        similarities = self.item_similarity[item_idx]
        user_rated_items = [i for i in range(len(similarities))
                           if self.ratings_matrix.iloc[user_idx, i] > 0]

        if not user_rated_items:
            return self.user_mean.iloc[user_idx]

        # Sort by similarity and take top k
        similar_items = sorted(user_rated_items, 
                             key=lambda x: similarities[x], reverse=True)[:k]

        # Calculate weighted average
        numerator = sum(similarities[i] * self.ratings_matrix.iloc[user_idx, i]
                       for i in similar_items)
        denominator = sum(abs(similarities[i]) for i in similar_items)

        if denominator == 0:
            return self.user_mean.iloc[user_idx]

        return numerator / denominator

# Example usage with sample data
np.random.seed(42)
n_users, n_items = 20, 15

# Create sample ratings matrix (sparse)
ratings = np.random.choice([0, 1, 2, 3, 4, 5], 
                          size=(n_users, n_items), 
                          p=[0.7, 0.05, 0.05, 0.1, 0.05, 0.05])
ratings_df = pd.DataFrame(ratings, 
                         index=[f'User_{i}' for i in range(n_users)],
                         columns=[f'Item_{i}' for i in range(n_items)])

# Replace 0s with NaN to represent missing ratings
ratings_df = ratings_df.replace(0, np.nan)

print("Sample ratings matrix:")
print(ratings_df.head())

# Fit models
cf_user = CollaborativeFilteringScratch(method='user_based')
cf_user.fit(ratings_df)

cf_item = CollaborativeFilteringScratch(method='item_based')  
cf_item.fit(ratings_df)

# Make predictions
user_idx, item_idx = 0, 5
pred_user = cf_user.predict_user_based(user_idx, item_idx)
pred_item = cf_item.predict_item_based(user_idx, item_idx)

print(f"\nPredictions for User_0, Item_5:")
print(f"User-based CF: {pred_user:.2f}")
print(f"Item-based CF: {pred_item:.2f}")

βš™οΈ From Scratch Implementation

import numpy as np
import pandas as pd
from typing import Dict, List, Tuple
import math

class CollaborativeFilteringFromScratch:
    """
    Complete implementation of Collaborative Filtering from scratch
    Includes User-based, Item-based, and Matrix Factorization approaches
    """

    def __init__(self, approach='user_based', n_factors=10, learning_rate=0.01, 
                 regularization=0.01, n_epochs=100):
        self.approach = approach
        self.n_factors = n_factors
        self.learning_rate = learning_rate
        self.regularization = regularization
        self.n_epochs = n_epochs

        # Will be populated during training
        self.ratings_matrix = None
        self.user_mean = None
        self.item_mean = None
        self.global_mean = None
        self.user_factors = None
        self.item_factors = None
        self.user_bias = None
        self.item_bias = None

    def pearson_correlation(self, x, y):
        """Calculate Pearson correlation coefficient"""
        # Remove NaN values
        mask = ~(np.isnan(x) | np.isnan(y))
        if np.sum(mask) < 2:
            return 0

        x_clean, y_clean = x[mask], y[mask]

        if len(x_clean) == 0 or np.std(x_clean) == 0 or np.std(y_clean) == 0:
            return 0

        return np.corrcoef(x_clean, y_clean)[0, 1] if len(x_clean) > 1 else 0

    def cosine_similarity(self, x, y):
        """Calculate cosine similarity"""
        # Replace NaN with 0 for cosine similarity
        x_clean = np.nan_to_num(x)
        y_clean = np.nan_to_num(y)

        dot_product = np.dot(x_clean, y_clean)
        norm_x = np.linalg.norm(x_clean)
        norm_y = np.linalg.norm(y_clean)

        if norm_x == 0 or norm_y == 0:
            return 0

        return dot_product / (norm_x * norm_y)

    def fit(self, ratings_df):
        """
        Fit the collaborative filtering model
        ratings_df: DataFrame with users as index, items as columns
        """
        self.ratings_matrix = ratings_df.copy()
        self.users = list(ratings_df.index)
        self.items = list(ratings_df.columns)
        self.n_users = len(self.users)
        self.n_items = len(self.items)

        # Calculate means
        self.user_mean = ratings_df.mean(axis=1, skipna=True)
        self.item_mean = ratings_df.mean(axis=0, skipna=True)
        self.global_mean = ratings_df.stack().mean()

        if self.approach == 'matrix_factorization':
            self._fit_matrix_factorization()

    def _fit_matrix_factorization(self):
        """Fit matrix factorization using gradient descent"""
        # Initialize factors and biases
        np.random.seed(42)
        self.user_factors = np.random.normal(0, 0.1, (self.n_users, self.n_factors))
        self.item_factors = np.random.normal(0, 0.1, (self.n_items, self.n_factors))
        self.user_bias = np.zeros(self.n_users)
        self.item_bias = np.zeros(self.n_items)

        # Get all known ratings
        known_ratings = []
        for i, user in enumerate(self.users):
            for j, item in enumerate(self.items):
                rating = self.ratings_matrix.loc[user, item]
                if not np.isnan(rating):
                    known_ratings.append((i, j, rating))

        # Gradient descent
        for epoch in range(self.n_epochs):
            total_error = 0

            for user_idx, item_idx, rating in known_ratings:
                # Predict rating
                prediction = (self.global_mean + 
                             self.user_bias[user_idx] + 
                             self.item_bias[item_idx] +
                             np.dot(self.user_factors[user_idx], 
                                   self.item_factors[item_idx]))

                # Calculate error
                error = rating - prediction
                total_error += error ** 2

                # Update biases
                self.user_bias[user_idx] += self.learning_rate * (
                    error - self.regularization * self.user_bias[user_idx])
                self.item_bias[item_idx] += self.learning_rate * (
                    error - self.regularization * self.item_bias[item_idx])

                # Update factors
                user_factors_old = self.user_factors[user_idx].copy()
                self.user_factors[user_idx] += self.learning_rate * (
                    error * self.item_factors[item_idx] - 
                    self.regularization * self.user_factors[user_idx])
                self.item_factors[item_idx] += self.learning_rate * (
                    error * user_factors_old - 
                    self.regularization * self.item_factors[item_idx])

            # Print progress
            if (epoch + 1) % 20 == 0:
                rmse = np.sqrt(total_error / len(known_ratings))
                print(f"Epoch {epoch + 1}/{self.n_epochs}, RMSE: {rmse:.4f}")

    def predict(self, user, item, k=20):
        """Predict rating for user-item pair"""
        if user not in self.users or item not in self.items:
            return self.global_mean

        user_idx = self.users.index(user)
        item_idx = self.items.index(item)

        # If rating already exists, return it
        existing_rating = self.ratings_matrix.loc[user, item]
        if not np.isnan(existing_rating):
            return existing_rating

        if self.approach == 'user_based':
            return self._predict_user_based(user_idx, item_idx, k)
        elif self.approach == 'item_based':
            return self._predict_item_based(user_idx, item_idx, k)
        elif self.approach == 'matrix_factorization':
            return self._predict_matrix_factorization(user_idx, item_idx)
        else:
            return self.global_mean

    def _predict_user_based(self, user_idx, item_idx, k):
        """User-based collaborative filtering prediction"""
        target_user_ratings = self.ratings_matrix.iloc[user_idx].values
        similarities = []

        # Calculate similarities with all other users
        for i, other_user in enumerate(self.users):
            if i == user_idx:
                continue

            other_user_ratings = self.ratings_matrix.iloc[i].values
            similarity = self.pearson_correlation(target_user_ratings, other_user_ratings)

            # Only consider users who have rated this item
            if not np.isnan(self.ratings_matrix.iloc[i, item_idx]) and similarity > 0:
                similarities.append((i, similarity))

        # Sort by similarity and take top k
        similarities.sort(key=lambda x: x[1], reverse=True)
        top_similar = similarities[:k]

        if not top_similar:
            return self.user_mean.iloc[user_idx]

        # Calculate weighted average
        numerator = sum(sim * (self.ratings_matrix.iloc[user_i, item_idx] - 
                              self.user_mean.iloc[user_i])
                       for user_i, sim in top_similar)
        denominator = sum(abs(sim) for _, sim in top_similar)

        if denominator == 0:
            return self.user_mean.iloc[user_idx]

        return self.user_mean.iloc[user_idx] + numerator / denominator

    def _predict_item_based(self, user_idx, item_idx, k):
        """Item-based collaborative filtering prediction"""
        target_item_ratings = self.ratings_matrix.iloc[:, item_idx].values
        similarities = []

        # Calculate similarities with all other items
        for j, other_item in enumerate(self.items):
            if j == item_idx:
                continue

            other_item_ratings = self.ratings_matrix.iloc[:, j].values
            similarity = self.pearson_correlation(target_item_ratings, other_item_ratings)

            # Only consider items that this user has rated
            if not np.isnan(self.ratings_matrix.iloc[user_idx, j]) and similarity > 0:
                similarities.append((j, similarity))

        # Sort by similarity and take top k
        similarities.sort(key=lambda x: x[1], reverse=True)
        top_similar = similarities[:k]

        if not top_similar:
            return self.item_mean.iloc[item_idx]

        # Calculate weighted average
        numerator = sum(sim * self.ratings_matrix.iloc[user_idx, item_j]
                       for item_j, sim in top_similar)
        denominator = sum(abs(sim) for _, sim in top_similar)

        if denominator == 0:
            return self.item_mean.iloc[item_idx]

        return numerator / denominator

    def _predict_matrix_factorization(self, user_idx, item_idx):
        """Matrix factorization prediction"""
        prediction = (self.global_mean + 
                     self.user_bias[user_idx] + 
                     self.item_bias[item_idx] +
                     np.dot(self.user_factors[user_idx], 
                           self.item_factors[item_idx]))
        return prediction

    def get_recommendations(self, user, n_recommendations=5):
        """Get top N recommendations for a user"""
        if user not in self.users:
            return []

        user_idx = self.users.index(user)
        user_ratings = self.ratings_matrix.loc[user]

        # Find items the user hasn't rated
        unrated_items = user_ratings[user_ratings.isna()].index.tolist()

        # Predict ratings for unrated items
        predictions = []
        for item in unrated_items:
            pred_rating = self.predict(user, item)
            predictions.append((item, pred_rating))

        # Sort by predicted rating
        predictions.sort(key=lambda x: x[1], reverse=True)

        return predictions[:n_recommendations]

# Example usage and testing
if __name__ == "__main__":
    # Create sample data
    np.random.seed(42)
    users = [f'User_{i}' for i in range(15)]
    items = [f'Movie_{i}' for i in range(10)]

    # Create ratings matrix with missing values
    ratings_data = {}
    for user in users:
        ratings_data[user] = {}
        for item in items:
            # 60% chance of having a rating
            if np.random.random() > 0.6:
                ratings_data[user][item] = np.random.randint(1, 6)
            else:
                ratings_data[user][item] = np.nan

    ratings_df = pd.DataFrame(ratings_data).T
    print("Sample ratings matrix:")
    print(ratings_df.head())

    # Test different approaches
    approaches = ['user_based', 'item_based', 'matrix_factorization']

    for approach in approaches:
        print(f"\n{'='*50}")
        print(f"Testing {approach.replace('_', ' ').title()}")
        print('='*50)

        cf_model = CollaborativeFilteringFromScratch(approach=approach, n_epochs=50)
        cf_model.fit(ratings_df)

        # Test predictions
        test_user = 'User_0'
        test_item = 'Movie_5'

        prediction = cf_model.predict(test_user, test_item)
        print(f"Prediction for {test_user} -> {test_item}: {prediction:.2f}")

        # Get recommendations
        recommendations = cf_model.get_recommendations(test_user, n_recommendations=3)
        print(f"\nTop 3 recommendations for {test_user}:")
        for item, rating in recommendations:
            print(f"  {item}: {rating:.2f}")

⚠️ Assumptions and Limitations

Assumptions

  1. User Consistency: Users have consistent preferences over time
  2. Transitivity: If user A is similar to user B, and user B likes item X, then user A will also like item X
  3. Sufficient Data: Enough user-item interactions exist for meaningful patterns
  4. Rating Reliability: User ratings accurately reflect their true preferences

Limitations

1. Cold Start Problems

  • New Users: Cannot make recommendations for users with no rating history
  • New Items: Cannot recommend items with no ratings
  • Solution: Use hybrid approaches combining content-based filtering

2. Data Sparsity

  • Most user-item matrices are extremely sparse (95%+ missing values)
  • Few overlapping ratings between users make similarity calculations unreliable
  • Solution: Matrix factorization, dimensionality reduction

3. Scalability Issues

  • User-based CF: O(mn) for m users, n items per prediction
  • Similarity calculations become expensive with large datasets
  • Solution: Approximate algorithms, sampling, clustering

4. Gray Sheep Problem

  • Users with unique tastes don't match well with any group
  • Hard to find similar users for recommendations
  • Solution: Content-based or demographic filtering

5. Filter Bubble

  • Recommends similar items to what user already likes
  • Reduces serendipity and diversity
  • Solution: Add randomness, diversity metrics

Comparison with Other Approaches

Aspect User-Based CF Item-Based CF Matrix Factorization
Interpretability High High Low
Scalability Poor Better Good
Accuracy Medium Medium High
Cold Start Poor Poor Better
Sparsity Handling Poor Better Good

πŸ’‘ Interview Questions

Q1: What is collaborative filtering and how does it differ from content-based filtering?

Answer:

Collaborative filtering predicts user preferences based on behavior of similar users, while content-based filtering uses item features.

Key differences: - Data Required: CF needs user behavior data; content-based needs item features - Recommendations: CF can recommend items dissimilar in content but liked by similar users - Cold Start: CF struggles with new users/items; content-based can handle new items - Serendipity: CF provides more surprising recommendations - Domain Knowledge: CF doesn't require domain expertise; content-based does

Q2: Explain the cold start problem in collaborative filtering and potential solutions.

Answer:

Cold start occurs when there's insufficient data for new users or items.

Types: 1. New User: No rating history β†’ Cannot find similar users 2. New Item: No ratings β†’ Cannot recommend to anyone 3. New System: Few users/items overall

Solutions: - Hybrid Systems: Combine with content-based filtering - Demographic Filtering: Use age, gender, location for new users - Popular Items: Recommend trending/popular items to new users - Active Learning: Ask new users to rate popular items - Side Information: Use implicit feedback (views, clicks, time spent)

Q3: What are the advantages and disadvantages of user-based vs item-based collaborative filtering?

Answer:

User-Based CF: - Advantages: Intuitive, good for diverse recommendations, works well with user communities - Disadvantages: Poor scalability (users grow faster than items), unstable (user preferences change)

Item-Based CF: - Advantages: Better scalability, more stable (item relationships don't change often), pre-computable - Disadvantages: Less diverse recommendations, may create filter bubbles

When to use: - User-based: Small user base, community-driven platforms, need for diversity - Item-based: Large user base, stable item catalog, need for stability

Q4: How does matrix factorization work in collaborative filtering? What are its benefits?

Answer:

Matrix factorization decomposes the user-item rating matrix R into two lower-dimensional matrices P (user factors) and Q (item factors):

\[R \approx P \times Q^T\]

How it works: 1. Initialize P and Q with random values 2. For each known rating, predict: \(\hat{r}_{ui} = p_u^T q_i\) 3. Minimize error: \(\min \sum (r_{ui} - p_u^T q_i)^2 + \lambda(||P||^2 + ||Q||^2)\) 4. Update factors using gradient descent

Benefits: - Handles sparsity better than neighborhood methods - More scalable than user/item-based approaches - Can incorporate biases and side information - Discovers latent factors automatically - Better accuracy on sparse datasets

Q5: What evaluation metrics would you use for a recommendation system?

Answer:

Accuracy Metrics: - RMSE/MAE: For rating prediction tasks - Precision/Recall: For top-N recommendations - F1-Score: Harmonic mean of precision and recall - AUC: Area under ROC curve for binary relevance

Ranking Metrics: - NDCG: Normalized Discounted Cumulative Gain - MAP: Mean Average Precision - MRR: Mean Reciprocal Rank

Beyond Accuracy: - Coverage: Percentage of items that can be recommended - Diversity: Variety in recommendations - Novelty: How unknown recommended items are - Serendipity: Surprising but relevant recommendations - Business Metrics: Click-through rate, conversion rate, user engagement

Q6: How would you handle the scalability challenges in collaborative filtering?

Answer:

Techniques for scalability:

  1. Dimensionality Reduction:
  2. SVD, NMF for matrix factorization
  3. Clustering users/items to reduce computation

  4. Sampling Strategies:

  5. Sample subset of similar users/items
  6. Negative sampling for implicit feedback

  7. Approximate Algorithms:

  8. Locality Sensitive Hashing (LSH) for similarity
  9. Randomized algorithms

  10. Distributed Computing:

  11. MapReduce implementations
  12. Spark MLlib for large-scale CF

  13. Preprocessing:

  14. Pre-compute item-item similarities (more stable)
  15. Use incremental learning algorithms
Q7: What is the difference between explicit and implicit feedback? How do you handle each?

Answer:

Explicit Feedback: - Direct ratings (1-5 stars, thumbs up/down) - Advantages: Clear preference signal - Disadvantages: Sparse, biased (only engaged users rate)

Implicit Feedback: - Indirect behavior (views, clicks, purchases, time spent) - Advantages: Abundant, all users generate it - Disadvantages: Noisy, positive-only (no explicit negatives)

Handling Strategies: - Explicit: Standard CF algorithms, handle missing as unknown - Implicit: Treat confidence as rating strength, generate negative samples, use specialized algorithms (BPR, WARP)

Example transformation for implicit: - View time β†’ confidence score - Multiple purchases β†’ higher preference - Recent activity β†’ higher weight

Q8: How would you detect and prevent data quality issues in collaborative filtering?

Answer:

Common Issues: 1. Fake Reviews/Ratings: Artificially inflate/deflate ratings 2. Rating Bias: Users with extreme rating patterns 3. Data Sparsity: Very few ratings per user/item 4. Temporal Effects: Preferences change over time

Detection Methods: - Statistical analysis (rating distributions, user patterns) - Anomaly detection algorithms - Graph-based analysis (unusual rating patterns) - Temporal analysis (sudden rating spikes)

Prevention/Mitigation: - User verification and reputation systems - Rate limiting and CAPTCHA - Weighted ratings by user trustworthiness - Temporal weighting (recent ratings more important) - Robust algorithms less sensitive to outliers

Q9: How would you design a recommendation system for a new e-commerce platform?

Answer:

Initial Phase (Cold Start): 1. Popular Items: Show trending/bestselling products 2. Content-Based: Use product features, categories, descriptions 3. Demographic: Age, gender, location-based recommendations

Growth Phase: 1. Simple CF: User-based or item-based with sufficient data 2. Hybrid Approach: Combine content-based and collaborative 3. Implicit Feedback: Views, cart additions, purchases

Mature Phase: 1. Matrix Factorization: Handle large sparse matrices 2. Deep Learning: Neural collaborative filtering, autoencoders 3. Real-time: Online learning, session-based recommendations

System Design Considerations: - A/B testing framework for algorithm comparison - Real-time vs batch processing - Scalable infrastructure (distributed computing) - Business metrics alignment

Q10: What are some advanced techniques in modern collaborative filtering?

Answer:

Deep Learning Approaches: 1. Neural Collaborative Filtering: Replace dot product with neural network 2. Autoencoders: Learn user/item representations 3. RNNs/LSTMs: Model sequential behavior 4. Graph Neural Networks: Leverage user-item graph structure

Advanced Matrix Factorization: - Non-negative Matrix Factorization: Interpretable factors - Bayesian Matrix Factorization: Uncertainty quantification - Tensor Factorization: Multi-dimensional data (user-item-context)

Multi-Armed Bandits: - Exploration vs exploitation in recommendations - Contextual bandits for personalization

Reinforcement Learning: - Long-term user satisfaction optimization - Dynamic recommendation strategies

Fairness and Bias Mitigation: - Demographic parity in recommendations - Bias-aware collaborative filtering

🧠 Examples

Example 1: Movie Recommendation System

# Real-world example using MovieLens dataset structure
import pandas as pd
import numpy as np

# Sample movie data (simplified MovieLens format)
movies_data = {
    'movie_id': [1, 2, 3, 4, 5],
    'title': ['Toy Story', 'Jumanji', 'Heat', 'Casino', 'Sabrina'],
    'genres': ['Animation|Children|Comedy', 'Adventure|Children|Fantasy', 
               'Action|Crime|Thriller', 'Crime|Drama', 'Comedy|Romance']
}

ratings_data = {
    'user_id': [1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5],
    'movie_id': [1, 2, 3, 1, 4, 2, 3, 5, 1, 5, 3, 4],
    'rating': [5, 4, 3, 4, 5, 3, 4, 5, 5, 4, 3, 4]
}

movies_df = pd.DataFrame(movies_data)
ratings_df = pd.DataFrame(ratings_data)

# Create user-item matrix
user_item_matrix = ratings_df.pivot(index='user_id', 
                                   columns='movie_id', 
                                   values='rating')

print("User-Item Rating Matrix:")
print(user_item_matrix)

# Apply collaborative filtering
cf_model = CollaborativeFilteringFromScratch(approach='item_based')
cf_model.fit(user_item_matrix)

# Get recommendations for User 1
recommendations = cf_model.get_recommendations(1, n_recommendations=2)
print(f"\nRecommendations for User 1:")
for movie_id, predicted_rating in recommendations:
    movie_title = movies_df[movies_df['movie_id'] == movie_id]['title'].values[0]
    print(f"  {movie_title}: {predicted_rating:.2f}")

Output:

User-Item Rating Matrix:
movie_id    1    2    3    4    5
user_id                         
1         5.0  4.0  3.0  NaN  NaN
2         4.0  NaN  NaN  5.0  NaN
3         NaN  3.0  4.0  NaN  5.0
4         5.0  NaN  NaN  NaN  4.0
5         NaN  NaN  3.0  4.0  NaN

Recommendations for User 1:
  Casino: 4.21
  Sabrina: 3.87

Example 2: Performance Comparison

# Compare different approaches on synthetic data
from sklearn.metrics import mean_squared_error
import time

# Generate larger synthetic dataset
np.random.seed(42)
n_users, n_items = 100, 50
sparsity = 0.1  # 10% of entries are filled

# Create synthetic ratings with latent factors
true_user_factors = np.random.normal(0, 1, (n_users, 5))
true_item_factors = np.random.normal(0, 1, (n_items, 5))
true_ratings = np.dot(true_user_factors, true_item_factors.T)

# Add noise and sparsity
mask = np.random.random((n_users, n_items)) < sparsity
observed_ratings = true_ratings + np.random.normal(0, 0.5, (n_users, n_items))
observed_ratings = np.clip(observed_ratings, 1, 5)  # Clip to rating scale
observed_ratings[~mask] = np.nan

# Convert to DataFrame
ratings_df = pd.DataFrame(observed_ratings, 
                         index=[f'User_{i}' for i in range(n_users)],
                         columns=[f'Item_{i}' for i in range(n_items)])

# Split train/test
train_mask = np.random.random((n_users, n_items)) < 0.8
test_mask = mask & ~train_mask

train_df = ratings_df.copy()
train_df[~train_mask] = np.nan

# Test different approaches
approaches = ['user_based', 'item_based', 'matrix_factorization']
results = {}

for approach in approaches:
    print(f"\nTesting {approach}...")
    start_time = time.time()

    model = CollaborativeFilteringFromScratch(approach=approach, n_epochs=50)
    model.fit(train_df)

    # Make predictions on test set
    predictions = []
    actuals = []

    for i in range(n_users):
        for j in range(n_items):
            if test_mask[i, j]:
                user = f'User_{i}'
                item = f'Item_{j}'
                pred = model.predict(user, item)
                actual = ratings_df.iloc[i, j]

                predictions.append(pred)
                actuals.append(actual)

    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(actuals, predictions))
    training_time = time.time() - start_time

    results[approach] = {
        'RMSE': rmse,
        'Training Time': training_time,
        'Predictions': len(predictions)
    }

    print(f"RMSE: {rmse:.4f}")
    print(f"Training Time: {training_time:.2f}s")

# Display results
print(f"\n{'='*60}")
print("PERFORMANCE COMPARISON")
print('='*60)
print(f"{'Approach':<20} {'RMSE':<10} {'Time (s)':<10}")
print('-'*40)
for approach, metrics in results.items():
    print(f"{approach.replace('_', ' ').title():<20} {metrics['RMSE']:<10.4f} {metrics['Training Time']:<10.2f}")

This comprehensive implementation demonstrates how collaborative filtering works in practice, handles real-world challenges, and provides a foundation for building production recommendation systems.

πŸ“š References

  1. Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender Systems Handbook. Springer.

  2. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37.

  3. Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009.

  4. Sarwar, B., et al. (2001). Item-based collaborative filtering recommendation algorithms. Proceedings of the 10th international conference on World Wide Web.

  5. Netflix Prize Documentation: Netflix Prize

  6. Surprise Library Documentation: Surprise

  7. MovieLens Datasets: GroupLens Research

  8. Collaborative Filtering Tutorial: Towards Data Science

  9. Matrix Factorization: Netflix Tech Blog

  10. Modern Recommender Systems: RecSys Conference Proceedings