🎯 Overfitting and Underfitting

Overfitting and Underfitting are fundamental concepts in machine learning that describe how well a model generalizes to unseen data - the central challenge in building reliable predictive models.

Resources: Scikit-learn Model Selection | ESL Chapter 7 | Bias-Variance Tradeoff Paper

📊 Summary

Overfitting occurs when a model learns the training data too well, capturing noise and specific patterns that don't generalize to new data. Underfitting happens when a model is too simple to capture the underlying patterns in the data.

Key Characteristics:

Overfitting: - High training accuracy, low validation/test accuracy - Model memorizes training data instead of learning patterns - Complex models with too many parameters - Poor generalization to unseen data

Underfitting: - Low training accuracy, low validation/test accuracy
- Model is too simple to capture underlying patterns - High bias, unable to learn from training data - Consistent poor performance across all datasets

Applications: - Model selection and hyperparameter tuning - Regularization technique selection - Architecture design for neural networks - Feature engineering decisions - Cross-validation strategy - Early stopping criteria

Related Concepts: - Bias-Variance Tradeoff: Fundamental framework explaining overfitting/underfitting - Model Complexity: Key factor determining fitting behavior - Regularization: Primary technique to prevent overfitting - Cross-Validation: Method to detect and measure fitting issues

🧠 Intuition

How Overfitting and Underfitting Work

Imagine you're learning to recognize handwritten digits. An underfitted model might only look at basic features like "has curves" or "has straight lines" - too simple to distinguish between different digits. An overfitted model might memorize every tiny detail of each training example, including pen pressure variations and paper texture, making it fail on new handwriting styles.

The ideal model finds the right balance - learning the essential patterns that generalize well without memorizing irrelevant details.

Mathematical Foundation

1. Bias-Variance Decomposition

The expected prediction error can be decomposed as: \(\(E[(y - \hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2\)\)

Where: - Bias: Error from oversimplifying assumptions - Variance: Error from sensitivity to training data variations
- Irreducible Error (\(\sigma^2\)): Inherent noise in the problem

2. Model Complexity vs Error

\[\text{Training Error} = \frac{1}{n}\sum_{i=1}^{n}L(y_i, \hat{f}(x_i))\]

\[\text{Generalization Error} = E[L(y, \hat{f}(x))]\]

As model complexity increases: - Training error decreases monotonically - Generalization error follows a U-shaped curve - Optimal complexity minimizes generalization error

3. VC Dimension and Generalization

For a model class with VC dimension \(d\) and \(n\) training samples: \(\(\text{Generalization Error} \leq \text{Training Error} + \sqrt{\frac{d\log(n) - \log(\delta)}{n}}\)\)

This bound shows that complex models (high \(d\)) need more data to generalize well.

4. Learning Curves

Training and validation error as functions of: - Sample size: \(\text{Error}(n)\) - Model complexity: \(\text{Error}(\lambda)\) where \(\lambda\) controls complexity

🛠️ Implementation using Libraries

Scikit-learn Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, validation_curve, learning_curve
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import seaborn as sns

# Generate synthetic dataset
np.random.seed(42)
def generate_data(n_samples=100, noise=0.3):
    X = np.linspace(0, 1, n_samples).reshape(-1, 1)
    y = 1.5 * X.ravel() + np.sin(1.5 * np.pi * X.ravel()) + np.random.normal(0, noise, n_samples)
    return X, y

X, y = generate_data(n_samples=100, noise=0.3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create models with different complexities
models = {
    'Underfitting (degree=1)': Pipeline([
        ('poly', PolynomialFeatures(degree=1)),
        ('ridge', Ridge(alpha=0.1))
    ]),
    'Good Fit (degree=3)': Pipeline([
        ('poly', PolynomialFeatures(degree=3)),
        ('ridge', Ridge(alpha=0.1))
    ]),
    'Overfitting (degree=15)': Pipeline([
        ('poly', PolynomialFeatures(degree=15)),
        ('ridge', Ridge(alpha=0.01))
    ])
}

# Train and evaluate models
results = {}
X_plot = np.linspace(0, 1, 100).reshape(-1, 1)

plt.figure(figsize=(15, 5))
for i, (name, model) in enumerate(models.items(), 1):
    model.fit(X_train, y_train)

    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)

    results[name] = {
        'train_score': train_score,
        'test_score': test_score,
        'predictions': model.predict(X_plot)
    }

    plt.subplot(1, 3, i)
    plt.scatter(X_train, y_train, alpha=0.6, label='Training Data')
    plt.scatter(X_test, y_test, alpha=0.6, label='Test Data')
    plt.plot(X_plot, results[name]['predictions'], 'r-', linewidth=2)
    plt.title(f'{name}\nTrain R²: {train_score:.3f}, Test R²: {test_score:.3f}')
    plt.legend()
    plt.xlabel('X')
    plt.ylabel('y')

plt.tight_layout()
plt.show()

# Print results
print("Model Performance Comparison:")
print("-" * 50)
for name, result in results.items():
    print(f"{name:25s} | Train R²: {result['train_score']:.3f} | Test R²: {result['test_score']:.3f}")

Learning Curves Analysis

def plot_learning_curves(estimator, X, y, title):
    """Plot learning curves to diagnose overfitting/underfitting"""
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=-1, 
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='neg_mean_squared_error'
    )

    train_scores_mean = -train_scores.mean(axis=1)
    train_scores_std = train_scores.std(axis=1)
    val_scores_mean = -val_scores.mean(axis=1)
    val_scores_std = val_scores.std(axis=1)

    plt.figure(figsize=(8, 6))
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r', label='Training Error')
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1, color='r')

    plt.plot(train_sizes, val_scores_mean, 'o-', color='g', label='Validation Error')
    plt.fill_between(train_sizes, val_scores_mean - val_scores_std,
                     val_scores_mean + val_scores_std, alpha=0.1, color='g')

    plt.xlabel('Training Set Size')
    plt.ylabel('Mean Squared Error')
    plt.title(f'Learning Curves: {title}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Analyze different model complexities
for name, model in models.items():
    plot_learning_curves(model, X, y, name)

Validation Curves for Hyperparameter Tuning

def plot_validation_curve(estimator, X, y, param_name, param_range, title):
    """Plot validation curve for hyperparameter tuning"""
    train_scores, val_scores = validation_curve(
        estimator, X, y, param_name=param_name, param_range=param_range,
        cv=5, scoring='neg_mean_squared_error', n_jobs=-1
    )

    train_scores_mean = -train_scores.mean(axis=1)
    val_scores_mean = -val_scores.mean(axis=1)

    plt.figure(figsize=(8, 6))
    plt.semilogx(param_range, train_scores_mean, 'o-', color='r', label='Training Error')
    plt.semilogx(param_range, val_scores_mean, 'o-', color='g', label='Validation Error')
    plt.xlabel(param_name)
    plt.ylabel('Mean Squared Error')
    plt.title(f'Validation Curve: {title}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Ridge regularization parameter tuning
ridge_model = Pipeline([
    ('poly', PolynomialFeatures(degree=10)),
    ('ridge', Ridge())
])

alpha_range = np.logspace(-4, 2, 20)
plot_validation_curve(ridge_model, X, y, 'ridge__alpha', alpha_range, 'Ridge Alpha')

🔧 From Scratch Implementation

Simple Overfitting Detection Framework

import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple, Callable

class FittingAnalyzer:
    """
    A class to analyze and detect overfitting/underfitting patterns
    """

    def __init__(self, random_state: int = 42):
        self.random_state = random_state
        np.random.seed(random_state)
        self.history = {}

    def generate_polynomial_data(self, n_samples: int = 100, 
                               noise: float = 0.3, 
                               true_degree: int = 3) -> Tuple[np.ndarray, np.ndarray]:
        """Generate synthetic polynomial data for testing"""
        X = np.linspace(0, 1, n_samples).reshape(-1, 1)

        # True function: polynomial of specified degree
        if true_degree == 3:
            y_true = 1.5 * X.ravel() + np.sin(1.5 * np.pi * X.ravel())
        else:
            # Generate random polynomial coefficients
            coeffs = np.random.normal(0, 1, true_degree + 1)
            y_true = sum(coeffs[i] * (X.ravel() ** i) for i in range(true_degree + 1))

        # Add noise
        y = y_true + np.random.normal(0, noise, n_samples)

        return X, y, y_true

    def polynomial_features(self, X: np.ndarray, degree: int) -> np.ndarray:
        """Create polynomial features up to specified degree"""
        n_samples = X.shape[0]
        n_features = degree + 1

        # Create polynomial feature matrix
        X_poly = np.ones((n_samples, n_features))
        for i in range(1, degree + 1):
            X_poly[:, i] = (X[:, 0] ** i)

        return X_poly

    def ridge_regression_fit(self, X: np.ndarray, y: np.ndarray, 
                           alpha: float = 0.01) -> np.ndarray:
        """Fit ridge regression with L2 regularization"""
        # Add regularization to prevent singular matrix
        I = np.eye(X.shape[1])
        I[0, 0] = 0  # Don't regularize intercept

        # Ridge regression solution: (X^T X + λI)^(-1) X^T y
        coefficients = np.linalg.solve(X.T @ X + alpha * I, X.T @ y)

        return coefficients

    def predict(self, X: np.ndarray, coefficients: np.ndarray) -> np.ndarray:
        """Make predictions using fitted coefficients"""
        return X @ coefficients

    def mean_squared_error(self, y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """Calculate mean squared error"""
        return np.mean((y_true - y_pred) ** 2)

    def r2_score(self, y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """Calculate R² score"""
        ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
        ss_res = np.sum((y_true - y_pred) ** 2)
        return 1 - (ss_res / ss_tot)

    def train_test_split(self, X: np.ndarray, y: np.ndarray, 
                        test_size: float = 0.3) -> Tuple[np.ndarray, ...]:
        """Split data into training and testing sets"""
        n_samples = X.shape[0]
        n_test = int(n_samples * test_size)

        # Random indices for test set
        test_indices = np.random.choice(n_samples, n_test, replace=False)
        train_indices = np.setdiff1d(np.arange(n_samples), test_indices)

        return (X[train_indices], X[test_indices], 
                y[train_indices], y[test_indices])

    def analyze_model_complexity(self, X: np.ndarray, y: np.ndarray,
                                max_degree: int = 15) -> dict:
        """Analyze different polynomial degrees to show overfitting/underfitting"""
        X_train, X_test, y_train, y_test = self.train_test_split(X, y)

        degrees = range(1, max_degree + 1)
        train_errors = []
        test_errors = []
        train_r2s = []
        test_r2s = []

        for degree in degrees:
            # Create polynomial features
            X_train_poly = self.polynomial_features(X_train, degree)
            X_test_poly = self.polynomial_features(X_test, degree)

            # Fit model
            coeffs = self.ridge_regression_fit(X_train_poly, y_train)

            # Make predictions
            y_train_pred = self.predict(X_train_poly, coeffs)
            y_test_pred = self.predict(X_test_poly, coeffs)

            # Calculate metrics
            train_mse = self.mean_squared_error(y_train, y_train_pred)
            test_mse = self.mean_squared_error(y_test, y_test_pred)
            train_r2 = self.r2_score(y_train, y_train_pred)
            test_r2 = self.r2_score(y_test, y_test_pred)

            train_errors.append(train_mse)
            test_errors.append(test_mse)
            train_r2s.append(train_r2)
            test_r2s.append(test_r2)

        results = {
            'degrees': degrees,
            'train_errors': train_errors,
            'test_errors': test_errors,
            'train_r2s': train_r2s,
            'test_r2s': test_r2s
        }

        self.history['complexity_analysis'] = results
        return results

    def plot_complexity_analysis(self, results: dict = None):
        """Plot the complexity analysis results"""
        if results is None:
            results = self.history.get('complexity_analysis')
            if results is None:
                raise ValueError("No complexity analysis results found. Run analyze_model_complexity first.")

        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

        # Plot MSE vs complexity
        ax1.plot(results['degrees'], results['train_errors'], 'o-', 
                label='Training Error', color='blue')
        ax1.plot(results['degrees'], results['test_errors'], 'o-', 
                label='Validation Error', color='red')
        ax1.set_xlabel('Polynomial Degree (Model Complexity)')
        ax1.set_ylabel('Mean Squared Error')
        ax1.set_title('Error vs Model Complexity')
        ax1.legend()
        ax1.grid(True, alpha=0.3)

        # Plot R² vs complexity
        ax2.plot(results['degrees'], results['train_r2s'], 'o-', 
                label='Training R²', color='blue')
        ax2.plot(results['degrees'], results['test_r2s'], 'o-', 
                label='Validation R²', color='red')
        ax2.set_xlabel('Polynomial Degree (Model Complexity)')
        ax2.set_ylabel('R² Score')
        ax2.set_title('R² vs Model Complexity')
        ax2.legend()
        ax2.grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

    def detect_overfitting(self, train_score: float, test_score: float, 
                          threshold: float = 0.1) -> str:
        """Detect overfitting based on train-test performance gap"""
        gap = train_score - test_score

        if gap > threshold and test_score < 0.7:
            return "Overfitting detected"
        elif train_score < 0.6 and test_score < 0.6:
            return "Underfitting detected"
        else:
            return "Good fit"

# Example usage
if __name__ == "__main__":
    # Initialize analyzer
    analyzer = FittingAnalyzer(random_state=42)

    # Generate synthetic data
    X, y, y_true = analyzer.generate_polynomial_data(n_samples=100, noise=0.2)

    # Analyze model complexity
    print("Analyzing model complexity...")
    results = analyzer.analyze_model_complexity(X, y, max_degree=15)

    # Plot results
    analyzer.plot_complexity_analysis()

    # Find optimal complexity
    optimal_idx = np.argmin(results['test_errors'])
    optimal_degree = results['degrees'][optimal_idx]

    print(f"\nOptimal polynomial degree: {optimal_degree}")
    print(f"Test R² at optimal complexity: {results['test_r2s'][optimal_idx]:.3f}")

    # Detect fitting issues for different complexities
    for i, degree in enumerate([1, optimal_degree, 15]):
        if i < len(results['train_r2s']):
            status = analyzer.detect_overfitting(
                results['train_r2s'][degree-1], 
                results['test_r2s'][degree-1]
            )
            print(f"Degree {degree}: {status}")

⚠️ Assumptions and Limitations

Overfitting Assumptions and Limitations

Assumptions: - Training data is representative of the target population - Test/validation sets are independent and identically distributed - The underlying function exists and is learnable - Sufficient data is available to assess generalization

Limitations: - Data-dependent: Overfitting detection depends on data quality and quantity - Model-specific: Different models overfit in different ways - Metric sensitivity: Choice of evaluation metric affects overfitting detection - Temporal effects: Models may overfit to specific time periods in time series data

Prevention Techniques Limitations

Regularization: - May underfit if regularization is too strong - Requires hyperparameter tuning - Different regularization types (L1, L2) have different effects

Cross-validation: - Computationally expensive for large datasets - May not capture all generalization patterns - Assumes data is i.i.d. (problematic for time series)

Early stopping: - Requires validation set, reducing training data - May stop too early or too late - Sensitive to learning rate and optimization dynamics

Comparison with Other Approaches

vs. Statistical Model Selection: - Advantages: More flexible, works with complex models - Disadvantages: Less theoretical guarantees, more empirical

vs. Bayesian Methods: - Advantages: Simpler implementation, faster computation - Disadvantages: Less principled uncertainty quantification

vs. Ensemble Methods: - Advantages: Interpretable individual models - Disadvantages: May still overfit collectively

❓ Interview Questions

1. What is the fundamental difference between overfitting and underfitting? How do they relate to the bias-variance tradeoff?

Answer: - Overfitting: High variance, low bias - model memorizes training data, fails on new data - Underfitting: High bias, low variance - model too simple to capture underlying patterns - Bias-Variance Tradeoff: - Overfitting: Low training error, high test error (high variance) - Underfitting: High training error, high test error (high bias) - Optimal model: Balance between bias and variance - Total Error = Bias² + Variance + Irreducible Error - Goal: Find the sweet spot that minimizes total expected error

2. How would you detect overfitting in a machine learning model? Provide multiple approaches.

Answer: - Training vs Validation Performance: - Large gap between training and validation accuracy - Training error decreases while validation error increases - Learning Curves: - Training curve continues decreasing - Validation curve plateaus or increases - Cross-Validation: - High variance in cross-validation scores - Mean CV score much lower than training score - Regularization Response: - Model performance improves significantly with regularization - Very sensitive to hyperparameter changes - Statistical Tests: - Significant difference in performance metrics - Bootstrap confidence intervals don't overlap

3. What are the main techniques to prevent overfitting? Explain how each works.

Answer: - Regularization (L1/L2): - Adds penalty term to loss function - L1: Promotes sparsity, L2: Shrinks weights - Controls model complexity - Cross-Validation: - Better estimate of generalization performance - Helps in hyperparameter tuning - Early Stopping: - Stop training when validation error increases - Prevents memorization of training data - Data Augmentation: - Increases effective dataset size - Reduces overfitting to specific training examples - Dropout (Neural Networks): - Randomly deactivates neurons during training - Prevents co-adaptation of features - Ensemble Methods: - Combines multiple models - Reduces variance through averaging

4. In a neural network, you observe that training accuracy reaches 99% but validation accuracy is only 70%. What would you do?

Answer: - Immediate Actions: - Add regularization (L2, dropout) - Reduce model complexity (fewer layers/neurons) - Implement early stopping - Data-Related Solutions: - Collect more training data - Implement data augmentation - Check for data leakage - Architecture Changes: - Use batch normalization - Reduce learning rate - Use different optimizer - Monitoring Strategy: - Plot learning curves - Monitor multiple metrics - Use cross-validation for hyperparameter tuning - Validation: - Ensure train/validation split is appropriate - Check for distribution shift

5. How does the amount of training data affect overfitting and underfitting?

Answer: - More Data Generally: - Reduces overfitting (more examples to learn from) - Allows for more complex models without overfitting - Improves generalization capability - Overfitting with Limited Data: - Models memorize small training sets easily - High variance in model performance - Need simpler models or regularization - Underfitting Scenarios: - Even with more data, simple models may underfit - Complex relationships require complex models regardless of data size - Learning Curves Analysis: - Overfitting: Large gap between training/validation that persists - Underfitting: Both curves plateau at poor performance - Good fit: Curves converge to good performance

6. Explain the concept of model complexity and how it relates to overfitting. How do you choose the right complexity?

Answer: - Model Complexity Definition: - Number of parameters/features in the model - Flexibility of the model to fit different patterns - Measured by VC dimension, degrees of freedom, etc. - Relationship to Overfitting: - Higher complexity → Higher risk of overfitting - Lower complexity → Higher risk of underfitting - Sweet spot depends on data size and problem complexity - Choosing Right Complexity: - Validation curves: Plot performance vs complexity parameter - Cross-validation: Use CV to select optimal hyperparameters - Information criteria: AIC, BIC for statistical models - Regularization path: Analyze performance across regularization strengths - Practical Guidelines: - Start simple, increase complexity if needed - Use domain knowledge to guide complexity choices - Consider computational constraints

7. What is the difference between training error, validation error, and test error? How do they help diagnose overfitting?

Answer: - Training Error: - Error on data used to train the model - Always optimistic estimate of true performance - Decreases as model complexity increases - Validation Error: - Error on held-out data during model development - Used for hyperparameter tuning and model selection - Estimates generalization performance - Test Error: - Error on completely unseen data - Final unbiased estimate of model performance - Should only be used once at the end - Overfitting Diagnosis: - Overfitting: Training error << Validation error - Underfitting: Training error H Validation error (both high) - Good fit: Training error H Validation error H Test error (all reasonable) - Best Practices: - Never tune based on test error - Use validation error for all model development decisions - Report test error as final performance estimate

8. In time series forecasting, how does overfitting manifest differently than in traditional ML problems?

Answer: - Temporal Dependencies: - Models can overfit to specific time patterns - Random CV splits break temporal structure - Need time-aware validation (walk-forward, time series CV) - Common Overfitting Patterns: - Memorizing seasonal patterns that don't generalize - Over-relying on recent data points - Fitting noise in historical data - Detection Methods: - Use time-series cross-validation - Monitor performance on future time periods - Check residual patterns for autocorrelation - Prevention Techniques: - Use simpler models for shorter horizons - Apply temporal regularization - Implement proper feature engineering - Use ensemble methods with different time windows - Validation Strategy: - Split data chronologically - Use expanding or sliding window validation - Test on multiple future periods

9. How do ensemble methods help with overfitting? What are their limitations?

Answer: - How Ensembles Help: - Variance Reduction: Averaging reduces individual model variance - Error Diversification: Different models make different mistakes - Robustness: Less sensitive to outliers or noise - Regularization Effect: Combining models acts as implicit regularization - Types of Ensembles: - Bagging: Reduces variance (Random Forest) - Boosting: Reduces bias (AdaBoost, Gradient Boosting) - Stacking: Learns optimal combination of models - Limitations: - Increased Complexity: Harder to interpret and debug - Computational Cost: More expensive to train and predict - Diminishing Returns: Adding more models may not help - Can Still Overfit: Ensemble can collectively overfit - Best Practices: - Use diverse base models - Apply regularization to ensemble combination - Monitor ensemble performance on validation data - Consider ensemble size vs performance tradeoff

10. You have a dataset with 1000 samples and are training a neural network with 1 million parameters. What issues might you face and how would you address them?

Answer: - Primary Issue: Severe overfitting due to parameter/sample ratio (1000:1) - Expected Problems: - Model will memorize training data - Very poor generalization performance - High variance in predictions - Unstable training dynamics - Solutions: - Data: Collect more data, use data augmentation - Architecture: Reduce network size, use simpler models - Regularization: Heavy dropout, L2 regularization, batch normalization - Training: Early stopping, lower learning rates - Alternative Approaches: Transfer learning, pre-trained models - Monitoring Strategy: - Use aggressive cross-validation - Monitor training/validation gap closely - Consider using simpler models as baselines - Rule of Thumb: - Generally need 10x more samples than parameters - For deep learning, often need much more - Consider domain complexity when sizing models

📝 Examples

Real-World Example: House Price Prediction

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

# Generate realistic house price dataset
np.random.seed(42)
X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)

# Add meaningful feature names
feature_names = ['Size_sqft', 'Bedrooms', 'Age_years', 'Location_score', 'Condition_score']
X_df = pd.DataFrame(X, columns=feature_names)

# Make target more realistic (house prices in thousands)
y = np.abs(y) * 10 + 300  # Prices between $300K - $800K approximately

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.3, random_state=42)

print("House Price Prediction: Overfitting vs Underfitting Analysis")
print("=" * 60)

# Model 1: Underfitting (too simple)
print("\n1. UNDERFITTING EXAMPLE:")
print("-" * 30)

# Use only one feature (house size)
simple_model = LinearRegression()
simple_model.fit(X_train[['Size_sqft']], y_train)

train_score_simple = simple_model.score(X_train[['Size_sqft']], y_train)
test_score_simple = simple_model.score(X_test[['Size_sqft']], y_test)

print(f"Simple Model (Size only):")
print(f"Training R²: {train_score_simple:.3f}")
print(f"Test R²: {test_score_simple:.3f}")
print(f"Performance Gap: {abs(train_score_simple - test_score_simple):.3f}")
print("Analysis: Both scores are low → UNDERFITTING")

# Model 2: Good fit
print("\n2. GOOD FIT EXAMPLE:")
print("-" * 30)

good_model = Ridge(alpha=1.0)
good_model.fit(X_train, y_train)

train_score_good = good_model.score(X_train, y_train)
test_score_good = good_model.score(X_test, y_test)

print(f"Ridge Model (All features):")
print(f"Training R²: {train_score_good:.3f}")
print(f"Test R²: {test_score_good:.3f}")
print(f"Performance Gap: {abs(train_score_good - test_score_good):.3f}")
print("Analysis: Both scores reasonable, small gap → GOOD FIT")

# Model 3: Overfitting (too complex)
print("\n3. OVERFITTING EXAMPLE:")
print("-" * 30)

# Create high-degree polynomial features
overfit_model = Pipeline([
    ('poly', PolynomialFeatures(degree=8, include_bias=False)),
    ('linear', LinearRegression())
])

overfit_model.fit(X_train, y_train)

train_score_overfit = overfit_model.score(X_train, y_train)
test_score_overfit = overfit_model.score(X_test, y_test)

print(f"Polynomial Model (degree=8):")
print(f"Training R²: {train_score_overfit:.3f}")
print(f"Test R²: {test_score_overfit:.3f}")
print(f"Performance Gap: {abs(train_score_overfit - test_score_overfit):.3f}")
print("Analysis: High training score, low test score → OVERFITTING")

# Cross-validation analysis
print("\n4. CROSS-VALIDATION ANALYSIS:")
print("-" * 30)

models = {
    'Simple': Pipeline([('select', 'passthrough'), ('model', LinearRegression())]),
    'Good Fit': Ridge(alpha=1.0),
    'Complex': Pipeline([('poly', PolynomialFeatures(degree=8)), ('model', LinearRegression())])
}

for name, model in models.items():
    if name == 'Simple':
        cv_scores = cross_val_score(LinearRegression(), X_train[['Size_sqft']], y_train, cv=5, scoring='r2')
    else:
        cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')

    print(f"{name:12s}: Mean CV R² = {cv_scores.mean():.3f} (±{cv_scores.std():.3f})")

# Learning curves visualization
def plot_learning_curve_example():
    train_sizes = np.linspace(0.1, 1.0, 10)

    models_to_plot = {
        'Simple (Underfit)': LinearRegression(),
        'Good Fit (Ridge)': Ridge(alpha=1.0),
        'Complex (Overfit)': Pipeline([
            ('poly', PolynomialFeatures(degree=8)),
            ('model', LinearRegression())
        ])
    }

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    for i, (name, model) in enumerate(models_to_plot.items()):
        train_errors = []
        val_errors = []

        for train_size in train_sizes:
            n_train = int(train_size * len(X_train))
            X_subset = X_train.iloc[:n_train] if name != 'Simple (Underfit)' else X_train[['Size_sqft']].iloc[:n_train]
            y_subset = y_train[:n_train]

            # Fit model
            model.fit(X_subset, y_subset)

            # Training error
            train_pred = model.predict(X_subset)
            train_mse = np.mean((y_subset - train_pred) ** 2)
            train_errors.append(train_mse)

            # Validation error (use a separate validation set)
            X_val = X_test if name != 'Simple (Underfit)' else X_test[['Size_sqft']]
            val_pred = model.predict(X_val)
            val_mse = np.mean((y_test - val_pred) ** 2)
            val_errors.append(val_mse)

        axes[i].plot(train_sizes * len(X_train), train_errors, 'o-', label='Training Error', color='blue')
        axes[i].plot(train_sizes * len(X_train), val_errors, 'o-', label='Validation Error', color='red')
        axes[i].set_xlabel('Training Set Size')
        axes[i].set_ylabel('Mean Squared Error')
        axes[i].set_title(f'Learning Curve: {name}')
        axes[i].legend()
        axes[i].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

print("\n5. LEARNING CURVES:")
print("-" * 30)
print("Plotting learning curves for visual analysis...")
plot_learning_curve_example()

# Practical recommendations
print("\n6. PRACTICAL RECOMMENDATIONS:")
print("-" * 30)
print("For this house price prediction problem:")
print("" Simple model: Add more features (bedrooms, age, location)")
print("" Good fit model: Current Ridge regression is appropriate")  
print("" Complex model: Reduce polynomial degree or increase regularization")
print("" Consider collecting more data if available")
print("" Feature engineering might help more than complex models")

Output Analysis: - Underfitting: Simple model using only house size shows poor performance on both training and test data - Good Fit: Ridge regression with all features shows balanced performance - Overfitting: High-degree polynomial model shows perfect training performance but poor test performance - Learning Curves: Reveal the characteristic patterns of each fitting scenario

📚 References

Books:
The Elements of Statistical Learning - Hastie, Tibshirani, Friedman
Pattern Recognition and Machine Learning - Bishop
Hands-On Machine Learning - Aurélien Géron
Papers:
A Few Useful Things to Know About Machine Learning - Domingos
Understanding the Bias-Variance Tradeoff
Online Resources:
Scikit-learn Model Evaluation
Andrew Ng's Machine Learning Course - Stanford
Fast.ai Practical Deep Learning
Documentation:
Scikit-learn Cross-Validation
TensorFlow Regularization
PyTorch Model Selection