📘 Gradient Boosting

Gradient Boosting is an ensemble machine learning technique that builds models sequentially, where each new model corrects the errors made by the previous models, creating a strong predictor from many weak learners.

Resources: Scikit-learn Gradient Boosting | XGBoost Documentation | Original Paper by Friedman

✍️ Summary

Gradient Boosting is a machine learning ensemble method that combines multiple weak predictors (typically decision trees) to create a strong predictor. It works by iteratively adding models that predict the residual errors of the previous models, gradually improving the overall prediction.

Key characteristics: - Sequential learning: Models are built one after another - Error correction: Each model focuses on correcting previous mistakes - Gradient descent: Uses gradient descent to minimize loss function - Flexible: Can handle different loss functions and data types - High accuracy: Often achieves state-of-the-art performance

Applications: - Ranking problems (web search, recommendation systems) - Regression tasks with complex patterns - Classification with high accuracy requirements - Feature importance analysis - Kaggle competitions (very popular) - Financial modeling and risk assessment

Popular implementations: - Gradient Boosting Machines (GBM): Original implementation - XGBoost: Extreme Gradient Boosting (optimized) - LightGBM: Microsoft's fast implementation - CatBoost: Handles categorical features well

🧠 Intuition

How Gradient Boosting Works

Imagine you're trying to hit a target with arrows. After your first shot, you see where you missed and adjust your aim for the second shot. Gradient Boosting works similarly - each new model tries to correct the "mistakes" (residuals) of the combined previous models.

Mathematical Foundation

1. General Algorithm

Given training data \((x_i, y_i)\) for \(i = 1, ..., n\), Gradient Boosting learns a function \(F(x)\) that minimizes a loss function \(L(y, F(x))\).

Algorithm steps: 1. Initialize with a constant: \(F_0(x) = \arg\min_\gamma \sum_{i=1}^n L(y_i, \gamma)\) 2. For \(m = 1\) to \(M\) (number of iterations): - Compute negative gradients: \(r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}\) - Fit weak learner \(h_m(x)\) to targets \(r_{im}\) - Find optimal step size: \(\gamma_m = \arg\min_\gamma \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i))\) - Update: \(F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)\)

2. Loss Functions

For Regression: - Squared Loss: \(L(y, F) = \frac{1}{2}(y - F)^2\), negative gradient: \(r = y - F\) - Absolute Loss: \(L(y, F) = |y - F|\), negative gradient: \(r = \text{sign}(y - F)\) - Huber Loss: Combines squared and absolute loss

For Classification: - Logistic Loss: \(L(y, F) = \log(1 + e^{-yF})\) - Exponential Loss: \(L(y, F) = e^{-yF}\) (AdaBoost)

3. Regularization

To prevent overfitting: - Learning rate \(\nu\): \(F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x)\) - Tree depth: Limit complexity of weak learners - Subsampling: Use random subset of data for each iteration - Feature subsampling: Use random subset of features

Key Insights

Residual fitting: Each model predicts what previous models missed
Gradient descent: Follows gradient to minimize loss function
Bias-variance tradeoff: Reduces bias while controlling variance
Sequential dependency: Cannot be parallelized easily (unlike Random Forest)

🔢 Implementation using Libraries

Using Scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression, make_classification, load_boston
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report
from sklearn.tree import DecisionTreeRegressor
import seaborn as sns

# Regression Example
# Generate sample data
X, y = make_regression(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    noise=0.1,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor(
    n_estimators=100,          # number of boosting stages
    learning_rate=0.1,         # shrinkage parameter
    max_depth=3,              # max depth of individual trees
    min_samples_split=20,      # min samples to split
    min_samples_leaf=10,       # min samples in leaf
    subsample=0.8,            # fraction of samples for each tree
    random_state=42
)

gb_regressor.fit(X_train, y_train)

# Make predictions
y_pred = gb_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.3f}")

# Plot feature importance
feature_importance = gb_regressor.feature_importances_
indices = np.argsort(feature_importance)[::-1]

plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), feature_importance[indices])
plt.title("Feature Importance - Gradient Boosting")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.show()

# Classification Example
X_class, y_class = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_clusters_per_class=1,
    random_state=42
)

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42
)

# Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb_classifier.fit(X_train_c, y_train_c)

# Predictions and evaluation
y_pred_c = gb_classifier.predict(X_test_c)
accuracy = accuracy_score(y_test_c, y_pred_c)
print(f"Classification Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_c))

# Plot learning curve
test_scores = []
train_scores = []

for i, pred in enumerate(gb_regressor.staged_predict(X_test)):
    test_scores.append(mean_squared_error(y_test, pred))

for i, pred in enumerate(gb_regressor.staged_predict(X_train)):
    train_scores.append(mean_squared_error(y_train, pred))

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(test_scores) + 1), test_scores, label='Test Error')
plt.plot(range(1, len(train_scores) + 1), train_scores, label='Train Error')
plt.xlabel('Boosting Iterations')
plt.ylabel('Mean Squared Error')
plt.title('Gradient Boosting Learning Curve')
plt.legend()
plt.show()

Using XGBoost (Advanced Implementation)

import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load dataset
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create XGBoost regressor
xgb_regressor = xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

# Fit model
xgb_regressor.fit(X_train, y_train)

# Predictions
y_pred_xgb = xgb_regressor.predict(X_test)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
print(f"XGBoost MSE: {mse_xgb:.3f}")

# Plot feature importance
xgb.plot_importance(xgb_regressor, max_num_features=10)
plt.title("XGBoost Feature Importance")
plt.show()

⚙️ From Scratch Implementation

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

class GradientBoostingRegressor:
    """
    Gradient Boosting Regressor implementation from scratch.
    """

    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        """
        Initialize Gradient Boosting Regressor.

        Parameters:
        -----------
        n_estimators : int, number of boosting stages
        learning_rate : float, shrinkage parameter
        max_depth : int, maximum depth of individual trees
        """
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.models = []
        self.initial_prediction = None

    def fit(self, X, y):
        """
        Fit gradient boosting model.

        Parameters:
        -----------
        X : array-like, shape = [n_samples, n_features]
        y : array-like, shape = [n_samples]
        """
        # Initialize with mean of target values
        self.initial_prediction = np.mean(y)

        # Initialize predictions with constant
        predictions = np.full_like(y, self.initial_prediction, dtype=float)

        for i in range(self.n_estimators):
            # Compute negative gradients (residuals for squared loss)
            residuals = y - predictions

            # Fit weak learner to residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)

            # Make predictions with current tree
            tree_predictions = tree.predict(X)

            # Update overall predictions
            predictions += self.learning_rate * tree_predictions

            # Store the model
            self.models.append(tree)

    def predict(self, X):
        """
        Make predictions using the fitted model.

        Parameters:
        -----------
        X : array-like, shape = [n_samples, n_features]

        Returns:
        --------
        predictions : array-like, shape = [n_samples]
        """
        # Start with initial prediction
        predictions = np.full(X.shape[0], self.initial_prediction, dtype=float)

        # Add predictions from each tree
        for model in self.models:
            predictions += self.learning_rate * model.predict(X)

        return predictions

    def staged_predict(self, X):
        """
        Predict at each stage for plotting learning curves.

        Parameters:
        -----------
        X : array-like, shape = [n_samples, n_features]

        Yields:
        -------
        predictions : array-like, shape = [n_samples]
        """
        predictions = np.full(X.shape[0], self.initial_prediction, dtype=float)
        yield predictions.copy()

        for model in self.models:
            predictions += self.learning_rate * model.predict(X)
            yield predictions.copy()

# Example usage
if __name__ == "__main__":
    # Generate sample data
    X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)

    # Split data
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Create and train our model
    gb_scratch = GradientBoostingRegressor(n_estimators=50, learning_rate=0.1, max_depth=3)
    gb_scratch.fit(X_train, y_train)

    # Make predictions
    y_pred_scratch = gb_scratch.predict(X_test)

    # Calculate MSE
    mse_scratch = np.mean((y_test - y_pred_scratch) ** 2)
    print(f"From-scratch MSE: {mse_scratch:.3f}")

    # Compare with sklearn
    from sklearn.ensemble import GradientBoostingRegressor as SklearnGB
    sklearn_gb = SklearnGB(n_estimators=50, learning_rate=0.1, max_depth=3, random_state=42)
    sklearn_gb.fit(X_train, y_train)
    y_pred_sklearn = sklearn_gb.predict(X_test)
    mse_sklearn = np.mean((y_test - y_pred_sklearn) ** 2)
    print(f"Sklearn MSE: {mse_sklearn:.3f}")

    # Plot comparison
    plt.figure(figsize=(12, 4))

    plt.subplot(1, 2, 1)
    plt.scatter(y_test, y_pred_scratch, alpha=0.6)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
    plt.xlabel('True Values')
    plt.ylabel('Predicted Values')
    plt.title('From Scratch Implementation')

    plt.subplot(1, 2, 2)
    plt.scatter(y_test, y_pred_sklearn, alpha=0.6)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
    plt.xlabel('True Values')
    plt.ylabel('Predicted Values')
    plt.title('Sklearn Implementation')

    plt.tight_layout()
    plt.show()

⚠️ Assumptions and Limitations

Assumptions

Weak learners are better than random: Each base model should perform slightly better than chance
Sequential dependency: Models are built sequentially, not independently
Gradient computability: Loss function must be differentiable
Sufficient data: Needs adequate training data to avoid overfitting

Limitations

1. Overfitting Risk

Can easily overfit with too many iterations
Requires careful tuning of hyperparameters
Solution: Use validation set, early stopping, and regularization

2. Sequential Training

Cannot parallelize training (unlike Random Forest)
Slower to train on large datasets
Alternative: Use parallelized versions like LightGBM

3. Hyperparameter Sensitivity

Performance highly dependent on hyperparameter tuning
Many parameters to optimize (learning rate, depth, iterations)
Solution: Use automated hyperparameter tuning

4. Memory Usage

Stores all weak learners
Can become memory-intensive with many iterations
Solution: Limit number of estimators

5. Prediction Time

Slower prediction than single models
Each prediction requires all weak learners
Trade-off: Accuracy vs speed

Comparison with Other Methods

Method	Accuracy	Speed	Interpretability	Parallelization
Gradient Boosting	High	Slow	Low	No (training)
Random Forest	High	Fast	Medium	Yes
Single Decision Tree	Medium	Fast	High	N/A
Linear Models	Low-Medium	Very Fast	High	Yes
Neural Networks	High	Variable	Low	Yes

When to Use vs Avoid

Use Gradient Boosting when: - High accuracy is crucial - You have sufficient computational resources - Data is not extremely noisy - You can invest time in hyperparameter tuning

Avoid Gradient Boosting when: - Real-time predictions are critical - Interpretability is most important - Training time is heavily constrained - Data is very noisy or has many outliers

💡 Interview Questions

1. How does Gradient Boosting differ from Random Forest?

Answer: - Training: Gradient Boosting builds trees sequentially where each tree corrects errors of previous ones, while Random Forest builds trees independently in parallel - Overfitting: Gradient Boosting is more prone to overfitting due to sequential error correction, Random Forest reduces overfitting through averaging - Speed: Random Forest is faster to train due to parallelization, Gradient Boosting is sequential - Bias-Variance: Gradient Boosting reduces bias primarily, Random Forest reduces variance - Hyperparameters: Gradient Boosting has more critical hyperparameters (learning rate, n_estimators) that need careful tuning

2. Explain the mathematical intuition behind gradient boosting. How does it use gradients?

Answer: Gradient Boosting minimizes a loss function using gradient descent in function space: - At each iteration, it computes negative gradients of the loss function with respect to current predictions - These gradients represent the direction of steepest decrease in the loss - A new weak learner is trained to predict these gradients (residuals for squared loss) - The predictions are updated by adding the new model's output scaled by a learning rate - This process is repeated until convergence or max iterations reached

For squared loss: gradient = y - F(x) (actual residual) For logistic loss: gradient = y - p(x) (probability residual)

3. What role does the learning rate play in Gradient Boosting? How do you choose it?

Answer: The learning rate (η) controls how much each weak learner contributes to the final prediction: - Small η (0.01-0.1): More conservative updates, requires more iterations but often better generalization - Large η (0.3-1.0): Faster learning but higher overfitting risk - Trade-off: Lower learning rate with more estimators often yields better results - Selection: Use validation curves or cross-validation to find optimal value - Common practice: Start with η=0.1, then try η=0.05 with 2x estimators or η=0.2 with 0.5x estimators

4. How do you prevent overfitting in Gradient Boosting?

Answer: Multiple regularization techniques: - Learning rate: Lower values (0.01-0.1) prevent overfitting - Tree depth: Limit max_depth (3-8) to keep weak learners simple - Subsampling: Use fraction of data for each tree (0.5-0.8) - Feature subsampling: Use random subset of features per split - Early stopping: Monitor validation error and stop when it starts increasing - Minimum samples: Set min_samples_split and min_samples_leaf - Cross-validation: Use CV to select optimal number of estimators

5. What are the advantages of XGBoost over traditional Gradient Boosting?

Answer: XGBoost improvements: - Regularization: Built-in L1 and L2 regularization in objective function - Missing values: Handles missing values automatically by learning best direction - Parallelization: Parallel tree construction (not sequential like boosting stages) - Speed: Optimized implementation with caching and approximation algorithms - Memory efficiency: Block structure for out-of-core computation - Cross-validation: Built-in cross-validation during training - Flexibility: More loss functions and evaluation metrics - Pruning: Bottom-up tree pruning removes splits with negative gain

6. How would you tune hyperparameters for a Gradient Boosting model?

Answer: Systematic approach: 1. Start with defaults: n_estimators=100, learning_rate=0.1, max_depth=3 2. Tune tree parameters: max_depth (3-10), min_samples_split (10-50) 3. Optimize learning rate and estimators: Lower learning_rate, increase n_estimators 4. Add regularization: subsample (0.6-0.9), max_features 5. Use techniques: Grid search, random search, or Bayesian optimization 6. Validation: Use time-series split for temporal data, stratified CV for classification 7. Monitor: Plot validation curves to detect overfitting

Example order: max_depth → n_estimators & learning_rate → subsampling → feature selection

7. Explain the difference between AdaBoost and Gradient Boosting.

Answer: Key differences: - Error focus: AdaBoost reweights misclassified samples, Gradient Boosting fits residuals - Loss function: AdaBoost uses exponential loss, Gradient Boosting can use various losses - Weight updates: AdaBoost changes sample weights, Gradient Boosting changes predictions - Flexibility: Gradient Boosting works with any differentiable loss, AdaBoost is more restrictive - Outlier sensitivity: AdaBoost very sensitive to outliers, Gradient Boosting less so - Base learners: AdaBoost typically uses stumps, Gradient Boosting uses deeper trees - Applications: AdaBoost mainly classification, Gradient Boosting both regression and classification

8. How do you interpret feature importance in Gradient Boosting?

Answer: Feature importance calculation: - Frequency-based: How often a feature is used for splits across all trees - Gain-based: Average improvement in objective function when feature is used - Permutation importance: Decrease in model performance when feature values are randomly permuted - SHAP values: Game-theoretic approach showing contribution of each feature to predictions

Interpretation tips: - Higher values indicate more important features - Consider feature interactions and multicollinearity - Use multiple importance measures for robustness - Validate importance with domain knowledge

9. When would you choose Gradient Boosting over Deep Learning?

Answer: Choose Gradient Boosting when: - Tabular data: Works exceptionally well on structured data - Small to medium datasets: Less prone to overfitting than deep learning - Interpretability needed: Feature importance and decision paths are clearer - No image/text data: Deep learning excels with unstructured data - Quick deployment: Faster to train and tune than neural networks - Limited computational resources: Less GPU dependency - Heterogeneous features: Mix of numerical and categorical features - Proven track record: Dominates many Kaggle tabular competitions

10. How does the choice of loss function affect Gradient Boosting performance?

Answer: Loss function impacts: - Squared loss: Sensitive to outliers, good for normal residuals - Absolute loss: Robust to outliers, good for heavy-tailed distributions - Huber loss: Combines benefits of both, balanced approach - Logistic loss: For classification, provides probability estimates - Custom loss: Can optimize specific business metrics

Selection guidelines: - Analyze residual distribution - Consider outlier presence - Match business objective (e.g., quantile loss for different percentiles) - Use validation to compare different loss functions

🧠 Examples

Real-world Example: House Price Prediction

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load Boston housing dataset
boston = load_boston()
X, y = boston.data, boston.target
feature_names = boston.feature_names

# Create DataFrame for better visualization
df = pd.DataFrame(X, columns=feature_names)
df['PRICE'] = y

print("Dataset Info:")
print(f"Shape: {df.shape}")
print(f"Features: {list(feature_names)}")
print("\nFirst few rows:")
print(df.head())

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train Gradient Boosting model
gb_model = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=4,
    min_samples_split=20,
    min_samples_leaf=10,
    subsample=0.8,
    random_state=42
)

gb_model.fit(X_train, y_train)

# Make predictions
y_pred = gb_model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"\nModel Performance:")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.3f}")

# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': gb_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(15, 5))

# Plot 1: Predictions vs Actual
plt.subplot(1, 3, 1)
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title(f'Predictions vs Actual (R² = {r2:.3f})')

# Plot 2: Feature Importance
plt.subplot(1, 3, 2)
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.xlabel('Feature Importance')
plt.title('Top 10 Most Important Features')
plt.gca().invert_yaxis()

# Plot 3: Learning Curve
plt.subplot(1, 3, 3)
train_scores, test_scores = [], []
for pred_train, pred_test in zip(gb_model.staged_predict(X_train), 
                                gb_model.staged_predict(X_test)):
    train_scores.append(mean_squared_error(y_train, pred_train))
    test_scores.append(mean_squared_error(y_test, pred_test))

plt.plot(range(1, len(train_scores) + 1), train_scores, label='Train MSE')
plt.plot(range(1, len(test_scores) + 1), test_scores, label='Test MSE')
plt.xlabel('Boosting Iterations')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curve')
plt.legend()

plt.tight_layout()
plt.show()

# Residual analysis
residuals = y_test - y_pred

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')

plt.subplot(1, 2, 2)
plt.hist(residuals, bins=20, alpha=0.7)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residual Distribution')

plt.tight_layout()
plt.show()

print("\nTop 5 Most Important Features:")
print(feature_importance.head())

# Business insights
print("\nBusiness Insights:")
print("1. LSTAT (% lower status population) is the most important predictor")
print("2. RM (average rooms per dwelling) significantly affects price")
print("3. DIS (distance to employment centers) impacts housing values")
print("4. The model explains {:.1f}% of price variation".format(r2 * 100))

Example: Multi-class Classification

from sklearn.datasets import load_wine
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Load wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train gradient boosting classifier
gb_clf = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb_clf.fit(X_train, y_train)

# Predictions
y_pred = gb_clf.predict(X_test)
y_pred_proba = gb_clf.predict_proba(X_test)

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

# Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=wine.target_names, 
            yticklabels=wine.target_names)
plt.title('Confusion Matrix - Wine Classification')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Feature importance for classification
feature_importance_clf = pd.DataFrame({
    'feature': wine.feature_names,
    'importance': gb_clf.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 5 Most Important Features for Wine Classification:")
print(feature_importance_clf.head())

📚 References

Books

"The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman - Chapter 10
"Hands-On Machine Learning" by Aurélien Géron - Ensemble Methods chapter
"Pattern Recognition and Machine Learning" by Christopher Bishop
"Machine Learning: A Probabilistic Perspective" by Kevin Murphy