π Gradient Boosting
Gradient Boosting is an ensemble machine learning technique that builds models sequentially, where each new model corrects the errors made by the previous models, creating a strong predictor from many weak learners.
Resources: Scikit-learn Gradient Boosting | XGBoost Documentation | Original Paper by Friedman
βοΈ Summary
Gradient Boosting is a machine learning ensemble method that combines multiple weak predictors (typically decision trees) to create a strong predictor. It works by iteratively adding models that predict the residual errors of the previous models, gradually improving the overall prediction.
Key characteristics: - Sequential learning: Models are built one after another - Error correction: Each model focuses on correcting previous mistakes - Gradient descent: Uses gradient descent to minimize loss function - Flexible: Can handle different loss functions and data types - High accuracy: Often achieves state-of-the-art performance
Applications: - Ranking problems (web search, recommendation systems) - Regression tasks with complex patterns - Classification with high accuracy requirements - Feature importance analysis - Kaggle competitions (very popular) - Financial modeling and risk assessment
Popular implementations: - Gradient Boosting Machines (GBM): Original implementation - XGBoost: Extreme Gradient Boosting (optimized) - LightGBM: Microsoft's fast implementation - CatBoost: Handles categorical features well
π§ Intuition
How Gradient Boosting Works
Imagine you're trying to hit a target with arrows. After your first shot, you see where you missed and adjust your aim for the second shot. Gradient Boosting works similarly - each new model tries to correct the "mistakes" (residuals) of the combined previous models.
Mathematical Foundation
1. General Algorithm
Given training data \((x_i, y_i)\) for \(i = 1, ..., n\), Gradient Boosting learns a function \(F(x)\) that minimizes a loss function \(L(y, F(x))\).
Algorithm steps: 1. Initialize with a constant: \(F_0(x) = \arg\min_\gamma \sum_{i=1}^n L(y_i, \gamma)\) 2. For \(m = 1\) to \(M\) (number of iterations): - Compute negative gradients: \(r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}\) - Fit weak learner \(h_m(x)\) to targets \(r_{im}\) - Find optimal step size: \(\gamma_m = \arg\min_\gamma \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i))\) - Update: \(F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)\)
2. Loss Functions
For Regression: - Squared Loss: \(L(y, F) = \frac{1}{2}(y - F)^2\), negative gradient: \(r = y - F\) - Absolute Loss: \(L(y, F) = |y - F|\), negative gradient: \(r = \text{sign}(y - F)\) - Huber Loss: Combines squared and absolute loss
For Classification: - Logistic Loss: \(L(y, F) = \log(1 + e^{-yF})\) - Exponential Loss: \(L(y, F) = e^{-yF}\) (AdaBoost)
3. Regularization
To prevent overfitting: - Learning rate \(\nu\): \(F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x)\) - Tree depth: Limit complexity of weak learners - Subsampling: Use random subset of data for each iteration - Feature subsampling: Use random subset of features
Key Insights
- Residual fitting: Each model predicts what previous models missed
- Gradient descent: Follows gradient to minimize loss function
- Bias-variance tradeoff: Reduces bias while controlling variance
- Sequential dependency: Cannot be parallelized easily (unlike Random Forest)
π’ Implementation using Libraries
Using Scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression, make_classification, load_boston
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report
from sklearn.tree import DecisionTreeRegressor
import seaborn as sns
# Regression Example
# Generate sample data
X, y = make_regression(
n_samples=1000,
n_features=10,
n_informative=5,
noise=0.1,
random_state=42
)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and train Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor(
n_estimators=100, # number of boosting stages
learning_rate=0.1, # shrinkage parameter
max_depth=3, # max depth of individual trees
min_samples_split=20, # min samples to split
min_samples_leaf=10, # min samples in leaf
subsample=0.8, # fraction of samples for each tree
random_state=42
)
gb_regressor.fit(X_train, y_train)
# Make predictions
y_pred = gb_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.3f}")
# Plot feature importance
feature_importance = gb_regressor.feature_importances_
indices = np.argsort(feature_importance)[::-1]
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), feature_importance[indices])
plt.title("Feature Importance - Gradient Boosting")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.show()
# Classification Example
X_class, y_class = make_classification(
n_samples=1000,
n_features=10,
n_informative=5,
n_redundant=2,
n_clusters_per_class=1,
random_state=42
)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_class, y_class, test_size=0.2, random_state=42
)
# Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gb_classifier.fit(X_train_c, y_train_c)
# Predictions and evaluation
y_pred_c = gb_classifier.predict(X_test_c)
accuracy = accuracy_score(y_test_c, y_pred_c)
print(f"Classification Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_c))
# Plot learning curve
test_scores = []
train_scores = []
for i, pred in enumerate(gb_regressor.staged_predict(X_test)):
test_scores.append(mean_squared_error(y_test, pred))
for i, pred in enumerate(gb_regressor.staged_predict(X_train)):
train_scores.append(mean_squared_error(y_train, pred))
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(test_scores) + 1), test_scores, label='Test Error')
plt.plot(range(1, len(train_scores) + 1), train_scores, label='Train Error')
plt.xlabel('Boosting Iterations')
plt.ylabel('Mean Squared Error')
plt.title('Gradient Boosting Learning Curve')
plt.legend()
plt.show()
Using XGBoost (Advanced Implementation)
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
# Load dataset
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create XGBoost regressor
xgb_regressor = xgb.XGBRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
# Fit model
xgb_regressor.fit(X_train, y_train)
# Predictions
y_pred_xgb = xgb_regressor.predict(X_test)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
print(f"XGBoost MSE: {mse_xgb:.3f}")
# Plot feature importance
xgb.plot_importance(xgb_regressor, max_num_features=10)
plt.title("XGBoost Feature Importance")
plt.show()
βοΈ From Scratch Implementation
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
class GradientBoostingRegressor:
"""
Gradient Boosting Regressor implementation from scratch.
"""
def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
"""
Initialize Gradient Boosting Regressor.
Parameters:
-----------
n_estimators : int, number of boosting stages
learning_rate : float, shrinkage parameter
max_depth : int, maximum depth of individual trees
"""
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self.models = []
self.initial_prediction = None
def fit(self, X, y):
"""
Fit gradient boosting model.
Parameters:
-----------
X : array-like, shape = [n_samples, n_features]
y : array-like, shape = [n_samples]
"""
# Initialize with mean of target values
self.initial_prediction = np.mean(y)
# Initialize predictions with constant
predictions = np.full_like(y, self.initial_prediction, dtype=float)
for i in range(self.n_estimators):
# Compute negative gradients (residuals for squared loss)
residuals = y - predictions
# Fit weak learner to residuals
tree = DecisionTreeRegressor(max_depth=self.max_depth)
tree.fit(X, residuals)
# Make predictions with current tree
tree_predictions = tree.predict(X)
# Update overall predictions
predictions += self.learning_rate * tree_predictions
# Store the model
self.models.append(tree)
def predict(self, X):
"""
Make predictions using the fitted model.
Parameters:
-----------
X : array-like, shape = [n_samples, n_features]
Returns:
--------
predictions : array-like, shape = [n_samples]
"""
# Start with initial prediction
predictions = np.full(X.shape[0], self.initial_prediction, dtype=float)
# Add predictions from each tree
for model in self.models:
predictions += self.learning_rate * model.predict(X)
return predictions
def staged_predict(self, X):
"""
Predict at each stage for plotting learning curves.
Parameters:
-----------
X : array-like, shape = [n_samples, n_features]
Yields:
-------
predictions : array-like, shape = [n_samples]
"""
predictions = np.full(X.shape[0], self.initial_prediction, dtype=float)
yield predictions.copy()
for model in self.models:
predictions += self.learning_rate * model.predict(X)
yield predictions.copy()
# Example usage
if __name__ == "__main__":
# Generate sample data
X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)
# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train our model
gb_scratch = GradientBoostingRegressor(n_estimators=50, learning_rate=0.1, max_depth=3)
gb_scratch.fit(X_train, y_train)
# Make predictions
y_pred_scratch = gb_scratch.predict(X_test)
# Calculate MSE
mse_scratch = np.mean((y_test - y_pred_scratch) ** 2)
print(f"From-scratch MSE: {mse_scratch:.3f}")
# Compare with sklearn
from sklearn.ensemble import GradientBoostingRegressor as SklearnGB
sklearn_gb = SklearnGB(n_estimators=50, learning_rate=0.1, max_depth=3, random_state=42)
sklearn_gb.fit(X_train, y_train)
y_pred_sklearn = sklearn_gb.predict(X_test)
mse_sklearn = np.mean((y_test - y_pred_sklearn) ** 2)
print(f"Sklearn MSE: {mse_sklearn:.3f}")
# Plot comparison
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred_scratch, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('From Scratch Implementation')
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred_sklearn, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('Sklearn Implementation')
plt.tight_layout()
plt.show()
β οΈ Assumptions and Limitations
Assumptions
- Weak learners are better than random: Each base model should perform slightly better than chance
- Sequential dependency: Models are built sequentially, not independently
- Gradient computability: Loss function must be differentiable
- Sufficient data: Needs adequate training data to avoid overfitting
Limitations
1. Overfitting Risk
- Can easily overfit with too many iterations
- Requires careful tuning of hyperparameters
- Solution: Use validation set, early stopping, and regularization
2. Sequential Training
- Cannot parallelize training (unlike Random Forest)
- Slower to train on large datasets
- Alternative: Use parallelized versions like LightGBM
3. Hyperparameter Sensitivity
- Performance highly dependent on hyperparameter tuning
- Many parameters to optimize (learning rate, depth, iterations)
- Solution: Use automated hyperparameter tuning
4. Memory Usage
- Stores all weak learners
- Can become memory-intensive with many iterations
- Solution: Limit number of estimators
5. Prediction Time
- Slower prediction than single models
- Each prediction requires all weak learners
- Trade-off: Accuracy vs speed
Comparison with Other Methods
Method | Accuracy | Speed | Interpretability | Parallelization |
---|---|---|---|---|
Gradient Boosting | High | Slow | Low | No (training) |
Random Forest | High | Fast | Medium | Yes |
Single Decision Tree | Medium | Fast | High | N/A |
Linear Models | Low-Medium | Very Fast | High | Yes |
Neural Networks | High | Variable | Low | Yes |
When to Use vs Avoid
Use Gradient Boosting when: - High accuracy is crucial - You have sufficient computational resources - Data is not extremely noisy - You can invest time in hyperparameter tuning
Avoid Gradient Boosting when: - Real-time predictions are critical - Interpretability is most important - Training time is heavily constrained - Data is very noisy or has many outliers
π‘ Interview Questions
1. How does Gradient Boosting differ from Random Forest?
Answer: - Training: Gradient Boosting builds trees sequentially where each tree corrects errors of previous ones, while Random Forest builds trees independently in parallel - Overfitting: Gradient Boosting is more prone to overfitting due to sequential error correction, Random Forest reduces overfitting through averaging - Speed: Random Forest is faster to train due to parallelization, Gradient Boosting is sequential - Bias-Variance: Gradient Boosting reduces bias primarily, Random Forest reduces variance - Hyperparameters: Gradient Boosting has more critical hyperparameters (learning rate, n_estimators) that need careful tuning
2. Explain the mathematical intuition behind gradient boosting. How does it use gradients?
Answer: Gradient Boosting minimizes a loss function using gradient descent in function space: - At each iteration, it computes negative gradients of the loss function with respect to current predictions - These gradients represent the direction of steepest decrease in the loss - A new weak learner is trained to predict these gradients (residuals for squared loss) - The predictions are updated by adding the new model's output scaled by a learning rate - This process is repeated until convergence or max iterations reached
For squared loss: gradient = y - F(x) (actual residual) For logistic loss: gradient = y - p(x) (probability residual)
3. What role does the learning rate play in Gradient Boosting? How do you choose it?
Answer: The learning rate (Ξ·) controls how much each weak learner contributes to the final prediction: - Small Ξ· (0.01-0.1): More conservative updates, requires more iterations but often better generalization - Large Ξ· (0.3-1.0): Faster learning but higher overfitting risk - Trade-off: Lower learning rate with more estimators often yields better results - Selection: Use validation curves or cross-validation to find optimal value - Common practice: Start with Ξ·=0.1, then try Ξ·=0.05 with 2x estimators or Ξ·=0.2 with 0.5x estimators
4. How do you prevent overfitting in Gradient Boosting?
Answer: Multiple regularization techniques: - Learning rate: Lower values (0.01-0.1) prevent overfitting - Tree depth: Limit max_depth (3-8) to keep weak learners simple - Subsampling: Use fraction of data for each tree (0.5-0.8) - Feature subsampling: Use random subset of features per split - Early stopping: Monitor validation error and stop when it starts increasing - Minimum samples: Set min_samples_split and min_samples_leaf - Cross-validation: Use CV to select optimal number of estimators
5. What are the advantages of XGBoost over traditional Gradient Boosting?
Answer: XGBoost improvements: - Regularization: Built-in L1 and L2 regularization in objective function - Missing values: Handles missing values automatically by learning best direction - Parallelization: Parallel tree construction (not sequential like boosting stages) - Speed: Optimized implementation with caching and approximation algorithms - Memory efficiency: Block structure for out-of-core computation - Cross-validation: Built-in cross-validation during training - Flexibility: More loss functions and evaluation metrics - Pruning: Bottom-up tree pruning removes splits with negative gain
6. How would you tune hyperparameters for a Gradient Boosting model?
Answer: Systematic approach: 1. Start with defaults: n_estimators=100, learning_rate=0.1, max_depth=3 2. Tune tree parameters: max_depth (3-10), min_samples_split (10-50) 3. Optimize learning rate and estimators: Lower learning_rate, increase n_estimators 4. Add regularization: subsample (0.6-0.9), max_features 5. Use techniques: Grid search, random search, or Bayesian optimization 6. Validation: Use time-series split for temporal data, stratified CV for classification 7. Monitor: Plot validation curves to detect overfitting
Example order: max_depth β n_estimators & learning_rate β subsampling β feature selection
7. Explain the difference between AdaBoost and Gradient Boosting.
Answer: Key differences: - Error focus: AdaBoost reweights misclassified samples, Gradient Boosting fits residuals - Loss function: AdaBoost uses exponential loss, Gradient Boosting can use various losses - Weight updates: AdaBoost changes sample weights, Gradient Boosting changes predictions - Flexibility: Gradient Boosting works with any differentiable loss, AdaBoost is more restrictive - Outlier sensitivity: AdaBoost very sensitive to outliers, Gradient Boosting less so - Base learners: AdaBoost typically uses stumps, Gradient Boosting uses deeper trees - Applications: AdaBoost mainly classification, Gradient Boosting both regression and classification
8. How do you interpret feature importance in Gradient Boosting?
Answer: Feature importance calculation: - Frequency-based: How often a feature is used for splits across all trees - Gain-based: Average improvement in objective function when feature is used - Permutation importance: Decrease in model performance when feature values are randomly permuted - SHAP values: Game-theoretic approach showing contribution of each feature to predictions
Interpretation tips: - Higher values indicate more important features - Consider feature interactions and multicollinearity - Use multiple importance measures for robustness - Validate importance with domain knowledge
9. When would you choose Gradient Boosting over Deep Learning?
Answer: Choose Gradient Boosting when: - Tabular data: Works exceptionally well on structured data - Small to medium datasets: Less prone to overfitting than deep learning - Interpretability needed: Feature importance and decision paths are clearer - No image/text data: Deep learning excels with unstructured data - Quick deployment: Faster to train and tune than neural networks - Limited computational resources: Less GPU dependency - Heterogeneous features: Mix of numerical and categorical features - Proven track record: Dominates many Kaggle tabular competitions
10. How does the choice of loss function affect Gradient Boosting performance?
Answer: Loss function impacts: - Squared loss: Sensitive to outliers, good for normal residuals - Absolute loss: Robust to outliers, good for heavy-tailed distributions - Huber loss: Combines benefits of both, balanced approach - Logistic loss: For classification, provides probability estimates - Custom loss: Can optimize specific business metrics
Selection guidelines: - Analyze residual distribution - Consider outlier presence - Match business objective (e.g., quantile loss for different percentiles) - Use validation to compare different loss functions
π§ Examples
Real-world Example: House Price Prediction
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
# Load Boston housing dataset
boston = load_boston()
X, y = boston.data, boston.target
feature_names = boston.feature_names
# Create DataFrame for better visualization
df = pd.DataFrame(X, columns=feature_names)
df['PRICE'] = y
print("Dataset Info:")
print(f"Shape: {df.shape}")
print(f"Features: {list(feature_names)}")
print("\nFirst few rows:")
print(df.head())
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train Gradient Boosting model
gb_model = GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=4,
min_samples_split=20,
min_samples_leaf=10,
subsample=0.8,
random_state=42
)
gb_model.fit(X_train, y_train)
# Make predictions
y_pred = gb_model.predict(X_test)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"\nModel Performance:")
print(f"RMSE: {rmse:.2f}")
print(f"RΒ² Score: {r2:.3f}")
# Feature importance analysis
feature_importance = pd.DataFrame({
'feature': feature_names,
'importance': gb_model.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(15, 5))
# Plot 1: Predictions vs Actual
plt.subplot(1, 3, 1)
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title(f'Predictions vs Actual (RΒ² = {r2:.3f})')
# Plot 2: Feature Importance
plt.subplot(1, 3, 2)
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.xlabel('Feature Importance')
plt.title('Top 10 Most Important Features')
plt.gca().invert_yaxis()
# Plot 3: Learning Curve
plt.subplot(1, 3, 3)
train_scores, test_scores = [], []
for pred_train, pred_test in zip(gb_model.staged_predict(X_train),
gb_model.staged_predict(X_test)):
train_scores.append(mean_squared_error(y_train, pred_train))
test_scores.append(mean_squared_error(y_test, pred_test))
plt.plot(range(1, len(train_scores) + 1), train_scores, label='Train MSE')
plt.plot(range(1, len(test_scores) + 1), test_scores, label='Test MSE')
plt.xlabel('Boosting Iterations')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curve')
plt.legend()
plt.tight_layout()
plt.show()
# Residual analysis
residuals = y_test - y_pred
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.subplot(1, 2, 2)
plt.hist(residuals, bins=20, alpha=0.7)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residual Distribution')
plt.tight_layout()
plt.show()
print("\nTop 5 Most Important Features:")
print(feature_importance.head())
# Business insights
print("\nBusiness Insights:")
print("1. LSTAT (% lower status population) is the most important predictor")
print("2. RM (average rooms per dwelling) significantly affects price")
print("3. DIS (distance to employment centers) impacts housing values")
print("4. The model explains {:.1f}% of price variation".format(r2 * 100))
Example: Multi-class Classification
from sklearn.datasets import load_wine
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
# Load wine dataset
wine = load_wine()
X, y = wine.data, wine.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train gradient boosting classifier
gb_clf = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gb_clf.fit(X_train, y_train)
# Predictions
y_pred = gb_clf.predict(X_test)
y_pred_proba = gb_clf.predict_proba(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))
# Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=wine.target_names,
yticklabels=wine.target_names)
plt.title('Confusion Matrix - Wine Classification')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# Feature importance for classification
feature_importance_clf = pd.DataFrame({
'feature': wine.feature_names,
'importance': gb_clf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 5 Most Important Features for Wine Classification:")
print(feature_importance_clf.head())
π References
Books
- "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman - Chapter 10
- "Hands-On Machine Learning" by AurΓ©lien GΓ©ron - Ensemble Methods chapter
- "Pattern Recognition and Machine Learning" by Christopher Bishop
- "Machine Learning: A Probabilistic Perspective" by Kevin Murphy
Papers
- Greedy Function Approximation: A Gradient Boosting Machine - Jerome Friedman (2001)
- XGBoost: A Scalable Tree Boosting System - Chen & Guestrin (2016)
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree - Microsoft (2017)
Online Resources
- Scikit-learn Gradient Boosting Guide
- XGBoost Documentation
- LightGBM Documentation
- CatBoost Documentation
- Gradient Boosting Explained - Video by StatQuest
Tutorials and Blogs
- Complete Guide to Parameter Tuning in Gradient Boosting
- Understanding Gradient Boosting - Interactive explanation
- Kaggle Learn: Gradient Boosting