1. Scikit-learn Cheat Sheet
- 1. Scikit-learn Cheat Sheet
- 1.1 Getting Started
- 1.2 Data Preprocessing
- 1.3 Model Selection and Training
- 1.4 Model Evaluation
- 1.5 Hyperparameter Tuning
- 1.6 Pipelines
- 1.7 Ensemble Methods
- 1.8 Dimensionality Reduction
- 1.9 Model Inspection
- 1.10 Calibration
- 1.11 Dummy Estimators
- 1.12 Multi-label Classification
- 1.13 Multi-class and Multi-label Classification
- 1.14 Outlier Detection
- 1.15 Semi-Supervised Learning
- 1.16 Tips and Best Practices
This cheat sheet provides an exhaustive overview of the Scikit-learn (sklearn) machine learning library, covering essential concepts, code snippets, and best practices for efficient model building, training, evaluation, and deployment. It aims to be a one-stop reference for common tasks.
1.1 Getting Started
1.1.1 Installation
pip install scikit-learn
1.1.2 Importing Scikit-learn
import sklearn
from sklearn import datasets # For built-in datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
1.2 Data Preprocessing
1.2.1 Loading Data
1.2.1.1 Built-in Datasets
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
boston = datasets.load_boston() # Now deprecated, use fetch_california_housing
california_housing = datasets.fetch_california_housing()
X = california_housing.data
y = california_housing.target
digits = datasets.load_digits()
X = digits.data
y = digits.target
1.2.1.2 From Pandas DataFrame
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("your_data.csv")
X = df.drop("target_column", axis=1)
y = df["target_column"]
1.2.2 Splitting Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training, 20% testing
1.2.3 Feature Scaling
1.2.3.1 Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
1.2.3.2 Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
1.2.3.3 Robust Scaling
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
1.2.3.4 Normalization
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_train_normalized = normalizer.fit_transform(X_train)
X_test_normalized = normalizer.transform(X_test)
1.2.4 Handling Missing Values
1.2.4.1 Imputation (SimpleImputer)
from sklearn.impute import SimpleImputer
import numpy as np
imputer = SimpleImputer(strategy="mean") # Replace missing values with the mean
# Other strategies: "median", "most_frequent", "constant"
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)```
#### Imputation (KNNImputer)
```python
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
1.2.4.2 Dropping Missing Values
import pandas as pd
# Assuming X_train and X_test are pandas DataFrames
X_train_dropped = X_train.dropna()
X_test_dropped = X_test.dropna()
1.2.5 Encoding Categorical Features
1.2.5.1 One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Assuming X_train and X_test are pandas DataFrames
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # sparse=False for older versions
X_train_encoded = encoder.fit_transform(X_train[['categorical_feature']])
X_test_encoded = encoder.transform(X_test[['categorical_feature']])
# Or, using pandas:
X_train_encoded = pd.get_dummies(X_train, columns=['categorical_feature'])
X_test_encoded = pd.get_dummies(X_test, columns=['categorical_feature'])
1.2.5.2 Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train[['ordinal_feature']])
X_test_encoded = encoder.transform(X_test[['ordinal_feature']])
1.2.5.3 Label Encoding (for target variable)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y_train_encoded = encoder.fit_transform(y_train)
y_test_encoded = encoder.transform(y_test)
1.2.6 Feature Engineering
1.2.6.1 Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
1.2.6.2 Custom Transformers
from sklearn.preprocessing import FunctionTransformer
import numpy as np
def log_transform(x):
return np.log1p(x)
log_transformer = FunctionTransformer(log_transform)
X_train_log = log_transformer.transform(X_train)
X_test_log = log_transformer.transform(X_test)
1.2.7 Feature Selection
1.2.7.1 VarianceThreshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1) # Remove features with variance below 0.1
X_train_selected = selector.fit_transform(X_train)
X_test_selected = selector.transform(X_test)
1.2.7.2 SelectKBest
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=5) # Select top 5 features
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
1.2.7.3 SelectFromModel
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression(penalty="l1", solver='liblinear')
selector = SelectFromModel(estimator)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
1.2.7.4 RFE (Recursive Feature Elimination)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=5)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
1.3 Model Selection and Training
1.3.1 Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
1.3.2 Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear') # Add solver for smaller datasets
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
1.3.3 Support Vector Machines (SVM)
from sklearn.svm import SVC, SVR
# For classification
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)
# For regression
model = SVR(kernel='linear', C=1.0)
model.fit(X_train, y_train)
1.3.4 Decision Trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
# For classification
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)
# For regression
model = DecisionTreeRegressor(max_depth=5)
model.fit(X_train, y_train)
1.3.5 Random Forest
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# For classification
model = RandomForestClassifier(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)
# For regression
model = RandomForestRegressor(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)
1.3.6 Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
# For classification
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)
# For regression
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)
1.3.7 K-Nearest Neighbors (KNN)
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
# For classification
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# For regression
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train, y_train)
1.3.8 Naive Bayes
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
1.3.9 Clustering (K-Means)
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=42, n_init = 'auto') # Added n_init
model.fit(X_train)
labels = model.predict(X_test)
1.3.10 Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
1.3.11 Model Persistence
import joblib
# Save the model
joblib.dump(model, 'my_model.pkl')
# Load the model
loaded_model = joblib.load('my_model.pkl')
1.4 Model Evaluation
1.4.1 Regression Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
1.4.2 Classification Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
1.4.3 ROC Curve and AUC
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# For binary classification
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
1.4.4 Cross-Validation
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
# K-Fold Cross-Validation
cv_scores = cross_val_score(model, X, y, cv=5) # 5-fold cross-validation
# Stratified K-Fold (for classification)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=cv)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation score: {cv_scores.mean():.2f}")
1.4.5 Learning Curves
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10))
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(train_sizes, train_mean, label='Training score')
plt.plot(train_sizes, test_mean, label='Cross-validation score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1)
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curve')
plt.show()
1.4.6 Validation Curves
from sklearn.model_selection import validation_curve
import numpy as np
param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(
model, X, y, param_name="gamma", param_range=param_range,
cv=5, scoring="accuracy")
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(param_range, train_mean, label='Training score')
plt.plot(param_range, test_mean, label='Cross-validation score')
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.1)
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, alpha=0.1)
plt.xscale('log')
plt.xlabel('Parameter Value')
plt.ylabel('Score')
plt.legend()
plt.title('Validation Curve')
plt.show()
1.5 Hyperparameter Tuning
1.5.1 GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': [0.1, 1, 'scale', 'auto']
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', verbose=2)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.2f}")
best_model = grid_search.best_estimator_
1.5.2 RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
param_dist = {
'n_estimators': randint(10, 200),
'max_depth': [3, 5, 10, None],
'min_samples_split': randint(2, 11),
'min_samples_leaf': randint(1, 11),
'bootstrap': [True, False]
}
random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist,
n_iter=20, cv=5, scoring='accuracy', random_state=42, verbose=2)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.2f}")
best_model = random_search.best_estimator_
1.6 Pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
1.7 Ensemble Methods
1.7.1 Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
base_estimator = DecisionTreeClassifier(max_depth=5)
bagging = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42)
bagging.fit(X_train, y_train)
1.7.2 Boosting (AdaBoost)
from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
adaboost.fit(X_train, y_train)
1.7.3 Stacking
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
estimators = [
('svm', SVC(kernel='linear', C=1.0)),
('dt', DecisionTreeClassifier(max_depth=5))
]
final_estimator = LogisticRegression()
stacking = StackingClassifier(estimators=estimators, final_estimator=final_estimator)
stacking.fit(X_train, y_train)
1.7.4 Voting Classifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
estimator1 = LogisticRegression(solver='liblinear')
estimator2 = SVC(kernel='linear', C=1.0, probability=True) # probability=True for soft voting
voting = VotingClassifier(estimators=[('lr', estimator1), ('svc', estimator2)], voting='soft') # 'hard' for majority voting
voting.fit(X_train, y_train)
1.8 Dimensionality Reduction
1.8.1 PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
1.8.2 Linear Discriminant Analysis (LDA)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_train_lda = lda.fit_transform(X_train, y_train) # Supervised, needs y_train
X_test_lda = lda.transform(X_test)
1.8.3 t-distributed Stochastic Neighbor Embedding (t-SNE)
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train) # Usually only fit_transform
1.8.4 Non-negative Matrix Factorization (NMF)
from sklearn.decomposition import NMF
nmf = NMF(n_components=2, random_state=42)
X_train_nmf = nmf.fit_transform(X_train)
X_test_nmf = nmf.transform(X_test)
1.9 Model Inspection
1.9.1 Feature Importances
# For tree-based models (RandomForest, GradientBoosting)
importances = model.feature_importances_
print(importances)
# For linear models (LogisticRegression, LinearRegression)
coefficients = model.coef_
print(coefficients)
1.9.2 Partial Dependence Plots
from sklearn.inspection import plot_partial_dependence
plot_partial_dependence(model, X_train, features=[0, 1]) # Plot for features 0 and 1
1.9.3 Permutation Importance
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
print(result.importances_mean)
1.10 Calibration
from sklearn.calibration import CalibratedClassifierCV
calibrated_model = CalibratedClassifierCV(model, method='isotonic', cv=5) # 'sigmoid' is another method
calibrated_model.fit(X_train, y_train)
1.11 Dummy Estimators
from sklearn.dummy import DummyClassifier, DummyRegressor
# For classification
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
# For regression
dummy_reg = DummyRegressor(strategy="mean")
dummy_reg.fit(X_train, y_train)
1.12 Multi-label Classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
multilabel_model = MultiOutputClassifier(RandomForestClassifier())
multilabel_model.fit(X_train, y_train) # y_train is a 2D array of shape (n_samples, n_labels)
1.13 Multi-class and Multi-label Classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
ovr_model = OneVsRestClassifier(SVC(kernel='linear'))
ovr_model.fit(X_train, y_train)
1.14 Outlier Detection
from sklearn.ensemble import IsolationForest
outlier_detector = IsolationForest(random_state=42)
outlier_detector.fit(X_train)
outliers = outlier_detector.predict(X_test) # 1 for inliers, -1 for outliers
1.15 Semi-Supervised Learning
from sklearn.semi_supervised import LabelPropagation
label_prop_model = LabelPropagation()
label_prop_model.fit(X_train, y_train) # y_train can contain -1 for unlabeled samples
1.16 Tips and Best Practices
- Data Preprocessing: Always preprocess your data (scaling, encoding, handling missing values) before training a model.
- Cross-Validation: Use cross-validation to get a reliable estimate of your model's performance.
- Hyperparameter Tuning: Use
GridSearchCV
orRandomizedSearchCV
to find the best hyperparameters for your model. - Pipelines: Use pipelines to streamline your workflow and prevent data leakage.
- Model Persistence: Save your trained models using
joblib
orpickle
. - Feature Importance: Use feature importance techniques to understand which features are most important for your model.
- Regularization: Use regularization techniques (L1, L2, Dropout) to prevent overfitting.
- Ensemble Methods: Combine multiple models to improve performance.
- Choose the Right Model: Select a model that is appropriate for your data and task.
- Evaluate Your Model: Use appropriate evaluation metrics for your task.
- Understand Your Data: Spend time exploring and understanding your data before building a model.
- Start Simple: Begin with a simple model and gradually increase complexity.
- Iterate: Machine learning is an iterative process. Experiment with different models, features, and hyperparameters.
- Document Your Work: Keep track of your experiments and results.
- Use Version Control: Use Git to track changes to your code.
- Use Virtual Environments: Isolate project dependencies.
- Read the Documentation: The Scikit-learn documentation is excellent.