Scikit-Learn Mastery Guide: Complete Machine Learning in Python 🤖
Scikit-learn is the most popular and comprehensive machine learning library in Python, providing simple and efficient tools for data mining and data analysis. Whether you’re a beginner taking your first steps into machine learning or an experienced practitioner looking to implement production-ready models, this guide will take you from basic concepts to advanced techniques.
In this comprehensive ML engineering guide, we’ll explore every major algorithm in scikit-learn with practical examples using built-in datasets, demonstrating real-world applications and best practices for modern machine learning workflows.

Master Machine Learning with Python’s Most Trusted Library
🎥 Machine Learning in Python: Complete Crash Course (2-Part Series)
These two lectures provide a complete, hands-on introduction to machine learning in Python using scikit-learn.
Watch both parts to master the fundamentals and practical workflow of ML in Python!
Table of Contents
- Why Scikit-Learn is Essential for ML
- Installation with Conda Environment
- Benefits of Using Scikit-Learn
- ML Basics and Core Concepts
- Supervised Learning Algorithms
- Unsupervised Learning Algorithms
- Model Selection and Evaluation
- Data Preprocessing and Feature Engineering
- ML Pipelines and Automation
- Best Practices and Performance Tips
- Real-World Applications
- Conclusion
Why Scikit-Learn is Essential for ML Engineers
- Industry Standard: Used by data scientists
Professionals who extract insights from data using statistical methods, machine learning, and domain expertise.
, ML engineers, and researchers worldwide - Comprehensive Library: Covers supervised, unsupervised, and semi-supervised learning
- Production Ready: Optimized for performance with robust, well-tested implementations
- Consistent API: All estimators follow the same interface pattern (fit, predict, transform)
- Excellent Documentation: Comprehensive examples and theoretical background
- Active Community: Continuous development with regular releases and improvements
- Integration Friendly: Works seamlessly with NumPy
Fundamental package for scientific computing with Python.
, Pandas
Data manipulation and analysis library for Python.
, and visualization libraries
Installation with Conda Environment
Setting up a proper environment is crucial for ML projects. Here’s how to install scikit-learn using conda:
Step 1: Create a New Conda Environment
conda create -n ml-env python=3.9
# Activate the environment
conda activate ml-env
Step 2: Install Scikit-Learn and Dependencies
conda install scikit-learn pandas numpy matplotlib seaborn jupyter
# Alternative: Install from conda-forge (recommended for latest versions)
conda install -c conda-forge scikit-learn pandas numpy matplotlib seaborn
# Install additional ML libraries
conda install -c conda-forge xgboost lightgbm plotly
Step 3: Verify Installation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print(f”Scikit-learn version: {sklearn.__version__}”)
print(f”NumPy version: {np.__version__}”)
print(f”Pandas version: {pd.__version__}”)
# Test with a simple example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Load sample data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
# Create and train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print(f”Installation successful! Test accuracy: {accuracy:.2f}”)
Managing Dependencies
conda env export > ml-environment.yml
# Create environment from file
conda env create -f ml-environment.yml
# List installed packages
conda list
# Update scikit-learn
conda update scikit-learn
Benefits of Using Scikit-Learn
🚀 Ease of Use
Consistent API across all algorithms makes learning and switching between models seamless. The fit/predict pattern is intuitive and reduces cognitive overhead.
⚡ Performance
Optimized implementations in C/Cython provide excellent performance. Many algorithms support parallel processing and are memory-efficient.
🔧 Complete Toolkit
Everything you need for ML workflows: preprocessing, feature selection, model training, evaluation, and hyperparameter tuning in one package.
📊 Built-in Datasets
Comes with classic datasets for learning and benchmarking, making it easy to start experimenting immediately.
🏭 Production Ready
Battle-tested in production environments with robust error handling, extensive testing, and stable APIs.
ML Basics and Core Concepts
Understanding the Scikit-Learn API
import pandas as pd
from sklearn.datasets import load_iris, load_boston, load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error
# Load a sample dataset
iris = load_iris()
print(“Dataset shape:”, iris.data.shape)
print(“Feature names:”, iris.feature_names)
print(“Target names:”, iris.target_names)
# The standard ML workflow
X, y = iris.data, iris.target
# 1. Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 2. Preprocess (optional)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f”Training set shape: {X_train.shape}”)
print(f”Test set shape: {X_test.shape}”)
print(f”Class distribution: {np.bincount(y_train)}”)
Essential ML Concepts
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# Supervised Learning – we have target labels
from sklearn.svm import SVC
supervised_model = SVC()
supervised_model.fit(X_train, y_train)
predictions = supervised_model.predict(X_test)
print(f”Supervised accuracy: {accuracy_score(y_test, predictions):.3f}”)
# Unsupervised Learning – no target labels
unsupervised_model = KMeans(n_clusters=3, random_state=42)
clusters = unsupervised_model.fit_predict(X)
print(f”Cluster centers shape: {unsupervised_model.cluster_centers_.shape}”)
# Dimensionality Reduction
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f”Explained variance ratio: {pca.explained_variance_ratio_}”)
print(f”Reduced data shape: {X_reduced.shape}”)
Supervised Learning Algorithms
Supervised learning uses labeled data to learn a mapping from inputs to outputs. Let’s explore each major algorithm with practical examples.
1. Linear Regression
Linear Regression – Predicting Continuous Values
Best for: Predicting continuous numerical values with linear relationships
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Load California housing dataset (Boston dataset is deprecated)
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Use only a few features for simplicity
feature_names = housing.feature_names
X_simple = X[:, [0, 5, 6]] # MedInc, AveRooms, AveBedrms
feature_names_simple = [feature_names[i] for i in [0, 5, 6]]
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X_simple, y, test_size=0.2, random_state=42
)
# Create and train the model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
# Make predictions
y_pred = lr_model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(“Linear Regression Results:”)
print(f”Mean Squared Error: {mse:.3f}”)
print(f”R² Score: {r2:.3f}”)
print(f”Coefficients: {lr_model.coef_}”)
print(f”Intercept: {lr_model.intercept_:.3f}”)
# Feature importance
for feature, coef in zip(feature_names_simple, lr_model.coef_):
print(f”{feature}: {coef:.3f}”)
2. Logistic Regression
Logistic Regression – Binary and Multi-class Classification
Best for: Classification problems with interpretable results and probability estimates
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
# Load wine dataset
wine = load_wine()
X, y = wine.data, wine.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale the features (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create and train the model
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_scaled, y_train)
# Make predictions
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(“Logistic Regression Results:”)
print(f”Accuracy: {accuracy:.3f}”)
print(“\nClassification Report:”)
print(classification_report(y_test, y_pred, target_names=wine.target_names))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f”\nConfusion Matrix:\n{cm}”)
# Feature importance (coefficients)
feature_importance = pd.DataFrame({
‘feature’: wine.feature_names,
‘importance’: np.abs(log_reg.coef_[0]) # Taking first class coefficients
}).sort_values(‘importance’, ascending=False)
print(“\nTop 5 Most Important Features:”)
print(feature_importance.head())
3. Decision Trees
Decision Trees – Interpretable Non-linear Models
Best for: Problems requiring interpretability and handling non-linear relationships
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score
# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Create and train the model
dt_model = DecisionTreeClassifier(
max_depth=3, # Limit depth to prevent overfitting
min_samples_split=5,
min_samples_leaf=2,
random_state=42
)
dt_model.fit(X_train, y_train)
# Make predictions
y_pred = dt_model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(“Decision Tree Results:”)
print(f”Accuracy: {accuracy:.3f}”)
# Feature importance
feature_importance = pd.DataFrame({
‘feature’: iris.feature_names,
‘importance’: dt_model.feature_importances_
}).sort_values(‘importance’, ascending=False)
print(“\nFeature Importance:”)
print(feature_importance)
# Tree depth and leaves info
print(f”\nTree depth: {dt_model.get_depth()}”)
print(f”Number of leaves: {dt_model.get_n_leaves()}”)
# You can visualize the tree (uncomment to see)
# plt.figure(figsize=(15, 10))
# plot_tree(dt_model, feature_names=iris.feature_names,
# class_names=iris.target_names, filled=True, rounded=True)
# plt.show()
4. Random Forest
Random Forest – Ensemble of Decision Trees
Best for: High accuracy with built-in feature importance and reduced overfitting
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
# Load breast cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Create and train the model
rf_model = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
random_state=42,
n_jobs=-1 # Use all available cores
)
rf_model.fit(X_train, y_train)
# Make predictions
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)[:, 1] # Probability of positive class
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
auc_score = roc_auc_score(y_test, y_pred_proba)
print(“Random Forest Results:”)
print(f”Accuracy: {accuracy:.3f}”)
print(f”AUC Score: {auc_score:.3f}”)
# Feature importance
feature_importance = pd.DataFrame({
‘feature’: cancer.feature_names,
‘importance’: rf_model.feature_importances_
}).sort_values(‘importance’, ascending=False)
print(“\nTop 10 Most Important Features:”)
print(feature_importance.head(10))
# Out-of-bag score (built-in cross-validation)
rf_oob = RandomForestClassifier(
n_estimators=100, oob_score=True, random_state=42
)
rf_oob.fit(X_train, y_train)
print(f”\nOut-of-bag score: {rf_oob.oob_score_:.3f}”)
5. Support Vector Machine (SVM)
Support Vector Machine – Maximum Margin Classifier
Best for: High-dimensional data, text classification, and when you need robust performance
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
# Load digits dataset
digits = load_digits()
X, y = digits.data, digits.target
print(f”Dataset shape: {X.shape}”)
print(f”Number of classes: {len(np.unique(y))}”)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale the features (crucial for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create and train the model
svm_model = SVC(
kernel=’rbf’, # Radial Basis Function kernel
C=1.0, # Regularization parameter
gamma=’scale’, # Kernel coefficient
random_state=42
)
svm_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = svm_model.predict(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(“SVM Results:”)
print(f”Accuracy: {accuracy:.3f}”)
print(f”\nNumber of support vectors: {svm_model.n_support_}”)
print(f”Total support vectors: {svm_model.support_vectors_.shape[0]}”)
# Test different kernels
kernels = [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’]
for kernel in kernels:
svm_temp = SVC(kernel=kernel, random_state=42)
svm_temp.fit(X_train_scaled, y_train)
temp_pred = svm_temp.predict(X_test_scaled)
temp_accuracy = accuracy_score(y_test, temp_pred)
print(f”{kernel.capitalize()} kernel accuracy: {temp_accuracy:.3f}”)
6. K-Nearest Neighbors (KNN)
K-Nearest Neighbors – Instance-based Learning
Best for: Simple baseline models, recommendation systems, and pattern recognition
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale the features (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Find optimal k value
k_values = range(1, 21)
accuracies = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
# Find best k
best_k = k_values[np.argmax(accuracies)]
best_accuracy = max(accuracies)
print(“KNN Results:”)
print(f”Best k value: {best_k}”)
print(f”Best accuracy: {best_accuracy:.3f}”)
# Train final model with best k
knn_model = KNeighborsClassifier(n_neighbors=best_k)
knn_model.fit(X_train_scaled, y_train)
y_pred = knn_model.predict(X_test_scaled)
# Get prediction probabilities
y_pred_proba = knn_model.predict_proba(X_test_scaled)
print(f”\nFinal model accuracy: {accuracy_score(y_test, y_pred):.3f}”)
# Show k values vs accuracy
print(“\nK values vs Accuracy:”)
for k, acc in zip(k_values[:10], accuracies[:10]):
print(f”k={k}: {acc:.3f}”)
# Distance-based vs uniform weights
knn_uniform = KNeighborsClassifier(n_neighbors=best_k, weights=’uniform’)
knn_distance = KNeighborsClassifier(n_neighbors=best_k, weights=’distance’)
knn_uniform.fit(X_train_scaled, y_train)
knn_distance.fit(X_train_scaled, y_train)
uniform_acc = accuracy_score(y_test, knn_uniform.predict(X_test_scaled))
distance_acc = accuracy_score(y_test, knn_distance.predict(X_test_scaled))
print(f”\nUniform weights accuracy: {uniform_acc:.3f}”)
print(f”Distance weights accuracy: {distance_acc:.3f}”)
7. Naive Bayes
Naive Bayes – Probabilistic Classifier
Best for: Text classification, spam filtering, and when you need fast training/prediction
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score, classification_report
# Load wine dataset
wine = load_wine()
X, y = wine.data, wine.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Gaussian Naive Bayes (for continuous features)
gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)
y_pred_gnb = gnb_model.predict(X_test)
# Evaluate Gaussian NB
gnb_accuracy = accuracy_score(y_test, y_pred_gnb)
print(“Gaussian Naive Bayes Results:”)
print(f”Accuracy: {gnb_accuracy:.3f}”)
# Get prediction probabilities
y_pred_proba = gnb_model.predict_proba(X_test)
print(f”Prediction probabilities shape: {y_pred_proba.shape}”)
# Show class probabilities for first few predictions
print(“\nFirst 5 predictions with probabilities:”)
for i in range(5):
true_class = wine.target_names[y_test[i]]
pred_class = wine.target_names[y_pred_gnb[i]]
prob_max = np.max(y_pred_proba[i])
print(f”True: {true_class}, Predicted: {pred_class}, Confidence: {prob_max:.3f}”)
# Compare with scaled features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
gnb_scaled = GaussianNB()
gnb_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = gnb_scaled.predict(X_test_scaled)
scaled_accuracy = accuracy_score(y_test, y_pred_scaled)
print(f”\nGaussian NB with scaled features: {scaled_accuracy:.3f}”)
print(f”Improvement: {scaled_accuracy – gnb_accuracy:.3f}”)
# Feature log probabilities per class
print(f”\nNumber of features: {gnb_model.n_features_in_}”)
print(f”Classes: {gnb_model.classes_}”)
print(f”Class priors: {gnb_model.class_prior_}”)
Unsupervised Learning Algorithms
Unsupervised learning finds hidden patterns in data without labeled examples. Let’s explore clustering, dimensionality reduction, and anomaly detection.
1. K-Means Clustering
K-Means – Centroid-based Clustering
Best for: Customer segmentation, image segmentation, and data compression
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, silhouette_score
import matplotlib.pyplot as plt
# Load iris dataset (using it as unlabeled data)
iris = load_iris()
X = iris.data
y_true = iris.target # We’ll use this only for evaluation
# Determine optimal number of clusters using elbow method
inertias = []
silhouette_scores = []
k_range = range(2, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))
# Find optimal k using silhouette score
optimal_k = k_range[np.argmax(silhouette_scores)]
print(“K-Means Clustering Results:”)
print(f”Optimal number of clusters (silhouette): {optimal_k}”)
# Train final model
kmeans_model = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_labels = kmeans_model.fit_predict(X)
# Evaluate clustering (comparing with true labels)
ari_score = adjusted_rand_score(y_true, cluster_labels)
sil_score = silhouette_score(X, cluster_labels)
print(f”Adjusted Rand Index: {ari_score:.3f}”)
print(f”Silhouette Score: {sil_score:.3f}”)
print(f”Inertia (within-cluster sum of squares): {kmeans_model.inertia_:.2f}”)
# Cluster centers
print(f”\nCluster centers shape: {kmeans_model.cluster_centers_.shape}”)
print(“Cluster centers:”)
for i, center in enumerate(kmeans_model.cluster_centers_):
print(f”Cluster {i}: {center}”)
# Cluster sizes
unique, counts = np.unique(cluster_labels, return_counts=True)
print(f”\nCluster sizes: {dict(zip(unique, counts))}”)
# Show elbow method results
print(“\nElbow Method – K vs Inertia:”)
for k, inertia in zip(k_range, inertias):
print(f”k={k}: {inertia:.2f}”)
print(“\nSilhouette Method – K vs Score:”)
for k, score in zip(k_range, silhouette_scores):
print(f”k={k}: {score:.3f}”)
2. Principal Component Analysis (PCA)
PCA – Linear Dimensionality Reduction
Best for: Data visualization, noise reduction, and feature extraction
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Load wine dataset
wine = load_wine()
X, y = wine.data, wine.target
# Standardize the features (crucial for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Calculate cumulative explained variance
cumsum_variance = np.cumsum(pca.explained_variance_ratio_)
print(“PCA Results:”)
print(f”Original dimensions: {X.shape[1]}”)
print(f”First 5 explained variance ratios: {pca.explained_variance_ratio_[:5]}”)
# Find number of components for 95% variance
n_components_95 = np.argmax(cumsum_variance >= 0.95) + 1
print(f”Components needed for 95% variance: {n_components_95}”)
# Apply PCA with optimal components
pca_optimal = PCA(n_components=n_components_95)
X_pca_optimal = pca_optimal.fit_transform(X_scaled)
print(f”Reduced dimensions: {X_pca_optimal.shape[1]}”)
print(f”Variance explained by {n_components_95} components: {cumsum_variance[n_components_95-1]:.3f}”)
# 2D visualization
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)
print(f”\n2D PCA explained variance: {pca_2d.explained_variance_ratio_}”)
print(f”Total 2D variance explained: {sum(pca_2d.explained_variance_ratio_):.3f}”)
# Component loadings (feature contributions)
feature_importance = pd.DataFrame({
‘feature’: wine.feature_names,
‘PC1’: abs(pca_2d.components_[0]),
‘PC2’: abs(pca_2d.components_[1])
})
print(“\nTop features contributing to PC1:”)
print(feature_importance.sort_values(‘PC1’, ascending=False).head())
print(“\nTop features contributing to PC2:”)
print(feature_importance.sort_values(‘PC2’, ascending=False).head())
# Inverse transform (reconstruction)
X_reconstructed = pca_optimal.inverse_transform(X_pca_optimal)
reconstruction_error = np.mean((X_scaled – X_reconstructed) ** 2)
print(f”\nReconstruction error: {reconstruction_error:.4f}”)
3. DBSCAN Clustering
DBSCAN – Density-based Clustering
Best for: Anomaly detection, irregular cluster shapes, and noisy data
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
# Create sample data with noise
X_blobs, _ = make_blobs(n_samples=300, centers=4, n_features=2,
random_state=42, cluster_std=0.8)
# Add some noise points
np.random.seed(42)
noise_points = np.random.uniform(-6, 6, size=(20, 2))
X = np.vstack([X_blobs, noise_points])
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
cluster_labels = dbscan.fit_predict(X_scaled)
# Analyze results
n_clusters = len(set(cluster_labels)) – (1 if -1 in cluster_labels else 0)
n_noise = list(cluster_labels).count(-1)
print(“DBSCAN Clustering Results:”)
print(f”Number of clusters: {n_clusters}”)
print(f”Number of noise points: {n_noise}”)
# Cluster sizes (excluding noise)
unique, counts = np.unique(cluster_labels[cluster_labels != -1], return_counts=True)
if len(unique) > 0:
print(f”Cluster sizes: {dict(zip(unique, counts))}”)
# Silhouette score (excluding noise points)
if n_clusters > 1:
valid_indices = cluster_labels != -1
if np.sum(valid_indices) > 1 and len(np.unique(cluster_labels[valid_indices])) > 1:
sil_score = silhouette_score(X_scaled[valid_indices],
cluster_labels[valid_indices])
print(f”Silhouette Score: {sil_score:.3f}”)
# Core samples
n_core_samples = len(dbscan.core_sample_indices_)
print(f”Number of core samples: {n_core_samples}”)
# Parameter sensitivity analysis
eps_values = [0.3, 0.5, 0.7, 1.0]
min_samples_values = [3, 5, 10]
print(“\nParameter Sensitivity Analysis:”)
print(“eps\tmin_samples\tn_clusters\tn_noise”)
for eps in eps_values:
for min_samples in min_samples_values:
db_temp = DBSCAN(eps=eps, min_samples=min_samples)
labels_temp = db_temp.fit_predict(X_scaled)
n_clusters_temp = len(set(labels_temp)) – (1 if -1 in labels_temp else 0)
n_noise_temp = list(labels_temp).count(-1)
print(f”{eps}\t{min_samples}\t\t{n_clusters_temp}\t{n_noise_temp}”)
# Distance to kth nearest neighbor (for eps selection)
from sklearn.neighbors import NearestNeighbors
k = 5 # min_samples – 1
nbrs = NearestNeighbors(n_neighbors=k).fit(X_scaled)
distances, indices = nbrs.kneighbors(X_scaled)
distances = np.sort(distances[:, k-1], axis=0)
print(f”\nSuggested eps range: {distances[int(len(distances)*0.9)]:.3f} – {distances[int(len(distances)*0.95)]:.3f}”)
4. Hierarchical Clustering
Hierarchical Clustering – Tree-based Clustering
Best for: Understanding data hierarchy, small datasets, and when you need a dendrogram
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Load iris dataset
iris = load_iris()
X = iris.data
y_true = iris.target
# Use only first 2 features for visualization
X_vis = X[:, :2]
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_vis_scaled = scaler.fit_transform(X_vis)
# Apply Agglomerative Clustering with different linkages
linkage_methods = [‘ward’, ‘complete’, ‘average’, ‘single’]
results = {}
for linkage_method in linkage_methods:
agg_clustering = AgglomerativeClustering(
n_clusters=3,
linkage=linkage_method
)
cluster_labels = agg_clustering.fit_predict(X_scaled)
# Evaluate
ari_score = adjusted_rand_score(y_true, cluster_labels)
sil_score = silhouette_score(X_scaled, cluster_labels)
results[linkage_method] = {
‘ari’: ari_score,
‘silhouette’: sil_score,
‘labels’: cluster_labels
}
print(“Hierarchical Clustering Results:”)
print(“Linkage\t\tARI\tSilhouette”)
for method, result in results.items():
print(f”{method}\t\t{result[‘ari’]:.3f}\t{result[‘silhouette’]:.3f}”)
# Find best method
best_method = max(results.keys(), key=lambda x: results[x][‘ari’])
print(f”\nBest linkage method: {best_method}”)
# Analyze cluster hierarchy with different number of clusters
n_clusters_range = range(2, 8)
ward_results = []
for n_clusters in n_clusters_range:
agg_ward = AgglomerativeClustering(n_clusters=n_clusters, linkage=’ward’)
labels = agg_ward.fit_predict(X_scaled)
if n_clusters > 1:
sil_score = silhouette_score(X_scaled, labels)
ward_results.append((n_clusters, sil_score))
print(f”\nWard Linkage – Clusters vs Silhouette Score:”)
for n_clust, score in ward_results:
print(f”{n_clust} clusters: {score:.3f}”)
# Create dendrogram for visualization (using scipy)
print(f”\nCreating dendrogram with first 50 samples…”)
linkage_matrix = linkage(X_scaled[:50], method=’ward’)
# Distance threshold for automatic cluster determination
from scipy.cluster.hierarchy import fcluster
max_dist = 7 # You can adjust this
cluster_labels_thresh = fcluster(linkage_matrix, max_dist, criterion=’distance’)
n_clusters_auto = len(np.unique(cluster_labels_thresh))
print(f”Automatic clustering with distance threshold {max_dist}: {n_clusters_auto} clusters”)
# Connectivity-constrained clustering (for spatial data)
from sklearn.feature_extraction import image
# This is more relevant for image/spatial data
# For demonstration with iris data
connectivity = None # No spatial constraint for iris
connected_clustering = AgglomerativeClustering(
n_clusters=3,
connectivity=connectivity,
linkage=’ward’
)
connected_labels = connected_clustering.fit_predict(X_scaled)
connected_ari = adjusted_rand_score(y_true, connected_labels)
print(f”\nConnectivity-constrained clustering ARI: {connected_ari:.3f}”)
Model Selection and Evaluation
Proper model evaluation is crucial for building reliable ML systems. Let’s explore cross-validation, hyperparameter tuning, and various metrics.
Cross-Validation and Model Selection
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Define models to compare
models = {
‘Logistic Regression’: LogisticRegression(random_state=42, max_iter=1000),
‘Random Forest’: RandomForestClassifier(random_state=42),
‘SVM’: SVC(random_state=42),
‘KNN’: KNeighborsClassifier()
}
# Perform cross-validation
cv_results = {}
cv_folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for name, model in models.items():
cv_scores = cross_val_score(model, X, y, cv=cv_folds, scoring=’accuracy’)
cv_results[name] = {
‘mean’: cv_scores.mean(),
‘std’: cv_scores.std(),
‘scores’: cv_scores
}
print(“Cross-Validation Results (5-fold):”)
print(“Model\t\t\tMean±Std\tAll Scores”)
for name, results in cv_results.items():
scores_str = ‘[‘ + ‘, ‘.join([f'{s:.3f}’ for s in results[‘scores’]]) + ‘]’
print(f”{name:<20}{results['mean']:.3f}±{results['std']:.3f}\t{scores_str}")# Hyperparameter tuning with GridSearchCV
print("\n" + "="*60)
print("Hyperparameter Tuning Examples")
print("="*60)# Random Forest hyperparameter tuning
rf_param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}rf_grid = GridSearchCV(
RandomForestClassifier(random_state=42),
rf_param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=0
)rf_grid.fit(X, y)print(f"Random Forest - Best parameters: {rf_grid.best_params_}")
print(f"Random Forest - Best CV score: {rf_grid.best_score_:.3f}")# SVM hyperparameter tuning
svm_param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1],
'kernel': ['rbf', 'poly', 'sigmoid']
}svm_grid = GridSearchCV(
SVC(random_state=42),
svm_param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=0
)svm_grid.fit(X, y)print(f"SVM - Best parameters: {svm_grid.best_params_}")
print(f"SVM - Best CV score: {svm_grid.best_score_:.3f}")# Compare best models
best_models = {
'Best Random Forest': rf_grid.best_estimator_,
'Best SVM': svm_grid.best_estimator_,
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000)
}print("\nFinal Model Comparison:")
for name, model in best_models.items():
cv_scores = cross_val_score(model, X, y, cv=cv_folds, scoring='accuracy')
print(f"{name}: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
Advanced Evaluation Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report, roc_auc_score,
precision_recall_curve, roc_curve
)
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Load breast cancer dataset for binary classification
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train multiple models
models = {
‘Logistic Regression’: LogisticRegression(random_state=42, max_iter=1000),
‘Random Forest’: RandomForestClassifier(random_state=42),
‘SVM’: SVC(probability=True, random_state=42) # probability=True for ROC curve
}
evaluation_results = {}
for name, model in models.items():
# Train model
model.fit(X_train_scaled, y_train)
# Predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] # Probability of positive class
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
# Specificity (True Negative Rate)
specificity = tn / (tn + fp)
evaluation_results[name] = {
‘accuracy’: accuracy,
‘precision’: precision,
‘recall’: recall,
‘f1’: f1,
‘auc’: auc,
‘specificity’: specificity,
‘confusion_matrix’: cm,
‘y_pred’: y_pred,
‘y_pred_proba’: y_pred_proba
}
# Display results
print(“Comprehensive Model Evaluation Results:”)
print(“=”*80)
print(f”{‘Model’:<20}{'Accuracy':<10}{'Precision':<10}{'Recall':<10}{'F1':<10}{'AUC':<10}{'Specificity':<10}")
print("-"*80)for name, results in evaluation_results.items():
print(f"{name:<20}{results['accuracy']:<10.3f}{results['precision']:<10.3f}"
f"{results['recall']:<10.3f}{results['f1']:<10.3f}{results['auc']:<10.3f}"
f"{results['specificity']:<10.3f}")# Detailed classification report for best model
best_model_name = max(evaluation_results.keys(),
key=lambda x: evaluation_results[x]['f1'])
print(f"\nDetailed Classification Report for {best_model_name}:")
print(classification_report(y_test, evaluation_results[best_model_name]['y_pred'],
target_names=cancer.target_names))# Confusion matrices
print(f"\nConfusion Matrices:")
for name, results in evaluation_results.items():
print(f"\n{name}:")
print(results['confusion_matrix'])
cm = results['confusion_matrix']
print(f"True Negatives: {cm[0,0]}, False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}, True Positives: {cm[1,1]}")# ROC Curve data preparation
print(f"\nROC Curve Analysis:")
for name, results in evaluation_results.items():
fpr, tpr, _ = roc_curve(y_test, results['y_pred_proba'])
print(f"{name} - AUC: {results['auc']:.3f}")# Precision-Recall Curve analysis
print(f"\nPrecision-Recall Analysis:")
for name, results in evaluation_results.items():
precision_curve, recall_curve, _ = precision_recall_curve(y_test, results['y_pred_proba'])
# Average precision
from sklearn.metrics import average_precision_score
avg_precision = average_precision_score(y_test, results['y_pred_proba'])
print(f"{name} - Average Precision: {avg_precision:.3f}")
Learning Curves and Validation Curves
from sklearn.model_selection import learning_curve, validation_curve
import numpy as np
# Load wine dataset
wine = load_wine()
X, y = wine.data, wine.target
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Learning Curve – shows training performance vs dataset size
def plot_learning_curve_data(estimator, X, y, cv=5):
train_sizes = np.linspace(0.1, 1.0, 10)
train_sizes_abs, train_scores, val_scores = learning_curve(
estimator, X, y, cv=cv, train_sizes=train_sizes,
scoring=’accuracy’, n_jobs=-1, random_state=42
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
return train_sizes_abs, train_mean, train_std, val_mean, val_std
# Generate learning curves for Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
train_sizes, train_mean, train_std, val_mean, val_std = plot_learning_curve_data(
rf_model, X_scaled, y
)
print(“Learning Curve Analysis – Random Forest:”)
print(“Dataset Size\tTrain Score\tValidation Score”)
for size, tr_mean, tr_std, val_mean_score, val_std_score in zip(
train_sizes, train_mean, train_std, val_mean, val_std
):
print(f”{size:<12}{tr_mean:.3f}±{tr_std:.3f}\t{val_mean_score:.3f}±{val_std_score:.3f}")# Validation Curve - shows performance vs hyperparameter values
# For Random Forest: n_estimators
param_range = [10, 50, 100, 200, 500]
train_scores, val_scores = validation_curve(
RandomForestClassifier(random_state=42), X_scaled, y,
param_name='n_estimators', param_range=param_range,
cv=5, scoring='accuracy', n_jobs=-1
)train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)print(f"\nValidation Curve Analysis - n_estimators:")
print("n_estimators\tTrain Score\tValidation Score")
for param, tr_mean, tr_std, val_mean_score, val_std_score in zip(
param_range, train_mean, train_std, val_mean, val_std
):
print(f"{param:<12}{tr_mean:.3f}±{tr_std:.3f}\t{val_mean_score:.3f}±{val_std_score:.3f}")# Optimal n_estimators
optimal_idx = np.argmax(val_mean)
optimal_n_estimators = param_range[optimal_idx]
print(f"Optimal n_estimators: {optimal_n_estimators}")# For SVM: C parameter
svm_param_range = [0.1, 1, 10, 100, 1000]
svm_train_scores, svm_val_scores = validation_curve(
SVC(random_state=42), X_scaled, y,
param_name='C', param_range=svm_param_range,
cv=5, scoring='accuracy', n_jobs=-1
)svm_train_mean = np.mean(svm_train_scores, axis=1)
svm_train_std = np.std(svm_train_scores, axis=1)
svm_val_mean = np.mean(svm_val_scores, axis=1)
svm_val_std = np.std(svm_val_scores, axis=1)print(f"\nValidation Curve Analysis - SVM C parameter:")
print("C\t\tTrain Score\tValidation Score")
for param, tr_mean, tr_std, val_mean_score, val_std_score in zip(
svm_param_range, svm_train_mean, svm_train_std, svm_val_mean, svm_val_std
):
print(f"{param:<12}{tr_mean:.3f}±{tr_std:.3f}\t{val_mean_score:.3f}±{val_std_score:.3f}")# Bias-Variance Analysis
print(f"\nBias-Variance Analysis:")
print("High Bias (Underfitting): Training and validation scores are both low and close")
print("High Variance (Overfitting): Large gap between training and validation scores")
print("Good Fit: Both scores are high and close to each other")# Check for overfitting/underfitting
final_train_score = train_mean[-1]
final_val_score = val_mean[-1]
gap = final_train_score - final_val_scoreprint(f"\nRandom Forest Final Analysis:")
print(f"Final training score: {final_train_score:.3f}")
print(f"Final validation score: {final_val_score:.3f}")
print(f"Gap (potential overfitting indicator): {gap:.3f}")if gap > 0.05:
print(“Model might be overfitting – consider regularization”)
elif final_val_score < 0.8:
print("Model might be underfitting - consider more complex model")
else:
print("Model appears to have good bias-variance tradeoff")
Data Preprocessing and Feature Engineering
Data preprocessing is often the most crucial step in ML pipelines. Let’s explore scaling, encoding, and feature selection techniques.
Feature Scaling and Normalization
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler,
Normalizer, QuantileTransformer, PowerTransformer
)
import pandas as pd
import numpy as np
# Create sample data with different scales
np.random.seed(42)
data = pd.DataFrame({
‘feature_1’: np.random.normal(100, 15, 1000), # Mean=100, std=15
‘feature_2’: np.random.exponential(2, 1000), # Exponential distribution
‘feature_3’: np.random.uniform(0, 1, 1000), # Uniform [0,1]
‘feature_4’: np.random.normal(0, 0.1, 1000) + 1000 # Mean=1000, std=0.1
})
print(“Original Data Statistics:”)
print(data.describe())
# Different scaling methods
scalers = {
‘StandardScaler’: StandardScaler(),
‘MinMaxScaler’: MinMaxScaler(),
‘RobustScaler’: RobustScaler(),
‘QuantileTransformer’: QuantileTransformer(random_state=42),
‘PowerTransformer’: PowerTransformer(random_state=42)
}
scaling_results = {}
for name, scaler in scalers.items():
scaled_data = scaler.fit_transform(data)
scaling_results[name] = pd.DataFrame(
scaled_data,
columns=data.columns
)
# Compare scaling results
print(f”\nScaling Comparison – Feature Statistics After Scaling:”)
print(“=”*80)
for scaler_name, scaled_df in scaling_results.items():
print(f”\n{scaler_name}:”)
print(f”Mean: {scaled_df.mean().values}”)
print(f”Std: {scaled_df.std().values}”)
print(f”Min: {scaled_df.min().values}”)
print(f”Max: {scaled_df.max().values}”)
# Use StandardScaler with real dataset
wine = load_wine()
X, y = wine.data, wine.target
# Compare model performance with and without scaling
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
# Without scaling
svm_no_scale = SVC(random_state=42)
scores_no_scale = cross_val_score(svm_no_scale, X, y, cv=5)
# With scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
svm_scaled = SVC(random_state=42)
scores_scaled = cross_val_score(svm_scaled, X_scaled, y, cv=5)
print(f”\nSVM Performance Comparison:”)
print(f”Without scaling: {scores_no_scale.mean():.3f} ± {scores_no_scale.std():.3f}”)
print(f”With scaling: {scores_scaled.mean():.3f} ± {scores_scaled.std():.3f}”)
print(f”Improvement: {scores_scaled.mean() – scores_no_scale.mean():.3f}”)
# Feature-wise scaling analysis
print(f”\nOriginal Feature Ranges:”)
for i, feature_name in enumerate(wine.feature_names):
print(f”{feature_name}: [{X[:, i].min():.2f}, {X[:, i].max():.2f}]”)
# Robust vs Standard scaling comparison (with outliers)
# Add some outliers to demonstrate RobustScaler
X_with_outliers = X.copy()
X_with_outliers[0, 0] = X_with_outliers[0, 0] * 10 # Create outlier
standard_scaler = StandardScaler()
robust_scaler = RobustScaler()
X_standard = standard_scaler.fit_transform(X_with_outliers)
X_robust = robust_scaler.fit_transform(X_with_outliers)
print(f”\nOutlier Impact on Scaling (Feature 0):”)
print(f”Original with outlier: {X_with_outliers[0, 0]:.2f}”)
print(f”StandardScaler result: {X_standard[0, 0]:.2f}”)
print(f”RobustScaler result: {X_robust[0, 0]:.2f}”)
print(f”Standard deviation – Standard: {X_standard[:, 0].std():.3f}”)
print(f”Standard deviation – Robust: {X_robust[:, 0].std():.3f}”)
Categorical Encoding
from sklearn.preprocessing import (
LabelEncoder, OneHotEncoder, OrdinalEncoder, TargetEncoder
)
import pandas as pd
# Create sample categorical data
np.random.seed(42)
sample_data = pd.DataFrame({
‘color’: np.random.choice([‘red’, ‘blue’, ‘green’, ‘yellow’], 1000),
‘size’: np.random.choice([‘small’, ‘medium’, ‘large’], 1000),
‘quality’: np.random.choice([‘poor’, ‘fair’, ‘good’, ‘excellent’], 1000),
‘target’: np.random.choice([0, 1], 1000)
})
print(“Sample Categorical Data:”)
print(sample_data.head(10))
print(f”\nValue counts for each column:”)
for col in [‘color’, ‘size’, ‘quality’]:
print(f”{col}: {sample_data[col].nunique()} unique values”)
print(f” {sample_data[col].value_counts().to_dict()}”)
# 1. Label Encoding (for ordinal data)
label_encoder = LabelEncoder()
sample_data[‘size_label’] = label_encoder.fit_transform(sample_data[‘size’])
print(f”\nLabel Encoding for ‘size’:”)
size_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print(f”Mapping: {size_mapping}”)
# 2. Ordinal Encoding (with custom order)
ordinal_encoder = OrdinalEncoder(categories=[[‘poor’, ‘fair’, ‘good’, ‘excellent’]])
sample_data[‘quality_ordinal’] = ordinal_encoder.fit_transform(sample_data[[‘quality’]])
print(f”\nOrdinal Encoding for ‘quality’:”)
quality_mapping = dict(zip([‘poor’, ‘fair’, ‘good’, ‘excellent’], [0, 1, 2, 3]))
print(f”Custom mapping: {quality_mapping}”)
# 3. One-Hot Encoding
onehot_encoder = OneHotEncoder(sparse_output=False, drop=’first’) # drop=’first’ to avoid multicollinearity
color_encoded = onehot_encoder.fit_transform(sample_data[[‘color’]])
# Create column names for one-hot encoded features
color_columns = [f”color_{cat}” for cat in onehot_encoder.categories_[0][1:]] # Skip first due to drop=’first’
color_df = pd.DataFrame(color_encoded, columns=color_columns)
print(f”\nOne-Hot Encoding for ‘color’:”)
print(f”Original categories: {onehot_encoder.categories_[0]}”)
print(f”Encoded columns: {color_columns}”)
print(color_df.head())
# 4. Target Encoding (mean encoding)
try:
target_encoder = TargetEncoder(random_state=42)
sample_data[‘color_target’] = target_encoder.fit_transform(
sample_data[[‘color’]], sample_data[‘target’]
)
print(f”\nTarget Encoding for ‘color’:”)
target_means = sample_data.groupby(‘color’)[‘target’].mean().sort_values(ascending=False)
print(“Mean target value by color:”)
print(target_means)
except ImportError:
print(f”\nTarget Encoding not available in this sklearn version”)
# Alternative: Manual target encoding
target_means = sample_data.groupby(‘color’)[‘target’].mean()
sample_data[‘color_target_manual’] = sample_data[‘color’].map(target_means)
print(“Manual Target Encoding:”)
print(target_means)
# Compare encoding methods with a real example
# Using Titanic-like categorical data
titanic_sample = pd.DataFrame({
‘sex’: np.random.choice([‘male’, ‘female’], 500),
’embarked’: np.random.choice([‘S’, ‘C’, ‘Q’], 500),
‘class’: np.random.choice([‘First’, ‘Second’, ‘Third’], 500),
‘survived’: np.random.choice([0, 1], 500)
})
print(f”\nTitanic-like Dataset Encoding Comparison:”)
# Method 1: Label Encoding
le_sex = LabelEncoder()
titanic_sample[‘sex_label’] = le_sex.fit_transform(titanic_sample[‘sex’])
# Method 2: One-Hot Encoding
ohe = OneHotEncoder(sparse_output=False, drop=’first’)
embarked_encoded = ohe.fit_transform(titanic_sample[[’embarked’]])
embarked_cols = [f”embarked_{cat}” for cat in ohe.categories_[0][1:]]
# Method 3: Ordinal Encoding with logical order
class_order = [[‘Third’, ‘Second’, ‘First’]] # Logical order for class
oe_class = OrdinalEncoder(categories=class_order)
titanic_sample[‘class_ordinal’] = oe_class.fit_transform(titanic_sample[[‘class’]])
print(“Encoding results:”)
print(f”Sex (Label): {dict(zip(le_sex.classes_, le_sex.transform(le_sex.classes_)))}”)
print(f”Embarked (One-Hot): {embarked_cols}”)
print(f”Class (Ordinal): {dict(zip(class_order[0], [0, 1, 2]))}”)
# Performance comparison
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Prepare different encoding versions
X_label = titanic_sample[[‘sex_label’, ‘class_ordinal’]].copy()
X_label[’embarked_label’] = LabelEncoder().fit_transform(titanic_sample[’embarked’])
X_mixed = pd.concat([
titanic_sample[[‘sex_label’, ‘class_ordinal’]],
pd.DataFrame(embarked_encoded, columns=embarked_cols)
], axis=1)
y = titanic_sample[‘survived’]
# Compare performance
rf = RandomForestClassifier(random_state=42)
scores_label = cross_val_score(rf, X_label, y, cv=5)
scores_mixed = cross_val_score(rf, X_mixed, y, cv=5)
print(f”\nEncoding Performance Comparison:”)
print(f”All Label Encoding: {scores_label.mean():.3f} ± {scores_label.std():.3f}”)
print(f”Mixed Encoding: {scores_mixed.mean():.3f} ± {scores_mixed.std():.3f}”)
print(f”Difference: {scores_mixed.mean() – scores_label.mean():.3f}”)
Feature Selection
from sklearn.feature_selection import (
SelectKBest, f_classif, chi2, mutual_info_classif,
RFE, SelectFromModel, VarianceThreshold
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoCV
# Load breast cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
print(f”Original dataset shape: {X.shape}”)
print(f”Feature names (first 10): {cancer.feature_names[:10]}”)
# 1. Variance Threshold – Remove low-variance features
variance_selector = VarianceThreshold(threshold=0.1)
X_variance = variance_selector.fit_transform(X)
print(f”\nVariance Threshold Selection:”)
print(f”Features after variance threshold: {X_variance.shape[1]}”)
print(f”Removed features: {X.shape[1] – X_variance.shape[1]}”)
# 2. Univariate Feature Selection
# Using f_classif (ANOVA F-test)
k_best_f = SelectKBest(score_func=f_classif, k=10)
X_f_test = k_best_f.fit_transform(X, y)
# Get feature scores and names
f_scores = k_best_f.scores_
f_pvalues = k_best_f.pvalues_
selected_features_f = k_best_f.get_support()
print(f”\nUnivariate Selection (F-test) – Top 10 features:”)
feature_scores_f = list(zip(cancer.feature_names, f_scores, f_pvalues))
feature_scores_f.sort(key=lambda x: x[1], reverse=True)
for i, (name, score, pvalue) in enumerate(feature_scores_f[:10]):
print(f”{i+1:2d}. {name:<25} Score: {score:8.2f}, p-value: {pvalue:.2e}")# Using mutual information
k_best_mi = SelectKBest(score_func=mutual_info_classif, k=10)
X_mi = k_best_mi.fit_transform(X, y)mi_scores = k_best_mi.scores_
feature_scores_mi = list(zip(cancer.feature_names, mi_scores))
feature_scores_mi.sort(key=lambda x: x[1], reverse=True)print(f"\nUnivariate Selection (Mutual Information) - Top 10 features:")
for i, (name, score) in enumerate(feature_scores_mi[:10]):
print(f"{i+1:2d}. {name:<25} MI Score: {score:.4f}")# 3. Recursive Feature Elimination (RFE)
rfe_estimator = RandomForestClassifier(n_estimators=100, random_state=42)
rfe_selector = RFE(estimator=rfe_estimator, n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)rfe_selected = rfe_selector.get_support()
rfe_ranking = rfe_selector.ranking_print(f"\nRecursive Feature Elimination - Top 10 features:")
rfe_features = [(name, rank) for name, rank, selected in
zip(cancer.feature_names, rfe_ranking, rfe_selected) if selected]
rfe_features.sort(key=lambda x: x[1])for i, (name, rank) in enumerate(rfe_features):
print(f"{i+1:2d}. {name:<25} Rank: {rank}")# 4. Model-based Feature Selection (L1 regularization)
lasso_selector = SelectFromModel(LassoCV(cv=5, random_state=42))
X_lasso = lasso_selector.fit_transform(X, y)lasso_selected = lasso_selector.get_support()
lasso_coefs = lasso_selector.estimator_.coef_print(f"\nL1-based Feature Selection (Lasso):")
print(f"Features selected: {X_lasso.shape[1]}")lasso_features = [(name, abs(coef)) for name, coef, selected in
zip(cancer.feature_names, lasso_coefs, lasso_selected) if selected]
lasso_features.sort(key=lambda x: x[1], reverse=True)print("Selected features by Lasso:")
for i, (name, coef) in enumerate(lasso_features):
print(f"{i+1:2d}. {name:<25} |Coefficient|: {coef:.4f}")# 5. Tree-based Feature Selection
rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))
X_rf = rf_selector.fit_transform(X, y)rf_selected = rf_selector.get_support()
rf_importances = rf_selector.estimator_.feature_importances_print(f"\nTree-based Feature Selection (Random Forest):")
print(f"Features selected: {X_rf.shape[1]}")rf_features = [(name, importance) for name, importance, selected in
zip(cancer.feature_names, rf_importances, rf_selected) if selected]
rf_features.sort(key=lambda x: x[1], reverse=True)print("Selected features by Random Forest:")
for i, (name, importance) in enumerate(rf_features):
print(f"{i+1:2d}. {name:<25} Importance: {importance:.4f}")# Performance comparison of feature selection methods
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC# Standardize features for SVM
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)# Apply scaling to selected feature sets
X_f_scaled = scaler.fit_transform(X_f_test)
X_mi_scaled = scaler.fit_transform(X_mi)
X_rfe_scaled = scaler.fit_transform(X_rfe)
X_lasso_scaled = scaler.fit_transform(X_lasso)
X_rf_scaled = scaler.fit_transform(X_rf)# Evaluate with SVM
svm_model = SVC(random_state=42)feature_sets = {
'All features': X_scaled,
'F-test (10)': X_f_scaled,
'Mutual Info (10)': X_mi_scaled,
'RFE (10)': X_rfe_scaled,
f'Lasso ({X_lasso.shape[1]})': X_lasso_scaled,
f'Random Forest ({X_rf.shape[1]})': X_rf_scaled
}print(f"\nFeature Selection Performance Comparison (SVM):")
print("="*60)
print(f"{'Method':<20}{'Features':<10}{'CV Score':<15}{'Std':<10}")
print("-"*60)best_method = None
best_score = 0for method, X_selected in feature_sets.items():
cv_scores = cross_val_score(svm_model, X_selected, y, cv=5)
mean_score = cv_scores.mean()
std_score = cv_scores.std()
print(f"{method:<20}{X_selected.shape[1]:<10}{mean_score:<15.3f}{std_score:<10.3f}")
if mean_score > best_score:
best_score = mean_score
best_method = method
print(f”\nBest performing method: {best_method} (Score: {best_score:.3f})”)
# Feature overlap analysis
print(f”\nFeature Selection Overlap Analysis:”)
methods = {
‘F-test’: set(np.array(cancer.feature_names)[k_best_f.get_support()]),
‘Mutual Info’: set(np.array(cancer.feature_names)[k_best_mi.get_support()]),
‘RFE’: set(np.array(cancer.feature_names)[rfe_selected]),
‘Lasso’: set(np.array(cancer.feature_names)[lasso_selected]),
‘Random Forest’: set(np.array(cancer.feature_names)[rf_selected])
}
# Find common features across all methods
common_features = set.intersection(*methods.values())
print(f”Features selected by ALL methods ({len(common_features)}): {sorted(common_features)}”)
# Pairwise overlap
print(f”\nPairwise Method Overlap:”)
method_names = list(methods.keys())
for i, method1 in enumerate(method_names):
for method2 in method_names[i+1:]:
overlap = len(methods[method1] & methods[method2])
union = len(methods[method1] | methods[method2])
jaccard = overlap / union if union > 0 else 0
print(f”{method1} ∩ {method2}: {overlap} features (Jaccard: {jaccard:.3f})”)
ML Pipelines and Automation
Pipelines in scikit-learn allow you to chain preprocessing and model steps, ensuring consistent transformations and preventing data leakage.
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np
# Create a mixed dataset (numerical and categorical)
np.random.seed(42)
n_samples = 1000
# Generate mixed data
data = pd.DataFrame({
‘num_feature_1’: np.random.normal(50, 15, n_samples),
‘num_feature_2’: np.random.exponential(2, n_samples),
‘num_feature_3’: np.random.uniform(0, 100, n_samples),
‘cat_feature_1’: np.random.choice([‘A’, ‘B’, ‘C’], n_samples),
‘cat_feature_2’: np.random.choice([‘X’, ‘Y’], n_samples),
‘target’: np.random.choice([0, 1], n_samples)
})
# Add some missing values to make it realistic
missing_indices = np.random.choice(n_samples, size=50, replace=False)
data.loc[missing_indices[:25], ‘num_feature_1’] = np.nan
data.loc[missing_indices[25:], ‘cat_feature_1’] = np.nan
print(“Dataset Info:”)
print(f”Shape: {data.shape}”)
print(f”Missing values:\n{data.isnull().sum()}”)
print(f”\nFirst 5 rows:”)
print(data.head())
# Separate features and target
X = data.drop(‘target’, axis=1)
y = data[‘target’]
# Define preprocessing for numerical and categorical features
numerical_features = [‘num_feature_1’, ‘num_feature_2’, ‘num_feature_3’]
categorical_features = [‘cat_feature_1’, ‘cat_feature_2’]
# Create preprocessing pipelines
numerical_pipeline = Pipeline([
(‘imputer’, SimpleImputer(strategy=’median’)),
(‘scaler’, StandardScaler())
])
categorical_pipeline = Pipeline([
(‘imputer’, SimpleImputer(strategy=’most_frequent’)),
(‘onehot’, OneHotEncoder(drop=’first’, sparse_output=False))
])
# Combine preprocessing steps
preprocessor = ColumnTransformer([
(‘num’, numerical_pipeline, numerical_features),
(‘cat’, categorical_pipeline, categorical_features)
])
# Create complete ML pipeline
ml_pipeline = Pipeline([
(‘preprocessor’, preprocessor),
(‘feature_selection’, SelectKBest(f_classif, k=5)),
(‘classifier’, RandomForestClassifier(random_state=42))
])
# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Fit the pipeline
ml_pipeline.fit(X_train, y_train)
# Make predictions
y_pred = ml_pipeline.predict(X_test)
y_pred_proba = ml_pipeline.predict_proba(X_test)
# Evaluate the pipeline
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, y_pred)
print(f”\nPipeline Results:”)
print(f”Accuracy: {accuracy:.3f}”)
# Cross-validation with pipeline
cv_scores = cross_val_score(ml_pipeline, X_train, y_train, cv=5)
print(f”Cross-validation scores: {cv_scores}”)
print(f”Mean CV score: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}”)
# Hyperparameter tuning with pipeline
param_grid = {
‘feature_selection__k’: [3, 5, 7, ‘all’],
‘classifier__n_estimators’: [50, 100, 200],
‘classifier__max_depth’: [5, 10, None],
‘classifier__min_samples_split’: [2, 5, 10]
}
# Grid search with pipeline
grid_search = GridSearchCV(
ml_pipeline,
param_grid,
cv=5,
scoring=’accuracy’,
n_jobs=-1,
verbose=1
)
print(f”\nRunning Grid Search…”)
grid_search.fit(X_train, y_train)
print(f”Best parameters: {grid_search.best_params_}”)
print(f”Best cross-validation score: {grid_search.best_score_:.3f}”)
# Evaluate best model
best_model = grid_search.best_estimator_
best_pred = best_model.predict(X_test)
best_accuracy = accuracy_score(y_test, best_pred)
print(f”Test accuracy with best model: {best_accuracy:.3f}”)
# Pipeline introspection
print(f”\nPipeline Steps:”)
for i, (name, step) in enumerate(ml_pipeline.steps):
print(f”{i+1}. {name}: {type(step).__name__}”)
# Access individual components
preprocessor_step = ml_pipeline.named_steps[‘preprocessor’]
selector_step = ml_pipeline.named_steps[‘feature_selection’]
classifier_step = ml_pipeline.named_steps[‘classifier’]
print(f”\nPreprocessor transformers:”)
for name, transformer, features in preprocessor_step.transformers_:
print(f”- {name}: {type(transformer).__name__} on {features}”)
# Feature names after preprocessing
try:
feature_names = preprocessor_step.get_feature_names_out()
selected_features = selector_step.get_support()
final_features = feature_names[selected_features]
print(f”\nSelected features: {final_features}”)
except:
print(f”\nNumber of features after preprocessing: {preprocessor_step.fit_transform(X_train).shape[1]}”)
print(f”Number of selected features: {selector_step.k}”)
# Feature importance from the classifier
if hasattr(classifier_step, ‘feature_importances_’):
importance_df = pd.DataFrame({
‘importance’: classifier_step.feature_importances_
})
print(f”\nFeature importances (top 5):”)
print(importance_df.sort_values(‘importance’, ascending=False).head())
# Advanced Pipeline with Custom Transformer
from sklearn.base import BaseEstimator, TransformerMixin
class OutlierRemover(BaseEstimator, TransformerMixin):
def __init__(self, factor=1.5):
self.factor = factor
def fit(self, X, y=None):
# Calculate IQR for each feature
Q1 = np.percentile(X, 25, axis=0)
Q3 = np.percentile(X, 75, axis=0)
IQR = Q3 – Q1
self.lower_bound = Q1 – self.factor * IQR
self.upper_bound = Q3 + self.factor * IQR
return self
def transform(self, X):
# Cap outliers
X_transformed = X.copy()
for i in range(X.shape[1]):
X_transformed[:, i] = np.clip(
X_transformed[:, i],
self.lower_bound[i],
self.upper_bound[i]
)
return X_transformed
# Enhanced pipeline with custom transformer
enhanced_numerical_pipeline = Pipeline([
(‘imputer’, SimpleImputer(strategy=’median’)),
(‘outlier_remover’, OutlierRemover(factor=1.5)),
(‘scaler’, StandardScaler())
])
enhanced_preprocessor = ColumnTransformer([
(‘num’, enhanced_numerical_pipeline, numerical_features),
(‘cat’, categorical_pipeline, categorical_features)
])
enhanced_pipeline = Pipeline([
(‘preprocessor’, enhanced_preprocessor),
(‘feature_selection’, SelectKBest(f_classif, k=5)),
(‘classifier’, RandomForestClassifier(random_state=42))
])
# Compare pipelines
pipelines = {
‘Basic Pipeline’: ml_pipeline,
‘Enhanced Pipeline’: enhanced_pipeline
}
print(f”\nPipeline Comparison:”)
for name, pipeline in pipelines.items():
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f”{name}: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}”)
# Saving and Loading Pipelines
import joblib
# Save the best pipeline
joblib.dump(grid_search.best_estimator_, ‘best_ml_pipeline.pkl’)
print(f”\nPipeline saved to ‘best_ml_pipeline.pkl'”)
# Load and use the pipeline
# loaded_pipeline = joblib.load(‘best_ml_pipeline.pkl’)
# new_predictions = loaded_pipeline.predict(X_test)
print(f”\nPipeline Benefits:”)
print(“✓ Prevents data leakage by applying transformations consistently”)
print(“✓ Simplifies hyperparameter tuning across all steps”)
print(“✓ Ensures reproducible preprocessing”)
print(“✓ Easy to deploy and maintain”)
print(“✓ Handles complex preprocessing workflows”)
Best Practices and Performance Tips
Model Development Best Practices
🎯 Always Use Cross-Validation
Never rely on a single train-test split. Use stratified k-fold for classification and repeated cross-validation for robust estimates.
🔄 Prevent Data Leakage
Apply all transformations inside cross-validation loops or use pipelines. Fit scalers and feature selectors only on training data.
📊 Scale Your Features
Most algorithms (SVM, KNN, Neural Networks) require feature scaling. Tree-based methods are exceptions but can still benefit.
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings(‘ignore’)
# Load dataset
digits = load_digits()
X, y = digits.data, digits.target
# 1. Proper Cross-Validation Setup
def evaluate_model_properly(model, X, y, cv_folds=5):
“””
Evaluate model with multiple metrics using proper cross-validation
“””
# Define multiple scoring metrics
scoring = {
‘accuracy’: ‘accuracy’,
‘precision_macro’: ‘precision_macro’,
‘recall_macro’: ‘recall_macro’,
‘f1_macro’: ‘f1_macro’
}
# Use stratified k-fold to maintain class distribution
cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)
# Perform cross-validation
cv_results = cross_validate(
model, X, y,
cv=cv,
scoring=scoring,
return_train_score=True,
n_jobs=-1
)
return cv_results
# 2. Compare Multiple Models with Proper Evaluation
models = {
‘Logistic Regression’: make_pipeline(
StandardScaler(),
LogisticRegression(random_state=42, max_iter=1000)
),
‘Random Forest’: RandomForestClassifier(random_state=42),
‘SVM’: make_pipeline(
StandardScaler(),
SVC(random_state=42)
),
‘KNN’: make_pipeline(
StandardScaler(),
KNeighborsClassifier()
)
}
print(“Comprehensive Model Comparison:”)
print(“=”*80)
print(f”{‘Model’:<20}{'Accuracy':<10}{'Precision':<10}{'Recall':<10}{'F1':<10}{'AUC':<10}{'Specificity':<10}")
print("-"*80)model_results = {}
for name, model in models.items():
results = evaluate_model_properly(model, X, y)
# Calculate mean and std for each metric
metrics = {}
for metric in ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']:
test_scores = results[f'test_{metric}']
metrics[metric] = {
'mean': test_scores.mean(),
'std': test_scores.std()
}
model_results[name] = metrics
print(f"{name:<20}{metrics['accuracy']['mean']:<10.3f}{metrics['precision_macro']['mean']:<10.3f}"
f"{metrics['recall_macro']['mean']:<10.3f}{metrics['f1_macro']['mean']:<10.3f}")# 3. Detecting Overfitting
print(f"\nOverfitting Detection (Train vs Test Scores):")
print("="*60)
print(f"{'Model':<20}{'Train Acc':<12}{'Test Acc':<12}{'Gap':<10}")
print("-"*60)for name, model in models.items():
results = evaluate_model_properly(model, X, y)
train_acc = results['train_accuracy'].mean()
test_acc = results['test_accuracy'].mean()
gap = train_acc - test_acc
status = "⚠️ Overfitting" if gap > 0.05 else “✅ Good”
print(f”{name:<20}{train_acc:<12.3f}{test_acc:<12.3f}{gap:<10.3f} {status}")# 4. Hyperparameter Tuning Best Practices
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform# Random Search vs Grid Search comparison
print(f"\nHyperparameter Tuning Comparison:")# Define parameter distributions for RandomizedSearchCV
rf_param_dist = {
'n_estimators': randint(50, 500),
'max_depth': [3, 5, 10, 15, None],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['sqrt', 'log2', None]
}# Grid search (smaller grid for comparison)
rf_param_grid = {
'n_estimators': [100, 200],
'max_depth': [5, 10, None],
'min_samples_split': [2, 5]
}rf_base = RandomForestClassifier(random_state=42)# Random search
random_search = RandomizedSearchCV(
rf_base, rf_param_dist, n_iter=50, cv=3,
random_state=42, n_jobs=-1, verbose=0
)# Grid search
grid_search = GridSearchCV(
rf_base, rf_param_grid, cv=3,
n_jobs=-1, verbose=0
)# Time the searches
import timestart_time = time.time()
random_search.fit(X, y)
random_time = time.time() - start_timestart_time = time.time()
grid_search.fit(X, y)
grid_time = time.time() - start_timeprint(f"Random Search: {random_search.best_score_:.3f} in {random_time:.1f}s")
print(f"Grid Search: {grid_search.best_score_:.3f} in {grid_time:.1f}s")
print(f"Time savings: {((grid_time - random_time) / grid_time * 100):.1f}%")# 5. Feature Engineering Best Practices
print(f"\nFeature Engineering Guidelines:")
print("✓ Handle missing values appropriately")
print("✓ Scale numerical features for distance-based algorithms")
print("✓ Encode categorical variables properly")
print("✓ Create interaction features when domain knowledge suggests")
print("✓ Use domain knowledge for feature creation")
print("✓ Remove highly correlated features")
print("✓ Apply feature selection techniques")# Example: Correlation-based feature removal
correlation_threshold = 0.95
corr_matrix = pd.DataFrame(X).corr().abs()# Find highly correlated pairs
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if corr_matrix.iloc[i, j] > correlation_threshold:
high_corr_pairs.append((i, j, corr_matrix.iloc[i, j]))
print(f”\nHighly correlated feature pairs (>{correlation_threshold}):”)
for i, j, corr in high_corr_pairs:
print(f”Feature {i} – Feature {j}: {corr:.3f}”)
# 6. Model Interpretability
from sklearn.inspection import permutation_importance
# Get feature importance for Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X, y)
# Built-in feature importance
feature_importance = rf_model.feature_importances_
# Permutation importance (more reliable)
perm_importance = permutation_importance(
rf_model, X, y, n_repeats=10, random_state=42, n_jobs=-1
)
print(f”\nFeature Importance Comparison (Top 10):”)
print(f”{‘Rank’:<5}{'Built-in':<12}{'Permutation':<12}{'Difference':<12}")
print("-"*45)# Sort by permutation importance
importance_indices = perm_importance.importances_mean.argsort()[::-1]for i, idx in enumerate(importance_indices[:10]):
builtin_imp = feature_importance[idx]
perm_imp = perm_importance.importances_mean[idx]
diff = abs(builtin_imp - perm_imp)
print(f"{i+1:<5}{builtin_imp:<12.4f}{perm_imp:<12.4f}{diff:<12.4f}")print(f"\nModel Deployment Checklist:")
print("✓ Save preprocessing steps and model together (use pipelines)")
print("✓ Version your models and data")
print("✓ Monitor model performance in production")
print("✓ Set up alerts for performance degradation")
print("✓ Plan for model retraining")
print("✓ Document model assumptions and limitations")
print("✓ Test model fairness across different groups")
Real-World Applications
Complete End-to-End ML Project: Customer Churn Prediction
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings(‘ignore’)
# Simulate customer churn dataset
np.random.seed(42)
n_customers = 5000
# Generate realistic customer data
customer_data = pd.DataFrame({
‘customer_id’: range(1, n_customers + 1),
‘tenure_months’: np.random.normal(24, 12, n_customers).astype(int),
‘monthly_charges’: np.random.normal(65, 20, n_customers),
‘total_charges’: np.random.normal(1500, 800, n_customers),
‘contract_type’: np.random.choice([‘Month-to-month’, ‘One year’, ‘Two year’],
n_customers, p=[0.5, 0.3, 0.2]),
‘payment_method’: np.random.choice([‘Electronic check’, ‘Mailed check’, ‘Bank transfer’, ‘Credit card’],
n_customers, p=[0.4, 0.2, 0.2, 0.2]),
‘internet_service’: np.random.choice([‘DSL’, ‘Fiber optic’, ‘No’],
n_customers, p=[0.4, 0.4, 0.2]),
‘tech_support’: np.random.choice([‘Yes’, ‘No’], n_customers, p=[0.3, 0.7]),
‘online_security’: np.random.choice([‘Yes’, ‘No’], n_customers, p=[0.35, 0.65]),
‘senior_citizen’: np.random.choice([0, 1], n_customers, p=[0.84, 0.16]),
‘dependents’: np.random.choice([0, 1], n_customers, p=[0.7, 0.3]),
‘paperless_billing’: np.random.choice([0, 1], n_customers, p=[0.4, 0.6])
})
# Create realistic churn patterns
churn_probability = 0.1 # Base churn rate
# Factors that increase churn probability
customer_data[‘churn_prob’] = churn_probability
# Month-to-month contracts have higher churn
customer_data.loc[customer_data[‘contract_type’] == ‘Month-to-month’, ‘churn_prob’] *= 3
# High monthly charges increase churn
customer_data.loc[customer_data[‘monthly_charges’] > 80, ‘churn_prob’] *= 2
# Poor service (no tech support, no online security) increases churn
customer_data.loc[
(customer_data[‘tech_support’] == ‘No’) &
(customer_data[‘online_security’] == ‘No’), ‘churn_prob’
] *= 2.5
# Senior citizens have higher churn
customer_data.loc[customer_data[‘senior_citizen’] == 1, ‘churn_prob’] *= 1.5
# Short tenure customers have higher churn
customer_data.loc[customer_data[‘tenure_months’] < 12, 'churn_prob'] *= 2# Generate churn based on probabilities
customer_data['churn'] = np.random.binomial(1, customer_data['churn_prob'].clip(0, 1))# Remove the probability column (not available in real data)
customer_data = customer_data.drop(['customer_id', 'churn_prob'], axis=1)print("Customer Churn Dataset:")
print(f"Shape: {customer_data.shape}")
print(f"Churn rate: {customer_data['churn'].mean():.1%}")
print(f"\nFeature types:")
print(customer_data.dtypes)
print(f"\nFirst few rows:")
print(customer_data.head())# Exploratory Data Analysis
print(f"\nChurn Analysis by Key Features:")# Churn by contract type
churn_by_contract = customer_data.groupby('contract_type')['churn'].agg(['count', 'sum', 'mean'])
churn_by_contract['churn_rate'] = churn_by_contract['mean']
print(f"\nChurn by Contract Type:")
print(churn_by_contract[['count', 'sum', 'churn_rate']].round(3))# Churn by internet service
churn_by_internet = customer_data.groupby('internet_service')['churn'].agg(['count', 'sum', 'mean'])
print(f"\nChurn by Internet Service:")
print(churn_by_internet[['count', 'sum', 'mean']].round(3))# Feature Engineering
def create_features(df):
"""Create additional features based on domain knowledge"""
df_new = df.copy()
# Monthly charges per tenure (loyalty metric)
df_new['charges_per_tenure'] = df_new['monthly_charges'] / (df_new['tenure_months'] + 1)
# Total charges normalized by tenure
df_new['avg_monthly_spend'] = df_new['total_charges'] / (df_new['tenure_months'] + 1)
# Service quality score (number of additional services)
service_cols = ['tech_support', 'online_security']
df_new['service_quality_score'] = sum([
(df_new[col] == 'Yes').astype(int) for col in service_cols
])
# High value customer flag
df_new['high_value'] = (df_new['monthly_charges'] > df_new[‘monthly_charges’].quantile(0.75)).astype(int)
# Contract stability (non-month-to-month)
df_new[‘stable_contract’] = (df_new[‘contract_type’] != ‘Month-to-month’).astype(int)
return df_new
# Apply feature engineering
customer_data_fe = create_features(customer_data)
print(f”\nFeatures after engineering: {customer_data_fe.shape[1]}”)
print(f”New features: {[col for col in customer_data_fe.columns if col not in customer_data.columns]}”)
# Prepare data for modeling
X = customer_data_fe.drop(‘churn’, axis=1)
y = customer_data_fe[‘churn’]
# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=[‘object’]).columns.tolist()
numerical_cols = X.select_dtypes(include=[np.number]).columns.tolist()
print(f”\nCategorical features: {categorical_cols}”)
print(f”Numerical features: {numerical_cols}”)
# Preprocessing pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
preprocessor = ColumnTransformer([
(‘num’, StandardScaler(), numerical_cols),
(‘cat’, OneHotEncoder(drop=’first’, sparse_output=False), categorical_cols)
])
# Model pipeline
models = {
‘Logistic Regression’: Pipeline([
(‘preprocessor’, preprocessor),
(‘classifier’, LogisticRegression(random_state=42, max_iter=1000))
]),
‘Random Forest’: Pipeline([
(‘preprocessor’, preprocessor),
(‘classifier’, RandomForestClassifier(n_estimators=100, random_state=42))
]),
‘Gradient Boosting’: Pipeline([
(‘preprocessor’, preprocessor),
(‘classifier’, GradientBoostingClassifier(random_state=42))
])
}
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Model evaluation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model_results = {}
print(f”\nModel Performance Comparison:”)
print(“=”*60)
print(f”{‘Model’:<20}{'CV Accuracy':<15}{'CV AUC':<15}{'Test AUC':<15}")
print("-"*60)for name, pipeline in models.items():
# Cross-validation
cv_accuracy = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy')
cv_auc = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='roc_auc')
# Test performance
pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
test_auc = roc_auc_score(y_test, y_pred_proba)
model_results[name] = {
'cv_accuracy': cv_accuracy.mean(),
'cv_auc': cv_auc.mean(),
'test_auc': test_auc,
'model': pipeline
}
print(f"{name:<20}{cv_accuracy.mean():<15.3f}{cv_auc.mean():<15.3f}{test_auc:<15.3f}")# Select best model
best_model_name = max(model_results.keys(), key=lambda x: model_results[x]['cv_auc'])
best_model = model_results[best_model_name]['model']print(f"\nBest Model: {best_model_name}")# Detailed evaluation of best model
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]print(f"\nDetailed Performance Report:")
print(classification_report(y_test, y_pred))# Business Impact Analysis
print(f"\nBusiness Impact Analysis:")# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()print(f"Confusion Matrix:")
print(f"True Negatives: {tn:4d} | False Positives: {fp:4d}")
print(f"False Negatives: {fn:4d} | True Positives: {tp:4d}")# Business metrics (assuming average customer value)
avg_customer_value = 1000 # Annual customer value
retention_cost = 100 # Cost to retain a customer# Calculate potential savings
correctly_identified_churners = tp
false_alarms = fp
missed_churners = fnpotential_revenue_saved = correctly_identified_churners * avg_customer_value
wasted_retention_cost = false_alarms * retention_cost
lost_revenue = missed_churners * avg_customer_valuenet_benefit = potential_revenue_saved - wasted_retention_costprint(f"\nBusiness Impact (Annual):")
print(f"Potential revenue saved: ${potential_revenue_saved:,}")
print(f"Wasted retention costs: ${wasted_retention_cost:,}")
print(f"Lost revenue (missed): ${lost_revenue:,}")
print(f"Net benefit: ${net_benefit:,}")# Feature importance analysis
if hasattr(best_model.named_steps['classifier'], 'feature_importances_'):
# Get feature names after preprocessing
feature_names = (best_model.named_steps['preprocessor']
.get_feature_names_out())
importances = best_model.named_steps['classifier'].feature_importances_
# Create feature importance dataframe
feature_importance = pd.DataFrame({
'feature': feature_names,
'importance': importances
}).sort_values('importance', ascending=False)
print(f"\nTop 10 Most Important Features:")
print(feature_importance.head(10))
# Business recommendations based on feature importance
print(f"\nBusiness Recommendations:")
top_features = feature_importance.head(5)['feature'].tolist();
print("Focus on these key factors to reduce churn:")
for i, feature in enumerate(top_features, 1):
print(f"{i}. {feature}")print(f"\nModel Deployment Considerations:")
print("✓ Monitor model performance monthly")
print("✓ Retrain model quarterly with new data")
print("✓ Set up alerts for prediction confidence drops")
print("✓ A/B test retention strategies")
print("✓ Track business metrics alongside technical metrics")
Conclusion
Scikit-learn is an incredibly powerful and comprehensive machine learning library that has become the industry standard for ML development in Python. Throughout this guide, we’ve explored its vast capabilities, from basic algorithms to advanced pipelines and real-world applications.
What we’ve covered in this comprehensive guide:
- Installation and Setup: How to properly set up scikit-learn with conda environments
- Supervised Learning: Linear/Logistic Regression, Decision Trees, Random Forest, SVM, KNN, and Naive Bayes
- Unsupervised Learning: K-Means, PCA, DBSCAN, and Hierarchical Clustering
- Model Evaluation: Cross-validation, hyperparameter tuning, and comprehensive metrics
- Data Preprocessing: Feature scaling, encoding, and selection techniques
- ML Pipelines: Building robust, maintainable machine learning workflows
- Best Practices: Industry-standard approaches for reliable ML development
- Real-world Applications: Complete end-to-end project with business impact analysis
🚀 Key Takeaways for ML Engineers
Consistency: Scikit-learn’s uniform API makes it easy to experiment with different algorithms quickly.
Reliability: Battle-tested implementations ensure your models work correctly in production.
Scalability: Built-in parallelization and efficient algorithms handle large datasets effectively.
Integration: Seamless compatibility with the entire Python data science ecosystem.
📈 Your ML Journey Next Steps
Practice: Implement the examples in this guide with your own datasets
Experiment: Try different algorithms and compare their performance
Deploy: Put your models into production using frameworks like Flask or FastAPI
Learn More: Explore advanced topics like ensemble methods, neural networks, and AutoML
Remember the ML Engineering Mindset:
- Start simple, then increase complexity as needed
- Always validate your models properly with cross-validation
- Focus on solving real business problems, not just achieving high accuracy
- Monitor your models in production and retrain regularly
- Document your work and make it reproducible
Scikit-learn provides all the tools you need to build world-class machine learning systems. Whether you’re predicting customer churn, detecting fraud, recommending products, or solving any other ML problem, the concepts and techniques in this guide will serve as your foundation for success.
🎯 Master Machine Learning with Scikit-Learn!
The journey to becoming an expert ML engineer starts with solid fundamentals.
Essential Scikit-Learn Terms & Concepts
Term | Description | Example |
---|---|---|
Estimator | Any object that learns from data. Has fit() method | RandomForestClassifier() |
Transformer | Estimator with transform() method for data preprocessing | StandardScaler() |
Predictor | Estimator with predict() method for making predictions | svm.predict(X_test) |
Pipeline | Chain of transformers with final estimator | Pipeline([('scaler', StandardScaler()), ('clf', SVC())]) |
Cross-validation | Technique to assess model performance using multiple train-test splits | cross_val_score(model, X, y, cv=5) |
GridSearchCV | Exhaustive search over specified parameter values | GridSearchCV(model, param_grid, cv=5) |
RandomizedSearchCV | Random search over parameter distributions | RandomizedSearchCV(model, param_dist, n_iter=100) |
Feature Selection | Process of selecting relevant features for modeling | SelectKBest(f_classif, k=10) |
Stratification | Maintaining class distribution in train-test splits | train_test_split(X, y, stratify=y) |
Regularization | Technique to prevent overfitting by adding penalty terms | LogisticRegression(C=0.1) |
Hyperparameters | Configuration settings for algorithms set before training | RandomForestClassifier(n_estimators=100) |
Ensemble Methods | Combining multiple models for better performance | RandomForestClassifier, VotingClassifier |
Feature Engineering | Creating new features from existing data | PolynomialFeatures(degree=2) |
Dimensionality Reduction | Reducing number of features while preserving information | PCA(n_components=2) |
Clustering | Grouping similar data points together | KMeans(n_clusters=3) |
Classification | Predicting discrete class labels | SVC(), LogisticRegression() |
Regression | Predicting continuous numerical values | LinearRegression(), SVR() |
Overfitting | Model performs well on training data but poorly on new data | High training accuracy, low validation accuracy |
Underfitting | Model is too simple to capture underlying patterns | Low training and validation accuracy |
Bias-Variance Tradeoff | Balance between model simplicity and complexity | High bias (underfitting) vs High variance (overfitting) |
ROC-AUC | Area Under Receiver Operating Characteristic curve | roc_auc_score(y_true, y_scores) |
Precision | True positives / (True positives + False positives) | precision_score(y_true, y_pred) |
Recall | True positives / (True positives + False negatives) | recall_score(y_true, y_pred) |
F1-Score | Harmonic mean of precision and recall | f1_score(y_true, y_pred) |
Confusion Matrix | Table showing correct vs predicted classifications | confusion_matrix(y_true, y_pred) |