placement brief / Interview Questions / interview questions / 08 Jun 2026

Scikit-Learn Interview Questions 2026: 28 Answers with Code

Q: Is sklearn good enough for production ML in 2026, or do I need XGBoost/LightGBM?

sklearn's HistGradientBoostingClassifier matches XGBoost performance on most tabular datasets and supports native categorical features and missing values. XGBoost/LightGBM are still preferred in competitions and at companies with existing infrastructure. Both are covered in senior DS interviews.

28 scikit-learn interview questions with full code answers covering Estimator API, Pipeline, cross-validation, model selection, custom transformers, and production sklearn patterns for 2026.

By Aditya SharmaPublished 8 Jun 20262 sources listedSpot an error? Corrections open

4 min read last revised 8 Jun 2026

on this page§ 05

Scikit-learn is the industry-standard Python ML library, and sklearn fluency is expected in every data scientist and ML engineer interview. Interviewers go beyond basic model fitting -- they probe pipeline design, leakage prevention, custom estimators, and production patterns. This guide covers 28 scikit-learn interview questions with full answers and code examples.

PapersAdda's take: Candidates report that Pipeline + ColumnTransformer construction and cross-validation leakage questions are the two most common live coding scenarios at product company DS interviews. Building a full preprocessing pipeline from scratch in under 20 minutes is a reliable bar-setting test. Confirm the specific Python environment and libraries expected on the official company careers portal before you prepare.

Related articles: Data Science Interview Questions 2026 | Machine Learning Interview Questions 2026 | Pandas Interview Questions 2026 | Statistics for Data Science 2026 | MLOps Interview Questions 2026

Which Roles Test Scikit-Learn Deeply?

Role	Sklearn Focus
Data Scientist	Full pipelines, model selection, evaluation
ML Engineer	Custom estimators, production serialization, deployment
Applied Scientist	Algorithm internals, ensemble methods, hyperparameter optimization
Data Analyst (advanced)	Basic classification/regression, metrics, interpretation

EASY: Estimator API and Core Concepts (Questions 1-8)

Q1. What is the sklearn Estimator API? What are fit, transform, predict, and score?

from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin
import numpy as np

# The sklearn API contract:
# - All estimators expose set_params() and get_params() (from BaseEstimator)
# - Transformers: fit(X, y=None), transform(X) -> X_new, fit_transform(X, y)
# - Predictors: fit(X, y), predict(X) -> y_pred
# - Classifiers: predict_proba(X) -> probabilities, score(X, y) -> accuracy
# - Regressors: score(X, y) -> R^2

# Example: using a simple sklearn estimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()

# CORRECT: fit on train, transform on both
scaler.fit(X_train)             # learns mean/std from training data only
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)  # applies same stats to test

# WRONG: this leaks test data into training statistics
# scaler.fit(X_test)
# scaler.fit_transform(X_test)  # fit_transform is equivalent to fit+transform

# Classifier
clf = LogisticRegression()
clf.fit(X_train_s, y_train)
y_pred = clf.predict(X_test_s)
y_proba = clf.predict_proba(X_test_s)[:, 1]  # probability of positive class
acc = clf.score(X_test_s, y_test)            # accuracy

Q2. What is a sklearn Pipeline? Why is it important?

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer

# Pipeline: chain steps where each step's output is the next step's input
# All fit operations happen on training data only -- prevents leakage

pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000)),
])

# Single fit call handles all preprocessing + model training
pipe.fit(X_train, y_train)

# Single predict call handles all preprocessing + inference
y_pred = pipe.predict(X_test)
y_proba = pipe.predict_proba(X_test)[:, 1]

# Access intermediate steps
scaler_params = pipe.named_steps["scaler"].mean_

# set_params: override any step's parameters (for GridSearchCV)
pipe.set_params(clf__C=0.1, clf__max_iter=500)

# Why Pipeline matters:
# 1. Prevents leakage: fit_transform on test is impossible through Pipeline
# 2. Reusability: pickle the whole pipeline including preprocessing
# 3. GridSearchCV compatibility: search over preprocessing + model hyperparams together
# 4. Production readiness: one object handles train + serve identically

Q3. What is ColumnTransformer? How do you handle mixed data types?

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
import pandas as pd

# Mixed data: numerical + categorical + ordinal features
num_features = ["age", "salary", "tenure_days"]
cat_features = ["city", "department"]
ord_features = ["education_level"]  # ordered: High School < Bachelor < Master < PhD
ord_categories = [["High School", "Bachelor", "Master", "PhD"]]

# Preprocessing pipelines per feature type
num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])

ord_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(categories=ord_categories)),
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer([
    ("num", num_pipe, num_features),
    ("cat", cat_pipe, cat_features),
    ("ord", ord_pipe, ord_features),
], remainder="drop")  # or "passthrough" to keep unlisted columns

# Full pipeline with model
from sklearn.ensemble import GradientBoostingClassifier
full_pipeline = Pipeline([
    ("prep", preprocessor),
    ("model", GradientBoostingClassifier(n_estimators=200, random_state=42)),
])

full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)

# Auto-detect types (simpler for quick prototyping)
auto_preprocessor = ColumnTransformer([
    ("num", StandardScaler(), make_column_selector(dtype_include='number')),
    ("cat", OneHotEncoder(handle_unknown="ignore"), make_column_selector(dtype_exclude='number')),
])

Q4. What is cross-validation in sklearn? What are the different CV strategies?

from sklearn.model_selection import (
    cross_val_score, cross_validate,
    KFold, StratifiedKFold, TimeSeriesSplit,
    RepeatedStratifiedKFold, GroupKFold
)
from sklearn.ensemble import RandomForestClassifier
import numpy as np

model = RandomForestClassifier(n_estimators=100, random_state=42)

# KFold: standard, for regression
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='r2', n_jobs=-1)

# StratifiedKFold: for classification (preserves class ratio per fold)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
auc_scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc', n_jobs=-1)

# TimeSeriesSplit: NEVER shuffle time-series data
tscv = TimeSeriesSplit(n_splits=5, gap=7)  # gap = 7-day buffer between train and test
ts_scores = cross_val_score(model, X_ts, y_ts, cv=tscv, scoring='neg_mean_absolute_error')

# GroupKFold: prevent data leakage when same user appears in multiple rows
gkf = GroupKFold(n_splits=5)
group_scores = cross_val_score(model, X, y, cv=gkf, groups=user_ids, scoring='roc_auc')

# RepeatedStratifiedKFold: reduces variance of CV estimate
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
robust_scores = cross_val_score(model, X, y, cv=rskf, scoring='roc_auc')

# cross_validate: returns train scores + fit time + score time
cv_results = cross_validate(
    model, X, y, cv=skf,
    scoring={'auc': 'roc_auc', 'f1': 'f1_weighted'},
    return_train_score=True,
    n_jobs=-1,
)
print(f"Test AUC: {cv_results['test_auc'].mean():.4f} +/- {cv_results['test_auc'].std():.4f}")
print(f"Train AUC: {cv_results['train_auc'].mean():.4f}")

Q5. What is the difference between GridSearchCV and RandomizedSearchCV?

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import randint, uniform
import numpy as np

model = GradientBoostingClassifier(random_state=42)

# GridSearchCV: exhaustive search over all combinations
param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.05, 0.1],
}
# 3 * 3 * 3 = 27 combinations * 5 folds = 135 fits

grid_search = GridSearchCV(
    model, param_grid,
    cv=5, scoring='roc_auc',
    n_jobs=-1, verbose=1,
    refit=True,  # refit best model on full training set
)
grid_search.fit(X_train, y_train)

# RandomizedSearchCV: sample n_iter combinations from distributions
param_dist = {
    "n_estimators": randint(100, 500),
    "max_depth": randint(3, 10),
    "learning_rate": uniform(0.005, 0.2),
    "subsample": uniform(0.6, 0.4),
    "min_samples_leaf": randint(1, 20),
}
# Only n_iter=50 fits instead of exhaustive grid -- much faster

random_search = RandomizedSearchCV(
    model, param_dist,
    n_iter=50,
    cv=5, scoring='roc_auc',
    n_jobs=-1, random_state=42,
    refit=True,
)
random_search.fit(X_train, y_train)

print(f"Grid best AUC: {grid_search.best_score_:.4f}")
print(f"Random best AUC: {random_search.best_score_:.4f}")
print(f"Best params: {random_search.best_params_}")

# Access best model directly
best_model = random_search.best_estimator_

Q6. What is feature selection in sklearn? What are the three main approaches?

from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif,  # filter
    RFE, RFECV,                                    # wrapper
    SelectFromModel, SequentialFeatureSelector,     # embedded
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Lasso

# 1. Filter methods: score each feature independently
selector_f = SelectKBest(score_func=f_classif, k=10)
X_filtered = selector_f.fit_transform(X_train, y_train)
selected_features = selector_f.get_support()

selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)
X_mi = selector_mi.fit_transform(X_train, y_train)

# 2. Wrapper methods: use model performance to select features
rfe = RFE(estimator=RandomForestClassifier(n_estimators=50, random_state=42), n_features_to_select=10)
X_rfe = rfe.fit_transform(X_train, y_train)

# RFECV: automatically finds optimal number of features via cross-validation
rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=50, random_state=42),
    step=1, cv=5, scoring='roc_auc', n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f"Optimal features: {rfecv.n_features_}")

# 3. Embedded methods: feature importance from regularization or tree splitting
# Lasso (L1 regularization drives coefficients to 0)
lasso_selector = SelectFromModel(
    Lasso(alpha=0.01), max_features=20
)
X_lasso = lasso_selector.fit_transform(X_train, y_train)

# Tree-based feature importance
tree_selector = SelectFromModel(
    RandomForestClassifier(n_estimators=200, random_state=42),
    threshold="median",  # keep features with importance > median
)
X_tree = tree_selector.fit_transform(X_train, y_train)
print(f"Features selected by tree: {tree_selector.get_support().sum()}")

Q7. How do you evaluate a classification model with sklearn metrics?

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score,
    confusion_matrix, classification_report,
    precision_recall_curve, roc_curve
)
import numpy as np

y_true = np.array([0, 0, 1, 1, 1, 0, 1, 0])
y_pred = np.array([0, 1, 1, 0, 1, 0, 1, 0])
y_proba = np.array([0.1, 0.6, 0.8, 0.4, 0.9, 0.2, 0.7, 0.3])

# Basic metrics
print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")  # TP / (TP+FP)
print(f"Recall: {recall_score(y_true, y_pred):.4f}")        # TP / (TP+FN)
print(f"F1: {f1_score(y_true, y_pred):.4f}")                # harmonic mean

# Probability-based metrics (better for imbalanced data)
print(f"ROC-AUC: {roc_auc_score(y_true, y_proba):.4f}")
print(f"PR-AUC: {average_precision_score(y_true, y_proba):.4f}")

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"\nConfusion matrix:\n{cm}")
# [[TN FP]
#  [FN TP]]

# Full report
print(classification_report(y_true, y_pred, target_names=["negative", "positive"]))

# Threshold tuning: find threshold for target recall
precisions, recalls, thresholds = precision_recall_curve(y_true, y_proba)
# Find threshold where recall >= 0.90
target_recall = 0.90
idx = np.argmax(recalls >= target_recall)
if idx < len(thresholds):
    opt_threshold = thresholds[idx]
    print(f"Threshold for recall>={target_recall}: {opt_threshold:.3f}")
    print(f"Precision at that threshold: {precisions[idx]:.3f}")

Q8. How do you evaluate a regression model?

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, root_mean_squared_error,
    r2_score, mean_absolute_percentage_error, median_absolute_error
)
import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

print(f"MAE: {mean_absolute_error(y_true, y_pred):.4f}")
print(f"MSE: {mean_squared_error(y_true, y_pred):.4f}")
print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.4f}")
print(f"R^2: {r2_score(y_true, y_pred):.4f}")
print(f"MAPE: {mean_absolute_percentage_error(y_true, y_pred):.4f}")
print(f"MedAE: {median_absolute_error(y_true, y_pred):.4f}")

# When to use each metric:
# MAE: interpretable in original units, robust to outliers
# RMSE: penalizes large errors more, standard for forecasting
# MAPE: relative error, useful when scale varies across targets
# R^2: explained variance fraction [0, 1], 1 = perfect
# MedAE: robust metric for heavy-tailed error distributions

MEDIUM: Advanced Pipeline and Model Selection (Questions 9-20)

Q9. How do you build a custom transformer in sklearn?

from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import numpy as np

class CyclicalEncoder(BaseEstimator, TransformerMixin):
    """Encode cyclical features (hour, month, day-of-week) as sin/cos pairs."""

    def __init__(self, period: int = 24):
        self.period = period

    def fit(self, X, y=None):
        return self  # stateless transformer

    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            X = X.values
        X = X.ravel()
        return np.column_stack([
            np.sin(2 * np.pi * X / self.period),
            np.cos(2 * np.pi * X / self.period),
        ])

class WinsorizeTransformer(BaseEstimator, TransformerMixin):
    """Clip outliers at given percentiles, learned from training data."""

    def __init__(self, lower=0.01, upper=0.99):
        self.lower = lower
        self.upper = upper

    def fit(self, X, y=None):
        self.lower_bounds_ = np.nanquantile(X, self.lower, axis=0)
        self.upper_bounds_ = np.nanquantile(X, self.upper, axis=0)
        return self

    def transform(self, X):
        return np.clip(X, self.lower_bounds_, self.upper_bounds_)

# Use in a Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ("winsorize", WinsorizeTransformer(lower=0.01, upper=0.99)),
    ("scale", StandardScaler()),
])

# Test that it satisfies sklearn API contract
from sklearn.utils.estimator_checks import check_estimator
# check_estimator(WinsorizeTransformer())  # runs ~100 sklearn compatibility tests

Q10. What is a custom sklearn Classifier? Implement one.

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
import numpy as np

class NearestMeanClassifier(BaseEstimator, ClassifierMixin):
    """Classify by nearest class mean (simple baseline)."""

    def fit(self, X, y):
        # Validate inputs
        X, y = check_X_y(X, y)
        self.classes_ = unique_labels(y)     # required by sklearn API
        self.n_features_in_ = X.shape[1]    # required by sklearn API

        # Compute mean per class
        self.class_means_ = np.array([
            X[y == c].mean(axis=0) for c in self.classes_
        ])
        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        # Distance from each sample to each class mean
        dists = np.linalg.norm(
            X[:, np.newaxis, :] - self.class_means_[np.newaxis, :, :],
            axis=2
        )  # (N, n_classes)
        return self.classes_[dists.argmin(axis=1)]

    def predict_proba(self, X):
        check_is_fitted(self)
        X = check_array(X)
        dists = np.linalg.norm(
            X[:, np.newaxis, :] - self.class_means_[np.newaxis, :, :],
            axis=2
        )
        # Convert distances to probabilities via softmin
        sim = 1 / (dists + 1e-8)
        return sim / sim.sum(axis=1, keepdims=True)

# Test
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)
clf = NearestMeanClassifier()
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")

Q11. How does sklearn handle class imbalance?

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.utils.class_weight import compute_class_weight, compute_sample_weight
import numpy as np

y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1])  # 80/20 imbalance

# Method 1: class_weight parameter (most models support this)
lr = LogisticRegression(class_weight='balanced')
rf = RandomForestClassifier(class_weight='balanced')
svc = SVC(class_weight='balanced', probability=True)

# Manual class weights
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
manual_weights = dict(zip(np.unique(y), weights))
lr_manual = LogisticRegression(class_weight=manual_weights)

# Method 2: sample_weight in fit() (for GBM and other models)
sample_weights = compute_sample_weight('balanced', y=y)
gbm = GradientBoostingClassifier()
gbm.fit(X_train, y_train, sample_weight=sample_weights[train_indices])

# Method 3: Threshold tuning (post-hoc, model-agnostic)
clf = LogisticRegression(class_weight='balanced')
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_test)[:, 1]
# Lower threshold to favor recall for minority class
y_pred_low_thresh = (proba >= 0.3).astype(int)

# Method 4: imblearn integration (outside sklearn but compatible)
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE

pipeline = ImbPipeline([
    ("smote", SMOTE(random_state=42)),
    ("clf", GradientBoostingClassifier()),
])

Q12. What is the difference between Pipeline and FeatureUnion?

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np

# Pipeline: sequential -- each step's output feeds the next
sequential_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=10)),
    ("clf", LogisticRegression()),
])

# FeatureUnion: parallel -- concatenates outputs of multiple transformers
# Use when you want to combine different feature representations

numeric_features = Pipeline([
    ("scaler", StandardScaler()),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
])

text_features = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=1000)),
    ("svd", TruncatedSVD(n_components=50)),
])

# FeatureUnion concatenates both feature sets horizontally
combined_features = FeatureUnion([
    ("numeric", numeric_features),
    # ("text", text_features),  # if text column exists
])

# Full pipeline with FeatureUnion
full_pipe = Pipeline([
    ("features", combined_features),
    ("clf", LogisticRegression()),
])

# Modern alternative: ColumnTransformer (more flexible, preferred over FeatureUnion)
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
    ("num", numeric_features, num_col_indices),
    # ("txt", text_features, text_col_indices),
])

Q13. How do you detect and prevent data leakage in sklearn pipelines?

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np

# LEAKAGE EXAMPLE 1: scaling before split
# WRONG:
scaler = StandardScaler()
X_scaled_wrong = scaler.fit_transform(X)  # uses ALL data including test
X_tr_wrong, X_te_wrong = X_scaled_wrong[:800], X_scaled_wrong[800:]
# Test mean/std has contaminated training normalization

# CORRECT: use Pipeline in cross-validation
pipe = Pipeline([
    ("scaler", StandardScaler()),  # fit_transform on TRAIN fold only
    ("clf", LogisticRegression()),
])
# cross_val_score calls pipe.fit(X_train, y_train) then pipe.score(X_val, y_val)
# scaler.fit() is called only on X_train within each fold
scores = cross_val_score(pipe, X, y, cv=StratifiedKFold(5), scoring='roc_auc')

# LEAKAGE EXAMPLE 2: target encoding outside Pipeline
# WRONG:
# y_mean = df.groupby('city')['target'].mean()  # uses all targets
# df['city_te'] = df['city'].map(y_mean)         # then split

# CORRECT: implement target encoding as a custom transformer inside Pipeline
class LeakFreeTargetEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y):
        self.global_mean_ = y.mean()
        df = pd.DataFrame({"X": X.ravel(), "y": y})
        stats = df.groupby("X")["y"].agg(["mean", "count"])
        # Smoothed estimate
        k = 10
        self.mapping_ = ((stats["count"] * stats["mean"] + k * self.global_mean_)
                         / (stats["count"] + k)).to_dict()
        return self

    def transform(self, X):
        return np.array([self.mapping_.get(v, self.global_mean_) for v in X.ravel()]).reshape(-1, 1)

# LEAKAGE EXAMPLE 3: feature selection outside CV
# WRONG: select features on all data, then cross-validate
# selector = SelectKBest(k=10).fit(X, y)  # uses all labels
# scores = cross_val_score(clf, selector.transform(X), y, cv=5)

# CORRECT: feature selection inside Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
pipe_correct = Pipeline([
    ("selector", SelectKBest(f_classif, k=10)),  # fitted on train fold only
    ("clf", LogisticRegression()),
])

Q14. How do you use sklearn for time series forecasting?

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

def create_lag_features(series, lags=[1, 7, 14, 28]):
    """Create lag features from a time series."""
    df = pd.DataFrame({"y": series})
    for lag in lags:
        df[f"lag_{lag}"] = df["y"].shift(lag)
    for window in [7, 14]:
        df[f"rolling_mean_{window}"] = df["y"].shift(1).rolling(window).mean()
        df[f"rolling_std_{window}"] = df["y"].shift(1).rolling(window).std()
    df = df.dropna()
    X = df.drop("y", axis=1).values
    y = df["y"].values
    return X, y

# Synthetic time series
np.random.seed(42)
data = np.cumsum(np.random.randn(500)) + np.sin(np.arange(500) * 2 * np.pi / 7) * 10

X, y = create_lag_features(data)

# TimeSeriesSplit: respect temporal ordering
tscv = TimeSeriesSplit(n_splits=5, gap=1)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("reg", Ridge(alpha=1.0)),
])

scores = cross_val_score(pipe, X, y, cv=tscv, scoring='neg_mean_absolute_error')
print(f"MAE: {-scores.mean():.4f} +/- {-scores.std():.4f}")

# GridSearchCV respecting time series order
from sklearn.model_selection import GridSearchCV
param_grid = {"reg__alpha": [0.01, 0.1, 1.0, 10.0, 100.0]}
gs = GridSearchCV(pipe, param_grid, cv=tscv, scoring='neg_mean_absolute_error')
gs.fit(X, y)
print(f"Best alpha: {gs.best_params_['reg__alpha']}")

Q15. How do you serialize and deploy a sklearn model?

import joblib
import pickle
import json
from sklearn.pipeline import Pipeline

# Option 1: joblib (recommended for sklearn objects with numpy arrays)
joblib.dump(pipeline, "model.joblib", compress=3)  # compress: 1-9, 3 is good default
loaded_pipeline = joblib.load("model.joblib")
y_pred = loaded_pipeline.predict(X_test)

# Option 2: pickle (standard library, no external dependency)
with open("model.pkl", "wb") as f:
    pickle.dump(pipeline, f)
with open("model.pkl", "rb") as f:
    loaded = pickle.load(f)

# Option 3: ONNX export (framework-agnostic, production serving)
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Define input shape
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(pipeline, initial_types=initial_type)
with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

# ONNX inference (no sklearn dependency at serving time)
import onnxruntime as rt
sess = rt.InferenceSession("model.onnx")
input_name = sess.get_inputs()[0].name
pred = sess.run(None, {input_name: X_test.astype(np.float32)})[0]

# Version metadata: always save alongside the model
metadata = {
    "sklearn_version": "1.5.0",
    "trained_on": "2026-06-08",
    "n_features": X_train.shape[1],
    "feature_names": list(X_train.columns) if hasattr(X_train, 'columns') else None,
    "target_classes": pipeline.classes_.tolist() if hasattr(pipeline, 'classes_') else None,
}
with open("model_metadata.json", "w") as f:
    json.dump(metadata, f)

Q16. What is VotingClassifier and StackingClassifier?

from sklearn.ensemble import VotingClassifier, StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb

# VotingClassifier: combine predictions by voting or probability averaging
estimators = [
    ("rf", RandomForestClassifier(n_estimators=100, random_state=42)),
    ("svm", SVC(probability=True, kernel='rbf', random_state=42)),
    ("xgb", xgb.XGBClassifier(n_estimators=100, use_label_encoder=False, random_state=42)),
]

# Hard voting: majority vote on predicted class
hard_voter = VotingClassifier(estimators=estimators, voting='hard')

# Soft voting: average predicted probabilities (usually better)
soft_voter = VotingClassifier(estimators=estimators, voting='soft')
soft_voter.fit(X_train, y_train)

# StackingClassifier: use base models as feature generators for a meta-learner
from sklearn.model_selection import StratifiedKFold
base_estimators = [
    ("rf", RandomForestClassifier(n_estimators=100, random_state=42)),
    ("xgb", xgb.XGBClassifier(n_estimators=100, use_label_encoder=False, random_state=42)),
    ("lr", LogisticRegression(max_iter=1000)),
]

stacker = StackingClassifier(
    estimators=base_estimators,
    final_estimator=LogisticRegression(),  # meta-learner
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    stack_method='predict_proba',          # use probabilities as meta-features
    passthrough=True,                       # include original features in meta-training
    n_jobs=-1,
)
stacker.fit(X_train, y_train)
print(f"Stacker AUC: {roc_auc_score(y_test, stacker.predict_proba(X_test)[:, 1]):.4f}")

Q17. How do you handle text data in a sklearn pipeline?

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
import pandas as pd

# Sample text classification
texts = ["great product", "terrible service", "average quality", "excellent fast delivery"]
labels = [1, 0, 1, 1]

# TF-IDF pipeline
tfidf_pipe = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=10_000,
        ngram_range=(1, 2),     # unigrams + bigrams
        min_df=2,               # ignore terms in < 2 documents
        max_df=0.95,            # ignore terms in > 95% of documents
        sublinear_tf=True,      # log normalization of term frequency
        strip_accents='unicode',
        analyzer='word',
    )),
    ("clf", LogisticRegression(C=1.0, max_iter=1000, solver='lbfgs')),
])

tfidf_pipe.fit(texts, labels)

# Hyperparameter search over both TF-IDF and model parameters
param_grid = {
    "tfidf__max_features": [5000, 10000, 50000],
    "tfidf__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "clf__C": [0.1, 1.0, 10.0],
}
gs = GridSearchCV(tfidf_pipe, param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
# gs.fit(texts, labels)

# Combined text + numerical features
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ("text", TfidfVectorizer(max_features=5000, ngram_range=(1, 2)), "review_text"),
    ("num", StandardScaler(), ["word_count", "char_count", "exclamation_count"]),
])

mixed_pipe = Pipeline([
    ("prep", preprocessor),
    ("clf", LogisticRegression(max_iter=1000)),
])

Q18. How do you implement calibration for probability outputs?

from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Problem: many classifiers output poorly calibrated probabilities
# GBM outputs are often overconfident; Naive Bayes is often underconfident

X_tr, X_cal, y_tr, y_cal = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train base model
base = GradientBoostingClassifier(n_estimators=200, random_state=42)
base.fit(X_tr, y_tr)

# Calibrate with Platt scaling (logistic regression on held-out set)
platt = CalibratedClassifierCV(base, method='sigmoid', cv='prefit')
platt.fit(X_cal, y_cal)

# Or isotonic regression (more flexible but needs more calibration data)
isotonic = CalibratedClassifierCV(base, method='isotonic', cv='prefit')
isotonic.fit(X_cal, y_cal)

# CV-based calibration (no separate calibration set needed)
cv_calibrated = CalibratedClassifierCV(
    GradientBoostingClassifier(n_estimators=200, random_state=42),
    method='sigmoid', cv=5
)
cv_calibrated.fit(X_train, y_train)

# Evaluate calibration
from sklearn.calibration import calibration_curve
frac_pos, mean_pred = calibration_curve(y_test, base.predict_proba(X_test)[:, 1], n_bins=10)
frac_pos_cal, mean_pred_cal = calibration_curve(y_test, platt.predict_proba(X_test)[:, 1], n_bins=10)

# Perfect calibration: mean_pred == frac_pos (diagonal line)
print(f"Pre-calibration mean abs deviation: {np.abs(frac_pos - mean_pred).mean():.4f}")
print(f"Post-calibration mean abs deviation: {np.abs(frac_pos_cal - mean_pred_cal).mean():.4f}")

Q19. What is OneVsRestClassifier and OneVsOneClassifier?

from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)  # 3 classes

# OneVsRest (OvR): N binary classifiers (one per class vs all others)
# N classifiers, each sample classified by argmax of N decision functions
ovr = OneVsRestClassifier(SVC(probability=True, kernel='rbf'))
ovr.fit(X[:120], y[:120])
print(classification_report(y[120:], ovr.predict(X[120:])))

# OneVsOne (OvO): N*(N-1)/2 binary classifiers (one per class pair)
# For N=3: 3 classifiers. Each predicts pairwise winner; max-votes wins
ovo = OneVsOneClassifier(SVC(kernel='rbf'))
ovo.fit(X[:120], y[:120])

# Comparison:
# OvR: faster training (N classifiers), handles class imbalance per classifier
# OvO: better for SVMs (binary classification is SVM's native domain)
#       slower training (N*(N-1)/2 classifiers)
# Many sklearn models (LR, RF, GBM) are natively multiclass -- no wrapper needed

# Multilabel classification (OvR natively)
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_multilabel = mlb.fit_transform([(0, 1), (1, 2), (0, 2)])  # multiple labels per sample
ovr_ml = OneVsRestClassifier(LogisticRegression())
ovr_ml.fit(X, y_multilabel)

Q20. What is sklearn's set_output API? How do you get DataFrames from pipelines?

from sklearn import set_config
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd

# sklearn 1.2+: set_output API returns DataFrames
set_config(transform_output="pandas")

scaler = StandardScaler()
X_df = pd.DataFrame({"age": [25, 30, 35], "salary": [50000, 60000, 70000]})
X_scaled = scaler.fit_transform(X_df)
print(type(X_scaled))          # pandas.DataFrame
print(X_scaled.columns.tolist())  # ['age', 'salary'] -- column names preserved!

# ColumnTransformer with set_output
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), ["age", "salary"]),
    ("cat", OneHotEncoder(sparse_output=False), ["city"]),
])
preprocessor.set_output(transform="pandas")
df = pd.DataFrame({"age": [25], "salary": [50000], "city": ["Delhi"]})
out = preprocessor.fit_transform(df)
print(out)  # DataFrame with named columns including one-hot column names

# Per-estimator override (without global config)
scaler2 = StandardScaler().set_output(transform="pandas")

# Pipeline set_output
pipe = Pipeline([("scaler", StandardScaler()), ("pca", PCA(n_components=2))])
pipe.set_output(transform="pandas")

HARD: Production Patterns (Questions 21-28)

Q21. How do you use sklearn with Optuna for Bayesian hyperparameter optimization?

import optuna
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

def objective(trial):
    params = {
        "model__n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "model__max_depth": trial.suggest_int("max_depth", 2, 8),
        "model__learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
        "model__subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "model__min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 20),
        "model__max_features": trial.suggest_categorical("max_features", ["sqrt", "log2", None]),
    }
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("model", GradientBoostingClassifier(random_state=42)),
    ])
    pipe.set_params(**params)
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
    return scores.mean()

# Run optimization
study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=100, timeout=300)  # 100 trials or 5 minutes

print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

# Retrain best model on full training set
best_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", GradientBoostingClassifier(
        random_state=42, **{k: v for k, v in study.best_params.items()}
    )),
])
best_pipe.fit(X_train, y_train)

Q22. How do you interpret sklearn models with SHAP?

import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Train a pipeline
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier(n_estimators=100, random_state=42)),
])
pipe.fit(X_train, y_train)

# For tree models: use TreeExplainer (fastest, exact)
explainer = shap.TreeExplainer(pipe.named_steps["model"])

# Transform X_test through preprocessing first
X_test_scaled = pipe.named_steps["scaler"].transform(X_test)

shap_values = explainer.shap_values(X_test_scaled)

# For binary classification, shap_values is a list [class_0, class_1]
# Use shap_values[1] for positive class
feature_names = X_train.columns.tolist() if hasattr(X_train, 'columns') else [f"f{i}" for i in range(X_train.shape[1])]

# Summary plot: global feature importance
shap.summary_plot(shap_values[1], X_test_scaled, feature_names=feature_names)

# Waterfall plot: explain a single prediction
shap.waterfall_plot(shap.Explanation(
    values=shap_values[1][0],
    base_values=explainer.expected_value[1],
    data=X_test_scaled[0],
    feature_names=feature_names,
))

# Dependence plot: feature interaction
shap.dependence_plot("salary", shap_values[1], X_test_scaled, feature_names=feature_names)

# For non-tree models: KernelExplainer (model-agnostic, slower)
background = shap.sample(X_train_scaled, 100)  # 100 samples as background
kernel_explainer = shap.KernelExplainer(pipe.predict_proba, background)
shap_kernel = kernel_explainer.shap_values(X_test_scaled[:20])

Q23. How do you implement a production model monitoring system with sklearn?

import json
import pickle
from datetime import datetime
import numpy as np
from sklearn.metrics import roc_auc_score, f1_score
from typing import Optional

class ModelMonitor:
    def __init__(self, model, reference_X, reference_y, feature_names=None):
        self.model = model
        self.feature_names = feature_names or [f"f{i}" for i in range(reference_X.shape[1])]

        # Baseline statistics from training/validation data
        self.ref_mean = reference_X.mean(axis=0)
        self.ref_std = reference_X.std(axis=0) + 1e-8
        self.ref_auc = roc_auc_score(reference_y, model.predict_proba(reference_X)[:, 1])
        self.log = []

    def check_batch(self, X, y_true=None, batch_id: Optional[str] = None):
        report = {
            "batch_id": batch_id or datetime.now().isoformat(),
            "n_samples": len(X),
            "timestamp": datetime.now().isoformat(),
            "feature_drift": {},
            "prediction_drift": None,
            "performance": None,
            "alerts": [],
        }

        # Feature drift: PSI per feature
        for i, feat in enumerate(self.feature_names):
            z_shift = abs(X[:, i].mean() - self.ref_mean[i]) / self.ref_std[i]
            if z_shift > 3:
                report["feature_drift"][feat] = float(z_shift)
                report["alerts"].append(f"DRIFT: {feat} z_shift={z_shift:.2f}")

        # Prediction drift
        preds = self.model.predict_proba(X)[:, 1]
        pred_mean = preds.mean()
        report["prediction_drift"] = float(pred_mean)
        if abs(pred_mean - 0.5) > 0.3:
            report["alerts"].append(f"PREDICTION DRIFT: mean proba={pred_mean:.3f}")

        # Performance (if labels available)
        if y_true is not None:
            auc = roc_auc_score(y_true, preds)
            report["performance"] = float(auc)
            if auc < self.ref_auc - 0.05:
                report["alerts"].append(f"PERFORMANCE DROP: AUC {auc:.4f} vs baseline {self.ref_auc:.4f}")

        self.log.append(report)
        return report

Q24. How do you benchmark multiple sklearn models efficiently?

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import xgboost as xgb
import pandas as pd
import numpy as np
import time

def benchmark_models(X, y, models: dict, cv=5, scoring: dict = None):
    """Benchmark multiple models with consistent CV and report."""
    if scoring is None:
        scoring = {"auc": "roc_auc", "f1": "f1_weighted", "acc": "accuracy"}

    cv_obj = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
    results = []

    for name, model in models.items():
        pipe = Pipeline([("scaler", StandardScaler()), ("model", model)])
        t0 = time.time()
        cv_res = cross_validate(pipe, X, y, cv=cv_obj, scoring=scoring, n_jobs=-1)
        elapsed = time.time() - t0

        row = {"model": name, "cv_time_s": round(elapsed, 2)}
        for metric in scoring:
            scores = cv_res[f"test_{metric}"]
            row[f"{metric}_mean"] = round(scores.mean(), 4)
            row[f"{metric}_std"] = round(scores.std(), 4)
        results.append(row)

    return pd.DataFrame(results).sort_values("auc_mean", ascending=False)

models = {
    "LogisticRegression": LogisticRegression(max_iter=1000),
    "RandomForest": RandomForestClassifier(n_estimators=100, random_state=42),
    "GradientBoosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "XGBoost": xgb.XGBClassifier(n_estimators=100, use_label_encoder=False, random_state=42, verbosity=0),
    "SVM": SVC(probability=True, kernel='rbf'),
}

results_df = benchmark_models(X, y, models)
print(results_df.to_string(index=False))

Q25. Design a full sklearn ML pipeline for a production credit risk model.

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score, average_precision_score
from scipy.stats import randint, uniform
import joblib

# Feature definitions
num_features = ["age", "income", "loan_amount", "credit_utilization", "num_accounts"]
cat_features = ["employment_type", "loan_purpose", "state"]
bin_features = ["has_mortgage", "has_dependents"]

# Preprocessing
num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False, max_categories=20)),
])
bin_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
])

preprocessor = ColumnTransformer([
    ("num", num_pipe, num_features),
    ("cat", cat_pipe, cat_features),
    ("bin", bin_pipe, bin_features),
])

# Full pipeline with feature selection + model
base_model = GradientBoostingClassifier(random_state=42)

pipeline = Pipeline([
    ("prep", preprocessor),
    ("selector", SelectFromModel(base_model, threshold="median", prefit=False)),
    ("model", GradientBoostingClassifier(random_state=42)),
])

# Hyperparameter search
param_dist = {
    "model__n_estimators": randint(100, 600),
    "model__max_depth": randint(3, 7),
    "model__learning_rate": uniform(0.01, 0.2),
    "model__subsample": uniform(0.6, 0.4),
}

search = RandomizedSearchCV(
    pipeline, param_dist, n_iter=50, cv=StratifiedKFold(5), 
    scoring='roc_auc', n_jobs=-1, random_state=42, refit=True,
)
search.fit(X_train, y_train)

# Calibrate probabilities
calibrated = CalibratedClassifierCV(search.best_estimator_, method='sigmoid', cv='prefit')
calibrated.fit(X_val, y_val)

# Evaluate
proba = calibrated.predict_proba(X_test)[:, 1]
print(f"ROC-AUC: {roc_auc_score(y_test, proba):.4f}")
print(f"PR-AUC: {average_precision_score(y_test, proba):.4f}")

# Save full artifact
joblib.dump(calibrated, "credit_risk_model_v1.joblib", compress=3)

Q26. How does sklearn handle multioutput targets?

from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor, ClassifierChain
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Ridge
from sklearn.datasets import make_multilabel_classification
import numpy as np

# Multioutput classification: predict multiple binary labels simultaneously
X, y = make_multilabel_classification(n_samples=1000, n_features=20, n_classes=5, random_state=42)
# y shape: (1000, 5) -- 5 binary labels per sample

# MultiOutputClassifier: fits independent classifier per label
mo_clf = MultiOutputClassifier(RandomForestClassifier(n_estimators=50, random_state=42), n_jobs=-1)
mo_clf.fit(X[:800], y[:800])
y_pred = mo_clf.predict(X[800:])
y_proba = mo_clf.predict_proba(X[800:])  # list of (N, 2) arrays, one per label

# ClassifierChain: model i uses predictions from models 0..i-1 as additional features
# Captures label dependencies -- generally better than MultiOutputClassifier
chain = ClassifierChain(RandomForestClassifier(n_estimators=50, random_state=42), order='random', cv=5)
chain.fit(X[:800], y[:800])

# Multioutput regression: predict multiple continuous targets
X_reg = np.random.randn(500, 10)
y_reg = np.random.randn(500, 3)  # 3 targets

mo_reg = MultiOutputRegressor(Ridge(alpha=1.0), n_jobs=-1)
mo_reg.fit(X_reg[:400], y_reg[:400])
y_pred_reg = mo_reg.predict(X_reg[400:])
print(f"Prediction shape: {y_pred_reg.shape}")  # (100, 3)

Q27. How do you use sklearn's inspection module for model debugging?

from sklearn.inspection import (
    permutation_importance,
    partial_dependence,
    PartialDependenceDisplay,
)
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import numpy as np

model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Permutation importance: model-agnostic, unbiased
perm_imp = permutation_importance(
    model, X_test, y_test,
    n_repeats=10, random_state=42, n_jobs=-1
)
imp_df = pd.DataFrame({
    "feature": feature_names,
    "importance_mean": perm_imp.importances_mean,
    "importance_std": perm_imp.importances_std,
}).sort_values("importance_mean", ascending=False)

print(imp_df.head(10))

# Partial Dependence Plots: marginal effect of features
pdp_result = partial_dependence(
    model, X_test,
    features=[0, 1, (0, 1)],  # 1D for features 0,1 + 2D interaction
    kind="average",            # "individual" for ICE plots
)

# PDP values
print("PDP for feature 0:", pdp_result.average[0])

# Individual Conditional Expectation (ICE) plots
pdp_ice = partial_dependence(model, X_test, features=[0], kind="both")

# Detect unexpected interactions: plot 2D PDP
disp = PartialDependenceDisplay.from_estimator(
    model, X_test, features=[(0, 1)],
    feature_names=feature_names,
)

Q28. Design a complete sklearn-based ML pipeline for real-time recommendation serving.

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NearestNeighbors
import joblib

class RecommendationPipeline:
    """Two-stage recommendation: user embedding + ANN lookup."""

    def __init__(self, n_components=64, n_neighbors=50):
        self.n_components = n_components
        self.n_neighbors = n_neighbors
        self.item_encoder = None
        self.item_embeddings = None
        self.ann_index = None

    def fit(self, interactions: pd.DataFrame, items: pd.DataFrame):
        """
        interactions: (user_id, item_id, rating) DataFrame
        items: (item_id, features...) DataFrame
        """
        # Build item-feature matrix
        item_features = ColumnTransformer([
            ("cat", OrdinalEncoder(), ["genre", "language"]),
            ("num", StandardScaler(), ["release_year", "avg_rating"]),
        ])
        X_items = item_features.fit_transform(items.drop("item_id", axis=1))
        self.item_encoder = item_features

        # Collaborative filtering via SVD on interaction matrix
        from scipy.sparse import csr_matrix
        n_users = interactions["user_id"].max() + 1
        n_items = interactions["item_id"].max() + 1
        R = csr_matrix(
            (interactions["rating"], (interactions["user_id"], interactions["item_id"])),
            shape=(n_users, n_items)
        )
        svd = TruncatedSVD(n_components=self.n_components, random_state=42)
        self.item_embeddings = svd.fit_transform(R.T)  # (n_items, n_components)

        # Hybrid: concatenate CF embeddings + content features
        item_ids = items["item_id"].values
        content_embs = X_items[item_ids] if len(X_items) > max(item_ids) else X_items
        hybrid_embs = np.concatenate([self.item_embeddings, X_items], axis=1)

        # Build ANN index for fast retrieval
        self.ann_index = NearestNeighbors(n_neighbors=self.n_neighbors, metric='cosine', n_jobs=-1)
        self.ann_index.fit(hybrid_embs)
        self.hybrid_embs = hybrid_embs
        return self

    def recommend(self, user_history_item_ids: list, k: int = 10) -> list:
        """Return top-k item IDs for a user based on their history."""
        if not user_history_item_ids:
            # Cold start: return popular items
            return list(range(k))
        # User embedding = mean of interacted item embeddings
        user_emb = self.hybrid_embs[user_history_item_ids].mean(axis=0, keepdims=True)
        distances, indices = self.ann_index.kneighbors(user_emb)
        # Filter out already-seen items
        recs = [i for i in indices[0] if i not in set(user_history_item_ids)][:k]
        return recs

# Save/load
def save_pipeline(pipe, path):
    joblib.dump(pipe, path, compress=3)

def load_pipeline(path):
    return joblib.load(path)

FAQ

Q: What sklearn version should I know for 2026 interviews?

A: sklearn 1.4+ is the current standard. The most interview-relevant additions are set_output(transform="pandas") API (1.2), ColumnTransformer improvements, and HistGradientBoostingClassifier (the fast sklearn GBM). Candidates from public preparation resources confirm that sklearn 1.2+ features are expected at senior levels.

Q: Is sklearn good enough for production ML in 2026, or do I need XGBoost/LightGBM?

A: sklearn's HistGradientBoostingClassifier matches XGBoost performance on most tabular datasets and supports native categorical features and missing values. XGBoost/LightGBM are still preferred in competitions and at companies with existing infrastructure. Both are covered in senior DS interviews.

Q: How important is the Estimator API for interviews?

A: Very important. Interviewers at product companies specifically probe whether you understand why fit must be called only on training data, the risks of fitting on test/validation, and how Pipeline prevents leakage automatically. This is a common elimination question. Confirm the expected depth on the official company careers portal or interview prep guide.

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp

Scikit-Learn Interview Questions 2026: 28 Answers with Code

Which Roles Test Scikit-Learn Deeply?

EASY: Estimator API and Core Concepts (Questions 1-8)

Q1. What is the sklearn Estimator API? What are fit, transform, predict, and score?

Q2. What is a sklearn Pipeline? Why is it important?

Q3. What is ColumnTransformer? How do you handle mixed data types?

Q4. What is cross-validation in sklearn? What are the different CV strategies?

Q5. What is the difference between GridSearchCV and RandomizedSearchCV?

Q6. What is feature selection in sklearn? What are the three main approaches?

Q7. How do you evaluate a classification model with sklearn metrics?

Q8. How do you evaluate a regression model?

MEDIUM: Advanced Pipeline and Model Selection (Questions 9-20)

Q9. How do you build a custom transformer in sklearn?

Q10. What is a custom sklearn Classifier? Implement one.

Q11. How does sklearn handle class imbalance?

Q12. What is the difference between Pipeline and FeatureUnion?

Q13. How do you detect and prevent data leakage in sklearn pipelines?

Q14. How do you use sklearn for time series forecasting?

Q15. How do you serialize and deploy a sklearn model?

Q16. What is VotingClassifier and StackingClassifier?

Q17. How do you handle text data in a sklearn pipeline?

Q18. How do you implement calibration for probability outputs?

Q19. What is OneVsRestClassifier and OneVsOneClassifier?

Q20. What is sklearn's set_output API? How do you get DataFrames from pipelines?

HARD: Production Patterns (Questions 21-28)

Q21. How do you use sklearn with Optuna for Bayesian hyperparameter optimization?

Q22. How do you interpret sklearn models with SHAP?

Q23. How do you implement a production model monitoring system with sklearn?

Q24. How do you benchmark multiple sklearn models efficiently?

Q25. Design a full sklearn ML pipeline for a production credit risk model.

Q26. How does sklearn handle multioutput targets?

Q27. How do you use sklearn's inspection module for model debugging?

Q28. Design a complete sklearn-based ML pipeline for real-time recommendation serving.

FAQ

Q: What sklearn version should I know for 2026 interviews?

Q: Is sklearn good enough for production ML in 2026, or do I need XGBoost/LightGBM?

Q: How important is the Estimator API for interviews?

More resources in Interview Questions

Sat this this year? Share your story, earn ₹500.

Take a free timed mock test

Machine Learning Interview Questions 2026: 30 Answers with Code

Data Science Interview Questions 2026: 30 Answers with Code

PyTorch Interview Questions 2026: 28 Answers with Code

TensorFlow Interview Questions 2026: 28 Answers with Code

AI/ML Interview Questions 2026: 50 Answers [Verified]

Share this guide