placement brief / Interview Questions / interview questions / 08 Jun 2026

Machine Learning Interview Questions 2026: 30 Answers with Code

30 machine learning interview questions with full answers, Python code, and comparison tables covering the complete fresher-to-senior ML interview arc for 2026.

By Aditya SharmaPublished 8 Jun 20263 sources listedSpot an error? Corrections open

10 min read last revised 8 Jun 2026

on this page§ 06

Machine learning roles are the fastest-growing engineering track in 2026. From product-based companies in India to global FAANG hiring, ML interview rounds have expanded beyond theoretical statistics into practical model building, production pipeline design, and real-world tradeoff reasoning. This guide covers 30 questions with full answers, Python code, and comparison tables across the complete difficulty spectrum.

PapersAdda's take: The ML interview in 2026 rewards engineers who can explain what happens inside the black box AND build a working pipeline. Theory without code = red flag. Code without intuition = another red flag. This guide trains both. Candidate-reported feedback from public preparation resources consistently flags that interviewers at product companies follow up any algorithm question with "how would you put this in production?" Candidates report that gradient boosting and model evaluation metrics appear in over 80% of shortlists. Confirm exact interview formats on the official company careers portal before you prepare.

Related articles: AI/ML Interview Questions 2026 | Deep Learning Interview Questions 2026 | Data Science Interview Questions 2026 | Scikit-learn Interview Questions 2026 | Statistics for Data Science 2026 | Data Engineering Interview Questions 2026

Which Companies Ask These Questions?

Topic Cluster	Companies
Supervised Learning Fundamentals	All product companies, all FAANG
Ensemble Methods (RF, XGBoost)	Google, Amazon, Flipkart, PhonePe, Swiggy
Feature Engineering	Uber, LinkedIn, all ML-heavy teams
Model Evaluation and Metrics	All data science roles
ML System Design	Google, Meta, Amazon senior rounds
Clustering and Unsupervised	All data science roles
Pipelines and Production	MLOps roles, Databricks, AWS

EASY: Core Concepts (Questions 1-10)

PapersAdda's note: These are the questions that separate a prepared candidate from an unprepared one in the first 10 minutes. Get them cold.

Q1. What is machine learning? How is it different from traditional programming?

Aspect	Traditional Programming	Machine Learning
Input	Rules + Data	Data + Expected Output
Output	Program output	Rules (learned model)
Maintenance	Update rules manually	Retrain on new data
Works well when	Rules are known and stable	Rules are complex or unknown

Machine learning is the practice of building systems that learn patterns from data rather than being explicitly programmed. The program writes itself from examples.

# Traditional programming: manually coded rule
def is_spam(email):
    if "free money" in email.lower() or "click here" in email.lower():
        return True
    return False

# Machine learning: model learns the rule from data
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(training_emails)
model = LogisticRegression()
model.fit(X_train, labels)   # model discovers the rule from data

Q2. What is the difference between a parameter and a hyperparameter?

Term	Definition	Who Sets It	Examples
Parameter	Learned from training data	Optimizer	Weights, biases in a neural net; coefficients in linear regression
Hyperparameter	Configured before training	You	Learning rate, n_estimators, max_depth, regularization strength

Why it matters in interviews: Hyperparameter tuning is a core skill. Know GridSearchCV, RandomizedSearchCV, and Optuna (Bayesian optimization, 2026 standard).

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import scipy.stats as stats

param_dist = {
    'n_estimators': stats.randint(50, 500),
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': stats.randint(2, 20),
    'max_features': ['sqrt', 'log2', 0.3]
}

search = RandomizedSearchCV(
    RandomForestClassifier(n_jobs=-1),
    param_distributions=param_dist,
    n_iter=50, cv=5, scoring='f1', n_jobs=-1, random_state=42
)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)

Q3. Explain overfitting and underfitting. How do you detect and fix each?

Problem	Definition	Symptom	Fix
Overfitting	Model memorizes training noise	High train accuracy, low test accuracy	More data, regularization, dropout, simpler model
Underfitting	Model too simple for data	Low train AND test accuracy	More capacity, better features, more epochs
Good fit	Model captures true pattern	High train AND test accuracy	Keep

from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def plot_learning_curve(model, X, y):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, scoring='accuracy',
        train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
    )
    # If val_scores plateau far below train_scores: overfitting
    # If both plateau at low score: underfitting
    return train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)

Regularization is the primary fix for overfitting:

from sklearn.linear_model import Ridge, Lasso, ElasticNet

ridge = Ridge(alpha=10.0)      # L2: shrinks all weights
lasso = Lasso(alpha=0.1)       # L1: zeros irrelevant features (feature selection)
enet  = ElasticNet(alpha=0.1, l1_ratio=0.5)  # Both

Q4. What is feature scaling and when is it required?

Algorithm	Needs Scaling?	Why
Linear/Logistic Regression	Yes	Gradient descent converges faster
SVM	Yes	Margin depends on distance
KNN	Yes	Distance metric is scale-sensitive
Decision Tree / Random Forest	No	Split thresholds are scale-invariant
XGBoost / LightGBM	No	Tree splits are invariant
Neural Networks	Yes	Gradient flow is scale-sensitive

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# StandardScaler: (x - mean) / std -- best for normally distributed features
scaler = StandardScaler()

# MinMaxScaler: (x - min) / (max - min) -- when you need values in [0,1]
min_max = MinMaxScaler()

# RobustScaler: (x - median) / IQR -- resistant to outliers
robust = RobustScaler()

# Always fit only on training data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)   # transform only, no fit

Q5. What are the types of cross-validation and when do you use each?

Type	Description	Use When
K-Fold	Split into k folds; rotate	Standard for classification/regression
Stratified K-Fold	Maintain class proportion per fold	Imbalanced classification
Leave-One-Out (LOO)	n-fold CV; each sample is a fold	Very small datasets
Time Series Split	Train on past; validate on future	Any time series; NEVER shuffle
Group K-Fold	Samples from same group never split across folds	Patient data, user-level data

from sklearn.model_selection import (StratifiedKFold, TimeSeriesSplit,
                                      GroupKFold, cross_val_score)
from sklearn.ensemble import GradientBoostingClassifier

# Standard imbalanced classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(GradientBoostingClassifier(), X, y, cv=skf, scoring='roc_auc')

# Time series -- NEVER shuffle
tscv = TimeSeriesSplit(n_splits=5)
ts_scores = cross_val_score(GradientBoostingClassifier(), X, y, cv=tscv, scoring='neg_mean_squared_error')

Q6. How do you handle missing values in a dataset?

Strategy	When to Use	Risk
Drop rows	Missing rate < 5%, data is large	Information loss
Mean/Median imputation	Numerical, MCAR assumption	Distorts variance
Mode imputation	Categorical	Distorts distribution
KNN imputation	Small-medium datasets, correlated features	Slow at scale
Model-based (IterativeImputer)	Complex missingness patterns	High compute
Forward fill / Back fill	Time series	Only if temporal relationship holds
Add a missing indicator column	When missingness is informative	Always consider this

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df = pd.read_csv("data.csv")

# Check missing pattern
print(df.isnull().sum() / len(df) * 100)

# Median imputation (robust to outliers vs mean)
num_imputer = SimpleImputer(strategy='median')

# KNN imputation -- respects feature correlations
knn_imputer = KNNImputer(n_neighbors=5)

# Always add indicator for informative missingness
df['age_missing'] = df['age'].isnull().astype(int)

Q7. What is feature engineering? Give three practical examples.

Example 1: Date decomposition

import pandas as pd

df['date'] = pd.to_datetime(df['date'])
df['day_of_week']   = df['date'].dt.dayofweek   # 0=Monday
df['is_weekend']    = df['day_of_week'].isin([5,6]).astype(int)
df['hour']          = df['date'].dt.hour
df['is_business_hr']= df['hour'].between(9, 17).astype(int)
df['month']         = df['date'].dt.month

Example 2: Ratio and interaction features

# For loan default prediction
df['debt_to_income']   = df['total_debt'] / (df['annual_income'] + 1)
df['payment_ratio']    = df['monthly_payment'] / (df['monthly_income'] + 1)
df['credit_util_sq']   = df['credit_utilization'] ** 2  # non-linear effect

Example 3: Target encoding for high-cardinality categoricals

# Replace categories with mean target value (careful: do inside CV folds)
import category_encoders as ce
encoder = ce.TargetEncoder(cols=['city', 'product_category'])
X_encoded = encoder.fit_transform(X_train, y_train)

Q8. What is a confusion matrix and what can you derive from it?

                  Predicted Positive  Predicted Negative
Actual Positive        TP                  FN
Actual Negative        FP                  TN

From a confusion matrix you can compute:

Precision = TP / (TP + FP)
Recall / Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
F1 = 2 * Precision * Recall / (Precision + Recall)
Accuracy = (TP + TN) / Total

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import numpy as np

cm = confusion_matrix(y_test, y_pred)
print("TP:", cm[1,1], "TN:", cm[0,0], "FP:", cm[0,1], "FN:", cm[1,0])

# Multiclass
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1', 'Class 2']))

Q9. What is the difference between a parametric and non-parametric model?

Property	Parametric	Non-Parametric
Definition	Fixed number of parameters, regardless of data size	Number of parameters grows with data
Examples	Linear regression, logistic regression, Naive Bayes	KNN, decision trees, kernel SVM
Pros	Fast inference, interpretable, less data needed	Flexible, no distribution assumptions
Cons	Strong assumptions about data distribution	Slow at scale, prone to overfitting
Memory	O(parameters) = constant	O(training data)

Interview insight: KNN is the classic non-parametric model. It stores the entire training set; prediction = majority vote among k nearest neighbors. At FAANG scale, this is impractical (billions of samples). Approximate nearest neighbor (FAISS, ScaNN) makes it tractable.

Q10. What is Naive Bayes and when does it work well despite its naive assumption?

P(y|x_1,...,x_n) ∝ P(y) * ∏ P(x_i|y)

The "naive" part is the independence assumption (features are almost never truly independent).

When it works well:

Text classification (bag-of-words features are actually roughly independent)
Spam detection
Very small datasets where complex models overfit
When you need a fast baseline

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Text classification pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1,2))),
    ('clf', MultinomialNB(alpha=0.1))   # alpha = Laplace smoothing
])
text_clf.fit(X_train_text, y_train)

MEDIUM: Ensemble Methods and Evaluation (Questions 11-22)

Q11. Compare Random Forest, Gradient Boosting, XGBoost, and LightGBM.

Property	Random Forest	Gradient Boosting	XGBoost	LightGBM
Training	Parallel	Sequential	Sequential + regularization	Sequential + leaf-wise growth
Error reduced	Variance	Bias	Bias + variance	Bias + variance
Speed	Fast	Slow	Faster via histogram	Fastest (GOSS + EFB)
Best for	Wide datasets, interpretability	General tabular	Kaggle-winning, structured data	Large datasets, categorical data
Native categoricals	No	No	No	Yes
GPU	No (by default)	No	Yes	Yes

import lightgbm as lgb
import xgboost as xgb

# LightGBM -- preferred in 2026 for large tabular data
lgb_model = lgb.LGBMClassifier(
    n_estimators=1000, learning_rate=0.05, max_depth=-1,
    num_leaves=63, subsample=0.8, colsample_bytree=0.8,
    min_child_samples=20, reg_alpha=0.1, reg_lambda=0.1,
    n_jobs=-1, random_state=42
)
lgb_model.fit(X_train, y_train,
              eval_set=[(X_val, y_val)],
              callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)])

# XGBoost
xgb_model = xgb.XGBClassifier(
    n_estimators=1000, learning_rate=0.05, max_depth=6,
    subsample=0.8, colsample_bytree=0.8,
    reg_alpha=0.1, reg_lambda=1.0,
    tree_method='hist', device='cuda',
    eval_metric='logloss', early_stopping_rounds=50
)

Q12. How does gradient boosting work step by step?

Gradient boosting builds an additive model by fitting each new tree to the negative gradient (residuals) of the loss function:

Step 0: Initialize F_0(x) = argmin_γ Σ L(y_i, γ)
For m = 1 to M:
  1. Compute pseudo-residuals: r_im = -∂L(y_i, F_{m-1}(x_i)) / ∂F_{m-1}(x_i)
  2. Fit a regression tree T_m to r_im
  3. Update: F_m(x) = F_{m-1}(x) + η * T_m(x)

For MSE loss, pseudo-residuals = actual residuals (y - y_hat). For log-loss, residuals = y - sigmoid(y_hat).

import numpy as np
from sklearn.tree import DecisionTreeRegressor

# Manual gradient boosting for MSE loss
class SimpleGBM:
    def __init__(self, n_estimators=100, lr=0.1, max_depth=3):
        self.trees = []
        self.n_estimators = n_estimators
        self.lr = lr
        self.max_depth = max_depth

    def fit(self, X, y):
        self.F0 = y.mean()
        F = np.full(len(y), self.F0)
        for _ in range(self.n_estimators):
            residuals = y - F          # negative gradient of MSE
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            self.trees.append(tree)
            F += self.lr * tree.predict(X)

    def predict(self, X):
        return self.F0 + self.lr * sum(t.predict(X) for t in self.trees)

Q13. What is feature importance in tree models? How is it computed?

Tree models offer several feature importance measures:

Method	How	Pros	Cons
Impurity (Gini/MSE)	Sum of impurity reduction by feature across all splits	Fast, built-in	Biased toward high-cardinality and numerical features
Permutation Importance	Measure accuracy drop when feature is shuffled	Model-agnostic, unbiased	Slow
SHAP Values	Game theory-based attribution	Consistent, handles interactions	Slow for large ensembles

from sklearn.inspection import permutation_importance
import shap

# Impurity importance (fast, built-in)
importances = model.feature_importances_

# Permutation importance (unbiased)
perm_imp = permutation_importance(model, X_test, y_test,
                                   n_repeats=10, n_jobs=-1)
sorted_idx = perm_imp.importances_mean.argsort()[::-1]

# SHAP (gold standard in 2026)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

Q14. What is ROC-AUC vs PR-AUC? When does each metric mislead you?

ROC-AUC (Receiver Operating Characteristic)

Plots TPR (recall) vs FPR at all thresholds
AUC = probability that model ranks a random positive higher than a random negative
Misleads when: Severe class imbalance. A high AUC can hide poor minority-class performance because FPR looks small when TN >> FP

PR-AUC (Precision-Recall)

Plots Precision vs Recall at all thresholds
AUC = area under the precision-recall curve
Better for: Fraud detection, rare disease detection, any imbalanced binary problem

from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay

y_scores = model.predict_proba(X_test)[:, 1]

roc_auc = roc_auc_score(y_test, y_scores)
pr_auc  = average_precision_score(y_test, y_scores)

print(f"ROC-AUC: {roc_auc:.4f}")
print(f"PR-AUC:  {pr_auc:.4f}")

# If class imbalance is severe (e.g., 1% positives):
# ROC-AUC = 0.95 might still mean model misses 40% of positives
# PR-AUC tells the truth

Q15. How do you handle imbalanced classes in machine learning?

Strategy	Description	When to Use
Class weights	Penalize minority class misclassification more	First thing to try; no data modification
Oversampling (SMOTE)	Synthesize new minority samples	When dataset is small
Undersampling	Remove majority class samples	When majority class is massive
Threshold tuning	Move decision threshold from 0.5	Always in production
Focal Loss	Penalize easy examples less	Deep learning with severe imbalance
Ensemble with balanced subsampling	BalancedBaggingClassifier	Stable, general purpose

from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.utils.class_weight import compute_sample_weight

# Strategy 1: class_weight (zero computational cost)
model = GradientBoostingClassifier()
sample_weights = compute_sample_weight('balanced', y_train)
model.fit(X_train, y_train, sample_weight=sample_weights)

# Strategy 2: SMOTE oversampling pipeline
smote_pipeline = ImbPipeline([
    ('oversample', SMOTE(k_neighbors=5, random_state=42)),
    ('model', GradientBoostingClassifier())
])

# Strategy 3: Threshold tuning
from sklearn.metrics import precision_recall_curve
prec, rec, thresh = precision_recall_curve(y_val, model.predict_proba(X_val)[:,1])
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_thresh = thresh[np.argmax(f1_scores)]
y_pred_tuned = (model.predict_proba(X_test)[:,1] >= best_thresh).astype(int)

Q16. What is a learning curve and how do you use it for model diagnosis?

Pattern	Diagnosis	Fix
Both curves low and flat	High bias / underfitting	More complex model, better features
Large gap, train high, val low	High variance / overfitting	More data, regularization
Val curve still rising	Model benefits from more data	Collect more data
Both curves high and close	Well-fit	Ship it

import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC

train_sizes = np.linspace(0.1, 1.0, 10)
train_sizes, train_scores, val_scores = learning_curve(
    SVC(kernel='rbf', C=10), X, y,
    train_sizes=train_sizes, cv=5,
    scoring='accuracy', n_jobs=-1
)

print("Train:", train_scores.mean(axis=1).round(3))
print("Val:  ", val_scores.mean(axis=1).round(3))

Q17. How does principal component analysis (PCA) work?

Steps:

Standardize features (zero mean, unit variance)
Compute covariance matrix: C = X^T X / (n-1)
Eigen-decompose C: eigenvectors = principal components, eigenvalues = variance explained
Project data onto top-k eigenvectors

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Always scale first
X_scaled = StandardScaler().fit_transform(X)

# Retain 95% of variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print(f"Original features: {X.shape[1]}")
print(f"PCA components:    {X_pca.shape[1]}")
print(f"Variance explained per component: {pca.explained_variance_ratio_}")
print(f"Cumulative variance: {pca.explained_variance_ratio_.cumsum()}")

Use PCA for: Dimensionality reduction before clustering, visualization (PCA to 2D), removing multicollinearity, speeding up downstream models.

Do NOT use PCA for: Interpretability (components are linear combinations of all features), if features are already low-dimensional.

Q18. What is the difference between classification and regression? When does the boundary blur?

Aspect	Classification	Regression
Output	Discrete class label	Continuous value
Loss functions	Cross-entropy, Hinge loss	MSE, MAE, Huber
Evaluation	Accuracy, F1, AUC-ROC	RMSE, MAE, R²
Examples	Spam detection, fraud	House price, demand forecasting

When the boundary blurs:

Ordinal regression: Target is ordered categories (1-5 star rating). Can model as regression or multi-class classification; ordinal-specific methods do better.
Threshold classification: Take regression output and threshold it (e.g., predict churn probability > 0.6 = churn)
Calibration: Neural networks outputting soft probabilities are doing regression internally

# Ordinal regression
from sklearn.base import clone
from sklearn.linear_model import LogisticRegression

class OrdinalClassifier:
    """Frank-Hall ordinal encoding: train k-1 binary classifiers."""
    def __init__(self, base_clf=None):
        self.base_clf = base_clf or LogisticRegression()
        self.clfs_ = []
        self.classes_ = None

    def fit(self, X, y):
        self.classes_ = np.sort(np.unique(y))
        for i, c in enumerate(self.classes_[:-1]):
            binary_y = (y > c).astype(int)
            clf = clone(self.base_clf)
            clf.fit(X, binary_y)
            self.clfs_.append(clf)
        return self

Q19. What is a pipeline in scikit-learn and why is it critical for production?

Prevents data leakage: Preprocessing steps (scaling, imputation) are fit only on training data, never on test data
Clean code: One object to fit, predict, persist
Hyperparameter tuning is unified: GridSearchCV can tune pipeline parameters

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

num_features = ['age', 'income', 'tenure']
cat_features = ['city', 'product_type']

# Preprocessing for numerical features
num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])

# Full pipeline
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(n_estimators=200, learning_rate=0.05))
])

full_pipeline.fit(X_train, y_train)

Q20. Explain SHAP values and why they are the standard for model explainability in 2026.

Consistent: If a feature's impact increases, its SHAP value increases. Impurity-based importance violates this.
Local and global: Explain individual predictions AND aggregate into global importance.
Model-agnostic: Works for any ML model; specialized implementations (TreeSHAP) are exact and fast for trees.

import shap

# Fast exact SHAP for tree models (TreeSHAP)
explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X_test)

# Global: which features matter most?
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names)

# Local: why did this specific prediction happen?
shap.waterfall_plot(explainer(X_test)[0])

# Force plot: visual explanation for one instance
shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test.iloc[0])

Production pattern (Amazon, Flipkart): Attach SHAP explanations to fraud alerts and loan rejection decisions. Regulators in India (RBI) require explainable credit decisions.

Q21. What is A/B testing and how does it relate to ML model evaluation?

Steps:

Define hypothesis (H0: models perform equally; H1: new model is better)
Calculate required sample size (power analysis)
Run experiment until n_min reached
Use hypothesis test (t-test, z-test, or Mann-Whitney U) to determine significance
Never stop early based on current significance (p-hacking)

from scipy import stats
import numpy as np

# Conversion rates
control_conversions   = 1200
control_n             = 15000
treatment_conversions = 1350
treatment_n           = 15000

p_control   = control_conversions / control_n
p_treatment = treatment_conversions / treatment_n

# Two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest
count = np.array([treatment_conversions, control_conversions])
nobs  = np.array([treatment_n, control_n])
stat, pvalue = proportions_ztest(count, nobs, alternative='larger')
print(f"p-value: {pvalue:.4f}")
print("Reject H0" if pvalue < 0.05 else "Cannot reject H0")

Q22. What is stacking (stacked generalization) and how does it differ from bagging and boosting?

Ensemble	How	When to Use
Bagging	Average parallel models on bootstrap samples	Reduce variance (RF)
Boosting	Sequential models, each correcting the last	Reduce bias (XGB)
Stacking	Train meta-learner on base model predictions	Combine diverse models, squeeze last few %

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

base_models = [
    ('rf',  RandomForestClassifier(n_estimators=200, n_jobs=-1)),
    ('gb',  GradientBoostingClassifier(n_estimators=200)),
    ('svm', SVC(probability=True, kernel='rbf'))
]

# Meta-learner trained on base model out-of-fold predictions
stack = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(C=0.1),
    cv=5,
    stack_method='predict_proba'
)
stack.fit(X_train, y_train)

HARD: Advanced Topics and System Design (Questions 23-30)

Q23. How do you design an ML pipeline for production? What are the key components?

Production ML Pipeline Architecture (2026):

Data Layer:
  Raw ingestion (Kafka/S3) -> Feature Store (Feast/Tecton) -> Training data snapshots

Training Layer:
  Experiment tracking (MLflow/W&B) -> Hyperparameter tuning (Optuna) -> Model registry

Serving Layer:
  REST API (FastAPI) or batch scoring -> Model versioning -> Canary/shadow deployment

Monitoring Layer:
  Data drift (EvidentlyAI) -> Model performance -> Alerting -> Retraining triggers

import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

mlflow.set_experiment("churn-prediction-v2")

with mlflow.start_run():
    model = GradientBoostingClassifier(n_estimators=500, learning_rate=0.05)
    model.fit(X_train, y_train)

    val_roc = roc_auc_score(y_val, model.predict_proba(X_val)[:,1])

    mlflow.log_params(model.get_params())
    mlflow.log_metric("val_roc_auc", val_roc)

    signature = infer_signature(X_train, model.predict(X_train))
    mlflow.sklearn.log_model(model, "model", signature=signature)

Q24. What is concept drift and how do you detect and handle it?

Type	Definition	Example
Sudden drift	Abrupt change	New government policy changes fraud patterns
Gradual drift	Slow shift	Consumer behavior shifts over months
Recurring drift	Cyclic change	Seasonal patterns in retail
Feature drift	Input distribution shifts	Economic crisis shifts income distributions

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

# Compare reference (training) vs current (production) data
report = Report(metrics=[DataDriftPreset(), TargetDriftPreset()])
report.run(reference_data=X_train_df, current_data=X_production_df)
report.save_html("drift_report.html")

# Statistical test: KS test for continuous features
from scipy.stats import ks_2samp
for col in numerical_features:
    stat, pval = ks_2samp(X_train[col], X_production[col])
    if pval < 0.05:
        print(f"Drift detected in {col}: p={pval:.4f}")

Response to drift:

Retrain on recent data (rolling window)
Continuously fine-tune on production stream
Use online learning for fast-drift scenarios

Q25. Explain the curse of dimensionality and its practical impact on ML.

Sparsity: Data points are far apart. In d dimensions, to have the same density as 10 points in 1D, you need 10^d points.
Distance concentration: As d grows, the ratio of max-to-min distance among points converges to 1. Distance metrics become uninformative for KNN.
Computational explosion: Feature combinations grow exponentially.

Practical impact:

# Distance concentration demonstration
import numpy as np

np.random.seed(42)
for d in [2, 10, 100, 1000]:
    X = np.random.randn(1000, d)
    x = np.random.randn(d)
    dists = np.linalg.norm(X - x, axis=1)
    ratio = dists.max() / dists.min()
    print(f"d={d:5d}  max/min distance ratio = {ratio:.2f}")
# d=2: ratio ~50, d=1000: ratio ~1.2
# At d=1000, max and min distances are nearly equal -> KNN fails

Mitigations: PCA, autoencoders, feature selection (mutual information, LASSO), domain knowledge to drop irrelevant features.

Q26. What is Bayesian optimization and why is it better than grid search for hyperparameter tuning?

Method	How	Evals Needed	Smart?
Grid search	Try all combinations	O(n^k)	No
Random search	Random samples	O(n)	No
Bayesian (TPE, GP)	Exploit + explore based on history	O(50-100)	Yes

import optuna

def objective(trial):
    params = {
        'n_estimators':   trial.suggest_int('n_estimators', 100, 1000),
        'max_depth':      trial.suggest_int('max_depth', 3, 12),
        'learning_rate':  trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample':      trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0)
    }
    model = xgb.XGBClassifier(**params, eval_metric='logloss', verbosity=0)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
              early_stopping_rounds=30, verbose=False)
    return roc_auc_score(y_val, model.predict_proba(X_val)[:,1])

study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=100, n_jobs=4)
print("Best params:", study.best_params)

Q27. How does logistic regression actually work? Derive the update rule.

σ(z) = 1 / (1 + e^{-z})
P(y|x) = σ(w^T x)^y * (1 - σ(w^T x))^(1-y)

Log-likelihood (Cross-entropy loss, negated):

L(w) = -Σ [y_i * log σ(w^T x_i) + (1-y_i) * log(1 - σ(w^T x_i))]

Gradient:

∂L/∂w = Σ (σ(w^T x_i) - y_i) * x_i = X^T (y_hat - y)

Update rule (gradient descent):

w ← w - η * X^T (y_hat - y) / n

import numpy as np

class LogisticRegressionFromScratch:
    def __init__(self, lr=0.01, n_iters=1000):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def fit(self, X, y):
        n, d = X.shape
        self.weights = np.zeros(d)
        self.bias = 0.0
        for _ in range(self.n_iters):
            y_hat = self.sigmoid(X @ self.weights + self.bias)
            dw = X.T @ (y_hat - y) / n
            db = (y_hat - y).mean()
            self.weights -= self.lr * dw
            self.bias    -= self.lr * db

    def predict_proba(self, X):
        return self.sigmoid(X @ self.weights + self.bias)

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

Q28. What is the VC dimension and why does it matter for generalization?

Linear classifier in R^d: VC dimension = d+1
Sine function sin(wx): VC dimension = infinity (can shatter arbitrarily many points by tuning w)

Why it matters: Generalization bound (PAC learning):

Error_test <= Error_train + sqrt(VC_dim * log(n) / n)

Higher VC dimension = more complex hypothesis class = higher generalization gap at fixed n.

Interview insight: This is why deep neural networks with billions of parameters should overfit on their training data but don't in practice. Implicit regularization from SGD, early stopping, and dropout keep the effective VC dimension far below the parameter count. This is an open theoretical problem and FAANG research interviewers love asking about it.

Q29. Design a recommendation system. What ML components are involved?

Architecture layers:

1. Candidate Generation (fast, recall-focused)
   - Matrix Factorization: user embedding @ item embedding
   - Two-Tower Neural Network (BERT embeddings for items)
   - Approximate Nearest Neighbor: FAISS, ScaNN
   - Goal: get top 1,000 candidates from millions of items

2. Ranking (slow, precision-focused)
   - Features: user history, item metadata, context (time, device)
   - Model: LightGBM or Deep & Wide Neural Network
   - Objective: P(click|user, item, context), P(purchase|...)
   - Goal: rank top 1,000 candidates, return top 10-50

3. Re-ranking (business logic)
   - Diversity: avoid showing all same-category items
   - Freshness boost for new items
   - Business constraints (promoted items, inventory)

# Two-tower model sketch (simplified)
import torch.nn as nn

class TwoTowerModel(nn.Module):
    def __init__(self, user_vocab, item_vocab, embed_dim=64):
        super().__init__()
        self.user_tower = nn.Sequential(
            nn.Embedding(user_vocab, embed_dim),
            nn.Flatten(),
            nn.Linear(embed_dim, 128), nn.ReLU(),
            nn.Linear(128, 64)
        )
        self.item_tower = nn.Sequential(
            nn.Embedding(item_vocab, embed_dim),
            nn.Flatten(),
            nn.Linear(embed_dim, 128), nn.ReLU(),
            nn.Linear(128, 64)
        )

    def forward(self, user_ids, item_ids):
        u = self.user_tower(user_ids)
        i = self.item_tower(item_ids)
        return (u * i).sum(dim=-1)  # dot product similarity

Q30. What is multi-armed bandit and when is it better than A/B testing?

Aspect	A/B Test	Multi-Armed Bandit
Traffic allocation	Fixed (50/50)	Dynamic (shift to winner)
Regret	High (bad arm runs full duration)	Lower (bad arm starved quickly)
Statistical rigor	Formal hypothesis testing	Less formal, regret-minimization
Best for	Stable business decisions	Real-time optimization (ads, pricing)

import numpy as np

class EpsilonGreedyBandit:
    """Epsilon-greedy MAB: explore with probability epsilon."""
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts  = np.zeros(n_arms)
        self.values  = np.zeros(n_arms)

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)  # explore
        return np.argmax(self.values)               # exploit

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n  # incremental mean

# Thompson Sampling (better in practice)
class ThompsonSampling:
    def __init__(self, n_arms):
        self.alpha = np.ones(n_arms)  # successes + 1
        self.beta  = np.ones(n_arms)  # failures + 1

    def select_arm(self):
        samples = np.random.beta(self.alpha, self.beta)
        return np.argmax(samples)

    def update(self, arm, reward):
        self.alpha[arm] += reward
        self.beta[arm]  += 1 - reward

Comparison Table: ML Algorithms at a Glance

Algorithm	Type	Scale	Interpretable	Handles Non-linearity	2026 Status
Linear Regression	Regression	Large	Yes	No	Baseline
Logistic Regression	Classification	Large	Yes	No	Strong baseline
Decision Tree	Both	Medium	Yes	Yes	Rarely solo
Random Forest	Both	Large	Partial	Yes	Strong
XGBoost	Both	Large	Partial	Yes	Industry standard
LightGBM	Both	Very Large	Partial	Yes	Fastest tabular
SVM	Both	Medium	No	Yes (kernel)	Fading
KNN	Both	Small	Yes	Yes	Prototyping
Naive Bayes	Classification	Large	Yes	No	NLP baseline
Neural Network	Both	Very Large	No	Yes	Complex tasks

FAQ

Q: What is the most important ML algorithm to know for 2026 interviews?

A: Gradient boosting variants (XGBoost, LightGBM) for tabular data; transformers for sequential/unstructured data. Know both categories deeply.

Q: How much math do I need for ML interviews at product companies?

A: Linear algebra (matrix operations), calculus (chain rule, partial derivatives), probability (Bayes theorem, distributions), and statistics (hypothesis testing, confidence intervals). You will not be asked to prove convergence theorems.

Q: What is the difference between a data scientist and ML engineer role?

A: Data scientists focus on analysis, experimentation, and model building. ML engineers focus on deploying models at scale, building feature pipelines, and maintaining production systems. The boundary is blurring in 2026.

Q: Which Python libraries should I know for ML interviews?

A: scikit-learn (must), XGBoost and LightGBM (must), pandas and NumPy (must), SHAP (expected at senior level), Optuna (good to know), MLflow (good to know for MLOps roles).

Q: Is deep learning tested in ML engineer interviews?

A: Depends on the role. For ML engineer and data scientist roles at product companies, deep learning basics (neural nets, backprop, regularization) are fair game. Advanced topics (transformers, fine-tuning, distributed training) are tested for ML researcher and AI/ML engineer roles.

Related articles on PapersAdda:

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp