issue 117apr 27mmxxvi
est. 2017
Sun, 27 Apr 2026
vol. IX · no. 117
PapersAdda
placement intelligence, since 2017
640+ briefs · 24 campuses · by reservation
verified offers · sourced from r/developersIndia
razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1

Machine Learning Interview Questions 2026: 30 Answers with Code

28 min read
Interview Questions
Updated: 8 Jun 2026
Aditya Sharma
Aditya's Edit

PapersAdda 2026 Placement Cycle

By Aditya Sharma·Founder & Editor, PapersAdda

What changed in 2026 drives

Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.

What I'd actually study for this

  • 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
  • 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
  • 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
  • 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken

Where most candidates trip up

The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.

Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.

Machine learning roles are the fastest-growing engineering track in 2026. From product-based companies in India to global FAANG hiring, ML interview rounds have expanded beyond theoretical statistics into practical model building, production pipeline design, and real-world tradeoff reasoning. This guide covers 30 questions with full answers, Python code, and comparison tables across the complete difficulty spectrum.

PapersAdda's take: The ML interview in 2026 rewards engineers who can explain what happens inside the black box AND build a working pipeline. Theory without code = red flag. Code without intuition = another red flag. This guide trains both. Candidate-reported feedback from public preparation resources consistently flags that interviewers at product companies follow up any algorithm question with "how would you put this in production?" Candidates report that gradient boosting and model evaluation metrics appear in over 80% of shortlists. Confirm exact interview formats on the official company careers portal before you prepare.

Related articles: AI/ML Interview Questions 2026 | Deep Learning Interview Questions 2026 | Data Science Interview Questions 2026 | Scikit-learn Interview Questions 2026 | Statistics for Data Science 2026 | Data Engineering Interview Questions 2026


Which Companies Ask These Questions?

Topic ClusterCompanies
Supervised Learning FundamentalsAll product companies, all FAANG
Ensemble Methods (RF, XGBoost)Google, Amazon, Flipkart, PhonePe, Swiggy
Feature EngineeringUber, LinkedIn, all ML-heavy teams
Model Evaluation and MetricsAll data science roles
ML System DesignGoogle, Meta, Amazon senior rounds
Clustering and UnsupervisedAll data science roles
Pipelines and ProductionMLOps roles, Databricks, AWS

EASY: Core Concepts (Questions 1-10)

PapersAdda's note: These are the questions that separate a prepared candidate from an unprepared one in the first 10 minutes. Get them cold.

Q1. What is machine learning? How is it different from traditional programming?

AspectTraditional ProgrammingMachine Learning
InputRules + DataData + Expected Output
OutputProgram outputRules (learned model)
MaintenanceUpdate rules manuallyRetrain on new data
Works well whenRules are known and stableRules are complex or unknown

Machine learning is the practice of building systems that learn patterns from data rather than being explicitly programmed. The program writes itself from examples.

# Traditional programming: manually coded rule
def is_spam(email):
    if "free money" in email.lower() or "click here" in email.lower():
        return True
    return False

# Machine learning: model learns the rule from data
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(training_emails)
model = LogisticRegression()
model.fit(X_train, labels)   # model discovers the rule from data

Q2. What is the difference between a parameter and a hyperparameter?

TermDefinitionWho Sets ItExamples
ParameterLearned from training dataOptimizerWeights, biases in a neural net; coefficients in linear regression
HyperparameterConfigured before trainingYouLearning rate, n_estimators, max_depth, regularization strength

Why it matters in interviews: Hyperparameter tuning is a core skill. Know GridSearchCV, RandomizedSearchCV, and Optuna (Bayesian optimization, 2026 standard).

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import scipy.stats as stats

param_dist = {
    'n_estimators': stats.randint(50, 500),
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': stats.randint(2, 20),
    'max_features': ['sqrt', 'log2', 0.3]
}

search = RandomizedSearchCV(
    RandomForestClassifier(n_jobs=-1),
    param_distributions=param_dist,
    n_iter=50, cv=5, scoring='f1', n_jobs=-1, random_state=42
)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)

Q3. Explain overfitting and underfitting. How do you detect and fix each?

ProblemDefinitionSymptomFix
OverfittingModel memorizes training noiseHigh train accuracy, low test accuracyMore data, regularization, dropout, simpler model
UnderfittingModel too simple for dataLow train AND test accuracyMore capacity, better features, more epochs
Good fitModel captures true patternHigh train AND test accuracyKeep
from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def plot_learning_curve(model, X, y):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, scoring='accuracy',
        train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
    )
    # If val_scores plateau far below train_scores: overfitting
    # If both plateau at low score: underfitting
    return train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)

Regularization is the primary fix for overfitting:

from sklearn.linear_model import Ridge, Lasso, ElasticNet

ridge = Ridge(alpha=10.0)      # L2: shrinks all weights
lasso = Lasso(alpha=0.1)       # L1: zeros irrelevant features (feature selection)
enet  = ElasticNet(alpha=0.1, l1_ratio=0.5)  # Both

Q4. What is feature scaling and when is it required?

AlgorithmNeeds Scaling?Why
Linear/Logistic RegressionYesGradient descent converges faster
SVMYesMargin depends on distance
KNNYesDistance metric is scale-sensitive
Decision Tree / Random ForestNoSplit thresholds are scale-invariant
XGBoost / LightGBMNoTree splits are invariant
Neural NetworksYesGradient flow is scale-sensitive
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# StandardScaler: (x - mean) / std -- best for normally distributed features
scaler = StandardScaler()

# MinMaxScaler: (x - min) / (max - min) -- when you need values in [0,1]
min_max = MinMaxScaler()

# RobustScaler: (x - median) / IQR -- resistant to outliers
robust = RobustScaler()

# Always fit only on training data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)   # transform only, no fit

Q5. What are the types of cross-validation and when do you use each?

TypeDescriptionUse When
K-FoldSplit into k folds; rotateStandard for classification/regression
Stratified K-FoldMaintain class proportion per foldImbalanced classification
Leave-One-Out (LOO)n-fold CV; each sample is a foldVery small datasets
Time Series SplitTrain on past; validate on futureAny time series; NEVER shuffle
Group K-FoldSamples from same group never split across foldsPatient data, user-level data
from sklearn.model_selection import (StratifiedKFold, TimeSeriesSplit,
                                      GroupKFold, cross_val_score)
from sklearn.ensemble import GradientBoostingClassifier

# Standard imbalanced classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(GradientBoostingClassifier(), X, y, cv=skf, scoring='roc_auc')

# Time series -- NEVER shuffle
tscv = TimeSeriesSplit(n_splits=5)
ts_scores = cross_val_score(GradientBoostingClassifier(), X, y, cv=tscv, scoring='neg_mean_squared_error')

Q6. How do you handle missing values in a dataset?

StrategyWhen to UseRisk
Drop rowsMissing rate < 5%, data is largeInformation loss
Mean/Median imputationNumerical, MCAR assumptionDistorts variance
Mode imputationCategoricalDistorts distribution
KNN imputationSmall-medium datasets, correlated featuresSlow at scale
Model-based (IterativeImputer)Complex missingness patternsHigh compute
Forward fill / Back fillTime seriesOnly if temporal relationship holds
Add a missing indicator columnWhen missingness is informativeAlways consider this
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df = pd.read_csv("data.csv")

# Check missing pattern
print(df.isnull().sum() / len(df) * 100)

# Median imputation (robust to outliers vs mean)
num_imputer = SimpleImputer(strategy='median')

# KNN imputation -- respects feature correlations
knn_imputer = KNNImputer(n_neighbors=5)

# Always add indicator for informative missingness
df['age_missing'] = df['age'].isnull().astype(int)

Q7. What is feature engineering? Give three practical examples.

Example 1: Date decomposition

import pandas as pd

df['date'] = pd.to_datetime(df['date'])
df['day_of_week']   = df['date'].dt.dayofweek   # 0=Monday
df['is_weekend']    = df['day_of_week'].isin([5,6]).astype(int)
df['hour']          = df['date'].dt.hour
df['is_business_hr']= df['hour'].between(9, 17).astype(int)
df['month']         = df['date'].dt.month

Example 2: Ratio and interaction features

# For loan default prediction
df['debt_to_income']   = df['total_debt'] / (df['annual_income'] + 1)
df['payment_ratio']    = df['monthly_payment'] / (df['monthly_income'] + 1)
df['credit_util_sq']   = df['credit_utilization'] ** 2  # non-linear effect

Example 3: Target encoding for high-cardinality categoricals

# Replace categories with mean target value (careful: do inside CV folds)
import category_encoders as ce
encoder = ce.TargetEncoder(cols=['city', 'product_category'])
X_encoded = encoder.fit_transform(X_train, y_train)

Q8. What is a confusion matrix and what can you derive from it?

                  Predicted Positive  Predicted Negative
Actual Positive        TP                  FN
Actual Negative        FP                  TN

From a confusion matrix you can compute:

  • Precision = TP / (TP + FP)
  • Recall / Sensitivity = TP / (TP + FN)
  • Specificity = TN / (TN + FP)
  • F1 = 2 * Precision * Recall / (Precision + Recall)
  • Accuracy = (TP + TN) / Total
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import numpy as np

cm = confusion_matrix(y_test, y_pred)
print("TP:", cm[1,1], "TN:", cm[0,0], "FP:", cm[0,1], "FN:", cm[1,0])

# Multiclass
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1', 'Class 2']))

Q9. What is the difference between a parametric and non-parametric model?

PropertyParametricNon-Parametric
DefinitionFixed number of parameters, regardless of data sizeNumber of parameters grows with data
ExamplesLinear regression, logistic regression, Naive BayesKNN, decision trees, kernel SVM
ProsFast inference, interpretable, less data neededFlexible, no distribution assumptions
ConsStrong assumptions about data distributionSlow at scale, prone to overfitting
MemoryO(parameters) = constantO(training data)

Interview insight: KNN is the classic non-parametric model. It stores the entire training set; prediction = majority vote among k nearest neighbors. At FAANG scale, this is impractical (billions of samples). Approximate nearest neighbor (FAISS, ScaNN) makes it tractable.


Q10. What is Naive Bayes and when does it work well despite its naive assumption?

P(y|x_1,...,x_n) ∝ P(y) * ∏ P(x_i|y)

The "naive" part is the independence assumption (features are almost never truly independent).

When it works well:

  • Text classification (bag-of-words features are actually roughly independent)
  • Spam detection
  • Very small datasets where complex models overfit
  • When you need a fast baseline
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Text classification pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1,2))),
    ('clf', MultinomialNB(alpha=0.1))   # alpha = Laplace smoothing
])
text_clf.fit(X_train_text, y_train)

MEDIUM: Ensemble Methods and Evaluation (Questions 11-22)

Q11. Compare Random Forest, Gradient Boosting, XGBoost, and LightGBM.

PropertyRandom ForestGradient BoostingXGBoostLightGBM
TrainingParallelSequentialSequential + regularizationSequential + leaf-wise growth
Error reducedVarianceBiasBias + varianceBias + variance
SpeedFastSlowFaster via histogramFastest (GOSS + EFB)
Best forWide datasets, interpretabilityGeneral tabularKaggle-winning, structured dataLarge datasets, categorical data
Native categoricalsNoNoNoYes
GPUNo (by default)NoYesYes
import lightgbm as lgb
import xgboost as xgb

# LightGBM -- preferred in 2026 for large tabular data
lgb_model = lgb.LGBMClassifier(
    n_estimators=1000, learning_rate=0.05, max_depth=-1,
    num_leaves=63, subsample=0.8, colsample_bytree=0.8,
    min_child_samples=20, reg_alpha=0.1, reg_lambda=0.1,
    n_jobs=-1, random_state=42
)
lgb_model.fit(X_train, y_train,
              eval_set=[(X_val, y_val)],
              callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)])

# XGBoost
xgb_model = xgb.XGBClassifier(
    n_estimators=1000, learning_rate=0.05, max_depth=6,
    subsample=0.8, colsample_bytree=0.8,
    reg_alpha=0.1, reg_lambda=1.0,
    tree_method='hist', device='cuda',
    eval_metric='logloss', early_stopping_rounds=50
)

Q12. How does gradient boosting work step by step?

Gradient boosting builds an additive model by fitting each new tree to the negative gradient (residuals) of the loss function:

Step 0: Initialize F_0(x) = argmin_γ Σ L(y_i, γ)
For m = 1 to M:
  1. Compute pseudo-residuals: r_im = -∂L(y_i, F_{m-1}(x_i)) / ∂F_{m-1}(x_i)
  2. Fit a regression tree T_m to r_im
  3. Update: F_m(x) = F_{m-1}(x) + η * T_m(x)

For MSE loss, pseudo-residuals = actual residuals (y - y_hat). For log-loss, residuals = y - sigmoid(y_hat).

import numpy as np
from sklearn.tree import DecisionTreeRegressor

# Manual gradient boosting for MSE loss
class SimpleGBM:
    def __init__(self, n_estimators=100, lr=0.1, max_depth=3):
        self.trees = []
        self.n_estimators = n_estimators
        self.lr = lr
        self.max_depth = max_depth

    def fit(self, X, y):
        self.F0 = y.mean()
        F = np.full(len(y), self.F0)
        for _ in range(self.n_estimators):
            residuals = y - F          # negative gradient of MSE
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            self.trees.append(tree)
            F += self.lr * tree.predict(X)

    def predict(self, X):
        return self.F0 + self.lr * sum(t.predict(X) for t in self.trees)

Q13. What is feature importance in tree models? How is it computed?

Tree models offer several feature importance measures:

MethodHowProsCons
Impurity (Gini/MSE)Sum of impurity reduction by feature across all splitsFast, built-inBiased toward high-cardinality and numerical features
Permutation ImportanceMeasure accuracy drop when feature is shuffledModel-agnostic, unbiasedSlow
SHAP ValuesGame theory-based attributionConsistent, handles interactionsSlow for large ensembles
from sklearn.inspection import permutation_importance
import shap

# Impurity importance (fast, built-in)
importances = model.feature_importances_

# Permutation importance (unbiased)
perm_imp = permutation_importance(model, X_test, y_test,
                                   n_repeats=10, n_jobs=-1)
sorted_idx = perm_imp.importances_mean.argsort()[::-1]

# SHAP (gold standard in 2026)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

Q14. What is ROC-AUC vs PR-AUC? When does each metric mislead you?

ROC-AUC (Receiver Operating Characteristic)

  • Plots TPR (recall) vs FPR at all thresholds
  • AUC = probability that model ranks a random positive higher than a random negative
  • Misleads when: Severe class imbalance. A high AUC can hide poor minority-class performance because FPR looks small when TN >> FP

PR-AUC (Precision-Recall)

  • Plots Precision vs Recall at all thresholds
  • AUC = area under the precision-recall curve
  • Better for: Fraud detection, rare disease detection, any imbalanced binary problem
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay

y_scores = model.predict_proba(X_test)[:, 1]

roc_auc = roc_auc_score(y_test, y_scores)
pr_auc  = average_precision_score(y_test, y_scores)

print(f"ROC-AUC: {roc_auc:.4f}")
print(f"PR-AUC:  {pr_auc:.4f}")

# If class imbalance is severe (e.g., 1% positives):
# ROC-AUC = 0.95 might still mean model misses 40% of positives
# PR-AUC tells the truth

Q15. How do you handle imbalanced classes in machine learning?

StrategyDescriptionWhen to Use
Class weightsPenalize minority class misclassification moreFirst thing to try; no data modification
Oversampling (SMOTE)Synthesize new minority samplesWhen dataset is small
UndersamplingRemove majority class samplesWhen majority class is massive
Threshold tuningMove decision threshold from 0.5Always in production
Focal LossPenalize easy examples lessDeep learning with severe imbalance
Ensemble with balanced subsamplingBalancedBaggingClassifierStable, general purpose
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.utils.class_weight import compute_sample_weight

# Strategy 1: class_weight (zero computational cost)
model = GradientBoostingClassifier()
sample_weights = compute_sample_weight('balanced', y_train)
model.fit(X_train, y_train, sample_weight=sample_weights)

# Strategy 2: SMOTE oversampling pipeline
smote_pipeline = ImbPipeline([
    ('oversample', SMOTE(k_neighbors=5, random_state=42)),
    ('model', GradientBoostingClassifier())
])

# Strategy 3: Threshold tuning
from sklearn.metrics import precision_recall_curve
prec, rec, thresh = precision_recall_curve(y_val, model.predict_proba(X_val)[:,1])
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_thresh = thresh[np.argmax(f1_scores)]
y_pred_tuned = (model.predict_proba(X_test)[:,1] >= best_thresh).astype(int)

Q16. What is a learning curve and how do you use it for model diagnosis?

PatternDiagnosisFix
Both curves low and flatHigh bias / underfittingMore complex model, better features
Large gap, train high, val lowHigh variance / overfittingMore data, regularization
Val curve still risingModel benefits from more dataCollect more data
Both curves high and closeWell-fitShip it
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC

train_sizes = np.linspace(0.1, 1.0, 10)
train_sizes, train_scores, val_scores = learning_curve(
    SVC(kernel='rbf', C=10), X, y,
    train_sizes=train_sizes, cv=5,
    scoring='accuracy', n_jobs=-1
)

print("Train:", train_scores.mean(axis=1).round(3))
print("Val:  ", val_scores.mean(axis=1).round(3))

Q17. How does principal component analysis (PCA) work?

Steps:

  1. Standardize features (zero mean, unit variance)
  2. Compute covariance matrix: C = X^T X / (n-1)
  3. Eigen-decompose C: eigenvectors = principal components, eigenvalues = variance explained
  4. Project data onto top-k eigenvectors
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Always scale first
X_scaled = StandardScaler().fit_transform(X)

# Retain 95% of variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print(f"Original features: {X.shape[1]}")
print(f"PCA components:    {X_pca.shape[1]}")
print(f"Variance explained per component: {pca.explained_variance_ratio_}")
print(f"Cumulative variance: {pca.explained_variance_ratio_.cumsum()}")

Use PCA for: Dimensionality reduction before clustering, visualization (PCA to 2D), removing multicollinearity, speeding up downstream models.

Do NOT use PCA for: Interpretability (components are linear combinations of all features), if features are already low-dimensional.


Q18. What is the difference between classification and regression? When does the boundary blur?

AspectClassificationRegression
OutputDiscrete class labelContinuous value
Loss functionsCross-entropy, Hinge lossMSE, MAE, Huber
EvaluationAccuracy, F1, AUC-ROCRMSE, MAE, R²
ExamplesSpam detection, fraudHouse price, demand forecasting

When the boundary blurs:

  • Ordinal regression: Target is ordered categories (1-5 star rating). Can model as regression or multi-class classification; ordinal-specific methods do better.
  • Threshold classification: Take regression output and threshold it (e.g., predict churn probability > 0.6 = churn)
  • Calibration: Neural networks outputting soft probabilities are doing regression internally
# Ordinal regression
from sklearn.base import clone
from sklearn.linear_model import LogisticRegression

class OrdinalClassifier:
    """Frank-Hall ordinal encoding: train k-1 binary classifiers."""
    def __init__(self, base_clf=None):
        self.base_clf = base_clf or LogisticRegression()
        self.clfs_ = []
        self.classes_ = None

    def fit(self, X, y):
        self.classes_ = np.sort(np.unique(y))
        for i, c in enumerate(self.classes_[:-1]):
            binary_y = (y > c).astype(int)
            clf = clone(self.base_clf)
            clf.fit(X, binary_y)
            self.clfs_.append(clf)
        return self

Q19. What is a pipeline in scikit-learn and why is it critical for production?

  1. Prevents data leakage: Preprocessing steps (scaling, imputation) are fit only on training data, never on test data
  2. Clean code: One object to fit, predict, persist
  3. Hyperparameter tuning is unified: GridSearchCV can tune pipeline parameters
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

num_features = ['age', 'income', 'tenure']
cat_features = ['city', 'product_type']

# Preprocessing for numerical features
num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])

# Full pipeline
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(n_estimators=200, learning_rate=0.05))
])

full_pipeline.fit(X_train, y_train)

Q20. Explain SHAP values and why they are the standard for model explainability in 2026.

  • Consistent: If a feature's impact increases, its SHAP value increases. Impurity-based importance violates this.
  • Local and global: Explain individual predictions AND aggregate into global importance.
  • Model-agnostic: Works for any ML model; specialized implementations (TreeSHAP) are exact and fast for trees.
import shap

# Fast exact SHAP for tree models (TreeSHAP)
explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X_test)

# Global: which features matter most?
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names)

# Local: why did this specific prediction happen?
shap.waterfall_plot(explainer(X_test)[0])

# Force plot: visual explanation for one instance
shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test.iloc[0])

Production pattern (Amazon, Flipkart): Attach SHAP explanations to fraud alerts and loan rejection decisions. Regulators in India (RBI) require explainable credit decisions.


Q21. What is A/B testing and how does it relate to ML model evaluation?

Steps:

  1. Define hypothesis (H0: models perform equally; H1: new model is better)
  2. Calculate required sample size (power analysis)
  3. Run experiment until n_min reached
  4. Use hypothesis test (t-test, z-test, or Mann-Whitney U) to determine significance
  5. Never stop early based on current significance (p-hacking)
from scipy import stats
import numpy as np

# Conversion rates
control_conversions   = 1200
control_n             = 15000
treatment_conversions = 1350
treatment_n           = 15000

p_control   = control_conversions / control_n
p_treatment = treatment_conversions / treatment_n

# Two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest
count = np.array([treatment_conversions, control_conversions])
nobs  = np.array([treatment_n, control_n])
stat, pvalue = proportions_ztest(count, nobs, alternative='larger')
print(f"p-value: {pvalue:.4f}")
print("Reject H0" if pvalue < 0.05 else "Cannot reject H0")

Q22. What is stacking (stacked generalization) and how does it differ from bagging and boosting?

EnsembleHowWhen to Use
BaggingAverage parallel models on bootstrap samplesReduce variance (RF)
BoostingSequential models, each correcting the lastReduce bias (XGB)
StackingTrain meta-learner on base model predictionsCombine diverse models, squeeze last few %
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

base_models = [
    ('rf',  RandomForestClassifier(n_estimators=200, n_jobs=-1)),
    ('gb',  GradientBoostingClassifier(n_estimators=200)),
    ('svm', SVC(probability=True, kernel='rbf'))
]

# Meta-learner trained on base model out-of-fold predictions
stack = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(C=0.1),
    cv=5,
    stack_method='predict_proba'
)
stack.fit(X_train, y_train)

HARD: Advanced Topics and System Design (Questions 23-30)

Q23. How do you design an ML pipeline for production? What are the key components?

Production ML Pipeline Architecture (2026):

Data Layer:
  Raw ingestion (Kafka/S3) -> Feature Store (Feast/Tecton) -> Training data snapshots

Training Layer:
  Experiment tracking (MLflow/W&B) -> Hyperparameter tuning (Optuna) -> Model registry

Serving Layer:
  REST API (FastAPI) or batch scoring -> Model versioning -> Canary/shadow deployment

Monitoring Layer:
  Data drift (EvidentlyAI) -> Model performance -> Alerting -> Retraining triggers
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

mlflow.set_experiment("churn-prediction-v2")

with mlflow.start_run():
    model = GradientBoostingClassifier(n_estimators=500, learning_rate=0.05)
    model.fit(X_train, y_train)

    val_roc = roc_auc_score(y_val, model.predict_proba(X_val)[:,1])

    mlflow.log_params(model.get_params())
    mlflow.log_metric("val_roc_auc", val_roc)

    signature = infer_signature(X_train, model.predict(X_train))
    mlflow.sklearn.log_model(model, "model", signature=signature)

Q24. What is concept drift and how do you detect and handle it?

TypeDefinitionExample
Sudden driftAbrupt changeNew government policy changes fraud patterns
Gradual driftSlow shiftConsumer behavior shifts over months
Recurring driftCyclic changeSeasonal patterns in retail
Feature driftInput distribution shiftsEconomic crisis shifts income distributions
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

# Compare reference (training) vs current (production) data
report = Report(metrics=[DataDriftPreset(), TargetDriftPreset()])
report.run(reference_data=X_train_df, current_data=X_production_df)
report.save_html("drift_report.html")

# Statistical test: KS test for continuous features
from scipy.stats import ks_2samp
for col in numerical_features:
    stat, pval = ks_2samp(X_train[col], X_production[col])
    if pval < 0.05:
        print(f"Drift detected in {col}: p={pval:.4f}")

Response to drift:

  1. Retrain on recent data (rolling window)
  2. Continuously fine-tune on production stream
  3. Use online learning for fast-drift scenarios

Q25. Explain the curse of dimensionality and its practical impact on ML.

  1. Sparsity: Data points are far apart. In d dimensions, to have the same density as 10 points in 1D, you need 10^d points.
  2. Distance concentration: As d grows, the ratio of max-to-min distance among points converges to 1. Distance metrics become uninformative for KNN.
  3. Computational explosion: Feature combinations grow exponentially.

Practical impact:

# Distance concentration demonstration
import numpy as np

np.random.seed(42)
for d in [2, 10, 100, 1000]:
    X = np.random.randn(1000, d)
    x = np.random.randn(d)
    dists = np.linalg.norm(X - x, axis=1)
    ratio = dists.max() / dists.min()
    print(f"d={d:5d}  max/min distance ratio = {ratio:.2f}")
# d=2: ratio ~50, d=1000: ratio ~1.2
# At d=1000, max and min distances are nearly equal -> KNN fails

Mitigations: PCA, autoencoders, feature selection (mutual information, LASSO), domain knowledge to drop irrelevant features.


Q26. What is Bayesian optimization and why is it better than grid search for hyperparameter tuning?

MethodHowEvals NeededSmart?
Grid searchTry all combinationsO(n^k)No
Random searchRandom samplesO(n)No
Bayesian (TPE, GP)Exploit + explore based on historyO(50-100)Yes
import optuna

def objective(trial):
    params = {
        'n_estimators':   trial.suggest_int('n_estimators', 100, 1000),
        'max_depth':      trial.suggest_int('max_depth', 3, 12),
        'learning_rate':  trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample':      trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0)
    }
    model = xgb.XGBClassifier(**params, eval_metric='logloss', verbosity=0)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
              early_stopping_rounds=30, verbose=False)
    return roc_auc_score(y_val, model.predict_proba(X_val)[:,1])

study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=100, n_jobs=4)
print("Best params:", study.best_params)

Q27. How does logistic regression actually work? Derive the update rule.

σ(z) = 1 / (1 + e^{-z})
P(y|x) = σ(w^T x)^y * (1 - σ(w^T x))^(1-y)

Log-likelihood (Cross-entropy loss, negated):

L(w) = -Σ [y_i * log σ(w^T x_i) + (1-y_i) * log(1 - σ(w^T x_i))]

Gradient:

∂L/∂w = Σ (σ(w^T x_i) - y_i) * x_i = X^T (y_hat - y)

Update rule (gradient descent):

w ← w - η * X^T (y_hat - y) / n
import numpy as np

class LogisticRegressionFromScratch:
    def __init__(self, lr=0.01, n_iters=1000):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def fit(self, X, y):
        n, d = X.shape
        self.weights = np.zeros(d)
        self.bias = 0.0
        for _ in range(self.n_iters):
            y_hat = self.sigmoid(X @ self.weights + self.bias)
            dw = X.T @ (y_hat - y) / n
            db = (y_hat - y).mean()
            self.weights -= self.lr * dw
            self.bias    -= self.lr * db

    def predict_proba(self, X):
        return self.sigmoid(X @ self.weights + self.bias)

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

Q28. What is the VC dimension and why does it matter for generalization?

  • Linear classifier in R^d: VC dimension = d+1
  • Sine function sin(wx): VC dimension = infinity (can shatter arbitrarily many points by tuning w)

Why it matters: Generalization bound (PAC learning):

Error_test <= Error_train + sqrt(VC_dim * log(n) / n)

Higher VC dimension = more complex hypothesis class = higher generalization gap at fixed n.

Interview insight: This is why deep neural networks with billions of parameters should overfit on their training data but don't in practice. Implicit regularization from SGD, early stopping, and dropout keep the effective VC dimension far below the parameter count. This is an open theoretical problem and FAANG research interviewers love asking about it.


Q29. Design a recommendation system. What ML components are involved?

Architecture layers:

1. Candidate Generation (fast, recall-focused)
   - Matrix Factorization: user embedding @ item embedding
   - Two-Tower Neural Network (BERT embeddings for items)
   - Approximate Nearest Neighbor: FAISS, ScaNN
   - Goal: get top 1,000 candidates from millions of items

2. Ranking (slow, precision-focused)
   - Features: user history, item metadata, context (time, device)
   - Model: LightGBM or Deep & Wide Neural Network
   - Objective: P(click|user, item, context), P(purchase|...)
   - Goal: rank top 1,000 candidates, return top 10-50

3. Re-ranking (business logic)
   - Diversity: avoid showing all same-category items
   - Freshness boost for new items
   - Business constraints (promoted items, inventory)
# Two-tower model sketch (simplified)
import torch.nn as nn

class TwoTowerModel(nn.Module):
    def __init__(self, user_vocab, item_vocab, embed_dim=64):
        super().__init__()
        self.user_tower = nn.Sequential(
            nn.Embedding(user_vocab, embed_dim),
            nn.Flatten(),
            nn.Linear(embed_dim, 128), nn.ReLU(),
            nn.Linear(128, 64)
        )
        self.item_tower = nn.Sequential(
            nn.Embedding(item_vocab, embed_dim),
            nn.Flatten(),
            nn.Linear(embed_dim, 128), nn.ReLU(),
            nn.Linear(128, 64)
        )

    def forward(self, user_ids, item_ids):
        u = self.user_tower(user_ids)
        i = self.item_tower(item_ids)
        return (u * i).sum(dim=-1)  # dot product similarity

Q30. What is multi-armed bandit and when is it better than A/B testing?

AspectA/B TestMulti-Armed Bandit
Traffic allocationFixed (50/50)Dynamic (shift to winner)
RegretHigh (bad arm runs full duration)Lower (bad arm starved quickly)
Statistical rigorFormal hypothesis testingLess formal, regret-minimization
Best forStable business decisionsReal-time optimization (ads, pricing)
import numpy as np

class EpsilonGreedyBandit:
    """Epsilon-greedy MAB: explore with probability epsilon."""
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts  = np.zeros(n_arms)
        self.values  = np.zeros(n_arms)

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)  # explore
        return np.argmax(self.values)               # exploit

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n  # incremental mean

# Thompson Sampling (better in practice)
class ThompsonSampling:
    def __init__(self, n_arms):
        self.alpha = np.ones(n_arms)  # successes + 1
        self.beta  = np.ones(n_arms)  # failures + 1

    def select_arm(self):
        samples = np.random.beta(self.alpha, self.beta)
        return np.argmax(samples)

    def update(self, arm, reward):
        self.alpha[arm] += reward
        self.beta[arm]  += 1 - reward

Comparison Table: ML Algorithms at a Glance

AlgorithmTypeScaleInterpretableHandles Non-linearity2026 Status
Linear RegressionRegressionLargeYesNoBaseline
Logistic RegressionClassificationLargeYesNoStrong baseline
Decision TreeBothMediumYesYesRarely solo
Random ForestBothLargePartialYesStrong
XGBoostBothLargePartialYesIndustry standard
LightGBMBothVery LargePartialYesFastest tabular
SVMBothMediumNoYes (kernel)Fading
KNNBothSmallYesYesPrototyping
Naive BayesClassificationLargeYesNoNLP baseline
Neural NetworkBothVery LargeNoYesComplex tasks

FAQ

Q: What is the most important ML algorithm to know for 2026 interviews? A: Gradient boosting variants (XGBoost, LightGBM) for tabular data; transformers for sequential/unstructured data. Know both categories deeply.

Q: How much math do I need for ML interviews at product companies? A: Linear algebra (matrix operations), calculus (chain rule, partial derivatives), probability (Bayes theorem, distributions), and statistics (hypothesis testing, confidence intervals). You will not be asked to prove convergence theorems.

Q: What is the difference between a data scientist and ML engineer role? A: Data scientists focus on analysis, experimentation, and model building. ML engineers focus on deploying models at scale, building feature pipelines, and maintaining production systems. The boundary is blurring in 2026.

Q: Which Python libraries should I know for ML interviews? A: scikit-learn (must), XGBoost and LightGBM (must), pandas and NumPy (must), SHAP (expected at senior level), Optuna (good to know), MLflow (good to know for MLOps roles).

Q: Is deep learning tested in ML engineer interviews? A: Depends on the role. For ML engineer and data scientist roles at product companies, deep learning basics (neural nets, backprop, regularization) are fair game. Advanced topics (transformers, fine-tuning, distributed training) are tested for ML researcher and AI/ML engineer roles.


Related articles on PapersAdda:

Methodology applied to this articlelast verified 8 Jun 2026
Sources used
Public exam-pattern documents, official recruiter pages, and verified candidate reports on r/developersIndia and LinkedIn.
Verification window
Page last edited 8 Jun 2026 by Aditya Sharma. Numbers and patterns sanity-checked against the most recent 2026 cycle drives we tracked.
What we did NOT do
  • No fabricated salary numbers or success rates. If we quote a range, it's sourced.
  • No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
  • No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Paid contributor programme

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.

Submit your story →

Ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start Free Mock Test →

Related Articles

More from PapersAdda

Share this guide: