Machine Learning Interview Questions 2026: 30 Answers with Code

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
Machine learning roles are the fastest-growing engineering track in 2026. From product-based companies in India to global FAANG hiring, ML interview rounds have expanded beyond theoretical statistics into practical model building, production pipeline design, and real-world tradeoff reasoning. This guide covers 30 questions with full answers, Python code, and comparison tables across the complete difficulty spectrum.
PapersAdda's take: The ML interview in 2026 rewards engineers who can explain what happens inside the black box AND build a working pipeline. Theory without code = red flag. Code without intuition = another red flag. This guide trains both. Candidate-reported feedback from public preparation resources consistently flags that interviewers at product companies follow up any algorithm question with "how would you put this in production?" Candidates report that gradient boosting and model evaluation metrics appear in over 80% of shortlists. Confirm exact interview formats on the official company careers portal before you prepare.
Related articles: AI/ML Interview Questions 2026 | Deep Learning Interview Questions 2026 | Data Science Interview Questions 2026 | Scikit-learn Interview Questions 2026 | Statistics for Data Science 2026 | Data Engineering Interview Questions 2026
Which Companies Ask These Questions?
| Topic Cluster | Companies |
|---|---|
| Supervised Learning Fundamentals | All product companies, all FAANG |
| Ensemble Methods (RF, XGBoost) | Google, Amazon, Flipkart, PhonePe, Swiggy |
| Feature Engineering | Uber, LinkedIn, all ML-heavy teams |
| Model Evaluation and Metrics | All data science roles |
| ML System Design | Google, Meta, Amazon senior rounds |
| Clustering and Unsupervised | All data science roles |
| Pipelines and Production | MLOps roles, Databricks, AWS |
EASY: Core Concepts (Questions 1-10)
PapersAdda's note: These are the questions that separate a prepared candidate from an unprepared one in the first 10 minutes. Get them cold.
Q1. What is machine learning? How is it different from traditional programming?
| Aspect | Traditional Programming | Machine Learning |
|---|---|---|
| Input | Rules + Data | Data + Expected Output |
| Output | Program output | Rules (learned model) |
| Maintenance | Update rules manually | Retrain on new data |
| Works well when | Rules are known and stable | Rules are complex or unknown |
Machine learning is the practice of building systems that learn patterns from data rather than being explicitly programmed. The program writes itself from examples.
# Traditional programming: manually coded rule
def is_spam(email):
if "free money" in email.lower() or "click here" in email.lower():
return True
return False
# Machine learning: model learns the rule from data
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(training_emails)
model = LogisticRegression()
model.fit(X_train, labels) # model discovers the rule from data
Q2. What is the difference between a parameter and a hyperparameter?
| Term | Definition | Who Sets It | Examples |
|---|---|---|---|
| Parameter | Learned from training data | Optimizer | Weights, biases in a neural net; coefficients in linear regression |
| Hyperparameter | Configured before training | You | Learning rate, n_estimators, max_depth, regularization strength |
Why it matters in interviews: Hyperparameter tuning is a core skill. Know GridSearchCV, RandomizedSearchCV, and Optuna (Bayesian optimization, 2026 standard).
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import scipy.stats as stats
param_dist = {
'n_estimators': stats.randint(50, 500),
'max_depth': [None, 5, 10, 20],
'min_samples_split': stats.randint(2, 20),
'max_features': ['sqrt', 'log2', 0.3]
}
search = RandomizedSearchCV(
RandomForestClassifier(n_jobs=-1),
param_distributions=param_dist,
n_iter=50, cv=5, scoring='f1', n_jobs=-1, random_state=42
)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
Q3. Explain overfitting and underfitting. How do you detect and fix each?
| Problem | Definition | Symptom | Fix |
|---|---|---|---|
| Overfitting | Model memorizes training noise | High train accuracy, low test accuracy | More data, regularization, dropout, simpler model |
| Underfitting | Model too simple for data | Low train AND test accuracy | More capacity, better features, more epochs |
| Good fit | Model captures true pattern | High train AND test accuracy | Keep |
from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
def plot_learning_curve(model, X, y):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, cv=5, scoring='accuracy',
train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
)
# If val_scores plateau far below train_scores: overfitting
# If both plateau at low score: underfitting
return train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)
Regularization is the primary fix for overfitting:
from sklearn.linear_model import Ridge, Lasso, ElasticNet
ridge = Ridge(alpha=10.0) # L2: shrinks all weights
lasso = Lasso(alpha=0.1) # L1: zeros irrelevant features (feature selection)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5) # Both
Q4. What is feature scaling and when is it required?
| Algorithm | Needs Scaling? | Why |
|---|---|---|
| Linear/Logistic Regression | Yes | Gradient descent converges faster |
| SVM | Yes | Margin depends on distance |
| KNN | Yes | Distance metric is scale-sensitive |
| Decision Tree / Random Forest | No | Split thresholds are scale-invariant |
| XGBoost / LightGBM | No | Tree splits are invariant |
| Neural Networks | Yes | Gradient flow is scale-sensitive |
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# StandardScaler: (x - mean) / std -- best for normally distributed features
scaler = StandardScaler()
# MinMaxScaler: (x - min) / (max - min) -- when you need values in [0,1]
min_max = MinMaxScaler()
# RobustScaler: (x - median) / IQR -- resistant to outliers
robust = RobustScaler()
# Always fit only on training data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # transform only, no fit
Q5. What are the types of cross-validation and when do you use each?
| Type | Description | Use When |
|---|---|---|
| K-Fold | Split into k folds; rotate | Standard for classification/regression |
| Stratified K-Fold | Maintain class proportion per fold | Imbalanced classification |
| Leave-One-Out (LOO) | n-fold CV; each sample is a fold | Very small datasets |
| Time Series Split | Train on past; validate on future | Any time series; NEVER shuffle |
| Group K-Fold | Samples from same group never split across folds | Patient data, user-level data |
from sklearn.model_selection import (StratifiedKFold, TimeSeriesSplit,
GroupKFold, cross_val_score)
from sklearn.ensemble import GradientBoostingClassifier
# Standard imbalanced classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(GradientBoostingClassifier(), X, y, cv=skf, scoring='roc_auc')
# Time series -- NEVER shuffle
tscv = TimeSeriesSplit(n_splits=5)
ts_scores = cross_val_score(GradientBoostingClassifier(), X, y, cv=tscv, scoring='neg_mean_squared_error')
Q6. How do you handle missing values in a dataset?
| Strategy | When to Use | Risk |
|---|---|---|
| Drop rows | Missing rate < 5%, data is large | Information loss |
| Mean/Median imputation | Numerical, MCAR assumption | Distorts variance |
| Mode imputation | Categorical | Distorts distribution |
| KNN imputation | Small-medium datasets, correlated features | Slow at scale |
| Model-based (IterativeImputer) | Complex missingness patterns | High compute |
| Forward fill / Back fill | Time series | Only if temporal relationship holds |
| Add a missing indicator column | When missingness is informative | Always consider this |
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df = pd.read_csv("data.csv")
# Check missing pattern
print(df.isnull().sum() / len(df) * 100)
# Median imputation (robust to outliers vs mean)
num_imputer = SimpleImputer(strategy='median')
# KNN imputation -- respects feature correlations
knn_imputer = KNNImputer(n_neighbors=5)
# Always add indicator for informative missingness
df['age_missing'] = df['age'].isnull().astype(int)
Q7. What is feature engineering? Give three practical examples.
Example 1: Date decomposition
import pandas as pd
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.dayofweek # 0=Monday
df['is_weekend'] = df['day_of_week'].isin([5,6]).astype(int)
df['hour'] = df['date'].dt.hour
df['is_business_hr']= df['hour'].between(9, 17).astype(int)
df['month'] = df['date'].dt.month
Example 2: Ratio and interaction features
# For loan default prediction
df['debt_to_income'] = df['total_debt'] / (df['annual_income'] + 1)
df['payment_ratio'] = df['monthly_payment'] / (df['monthly_income'] + 1)
df['credit_util_sq'] = df['credit_utilization'] ** 2 # non-linear effect
Example 3: Target encoding for high-cardinality categoricals
# Replace categories with mean target value (careful: do inside CV folds)
import category_encoders as ce
encoder = ce.TargetEncoder(cols=['city', 'product_category'])
X_encoded = encoder.fit_transform(X_train, y_train)
Q8. What is a confusion matrix and what can you derive from it?
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
From a confusion matrix you can compute:
- Precision = TP / (TP + FP)
- Recall / Sensitivity = TP / (TP + FN)
- Specificity = TN / (TN + FP)
- F1 = 2 * Precision * Recall / (Precision + Recall)
- Accuracy = (TP + TN) / Total
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import numpy as np
cm = confusion_matrix(y_test, y_pred)
print("TP:", cm[1,1], "TN:", cm[0,0], "FP:", cm[0,1], "FN:", cm[1,0])
# Multiclass
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1', 'Class 2']))
Q9. What is the difference between a parametric and non-parametric model?
| Property | Parametric | Non-Parametric |
|---|---|---|
| Definition | Fixed number of parameters, regardless of data size | Number of parameters grows with data |
| Examples | Linear regression, logistic regression, Naive Bayes | KNN, decision trees, kernel SVM |
| Pros | Fast inference, interpretable, less data needed | Flexible, no distribution assumptions |
| Cons | Strong assumptions about data distribution | Slow at scale, prone to overfitting |
| Memory | O(parameters) = constant | O(training data) |
Interview insight: KNN is the classic non-parametric model. It stores the entire training set; prediction = majority vote among k nearest neighbors. At FAANG scale, this is impractical (billions of samples). Approximate nearest neighbor (FAISS, ScaNN) makes it tractable.
Q10. What is Naive Bayes and when does it work well despite its naive assumption?
P(y|x_1,...,x_n) ∝ P(y) * ∏ P(x_i|y)
The "naive" part is the independence assumption (features are almost never truly independent).
When it works well:
- Text classification (bag-of-words features are actually roughly independent)
- Spam detection
- Very small datasets where complex models overfit
- When you need a fast baseline
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
# Text classification pipeline
text_clf = Pipeline([
('vect', CountVectorizer(ngram_range=(1,2))),
('clf', MultinomialNB(alpha=0.1)) # alpha = Laplace smoothing
])
text_clf.fit(X_train_text, y_train)
MEDIUM: Ensemble Methods and Evaluation (Questions 11-22)
Q11. Compare Random Forest, Gradient Boosting, XGBoost, and LightGBM.
| Property | Random Forest | Gradient Boosting | XGBoost | LightGBM |
|---|---|---|---|---|
| Training | Parallel | Sequential | Sequential + regularization | Sequential + leaf-wise growth |
| Error reduced | Variance | Bias | Bias + variance | Bias + variance |
| Speed | Fast | Slow | Faster via histogram | Fastest (GOSS + EFB) |
| Best for | Wide datasets, interpretability | General tabular | Kaggle-winning, structured data | Large datasets, categorical data |
| Native categoricals | No | No | No | Yes |
| GPU | No (by default) | No | Yes | Yes |
import lightgbm as lgb
import xgboost as xgb
# LightGBM -- preferred in 2026 for large tabular data
lgb_model = lgb.LGBMClassifier(
n_estimators=1000, learning_rate=0.05, max_depth=-1,
num_leaves=63, subsample=0.8, colsample_bytree=0.8,
min_child_samples=20, reg_alpha=0.1, reg_lambda=0.1,
n_jobs=-1, random_state=42
)
lgb_model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)])
# XGBoost
xgb_model = xgb.XGBClassifier(
n_estimators=1000, learning_rate=0.05, max_depth=6,
subsample=0.8, colsample_bytree=0.8,
reg_alpha=0.1, reg_lambda=1.0,
tree_method='hist', device='cuda',
eval_metric='logloss', early_stopping_rounds=50
)
Q12. How does gradient boosting work step by step?
Gradient boosting builds an additive model by fitting each new tree to the negative gradient (residuals) of the loss function:
Step 0: Initialize F_0(x) = argmin_γ Σ L(y_i, γ)
For m = 1 to M:
1. Compute pseudo-residuals: r_im = -∂L(y_i, F_{m-1}(x_i)) / ∂F_{m-1}(x_i)
2. Fit a regression tree T_m to r_im
3. Update: F_m(x) = F_{m-1}(x) + η * T_m(x)
For MSE loss, pseudo-residuals = actual residuals (y - y_hat). For log-loss, residuals = y - sigmoid(y_hat).
import numpy as np
from sklearn.tree import DecisionTreeRegressor
# Manual gradient boosting for MSE loss
class SimpleGBM:
def __init__(self, n_estimators=100, lr=0.1, max_depth=3):
self.trees = []
self.n_estimators = n_estimators
self.lr = lr
self.max_depth = max_depth
def fit(self, X, y):
self.F0 = y.mean()
F = np.full(len(y), self.F0)
for _ in range(self.n_estimators):
residuals = y - F # negative gradient of MSE
tree = DecisionTreeRegressor(max_depth=self.max_depth)
tree.fit(X, residuals)
self.trees.append(tree)
F += self.lr * tree.predict(X)
def predict(self, X):
return self.F0 + self.lr * sum(t.predict(X) for t in self.trees)
Q13. What is feature importance in tree models? How is it computed?
Tree models offer several feature importance measures:
| Method | How | Pros | Cons |
|---|---|---|---|
| Impurity (Gini/MSE) | Sum of impurity reduction by feature across all splits | Fast, built-in | Biased toward high-cardinality and numerical features |
| Permutation Importance | Measure accuracy drop when feature is shuffled | Model-agnostic, unbiased | Slow |
| SHAP Values | Game theory-based attribution | Consistent, handles interactions | Slow for large ensembles |
from sklearn.inspection import permutation_importance
import shap
# Impurity importance (fast, built-in)
importances = model.feature_importances_
# Permutation importance (unbiased)
perm_imp = permutation_importance(model, X_test, y_test,
n_repeats=10, n_jobs=-1)
sorted_idx = perm_imp.importances_mean.argsort()[::-1]
# SHAP (gold standard in 2026)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)
Q14. What is ROC-AUC vs PR-AUC? When does each metric mislead you?
ROC-AUC (Receiver Operating Characteristic)
- Plots TPR (recall) vs FPR at all thresholds
- AUC = probability that model ranks a random positive higher than a random negative
- Misleads when: Severe class imbalance. A high AUC can hide poor minority-class performance because FPR looks small when TN >> FP
PR-AUC (Precision-Recall)
- Plots Precision vs Recall at all thresholds
- AUC = area under the precision-recall curve
- Better for: Fraud detection, rare disease detection, any imbalanced binary problem
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay
y_scores = model.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_scores)
pr_auc = average_precision_score(y_test, y_scores)
print(f"ROC-AUC: {roc_auc:.4f}")
print(f"PR-AUC: {pr_auc:.4f}")
# If class imbalance is severe (e.g., 1% positives):
# ROC-AUC = 0.95 might still mean model misses 40% of positives
# PR-AUC tells the truth
Q15. How do you handle imbalanced classes in machine learning?
| Strategy | Description | When to Use |
|---|---|---|
| Class weights | Penalize minority class misclassification more | First thing to try; no data modification |
| Oversampling (SMOTE) | Synthesize new minority samples | When dataset is small |
| Undersampling | Remove majority class samples | When majority class is massive |
| Threshold tuning | Move decision threshold from 0.5 | Always in production |
| Focal Loss | Penalize easy examples less | Deep learning with severe imbalance |
| Ensemble with balanced subsampling | BalancedBaggingClassifier | Stable, general purpose |
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.utils.class_weight import compute_sample_weight
# Strategy 1: class_weight (zero computational cost)
model = GradientBoostingClassifier()
sample_weights = compute_sample_weight('balanced', y_train)
model.fit(X_train, y_train, sample_weight=sample_weights)
# Strategy 2: SMOTE oversampling pipeline
smote_pipeline = ImbPipeline([
('oversample', SMOTE(k_neighbors=5, random_state=42)),
('model', GradientBoostingClassifier())
])
# Strategy 3: Threshold tuning
from sklearn.metrics import precision_recall_curve
prec, rec, thresh = precision_recall_curve(y_val, model.predict_proba(X_val)[:,1])
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_thresh = thresh[np.argmax(f1_scores)]
y_pred_tuned = (model.predict_proba(X_test)[:,1] >= best_thresh).astype(int)
Q16. What is a learning curve and how do you use it for model diagnosis?
| Pattern | Diagnosis | Fix |
|---|---|---|
| Both curves low and flat | High bias / underfitting | More complex model, better features |
| Large gap, train high, val low | High variance / overfitting | More data, regularization |
| Val curve still rising | Model benefits from more data | Collect more data |
| Both curves high and close | Well-fit | Ship it |
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC
train_sizes = np.linspace(0.1, 1.0, 10)
train_sizes, train_scores, val_scores = learning_curve(
SVC(kernel='rbf', C=10), X, y,
train_sizes=train_sizes, cv=5,
scoring='accuracy', n_jobs=-1
)
print("Train:", train_scores.mean(axis=1).round(3))
print("Val: ", val_scores.mean(axis=1).round(3))
Q17. How does principal component analysis (PCA) work?
Steps:
- Standardize features (zero mean, unit variance)
- Compute covariance matrix: C = X^T X / (n-1)
- Eigen-decompose C: eigenvectors = principal components, eigenvalues = variance explained
- Project data onto top-k eigenvectors
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# Always scale first
X_scaled = StandardScaler().fit_transform(X)
# Retain 95% of variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
print(f"Original features: {X.shape[1]}")
print(f"PCA components: {X_pca.shape[1]}")
print(f"Variance explained per component: {pca.explained_variance_ratio_}")
print(f"Cumulative variance: {pca.explained_variance_ratio_.cumsum()}")
Use PCA for: Dimensionality reduction before clustering, visualization (PCA to 2D), removing multicollinearity, speeding up downstream models.
Do NOT use PCA for: Interpretability (components are linear combinations of all features), if features are already low-dimensional.
Q18. What is the difference between classification and regression? When does the boundary blur?
| Aspect | Classification | Regression |
|---|---|---|
| Output | Discrete class label | Continuous value |
| Loss functions | Cross-entropy, Hinge loss | MSE, MAE, Huber |
| Evaluation | Accuracy, F1, AUC-ROC | RMSE, MAE, R² |
| Examples | Spam detection, fraud | House price, demand forecasting |
When the boundary blurs:
- Ordinal regression: Target is ordered categories (1-5 star rating). Can model as regression or multi-class classification; ordinal-specific methods do better.
- Threshold classification: Take regression output and threshold it (e.g., predict churn probability > 0.6 = churn)
- Calibration: Neural networks outputting soft probabilities are doing regression internally
# Ordinal regression
from sklearn.base import clone
from sklearn.linear_model import LogisticRegression
class OrdinalClassifier:
"""Frank-Hall ordinal encoding: train k-1 binary classifiers."""
def __init__(self, base_clf=None):
self.base_clf = base_clf or LogisticRegression()
self.clfs_ = []
self.classes_ = None
def fit(self, X, y):
self.classes_ = np.sort(np.unique(y))
for i, c in enumerate(self.classes_[:-1]):
binary_y = (y > c).astype(int)
clf = clone(self.base_clf)
clf.fit(X, binary_y)
self.clfs_.append(clf)
return self
Q19. What is a pipeline in scikit-learn and why is it critical for production?
- Prevents data leakage: Preprocessing steps (scaling, imputation) are fit only on training data, never on test data
- Clean code: One object to fit, predict, persist
- Hyperparameter tuning is unified:
GridSearchCVcan tune pipeline parameters
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
num_features = ['age', 'income', 'tenure']
cat_features = ['city', 'product_type']
# Preprocessing for numerical features
num_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Preprocessing for categorical features
cat_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine
preprocessor = ColumnTransformer([
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])
# Full pipeline
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(n_estimators=200, learning_rate=0.05))
])
full_pipeline.fit(X_train, y_train)
Q20. Explain SHAP values and why they are the standard for model explainability in 2026.
- Consistent: If a feature's impact increases, its SHAP value increases. Impurity-based importance violates this.
- Local and global: Explain individual predictions AND aggregate into global importance.
- Model-agnostic: Works for any ML model; specialized implementations (TreeSHAP) are exact and fast for trees.
import shap
# Fast exact SHAP for tree models (TreeSHAP)
explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X_test)
# Global: which features matter most?
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names)
# Local: why did this specific prediction happen?
shap.waterfall_plot(explainer(X_test)[0])
# Force plot: visual explanation for one instance
shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test.iloc[0])
Production pattern (Amazon, Flipkart): Attach SHAP explanations to fraud alerts and loan rejection decisions. Regulators in India (RBI) require explainable credit decisions.
Q21. What is A/B testing and how does it relate to ML model evaluation?
Steps:
- Define hypothesis (H0: models perform equally; H1: new model is better)
- Calculate required sample size (power analysis)
- Run experiment until n_min reached
- Use hypothesis test (t-test, z-test, or Mann-Whitney U) to determine significance
- Never stop early based on current significance (p-hacking)
from scipy import stats
import numpy as np
# Conversion rates
control_conversions = 1200
control_n = 15000
treatment_conversions = 1350
treatment_n = 15000
p_control = control_conversions / control_n
p_treatment = treatment_conversions / treatment_n
# Two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest
count = np.array([treatment_conversions, control_conversions])
nobs = np.array([treatment_n, control_n])
stat, pvalue = proportions_ztest(count, nobs, alternative='larger')
print(f"p-value: {pvalue:.4f}")
print("Reject H0" if pvalue < 0.05 else "Cannot reject H0")
Q22. What is stacking (stacked generalization) and how does it differ from bagging and boosting?
| Ensemble | How | When to Use |
|---|---|---|
| Bagging | Average parallel models on bootstrap samples | Reduce variance (RF) |
| Boosting | Sequential models, each correcting the last | Reduce bias (XGB) |
| Stacking | Train meta-learner on base model predictions | Combine diverse models, squeeze last few % |
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
base_models = [
('rf', RandomForestClassifier(n_estimators=200, n_jobs=-1)),
('gb', GradientBoostingClassifier(n_estimators=200)),
('svm', SVC(probability=True, kernel='rbf'))
]
# Meta-learner trained on base model out-of-fold predictions
stack = StackingClassifier(
estimators=base_models,
final_estimator=LogisticRegression(C=0.1),
cv=5,
stack_method='predict_proba'
)
stack.fit(X_train, y_train)
HARD: Advanced Topics and System Design (Questions 23-30)
Q23. How do you design an ML pipeline for production? What are the key components?
Production ML Pipeline Architecture (2026):
Data Layer:
Raw ingestion (Kafka/S3) -> Feature Store (Feast/Tecton) -> Training data snapshots
Training Layer:
Experiment tracking (MLflow/W&B) -> Hyperparameter tuning (Optuna) -> Model registry
Serving Layer:
REST API (FastAPI) or batch scoring -> Model versioning -> Canary/shadow deployment
Monitoring Layer:
Data drift (EvidentlyAI) -> Model performance -> Alerting -> Retraining triggers
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
mlflow.set_experiment("churn-prediction-v2")
with mlflow.start_run():
model = GradientBoostingClassifier(n_estimators=500, learning_rate=0.05)
model.fit(X_train, y_train)
val_roc = roc_auc_score(y_val, model.predict_proba(X_val)[:,1])
mlflow.log_params(model.get_params())
mlflow.log_metric("val_roc_auc", val_roc)
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, "model", signature=signature)
Q24. What is concept drift and how do you detect and handle it?
| Type | Definition | Example |
|---|---|---|
| Sudden drift | Abrupt change | New government policy changes fraud patterns |
| Gradual drift | Slow shift | Consumer behavior shifts over months |
| Recurring drift | Cyclic change | Seasonal patterns in retail |
| Feature drift | Input distribution shifts | Economic crisis shifts income distributions |
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
# Compare reference (training) vs current (production) data
report = Report(metrics=[DataDriftPreset(), TargetDriftPreset()])
report.run(reference_data=X_train_df, current_data=X_production_df)
report.save_html("drift_report.html")
# Statistical test: KS test for continuous features
from scipy.stats import ks_2samp
for col in numerical_features:
stat, pval = ks_2samp(X_train[col], X_production[col])
if pval < 0.05:
print(f"Drift detected in {col}: p={pval:.4f}")
Response to drift:
- Retrain on recent data (rolling window)
- Continuously fine-tune on production stream
- Use online learning for fast-drift scenarios
Q25. Explain the curse of dimensionality and its practical impact on ML.
- Sparsity: Data points are far apart. In d dimensions, to have the same density as 10 points in 1D, you need 10^d points.
- Distance concentration: As d grows, the ratio of max-to-min distance among points converges to 1. Distance metrics become uninformative for KNN.
- Computational explosion: Feature combinations grow exponentially.
Practical impact:
# Distance concentration demonstration
import numpy as np
np.random.seed(42)
for d in [2, 10, 100, 1000]:
X = np.random.randn(1000, d)
x = np.random.randn(d)
dists = np.linalg.norm(X - x, axis=1)
ratio = dists.max() / dists.min()
print(f"d={d:5d} max/min distance ratio = {ratio:.2f}")
# d=2: ratio ~50, d=1000: ratio ~1.2
# At d=1000, max and min distances are nearly equal -> KNN fails
Mitigations: PCA, autoencoders, feature selection (mutual information, LASSO), domain knowledge to drop irrelevant features.
Q26. What is Bayesian optimization and why is it better than grid search for hyperparameter tuning?
| Method | How | Evals Needed | Smart? |
|---|---|---|---|
| Grid search | Try all combinations | O(n^k) | No |
| Random search | Random samples | O(n) | No |
| Bayesian (TPE, GP) | Exploit + explore based on history | O(50-100) | Yes |
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0)
}
model = xgb.XGBClassifier(**params, eval_metric='logloss', verbosity=0)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
early_stopping_rounds=30, verbose=False)
return roc_auc_score(y_val, model.predict_proba(X_val)[:,1])
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=100, n_jobs=4)
print("Best params:", study.best_params)
Q27. How does logistic regression actually work? Derive the update rule.
σ(z) = 1 / (1 + e^{-z})
P(y|x) = σ(w^T x)^y * (1 - σ(w^T x))^(1-y)
Log-likelihood (Cross-entropy loss, negated):
L(w) = -Σ [y_i * log σ(w^T x_i) + (1-y_i) * log(1 - σ(w^T x_i))]
Gradient:
∂L/∂w = Σ (σ(w^T x_i) - y_i) * x_i = X^T (y_hat - y)
Update rule (gradient descent):
w ← w - η * X^T (y_hat - y) / n
import numpy as np
class LogisticRegressionFromScratch:
def __init__(self, lr=0.01, n_iters=1000):
self.lr = lr
self.n_iters = n_iters
self.weights = None
self.bias = None
def sigmoid(self, z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def fit(self, X, y):
n, d = X.shape
self.weights = np.zeros(d)
self.bias = 0.0
for _ in range(self.n_iters):
y_hat = self.sigmoid(X @ self.weights + self.bias)
dw = X.T @ (y_hat - y) / n
db = (y_hat - y).mean()
self.weights -= self.lr * dw
self.bias -= self.lr * db
def predict_proba(self, X):
return self.sigmoid(X @ self.weights + self.bias)
def predict(self, X, threshold=0.5):
return (self.predict_proba(X) >= threshold).astype(int)
Q28. What is the VC dimension and why does it matter for generalization?
- Linear classifier in R^d: VC dimension = d+1
- Sine function sin(wx): VC dimension = infinity (can shatter arbitrarily many points by tuning w)
Why it matters: Generalization bound (PAC learning):
Error_test <= Error_train + sqrt(VC_dim * log(n) / n)
Higher VC dimension = more complex hypothesis class = higher generalization gap at fixed n.
Interview insight: This is why deep neural networks with billions of parameters should overfit on their training data but don't in practice. Implicit regularization from SGD, early stopping, and dropout keep the effective VC dimension far below the parameter count. This is an open theoretical problem and FAANG research interviewers love asking about it.
Q29. Design a recommendation system. What ML components are involved?
Architecture layers:
1. Candidate Generation (fast, recall-focused)
- Matrix Factorization: user embedding @ item embedding
- Two-Tower Neural Network (BERT embeddings for items)
- Approximate Nearest Neighbor: FAISS, ScaNN
- Goal: get top 1,000 candidates from millions of items
2. Ranking (slow, precision-focused)
- Features: user history, item metadata, context (time, device)
- Model: LightGBM or Deep & Wide Neural Network
- Objective: P(click|user, item, context), P(purchase|...)
- Goal: rank top 1,000 candidates, return top 10-50
3. Re-ranking (business logic)
- Diversity: avoid showing all same-category items
- Freshness boost for new items
- Business constraints (promoted items, inventory)
# Two-tower model sketch (simplified)
import torch.nn as nn
class TwoTowerModel(nn.Module):
def __init__(self, user_vocab, item_vocab, embed_dim=64):
super().__init__()
self.user_tower = nn.Sequential(
nn.Embedding(user_vocab, embed_dim),
nn.Flatten(),
nn.Linear(embed_dim, 128), nn.ReLU(),
nn.Linear(128, 64)
)
self.item_tower = nn.Sequential(
nn.Embedding(item_vocab, embed_dim),
nn.Flatten(),
nn.Linear(embed_dim, 128), nn.ReLU(),
nn.Linear(128, 64)
)
def forward(self, user_ids, item_ids):
u = self.user_tower(user_ids)
i = self.item_tower(item_ids)
return (u * i).sum(dim=-1) # dot product similarity
Q30. What is multi-armed bandit and when is it better than A/B testing?
| Aspect | A/B Test | Multi-Armed Bandit |
|---|---|---|
| Traffic allocation | Fixed (50/50) | Dynamic (shift to winner) |
| Regret | High (bad arm runs full duration) | Lower (bad arm starved quickly) |
| Statistical rigor | Formal hypothesis testing | Less formal, regret-minimization |
| Best for | Stable business decisions | Real-time optimization (ads, pricing) |
import numpy as np
class EpsilonGreedyBandit:
"""Epsilon-greedy MAB: explore with probability epsilon."""
def __init__(self, n_arms, epsilon=0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
def select_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_arms) # explore
return np.argmax(self.values) # exploit
def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / n # incremental mean
# Thompson Sampling (better in practice)
class ThompsonSampling:
def __init__(self, n_arms):
self.alpha = np.ones(n_arms) # successes + 1
self.beta = np.ones(n_arms) # failures + 1
def select_arm(self):
samples = np.random.beta(self.alpha, self.beta)
return np.argmax(samples)
def update(self, arm, reward):
self.alpha[arm] += reward
self.beta[arm] += 1 - reward
Comparison Table: ML Algorithms at a Glance
| Algorithm | Type | Scale | Interpretable | Handles Non-linearity | 2026 Status |
|---|---|---|---|---|---|
| Linear Regression | Regression | Large | Yes | No | Baseline |
| Logistic Regression | Classification | Large | Yes | No | Strong baseline |
| Decision Tree | Both | Medium | Yes | Yes | Rarely solo |
| Random Forest | Both | Large | Partial | Yes | Strong |
| XGBoost | Both | Large | Partial | Yes | Industry standard |
| LightGBM | Both | Very Large | Partial | Yes | Fastest tabular |
| SVM | Both | Medium | No | Yes (kernel) | Fading |
| KNN | Both | Small | Yes | Yes | Prototyping |
| Naive Bayes | Classification | Large | Yes | No | NLP baseline |
| Neural Network | Both | Very Large | No | Yes | Complex tasks |
FAQ
Q: What is the most important ML algorithm to know for 2026 interviews? A: Gradient boosting variants (XGBoost, LightGBM) for tabular data; transformers for sequential/unstructured data. Know both categories deeply.
Q: How much math do I need for ML interviews at product companies? A: Linear algebra (matrix operations), calculus (chain rule, partial derivatives), probability (Bayes theorem, distributions), and statistics (hypothesis testing, confidence intervals). You will not be asked to prove convergence theorems.
Q: What is the difference between a data scientist and ML engineer role? A: Data scientists focus on analysis, experimentation, and model building. ML engineers focus on deploying models at scale, building feature pipelines, and maintaining production systems. The boundary is blurring in 2026.
Q: Which Python libraries should I know for ML interviews? A: scikit-learn (must), XGBoost and LightGBM (must), pandas and NumPy (must), SHAP (expected at senior level), Optuna (good to know), MLflow (good to know for MLOps roles).
Q: Is deep learning tested in ML engineer interviews? A: Depends on the role. For ML engineer and data scientist roles at product companies, deep learning basics (neural nets, backprop, regularization) are fair game. Advanced topics (transformers, fine-tuning, distributed training) are tested for ML researcher and AI/ML engineer roles.
Related articles on PapersAdda:
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →Related Articles
Airbnb Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airbnb's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Airtel Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airtel's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
AMD Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing AMD's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Atlassian Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Atlassian's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical,...
Barclays Interview Questions 2026
_Last verified by [Aditya Sharma](/author/aditya-sharma/) · cross-checked against PapersAdda Hiring Pulse and...
More from PapersAdda
Accenture Interview Questions 2026 (with Answers for Freshers)
Capgemini Interview Questions 2026 (with Answers for Freshers)
HCLTech Interview Questions 2026 (TechBee + TGT, with Answers)
IBM Interview Questions 2026 (with Answers for Freshers)