issue 117apr 27mmxxvi
est. 2017
Sun, 27 Apr 2026
vol. IX · no. 117
PapersAdda
placement intelligence, since 2017
640+ briefs · 24 campuses · by reservation
verified offers · sourced from r/developersIndia
razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1

Data Science Interview Questions 2026: 30 Answers with Code

22 min read
Interview Questions
Updated: 8 Jun 2026
Aditya Sharma
Aditya's Edit

PapersAdda 2026 Placement Cycle

By Aditya Sharma·Founder & Editor, PapersAdda

What changed in 2026 drives

Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.

What I'd actually study for this

  • 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
  • 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
  • 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
  • 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken

Where most candidates trip up

The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.

Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.

Data science remains one of the highest-demand technical roles in 2026, with strong hiring across fintech, e-commerce, healthtech, and product companies. Data scientist interviews test a broad stack: statistics, SQL, Python, ML modeling, experiment design, and product thinking. This guide covers 30 data science interview questions with full answers and code, organized from foundations to system design.

PapersAdda's take: Candidates report that statistics and A/B testing questions are the most common elimination filters in data science interviews at product companies like Flipkart, Swiggy, PhonePe, and FAANG India. SQL window function problems appear in over 65% of DS first rounds. Confirm the specific tools and data stack expected on the official company careers portal before you prepare.

Related articles: Machine Learning Interview Questions 2026 | Statistics for Data Science 2026 | SQL for Data Analysts 2026 | Pandas Interview Questions 2026 | Scikit-Learn Interview Questions 2026


Which Companies Hire for Data Science Roles?

SectorCompaniesDS Focus
FintechRazorpay, PhonePe, Zepto, GrowwFraud detection, credit risk, product analytics
E-commerceFlipkart, Meesho, Amazon IndiaRecommendation, pricing, supply chain
Ride-hailing/foodOla, Uber, Swiggy, ZomatoSurge pricing, demand forecasting, ETA
FAANG IndiaGoogle, Meta, MicrosoftProduct analytics, ads, ML platform
Healthcare1mg, Practo, AarogyaClinical ML, drug interaction, patient journey

EASY: Statistics and Python Foundations (Questions 1-10)

Q1. What is the difference between Type I and Type II errors? How do they relate to significance level and power?

Error TypeDefinitionCost
Type I (alpha)Reject true null hypothesis (false positive)False alarm
Type II (beta)Fail to reject false null hypothesis (false negative)Missed detection
  • Significance level alpha: P(Type I error) -- typically 0.05. Lower alpha = fewer false positives but more false negatives.
  • Power = 1 - beta: probability of detecting a true effect. Target >= 0.80.
  • Trade-off: decreasing alpha increases beta (for fixed sample size). Only increasing sample size reduces both.
from scipy import stats

# Power analysis for a t-test
from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
# Find required sample size for effect_size=0.3, alpha=0.05, power=0.80
n = analysis.solve_power(effect_size=0.3, alpha=0.05, power=0.80, ratio=1.0)
print(f"Required n per group: {n:.0f}")  # 176

Q2. Explain p-value. A p-value of 0.03 -- what does it mean?

The p-value is the probability of observing data at least as extreme as what you observed, assuming the null hypothesis is true.

p = 0.03 means: if there were truly no effect (null is true), there is a 3% chance of getting this result or something more extreme by random chance.

Common misinterpretations (don't say these in interviews):

  • "Probability the null is true" -- WRONG.
  • "Probability the effect is real" -- WRONG.
  • "Effect is practically significant" -- WRONG (large N can yield tiny p for negligible effects).

Practical guidance: p < alpha is a decision rule, not a measure of effect size. Always report effect size (Cohen's d, eta^2) alongside p-value.


Q3. What is the Central Limit Theorem? Why is it important for data science?

The CLT states: given a population with mean mu and finite variance sigma^2, the sampling distribution of the sample mean approaches N(mu, sigma^2/n) as n increases -- regardless of the population's original distribution.

Practical importance:

  • Justifies using z-tests and t-tests even when the population is skewed (at n >= 30 approximately).
  • Underpins confidence interval construction.
  • Explains why bootstrapping works.
import numpy as np
import matplotlib.pyplot as plt

# Demonstrate CLT on exponential population
np.random.seed(42)
pop = np.random.exponential(scale=2, size=100_000)  # right-skewed

sample_means = [np.mean(np.random.choice(pop, 50)) for _ in range(10_000)]

print(f"Population mean: {pop.mean():.3f}")
print(f"Sample mean distribution mean: {np.mean(sample_means):.3f}")
print(f"Sample mean distribution std: {np.std(sample_means):.3f}")
print(f"Expected std (sigma/sqrt(n)): {pop.std()/np.sqrt(50):.3f}")
# Distribution of sample_means is approximately Normal

Q4. What is Bayes' theorem? Give a data science example.

P(A|B) = P(B|A) * P(A) / P(B)

Example -- spam detection:

  • P(spam) = 0.3 (prior: 30% of emails are spam)
  • P("free" | spam) = 0.8 (likelihood: 80% of spam contains "free")
  • P("free" | not spam) = 0.1 (10% of legit emails contain "free")
  • P("free") = 0.8 * 0.3 + 0.1 * 0.7 = 0.31

P(spam | "free") = (0.8 * 0.3) / 0.31 = 0.774

An email containing "free" is 77% likely spam.

In ML: Naive Bayes classifier, Bayesian hyperparameter optimization, posterior inference.


Q5. What is the difference between correlation and causation? Give an example.

Correlation: statistical relationship between two variables. Correlation coefficient r measures linear strength (-1 to +1).

Causation: change in X directly produces change in Y (causal mechanism).

Classic examples of correlation without causation:

  • Ice cream sales and drowning rates are correlated -- both caused by hot weather (confounding variable).
  • Shoe size and reading ability in children -- both caused by age.

Establishing causation in data science:

  • RCT (A/B test): gold standard -- randomize treatment assignment.
  • Difference-in-differences: compare treated vs control group changes over time.
  • Instrumental variables: use a variable that affects X but not Y directly.
  • Propensity score matching: control for confounders in observational data.

Q6. How do you handle missing data? What are the different strategies?

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df = pd.read_csv("data.csv")

# 1. Drop rows/columns (only for MCAR and low % missing)
df_dropped = df.dropna(thresh=int(0.7 * len(df.columns)), axis=0)

# 2. Simple imputation
imputer_mean = SimpleImputer(strategy="mean")   # numerical
imputer_mode = SimpleImputer(strategy="most_frequent")  # categorical

# 3. KNN imputation (better for non-normal distributions)
knn_imputer = KNNImputer(n_neighbors=5)
df_knn = pd.DataFrame(knn_imputer.fit_transform(df.select_dtypes(include='number')))

# 4. MICE / Iterative imputation (best quality)
mice = IterativeImputer(random_state=42, max_iter=10)
df_mice = pd.DataFrame(mice.fit_transform(df.select_dtypes(include='number')))

# 5. Add missingness indicator (important! tells model WHERE data was missing)
df["age_missing"] = df["age"].isna().astype(int)

MCAR/MAR/MNAR: missing completely at random (safe to drop), missing at random (impute), missing not at random (requires domain modeling).


Q7. What is the difference between variance and standard deviation? When do you use each?

  • Variance = E[(X - mu)^2] -- average squared deviation from mean. Units are squared.
  • Standard deviation = sqrt(variance) -- same units as the original variable.

Use variance when doing mathematical derivations (additive for independent variables: Var(X+Y) = Var(X) + Var(Y)). Use standard deviation for interpreting spread in original units, for z-scores, for confidence intervals.

Coefficient of variation (CV) = std/mean: normalized spread measure for comparing variables with different scales.


Q8. Explain outlier detection methods. How do you handle outliers?

import pandas as pd
import numpy as np
from scipy import stats

data = pd.Series([1, 2, 3, 4, 5, 100, 2, 3, 4, 5])

# Method 1: IQR method
Q1, Q3 = data.quantile(0.25), data.quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = data[(data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)]

# Method 2: Z-score (assumes normality)
z_scores = np.abs(stats.zscore(data))
outliers_z = data[z_scores > 3]

# Method 3: Isolation Forest (multivariate)
from sklearn.ensemble import IsolationForest
clf = IsolationForest(contamination=0.1, random_state=42)
labels = clf.fit_predict(data.values.reshape(-1, 1))
outliers_if = data[labels == -1]

# Handling strategies:
# - Remove: only if data entry error and you can confirm it
# - Cap (Winsorize): clip at 1st and 99th percentile
# - Log transform: compress the scale
# - Use robust models: tree-based models are inherently outlier-robust

Q9. What is the difference between normalization and standardization?

MethodFormulaOutput rangeWhen to use
Min-Max normalization(x - min) / (max - min)[0, 1]Neural networks, KNN, SVM
Z-score standardization(x - mean) / stdMean=0, std=1Linear models, PCA, logistic regression
Robust scaling(x - median) / IQRCentered on medianData with significant outliers
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Important: fit on train, transform on both train and test
X_test_scaled = scaler.transform(X_test)  # NOT fit_transform on test

Tree-based models (RF, XGBoost): do NOT need scaling. Distance-based models (KNN, SVM, KMeans): always scale.


Q10. What is EDA? Walk me through your EDA process.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def full_eda(df: pd.DataFrame):
    # 1. Shape and types
    print(df.shape, "\n", df.dtypes)

    # 2. Missing values
    missing = df.isnull().sum()
    print("Missing:\n", missing[missing > 0])

    # 3. Descriptive statistics
    print(df.describe(include='all'))

    # 4. Target variable distribution
    target = df.columns[-1]
    print(f"Target balance:\n{df[target].value_counts(normalize=True)}")

    # 5. Numerical distributions
    df.select_dtypes(include='number').hist(bins=30, figsize=(15, 10))

    # 6. Correlation matrix
    corr = df.select_dtypes(include='number').corr()
    sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')

    # 7. Categorical cardinality
    for col in df.select_dtypes(include='object'):
        print(f"{col}: {df[col].nunique()} unique values")

    # 8. Outlier check per numerical column
    for col in df.select_dtypes(include='number'):
        Q1, Q3 = df[col].quantile([0.25, 0.75])
        IQR = Q3 - Q1
        n_outliers = ((df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)).sum()
        if n_outliers > 0:
            print(f"{col}: {n_outliers} outliers")

MEDIUM: Feature Engineering and Modeling (Questions 11-22)

Q11. What feature engineering techniques do you apply to datetime columns?

import pandas as pd

df["timestamp"] = pd.to_datetime(df["timestamp"])

# Cyclical encoding (important for hour/month/day -- wrap-around)
import numpy as np
df["hour_sin"] = np.sin(2 * np.pi * df["timestamp"].dt.hour / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["timestamp"].dt.hour / 24)
df["dow_sin"] = np.sin(2 * np.pi * df["timestamp"].dt.dayofweek / 7)
df["dow_cos"] = np.cos(2 * np.pi * df["timestamp"].dt.dayofweek / 7)

# Binary flags
df["is_weekend"] = df["timestamp"].dt.dayofweek >= 5
df["is_business_hours"] = df["timestamp"].dt.hour.between(9, 18)

# Elapsed features
reference_date = pd.Timestamp("2024-01-01")
df["days_since_signup"] = (df["timestamp"] - reference_date).dt.days

# Rolling aggregates (user behavior features)
df = df.sort_values("timestamp")
df["rolling_7d_orders"] = df.groupby("user_id")["order_id"].transform(
    lambda x: x.rolling("7D").count()
)

Q12. What is target encoding? What are its risks and how do you mitigate them?

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

def target_encode_cv(df, col, target, n_splits=5):
    """Cross-validation target encoding to prevent leakage."""
    df["te_" + col] = np.nan
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    for tr_idx, val_idx in kf.split(df):
        # Compute encoding from TRAINING fold only
        mapping = df.iloc[tr_idx].groupby(col)[target].mean()
        df.loc[df.index[val_idx], "te_" + col] = (
            df.iloc[val_idx][col].map(mapping)
        )

    # Add smoothing (shrink toward global mean for rare categories)
    global_mean = df[target].mean()
    counts = df.groupby(col)[target].count()
    k = 10  # smoothing factor
    df["te_" + col + "_smooth"] = df[col].map(
        lambda c: (counts.get(c, 0) * df.groupby(col)[target].mean().get(c, global_mean) + k * global_mean)
                  / (counts.get(c, 0) + k)
    )
    return df

Risks: target leakage (seeing test labels during encoding) and overfitting on low-frequency categories. Mitigations: CV encoding + smoothing.


Q13. How do you handle high-cardinality categorical features?

MethodWhen to use
Target encoding with smoothingTree models, moderate cardinality
Frequency encodingHigh cardinality, no target correlation
Embeddings (Entity embedding)Neural networks, very high cardinality
Hashing trickMemory-constrained, very high cardinality
Group rare categoriesTail categories with < 1% frequency
# Frequency encoding
freq_map = df["city"].value_counts(normalize=True)
df["city_freq"] = df["city"].map(freq_map)

# Group tail categories
top_cities = df["city"].value_counts().head(50).index
df["city_grouped"] = df["city"].where(df["city"].isin(top_cities), other="OTHER")

Q14. Compare logistic regression, random forest, and gradient boosting for a classification task.

PropertyLogistic RegressionRandom ForestGradient Boosting (XGBoost)
Training speedFastMedium (parallel)Slower (sequential)
InterpretabilityHigh (coefficients)Medium (feature importance)Medium
Handles non-linearityNo (without features)YesYes
Handles missing valuesNo (needs imputation)NoYes (XGBoost native)
Best forSparse features, text, baselineBalanced tabular dataCompetitions, complex tabular
RegularizationL1/L2Implicit via subsamplingL1/L2 + tree regularization
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

models = {
    "LogReg": LogisticRegression(C=1.0, max_iter=1000),
    "RF": RandomForestClassifier(n_estimators=200, max_depth=10, n_jobs=-1, random_state=42),
    "GBM": GradientBoostingClassifier(n_estimators=300, learning_rate=0.05, max_depth=5),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    print(f"{name}: AUC = {auc:.4f}")

Q15. How do you deal with class imbalance in a binary classification problem?

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

# Option 1: class_weight parameter (preferred for tree models)
model = GradientBoostingClassifier()  # GBM doesn't support class_weight
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(class_weight='balanced')

# Option 2: SMOTE oversampling
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# Option 3: Pipeline with SMOTE + undersampling
pipeline = ImbPipeline([
    ('over', SMOTE(sampling_strategy=0.3)),
    ('under', RandomUnderSampler(sampling_strategy=0.5)),
    ('model', GradientBoostingClassifier()),
])

# Option 4: Change decision threshold
probs = model.predict_proba(X_test)[:, 1]
threshold = 0.3  # lower threshold increases recall for minority class
preds = (probs >= threshold).astype(int)
print(classification_report(y_test, preds))

Evaluation for imbalanced data: use PR-AUC, F1-score, recall at fixed precision. Not accuracy.


Q16. What is feature importance? Compare SHAP vs permutation importance vs model-native importance.

import shap
import numpy as np
import pandas as pd
from sklearn.inspection import permutation_importance

# Model-native (impurity-based) -- fast but biased toward high-cardinality
feature_imp_native = pd.Series(
    model.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

# Permutation importance -- model-agnostic, uses held-out data
perm_imp = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
feature_imp_perm = pd.Series(perm_imp.importances_mean, index=X_train.columns)

# SHAP (best quality -- game-theoretic attribution)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=X_train.columns)
MethodBiasSpeedUse case
Native (impurity)High-cardinality biasFastestQuick sanity check
PermutationNone (uses test set)MediumModel-agnostic, production
SHAPNone (consistent)SlowExplanation, debugging, fairness

Q17. What is cross-validation? Describe k-fold, stratified k-fold, and time-series CV.

from sklearn.model_selection import (
    KFold, StratifiedKFold, TimeSeriesSplit, cross_val_score
)
import numpy as np

# Standard K-Fold (regression)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='r2')

# Stratified K-Fold (classification -- preserves class ratio per fold)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
auc_scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')

# Time-Series CV (CRITICAL: never shuffle time series data)
tscv = TimeSeriesSplit(n_splits=5, gap=1)  # gap prevents leakage
ts_scores = cross_val_score(model, X, y, cv=tscv, scoring='neg_mean_squared_error')

print(f"AUC: {auc_scores.mean():.4f} +/- {auc_scores.std():.4f}")

Key rule for time series: always split chronologically. Shuffled CV on time series gives optimistic scores because it lets future data inform past predictions.


Q18. What is dimensionality reduction? Compare PCA vs t-SNE vs UMAP.

MethodTypePreservesSpeedScalability
PCALinearGlobal variance, distancesFastHigh (SVD)
t-SNENon-linearLocal neighborhood structureSlow O(N^2)Low (<50K)
UMAPNon-linearLocal + some global structureMediumHigh (millions)
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

# PCA for preprocessing (retain 95% variance)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
print(f"Components needed: {pca.n_components_}")

# t-SNE for visualization only (do not use for downstream ML)
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_jobs=-1)
X_2d = tsne.fit_transform(X_scaled[:5000])  # subsample for speed

# UMAP for both visualization and downstream use
reducer = umap.UMAP(n_components=10, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

Q19. How do you build a churn prediction model? Walk through the full pipeline.

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, precision_recall_curve
import xgboost as xgb

# 1. Feature engineering from raw subscription data
def build_features(df):
    df["days_since_last_login"] = (pd.Timestamp.now() - df["last_login"]).dt.days
    df["avg_sessions_per_week"] = df["total_sessions"] / (df["tenure_days"] / 7 + 1)
    df["support_ticket_rate"] = df["support_tickets"] / (df["tenure_days"] + 1)
    df["plan_downgrade"] = (df["current_plan_value"] < df["signup_plan_value"]).astype(int)
    return df

# 2. Preprocessing pipeline
num_features = ["days_since_last_login", "avg_sessions_per_week", "tenure_days"]
cat_features = ["plan_type", "acquisition_channel", "country"]

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), num_features),
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_features),
])

# 3. Model pipeline
pipeline = Pipeline([
    ("prep", preprocessor),
    ("model", xgb.XGBClassifier(
        n_estimators=500, learning_rate=0.05, max_depth=6,
        scale_pos_weight=10,  # class imbalance
        use_label_encoder=False, eval_metric='logloss', random_state=42,
    )),
])

pipeline.fit(X_train, y_train)
auc = roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1])
print(f"Churn model AUC: {auc:.4f}")

# 4. Business threshold -- target 80% recall on churners
prec, rec, thresholds = precision_recall_curve(y_test, pipeline.predict_proba(X_test)[:, 1])
target_thresh = thresholds[rec >= 0.80][0]

Q20. How do you design and analyze an A/B test?

from scipy import stats
import numpy as np
from statsmodels.stats.power import NormalIndPower

# Step 1: Power analysis (before experiment)
effect_size = 0.02 / 0.15  # 2% absolute lift on 15% baseline = 0.133 relative
power_analysis = NormalIndPower()
n_per_group = power_analysis.solve_power(
    effect_size=effect_size, alpha=0.05, power=0.80, ratio=1.0
)
print(f"Required sample per group: {n_per_group:.0f}")

# Step 2: Run experiment (collect data for required duration)

# Step 3: Statistical test
control_conversions = 1520  # out of 10000
treatment_conversions = 1690  # out of 10000
n = 10000

p_control = control_conversions / n
p_treatment = treatment_conversions / n

# Two-proportion z-test
count = np.array([treatment_conversions, control_conversions])
nobs = np.array([n, n])
from statsmodels.stats.proportion import proportions_ztest
stat, p_value = proportions_ztest(count, nobs)

print(f"Control rate: {p_control:.4f}")
print(f"Treatment rate: {p_treatment:.4f}")
print(f"Relative lift: {(p_treatment - p_control) / p_control * 100:.1f}%")
print(f"p-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")

Common pitfalls: peeking (stopping early when significant), SRM (sample ratio mismatch), novelty effect, SUTVA violations (network effects in social products).


Q21. What is multicollinearity? How do you detect and address it?

Multicollinearity: high correlation between predictor variables. Inflates coefficient standard errors in linear models, making interpretation unreliable.

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Detection: Variance Inflation Factor (VIF)
def calculate_vif(df):
    vif_data = pd.DataFrame()
    vif_data["feature"] = df.columns
    vif_data["VIF"] = [
        variance_inflation_factor(df.values, i)
        for i in range(len(df.columns))
    ]
    return vif_data.sort_values("VIF", ascending=False)

vif_df = calculate_vif(X_train)
print(vif_df)
# VIF > 10: severe multicollinearity
# VIF 5-10: moderate, investigate
# VIF < 5: acceptable

Solutions:

  • Drop one of highly correlated feature pairs.
  • Use PCA to create orthogonal components.
  • Use Ridge regression (L2) which handles multicollinearity gracefully.
  • Tree-based models are inherently robust to multicollinearity.

Q22. Explain gradient boosting. How does XGBoost improve on vanilla gradient boosting?

Gradient boosting builds an ensemble additively: each tree fits the negative gradient (pseudo-residuals) of the loss function from the previous ensemble.

F_m(x) = F_{m-1}(x) + nu * h_m(x)

where h_m is a tree fit to residuals -[dL/dF]{F=F{m-1}}.

XGBoost improvements over sklearn GBM:

Featuresklearn GBMXGBoost
SpeedSingle-threadedParallelized column block
RegularizationNoneL1 (alpha) + L2 (lambda) on leaf weights
Missing valuesManual imputation neededNative sparse handling
Tree structureLevel-wise growthLeaf-wise growth (LightGBM) or depth-wise
Second-order optimizationFirst-order gradients onlyHessian (second-order) for splits
import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=5,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,    # L1
    reg_lambda=1.0,   # L2
    early_stopping_rounds=50,
    eval_metric='auc',
    random_state=42,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=100)

HARD: System Design and Case Studies (Questions 23-30)

Q23. Design a demand forecasting system for an e-commerce platform.

Problem: Forecast daily demand per SKU x warehouse for 90-day horizon
Scale: 500K SKUs, 20 warehouses = 10M time series

Data sources:
  - Historical sales (3 years)
  - Price and promotions calendar
  - Category metadata (brand, subcategory, weight)
  - External: Google Trends, holiday calendar, weather

Feature engineering:
  - Lag features: sales_t-1, t-7, t-14, t-28, t-365
  - Rolling statistics: 7d/28d/90d rolling mean/std/median
  - Calendar: day_of_week, month, is_holiday, days_to_event
  - Price elasticity: log(price / rolling_avg_price)
  - Trend: linear time index, Fourier features for seasonality

Model stack:
  - Global LightGBM (one model, all SKUs -- features carry SKU identity)
  - Fine-tune with per-category models for top-volume SKUs
  - Uncertainty: quantile regression (p10, p50, p90)

Evaluation:
  - WMAPE (weighted mean absolute percentage error) per category
  - Pinball loss at p10/p90 for uncertainty calibration
  - Business metric: in-stock rate, overstock cost

Infrastructure:
  - Training: weekly batch on Databricks (Spark for feature compute)
  - Serving: pre-computed forecasts stored in DynamoDB/BigQuery
  - Monitoring: WMAPE drift alerts, auto-retrain trigger on >5% degradation

Q24. A product metric dropped 20% overnight. How do you diagnose it?

Systematic debugging framework:

  1. Validate the data: is the metric genuinely down or is it a logging/instrumentation issue?

    • Check event volume vs previous same-day-of-week.
    • Check for any dashboarding or SQL query changes.
    • Compare raw event counts vs derived metric.
  2. Scope the drop: which segment is affected?

    • Slice by platform (iOS/Android/Web).
    • Slice by geography, user cohort, acquisition channel.
    • Slice by feature area (checkout, homepage, search).
  3. Check for external events:

    • Any deployments in the past 24 hours?
    • Any third-party API outages?
    • Any marketing campaigns ending?
    • Any seasonal patterns (Monday always lower)?
  4. Correlate with other metrics:

    • Did session volume drop (traffic problem) or conversion rate drop (funnel problem)?
    • Did error rates increase?
    • Did A/B test traffic allocation change?
  5. Root cause and fix:

    • Logging bug: fix and backfill.
    • Feature regression: rollback and investigate.
    • Real user behavior change: investigate user feedback, reviews.

Q25. How do you build a real-time fraud detection model?

# Feature engineering for transaction fraud
def compute_transaction_features(tx, user_history):
    return {
        # Velocity features (key for fraud)
        "txn_count_1h": user_history.last("1h").shape[0],
        "txn_amount_1h": user_history.last("1h")["amount"].sum(),
        "unique_merchants_24h": user_history.last("24h")["merchant_id"].nunique(),

        # Deviation features
        "amount_vs_avg_ratio": tx["amount"] / (user_history["amount"].mean() + 1),
        "hour_vs_usual": abs(tx["hour"] - user_history["hour"].mode()[0]),

        # Device / location features
        "new_device": 1 if tx["device_id"] not in user_history["device_id"].values else 0,
        "new_location": 1 if tx["country"] not in user_history["country"].values else 0,
        "distance_from_last_txn_km": haversine_distance(tx, user_history.iloc[-1]),
    }

Production architecture:

  • Real-time feature store: Redis for velocity counters (INCR + TTL).
  • Model serving: LightGBM loaded in FastAPI, P99 latency <10ms.
  • Threshold: tune for 0.1% FPR (fraud blocking must not block legit users).
  • Human review queue: scores in [0.4, 0.7] range.
  • Feedback loop: reviewed transactions feed back to training data weekly.

Q26. What is the difference between parametric and non-parametric statistical tests?

PropertyParametricNon-parametric
AssumptionNormal distribution, equal varianceNo distributional assumption
Examplest-test, ANOVA, Pearson correlationMann-Whitney U, Kruskal-Wallis, Spearman
PowerHigher (when assumptions hold)Lower but robust
Sample sizeWorks well even for small nBetter for small or non-normal samples
from scipy import stats

# Parametric: independent samples t-test
t_stat, p_val = stats.ttest_ind(group_a, group_b, equal_var=False)  # Welch's t-test

# Non-parametric equivalent: Mann-Whitney U
u_stat, p_val_mw = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')

# When to use non-parametric:
# - Small samples (n < 30)
# - Ordinal data (ratings 1-5)
# - Heavy tails / outliers
# - Confirmed non-normal distribution

# Test for normality
stat, p_normal = stats.shapiro(group_a)  # H0: data is normal
print(f"Shapiro p={p_normal:.4f}, normal={'yes' if p_normal > 0.05 else 'no'}")

Q27. How do you monitor a deployed ML model for data drift?

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, ClassificationPreset

# Reference data (training distribution)
reference_df = train_data.copy()
# Current data (last 7 days of production)
current_df = production_data.copy()

# Statistical drift report
report = Report(metrics=[
    DataDriftPreset(),         # PSI / KS test per feature
    ClassificationPreset(),    # Precision, recall, F1 vs baseline
])
report.run(reference_data=reference_df, current_data=current_df)
report.save_html("/tmp/drift_report.html")

# Custom PSI for numerical features
def psi(expected, actual, n_bins=10):
    bins = np.quantile(expected, np.linspace(0, 1, n_bins + 1))
    exp_pct = np.histogram(expected, bins=bins)[0] / len(expected)
    act_pct = np.histogram(actual, bins=bins)[0] / len(actual)
    exp_pct = np.clip(exp_pct, 1e-8, None)
    act_pct = np.clip(act_pct, 1e-8, None)
    return np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))

# PSI thresholds: <0.1 stable, 0.1-0.2 slight drift, >0.2 significant drift
for col in num_features:
    score = psi(reference_df[col].dropna(), current_df[col].dropna())
    if score > 0.2:
        print(f"ALERT: {col} PSI={score:.3f} -- significant drift")

Q28. Design a recommendation system for an OTT platform with 50M users and 10K titles.

Problem: Rank 10K titles per user at page load time (<100ms)

Stage 1 -- Retrieval (10K -> 500 candidates):
  - Collaborative filtering: Matrix factorization (ALS via Spark)
  - Content-based: Title embeddings (title, genre, cast) via MLP
  - Trending: Popularity signals (last 7d views) per genre x region
  - Recency: Last 5 watched genres as query vector

Stage 2 -- Ranking (500 -> 20 final):
  - Two-tower model: user tower + item tower -> dot product score
  - Features: user history, device context, time of day, completion rate
  - Fine-grained: XGBoost re-ranker with engagement signals
  - Business rules: new releases boost, licensed content expiry boost

Stage 3 -- Post-processing:
  - Diversity: MMR (Maximal Marginal Relevance) to avoid filter bubbles
  - Freshness: penalize already-seen titles
  - Experimentation: holdout buckets for A/B testing new models

Serving:
  - Precomputed candidate sets refreshed hourly (Spark batch)
  - Online ranking: ONNX two-tower model in triton, <30ms P99
  - Feature store: Redis (user history, last-session features)
  - Fallback: popularity-based ranking on cold-start users

Evaluation:
  - Offline: NDCG@20, Hit Rate@20 on held-out interactions
  - Online: watch-start rate, completion rate, session depth

FAQ

Q: What Python libraries should I master for data science interviews? A: pandas (data manipulation), NumPy (numerical operations), scikit-learn (ML), XGBoost/LightGBM (boosting), matplotlib/seaborn (visualization), scipy.stats (statistical tests). Candidates from public preparation resources confirm these six cover the vast majority of hands-on coding rounds.

Q: How important is SQL in data science interviews vs Python? A: Both are tested and both can be eliminators. Many companies start with a SQL screen. Window functions, CTEs, joins, and aggregations are the core of DS SQL rounds. Confirm the specific interview structure on the official company careers portal.

Q: Should I memorize ML algorithm math? A: Derivation-level math is asked at research scientist roles and FAANG. For most DS roles, you need conceptual clarity on bias-variance, loss functions, gradient descent, and evaluation metrics -- not full proofs. Tree-based model internals (gini impurity, information gain, split criteria) are frequently asked at all levels.

Methodology applied to this articlelast verified 8 Jun 2026
Sources used
Public exam-pattern documents, official recruiter pages, and verified candidate reports on r/developersIndia and LinkedIn.
Verification window
Page last edited 8 Jun 2026 by Aditya Sharma. Numbers and patterns sanity-checked against the most recent 2026 cycle drives we tracked.
What we did NOT do
  • No fabricated salary numbers or success rates. If we quote a range, it's sourced.
  • No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
  • No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Paid contributor programme

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.

Submit your story →

Ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start Free Mock Test →

Related Articles

More from PapersAdda

Share this guide: