placement brief / Interview Questions / interview questions / 08 Jun 2026

Data Science Interview Questions 2026: 30 Answers with Code

Q: What Python libraries should I master for data science interviews?

pandas (data manipulation), NumPy (numerical operations), scikit-learn (ML), XGBoost/LightGBM (boosting), matplotlib/seaborn (visualization), scipy.stats (statistical tests). Candidates from public preparation resources confirm these six cover the vast majority of hands-on coding rounds.

Q: How important is SQL in data science interviews vs Python?

Both are tested and both can be eliminators. Many companies start with a SQL screen. Window functions, CTEs, joins, and aggregations are the core of DS SQL rounds. Confirm the specific interview structure on the official company careers portal.

Q: Should I memorize ML algorithm math?

Derivation-level math is asked at research scientist roles and FAANG. For most DS roles, you need conceptual clarity on bias-variance, loss functions, gradient descent, and evaluation metrics -- not full proofs. Tree-based model internals (gini impurity, information gain, split criteria) are frequently asked at all levels.

30 data science interview questions with full code answers covering statistics, EDA, feature engineering, ML models, A/B testing, and end-to-end data science case studies for 2026.

By Aditya SharmaPublished 8 Jun 20262 sources listedSpot an error? Corrections open

8 min read last revised 8 Jun 2026

on this page§ 05

Data science remains one of the highest-demand technical roles in 2026, with strong hiring across fintech, e-commerce, healthtech, and product companies. Data scientist interviews test a broad stack: statistics, SQL, Python, ML modeling, experiment design, and product thinking. This guide covers 30 data science interview questions with full answers and code, organized from foundations to system design.

PapersAdda's take: Candidates report that statistics and A/B testing questions are the most common elimination filters in data science interviews at product companies like Flipkart, Swiggy, PhonePe, and FAANG India. SQL window function problems appear in over 65% of DS first rounds. Confirm the specific tools and data stack expected on the official company careers portal before you prepare.

Related articles: Machine Learning Interview Questions 2026 | Statistics for Data Science 2026 | SQL for Data Analysts 2026 | Pandas Interview Questions 2026 | Scikit-Learn Interview Questions 2026

Which Companies Hire for Data Science Roles?

Sector	Companies	DS Focus
Fintech	Razorpay, PhonePe, Zepto, Groww	Fraud detection, credit risk, product analytics
E-commerce	Flipkart, Meesho, Amazon India	Recommendation, pricing, supply chain
Ride-hailing/food	Ola, Uber, Swiggy, Zomato	Surge pricing, demand forecasting, ETA
FAANG India	Google, Meta, Microsoft	Product analytics, ads, ML platform
Healthcare	1mg, Practo, Aarogya	Clinical ML, drug interaction, patient journey

EASY: Statistics and Python Foundations (Questions 1-10)

Q1. What is the difference between Type I and Type II errors? How do they relate to significance level and power?

Error Type	Definition	Cost
Type I (alpha)	Reject true null hypothesis (false positive)	False alarm
Type II (beta)	Fail to reject false null hypothesis (false negative)	Missed detection

Significance level alpha: P(Type I error) -- typically 0.05. Lower alpha = fewer false positives but more false negatives.
Power = 1 - beta: probability of detecting a true effect. Target >= 0.80.
Trade-off: decreasing alpha increases beta (for fixed sample size). Only increasing sample size reduces both.

from scipy import stats

# Power analysis for a t-test
from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
# Find required sample size for effect_size=0.3, alpha=0.05, power=0.80
n = analysis.solve_power(effect_size=0.3, alpha=0.05, power=0.80, ratio=1.0)
print(f"Required n per group: {n:.0f}")  # 176

Q2. Explain p-value. A p-value of 0.03 -- what does it mean?

The p-value is the probability of observing data at least as extreme as what you observed, assuming the null hypothesis is true.

p = 0.03 means: if there were truly no effect (null is true), there is a 3% chance of getting this result or something more extreme by random chance.

Common misinterpretations (don't say these in interviews):

"Probability the null is true" -- WRONG.
"Probability the effect is real" -- WRONG.
"Effect is practically significant" -- WRONG (large N can yield tiny p for negligible effects).

Practical guidance: p < alpha is a decision rule, not a measure of effect size. Always report effect size (Cohen's d, eta^2) alongside p-value.

Q3. What is the Central Limit Theorem? Why is it important for data science?

The CLT states: given a population with mean mu and finite variance sigma^2, the sampling distribution of the sample mean approaches N(mu, sigma^2/n) as n increases -- regardless of the population's original distribution.

Practical importance:

Justifies using z-tests and t-tests even when the population is skewed (at n >= 30 approximately).
Underpins confidence interval construction.
Explains why bootstrapping works.

import numpy as np
import matplotlib.pyplot as plt

# Demonstrate CLT on exponential population
np.random.seed(42)
pop = np.random.exponential(scale=2, size=100_000)  # right-skewed

sample_means = [np.mean(np.random.choice(pop, 50)) for _ in range(10_000)]

print(f"Population mean: {pop.mean():.3f}")
print(f"Sample mean distribution mean: {np.mean(sample_means):.3f}")
print(f"Sample mean distribution std: {np.std(sample_means):.3f}")
print(f"Expected std (sigma/sqrt(n)): {pop.std()/np.sqrt(50):.3f}")
# Distribution of sample_means is approximately Normal

Q4. What is Bayes' theorem? Give a data science example.

P(A|B) = P(B|A) * P(A) / P(B)

Example -- spam detection:

P(spam) = 0.3 (prior: 30% of emails are spam)
P("free" | spam) = 0.8 (likelihood: 80% of spam contains "free")
P("free" | not spam) = 0.1 (10% of legit emails contain "free")
P("free") = 0.8 * 0.3 + 0.1 * 0.7 = 0.31

P(spam | "free") = (0.8 * 0.3) / 0.31 = 0.774

An email containing "free" is 77% likely spam.

In ML: Naive Bayes classifier, Bayesian hyperparameter optimization, posterior inference.

Q5. What is the difference between correlation and causation? Give an example.

Correlation: statistical relationship between two variables. Correlation coefficient r measures linear strength (-1 to +1).

Causation: change in X directly produces change in Y (causal mechanism).

Classic examples of correlation without causation:

Ice cream sales and drowning rates are correlated -- both caused by hot weather (confounding variable).
Shoe size and reading ability in children -- both caused by age.

Establishing causation in data science:

RCT (A/B test): gold standard -- randomize treatment assignment.
Difference-in-differences: compare treated vs control group changes over time.
Instrumental variables: use a variable that affects X but not Y directly.
Propensity score matching: control for confounders in observational data.

Q6. How do you handle missing data? What are the different strategies?

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df = pd.read_csv("data.csv")

# 1. Drop rows/columns (only for MCAR and low % missing)
df_dropped = df.dropna(thresh=int(0.7 * len(df.columns)), axis=0)

# 2. Simple imputation
imputer_mean = SimpleImputer(strategy="mean")   # numerical
imputer_mode = SimpleImputer(strategy="most_frequent")  # categorical

# 3. KNN imputation (better for non-normal distributions)
knn_imputer = KNNImputer(n_neighbors=5)
df_knn = pd.DataFrame(knn_imputer.fit_transform(df.select_dtypes(include='number')))

# 4. MICE / Iterative imputation (best quality)
mice = IterativeImputer(random_state=42, max_iter=10)
df_mice = pd.DataFrame(mice.fit_transform(df.select_dtypes(include='number')))

# 5. Add missingness indicator (important! tells model WHERE data was missing)
df["age_missing"] = df["age"].isna().astype(int)

MCAR/MAR/MNAR: missing completely at random (safe to drop), missing at random (impute), missing not at random (requires domain modeling).

Q7. What is the difference between variance and standard deviation? When do you use each?

Variance = E[(X - mu)^2] -- average squared deviation from mean. Units are squared.
Standard deviation = sqrt(variance) -- same units as the original variable.

Use variance when doing mathematical derivations (additive for independent variables: Var(X+Y) = Var(X) + Var(Y)). Use standard deviation for interpreting spread in original units, for z-scores, for confidence intervals.

Coefficient of variation (CV) = std/mean: normalized spread measure for comparing variables with different scales.

Q8. Explain outlier detection methods. How do you handle outliers?

import pandas as pd
import numpy as np
from scipy import stats

data = pd.Series([1, 2, 3, 4, 5, 100, 2, 3, 4, 5])

# Method 1: IQR method
Q1, Q3 = data.quantile(0.25), data.quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = data[(data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)]

# Method 2: Z-score (assumes normality)
z_scores = np.abs(stats.zscore(data))
outliers_z = data[z_scores > 3]

# Method 3: Isolation Forest (multivariate)
from sklearn.ensemble import IsolationForest
clf = IsolationForest(contamination=0.1, random_state=42)
labels = clf.fit_predict(data.values.reshape(-1, 1))
outliers_if = data[labels == -1]

# Handling strategies:
# - Remove: only if data entry error and you can confirm it
# - Cap (Winsorize): clip at 1st and 99th percentile
# - Log transform: compress the scale
# - Use robust models: tree-based models are inherently outlier-robust

Q9. What is the difference between normalization and standardization?

Method	Formula	Output range	When to use
Min-Max normalization	(x - min) / (max - min)	[0, 1]	Neural networks, KNN, SVM
Z-score standardization	(x - mean) / std	Mean=0, std=1	Linear models, PCA, logistic regression
Robust scaling	(x - median) / IQR	Centered on median	Data with significant outliers

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Important: fit on train, transform on both train and test
X_test_scaled = scaler.transform(X_test)  # NOT fit_transform on test

Tree-based models (RF, XGBoost): do NOT need scaling. Distance-based models (KNN, SVM, KMeans): always scale.

Q10. What is EDA? Walk me through your EDA process.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def full_eda(df: pd.DataFrame):
    # 1. Shape and types
    print(df.shape, "\n", df.dtypes)

    # 2. Missing values
    missing = df.isnull().sum()
    print("Missing:\n", missing[missing > 0])

    # 3. Descriptive statistics
    print(df.describe(include='all'))

    # 4. Target variable distribution
    target = df.columns[-1]
    print(f"Target balance:\n{df[target].value_counts(normalize=True)}")

    # 5. Numerical distributions
    df.select_dtypes(include='number').hist(bins=30, figsize=(15, 10))

    # 6. Correlation matrix
    corr = df.select_dtypes(include='number').corr()
    sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')

    # 7. Categorical cardinality
    for col in df.select_dtypes(include='object'):
        print(f"{col}: {df[col].nunique()} unique values")

    # 8. Outlier check per numerical column
    for col in df.select_dtypes(include='number'):
        Q1, Q3 = df[col].quantile([0.25, 0.75])
        IQR = Q3 - Q1
        n_outliers = ((df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)).sum()
        if n_outliers > 0:
            print(f"{col}: {n_outliers} outliers")

MEDIUM: Feature Engineering and Modeling (Questions 11-22)

Q11. What feature engineering techniques do you apply to datetime columns?

import pandas as pd

df["timestamp"] = pd.to_datetime(df["timestamp"])

# Cyclical encoding (important for hour/month/day -- wrap-around)
import numpy as np
df["hour_sin"] = np.sin(2 * np.pi * df["timestamp"].dt.hour / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["timestamp"].dt.hour / 24)
df["dow_sin"] = np.sin(2 * np.pi * df["timestamp"].dt.dayofweek / 7)
df["dow_cos"] = np.cos(2 * np.pi * df["timestamp"].dt.dayofweek / 7)

# Binary flags
df["is_weekend"] = df["timestamp"].dt.dayofweek >= 5
df["is_business_hours"] = df["timestamp"].dt.hour.between(9, 18)

# Elapsed features
reference_date = pd.Timestamp("2024-01-01")
df["days_since_signup"] = (df["timestamp"] - reference_date).dt.days

# Rolling aggregates (user behavior features)
df = df.sort_values("timestamp")
df["rolling_7d_orders"] = df.groupby("user_id")["order_id"].transform(
    lambda x: x.rolling("7D").count()
)

Q12. What is target encoding? What are its risks and how do you mitigate them?

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

def target_encode_cv(df, col, target, n_splits=5):
    """Cross-validation target encoding to prevent leakage."""
    df["te_" + col] = np.nan
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    for tr_idx, val_idx in kf.split(df):
        # Compute encoding from TRAINING fold only
        mapping = df.iloc[tr_idx].groupby(col)[target].mean()
        df.loc[df.index[val_idx], "te_" + col] = (
            df.iloc[val_idx][col].map(mapping)
        )

    # Add smoothing (shrink toward global mean for rare categories)
    global_mean = df[target].mean()
    counts = df.groupby(col)[target].count()
    k = 10  # smoothing factor
    df["te_" + col + "_smooth"] = df[col].map(
        lambda c: (counts.get(c, 0) * df.groupby(col)[target].mean().get(c, global_mean) + k * global_mean)
                  / (counts.get(c, 0) + k)
    )
    return df

Risks: target leakage (seeing test labels during encoding) and overfitting on low-frequency categories. Mitigations: CV encoding + smoothing.

Q13. How do you handle high-cardinality categorical features?

Method	When to use
Target encoding with smoothing	Tree models, moderate cardinality
Frequency encoding	High cardinality, no target correlation
Embeddings (Entity embedding)	Neural networks, very high cardinality
Hashing trick	Memory-constrained, very high cardinality
Group rare categories	Tail categories with < 1% frequency

# Frequency encoding
freq_map = df["city"].value_counts(normalize=True)
df["city_freq"] = df["city"].map(freq_map)

# Group tail categories
top_cities = df["city"].value_counts().head(50).index
df["city_grouped"] = df["city"].where(df["city"].isin(top_cities), other="OTHER")

Q14. Compare logistic regression, random forest, and gradient boosting for a classification task.

Property	Logistic Regression	Random Forest	Gradient Boosting (XGBoost)
Training speed	Fast	Medium (parallel)	Slower (sequential)
Interpretability	High (coefficients)	Medium (feature importance)	Medium
Handles non-linearity	No (without features)	Yes	Yes
Handles missing values	No (needs imputation)	No	Yes (XGBoost native)
Best for	Sparse features, text, baseline	Balanced tabular data	Competitions, complex tabular
Regularization	L1/L2	Implicit via subsampling	L1/L2 + tree regularization

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

models = {
    "LogReg": LogisticRegression(C=1.0, max_iter=1000),
    "RF": RandomForestClassifier(n_estimators=200, max_depth=10, n_jobs=-1, random_state=42),
    "GBM": GradientBoostingClassifier(n_estimators=300, learning_rate=0.05, max_depth=5),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    print(f"{name}: AUC = {auc:.4f}")

Q15. How do you deal with class imbalance in a binary classification problem?

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

# Option 1: class_weight parameter (preferred for tree models)
model = GradientBoostingClassifier()  # GBM doesn't support class_weight
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(class_weight='balanced')

# Option 2: SMOTE oversampling
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# Option 3: Pipeline with SMOTE + undersampling
pipeline = ImbPipeline([
    ('over', SMOTE(sampling_strategy=0.3)),
    ('under', RandomUnderSampler(sampling_strategy=0.5)),
    ('model', GradientBoostingClassifier()),
])

# Option 4: Change decision threshold
probs = model.predict_proba(X_test)[:, 1]
threshold = 0.3  # lower threshold increases recall for minority class
preds = (probs >= threshold).astype(int)
print(classification_report(y_test, preds))

Evaluation for imbalanced data: use PR-AUC, F1-score, recall at fixed precision. Not accuracy.

Q16. What is feature importance? Compare SHAP vs permutation importance vs model-native importance.

import shap
import numpy as np
import pandas as pd
from sklearn.inspection import permutation_importance

# Model-native (impurity-based) -- fast but biased toward high-cardinality
feature_imp_native = pd.Series(
    model.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

# Permutation importance -- model-agnostic, uses held-out data
perm_imp = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
feature_imp_perm = pd.Series(perm_imp.importances_mean, index=X_train.columns)

# SHAP (best quality -- game-theoretic attribution)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=X_train.columns)

Method	Bias	Speed	Use case
Native (impurity)	High-cardinality bias	Fastest	Quick sanity check
Permutation	None (uses test set)	Medium	Model-agnostic, production
SHAP	None (consistent)	Slow	Explanation, debugging, fairness

Q17. What is cross-validation? Describe k-fold, stratified k-fold, and time-series CV.

from sklearn.model_selection import (
    KFold, StratifiedKFold, TimeSeriesSplit, cross_val_score
)
import numpy as np

# Standard K-Fold (regression)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='r2')

# Stratified K-Fold (classification -- preserves class ratio per fold)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
auc_scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')

# Time-Series CV (CRITICAL: never shuffle time series data)
tscv = TimeSeriesSplit(n_splits=5, gap=1)  # gap prevents leakage
ts_scores = cross_val_score(model, X, y, cv=tscv, scoring='neg_mean_squared_error')

print(f"AUC: {auc_scores.mean():.4f} +/- {auc_scores.std():.4f}")

Key rule for time series: always split chronologically. Shuffled CV on time series gives optimistic scores because it lets future data inform past predictions.

Q18. What is dimensionality reduction? Compare PCA vs t-SNE vs UMAP.

Method	Type	Preserves	Speed	Scalability
PCA	Linear	Global variance, distances	Fast	High (SVD)
t-SNE	Non-linear	Local neighborhood structure	Slow O(N^2)	Low (<50K)
UMAP	Non-linear	Local + some global structure	Medium	High (millions)

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

# PCA for preprocessing (retain 95% variance)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
print(f"Components needed: {pca.n_components_}")

# t-SNE for visualization only (do not use for downstream ML)
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_jobs=-1)
X_2d = tsne.fit_transform(X_scaled[:5000])  # subsample for speed

# UMAP for both visualization and downstream use
reducer = umap.UMAP(n_components=10, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

Q19. How do you build a churn prediction model? Walk through the full pipeline.

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, precision_recall_curve
import xgboost as xgb

# 1. Feature engineering from raw subscription data
def build_features(df):
    df["days_since_last_login"] = (pd.Timestamp.now() - df["last_login"]).dt.days
    df["avg_sessions_per_week"] = df["total_sessions"] / (df["tenure_days"] / 7 + 1)
    df["support_ticket_rate"] = df["support_tickets"] / (df["tenure_days"] + 1)
    df["plan_downgrade"] = (df["current_plan_value"] < df["signup_plan_value"]).astype(int)
    return df

# 2. Preprocessing pipeline
num_features = ["days_since_last_login", "avg_sessions_per_week", "tenure_days"]
cat_features = ["plan_type", "acquisition_channel", "country"]

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), num_features),
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_features),
])

# 3. Model pipeline
pipeline = Pipeline([
    ("prep", preprocessor),
    ("model", xgb.XGBClassifier(
        n_estimators=500, learning_rate=0.05, max_depth=6,
        scale_pos_weight=10,  # class imbalance
        use_label_encoder=False, eval_metric='logloss', random_state=42,
    )),
])

pipeline.fit(X_train, y_train)
auc = roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1])
print(f"Churn model AUC: {auc:.4f}")

# 4. Business threshold -- target 80% recall on churners
prec, rec, thresholds = precision_recall_curve(y_test, pipeline.predict_proba(X_test)[:, 1])
target_thresh = thresholds[rec >= 0.80][0]

Q20. How do you design and analyze an A/B test?

from scipy import stats
import numpy as np
from statsmodels.stats.power import NormalIndPower

# Step 1: Power analysis (before experiment)
effect_size = 0.02 / 0.15  # 2% absolute lift on 15% baseline = 0.133 relative
power_analysis = NormalIndPower()
n_per_group = power_analysis.solve_power(
    effect_size=effect_size, alpha=0.05, power=0.80, ratio=1.0
)
print(f"Required sample per group: {n_per_group:.0f}")

# Step 2: Run experiment (collect data for required duration)

# Step 3: Statistical test
control_conversions = 1520  # out of 10000
treatment_conversions = 1690  # out of 10000
n = 10000

p_control = control_conversions / n
p_treatment = treatment_conversions / n

# Two-proportion z-test
count = np.array([treatment_conversions, control_conversions])
nobs = np.array([n, n])
from statsmodels.stats.proportion import proportions_ztest
stat, p_value = proportions_ztest(count, nobs)

print(f"Control rate: {p_control:.4f}")
print(f"Treatment rate: {p_treatment:.4f}")
print(f"Relative lift: {(p_treatment - p_control) / p_control * 100:.1f}%")
print(f"p-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")

Common pitfalls: peeking (stopping early when significant), SRM (sample ratio mismatch), novelty effect, SUTVA violations (network effects in social products).

Q21. What is multicollinearity? How do you detect and address it?

Multicollinearity: high correlation between predictor variables. Inflates coefficient standard errors in linear models, making interpretation unreliable.

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Detection: Variance Inflation Factor (VIF)
def calculate_vif(df):
    vif_data = pd.DataFrame()
    vif_data["feature"] = df.columns
    vif_data["VIF"] = [
        variance_inflation_factor(df.values, i)
        for i in range(len(df.columns))
    ]
    return vif_data.sort_values("VIF", ascending=False)

vif_df = calculate_vif(X_train)
print(vif_df)
# VIF > 10: severe multicollinearity
# VIF 5-10: moderate, investigate
# VIF < 5: acceptable

Solutions:

Drop one of highly correlated feature pairs.
Use PCA to create orthogonal components.
Use Ridge regression (L2) which handles multicollinearity gracefully.
Tree-based models are inherently robust to multicollinearity.

Q22. Explain gradient boosting. How does XGBoost improve on vanilla gradient boosting?

Gradient boosting builds an ensemble additively: each tree fits the negative gradient (pseudo-residuals) of the loss function from the previous ensemble.

F_m(x) = F_{m-1}(x) + nu * h_m(x)

where h_m is a tree fit to residuals -[dL/dF]{F=F{m-1}}.

XGBoost improvements over sklearn GBM:

Feature	sklearn GBM	XGBoost
Speed	Single-threaded	Parallelized column block
Regularization	None	L1 (alpha) + L2 (lambda) on leaf weights
Missing values	Manual imputation needed	Native sparse handling
Tree structure	Level-wise growth	Leaf-wise growth (LightGBM) or depth-wise
Second-order optimization	First-order gradients only	Hessian (second-order) for splits

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=5,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,    # L1
    reg_lambda=1.0,   # L2
    early_stopping_rounds=50,
    eval_metric='auc',
    random_state=42,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=100)

HARD: System Design and Case Studies (Questions 23-30)

Q23. Design a demand forecasting system for an e-commerce platform.

Problem: Forecast daily demand per SKU x warehouse for 90-day horizon
Scale: 500K SKUs, 20 warehouses = 10M time series

Data sources:
  - Historical sales (3 years)
  - Price and promotions calendar
  - Category metadata (brand, subcategory, weight)
  - External: Google Trends, holiday calendar, weather

Feature engineering:
  - Lag features: sales_t-1, t-7, t-14, t-28, t-365
  - Rolling statistics: 7d/28d/90d rolling mean/std/median
  - Calendar: day_of_week, month, is_holiday, days_to_event
  - Price elasticity: log(price / rolling_avg_price)
  - Trend: linear time index, Fourier features for seasonality

Model stack:
  - Global LightGBM (one model, all SKUs -- features carry SKU identity)
  - Fine-tune with per-category models for top-volume SKUs
  - Uncertainty: quantile regression (p10, p50, p90)

Evaluation:
  - WMAPE (weighted mean absolute percentage error) per category
  - Pinball loss at p10/p90 for uncertainty calibration
  - Business metric: in-stock rate, overstock cost

Infrastructure:
  - Training: weekly batch on Databricks (Spark for feature compute)
  - Serving: pre-computed forecasts stored in DynamoDB/BigQuery
  - Monitoring: WMAPE drift alerts, auto-retrain trigger on >5% degradation

Q24. A product metric dropped 20% overnight. How do you diagnose it?

Systematic debugging framework:

Validate the data: is the metric genuinely down or is it a logging/instrumentation issue?
- Check event volume vs previous same-day-of-week.
- Check for any dashboarding or SQL query changes.
- Compare raw event counts vs derived metric.
Scope the drop: which segment is affected?
- Slice by platform (iOS/Android/Web).
- Slice by geography, user cohort, acquisition channel.
- Slice by feature area (checkout, homepage, search).
Check for external events:
- Any deployments in the past 24 hours?
- Any third-party API outages?
- Any marketing campaigns ending?
- Any seasonal patterns (Monday always lower)?
Correlate with other metrics:
- Did session volume drop (traffic problem) or conversion rate drop (funnel problem)?
- Did error rates increase?
- Did A/B test traffic allocation change?
Root cause and fix:
- Logging bug: fix and backfill.
- Feature regression: rollback and investigate.
- Real user behavior change: investigate user feedback, reviews.

Q25. How do you build a real-time fraud detection model?

# Feature engineering for transaction fraud
def compute_transaction_features(tx, user_history):
    return {
        # Velocity features (key for fraud)
        "txn_count_1h": user_history.last("1h").shape[0],
        "txn_amount_1h": user_history.last("1h")["amount"].sum(),
        "unique_merchants_24h": user_history.last("24h")["merchant_id"].nunique(),

        # Deviation features
        "amount_vs_avg_ratio": tx["amount"] / (user_history["amount"].mean() + 1),
        "hour_vs_usual": abs(tx["hour"] - user_history["hour"].mode()[0]),

        # Device / location features
        "new_device": 1 if tx["device_id"] not in user_history["device_id"].values else 0,
        "new_location": 1 if tx["country"] not in user_history["country"].values else 0,
        "distance_from_last_txn_km": haversine_distance(tx, user_history.iloc[-1]),
    }

Production architecture:

Real-time feature store: Redis for velocity counters (INCR + TTL).
Model serving: LightGBM loaded in FastAPI, P99 latency <10ms.
Threshold: tune for 0.1% FPR (fraud blocking must not block legit users).
Human review queue: scores in [0.4, 0.7] range.
Feedback loop: reviewed transactions feed back to training data weekly.

Q26. What is the difference between parametric and non-parametric statistical tests?

Property	Parametric	Non-parametric
Assumption	Normal distribution, equal variance	No distributional assumption
Examples	t-test, ANOVA, Pearson correlation	Mann-Whitney U, Kruskal-Wallis, Spearman
Power	Higher (when assumptions hold)	Lower but robust
Sample size	Works well even for small n	Better for small or non-normal samples

from scipy import stats

# Parametric: independent samples t-test
t_stat, p_val = stats.ttest_ind(group_a, group_b, equal_var=False)  # Welch's t-test

# Non-parametric equivalent: Mann-Whitney U
u_stat, p_val_mw = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')

# When to use non-parametric:
# - Small samples (n < 30)
# - Ordinal data (ratings 1-5)
# - Heavy tails / outliers
# - Confirmed non-normal distribution

# Test for normality
stat, p_normal = stats.shapiro(group_a)  # H0: data is normal
print(f"Shapiro p={p_normal:.4f}, normal={'yes' if p_normal > 0.05 else 'no'}")

Q27. How do you monitor a deployed ML model for data drift?

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, ClassificationPreset

# Reference data (training distribution)
reference_df = train_data.copy()
# Current data (last 7 days of production)
current_df = production_data.copy()

# Statistical drift report
report = Report(metrics=[
    DataDriftPreset(),         # PSI / KS test per feature
    ClassificationPreset(),    # Precision, recall, F1 vs baseline
])
report.run(reference_data=reference_df, current_data=current_df)
report.save_html("/tmp/drift_report.html")

# Custom PSI for numerical features
def psi(expected, actual, n_bins=10):
    bins = np.quantile(expected, np.linspace(0, 1, n_bins + 1))
    exp_pct = np.histogram(expected, bins=bins)[0] / len(expected)
    act_pct = np.histogram(actual, bins=bins)[0] / len(actual)
    exp_pct = np.clip(exp_pct, 1e-8, None)
    act_pct = np.clip(act_pct, 1e-8, None)
    return np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))

# PSI thresholds: <0.1 stable, 0.1-0.2 slight drift, >0.2 significant drift
for col in num_features:
    score = psi(reference_df[col].dropna(), current_df[col].dropna())
    if score > 0.2:
        print(f"ALERT: {col} PSI={score:.3f} -- significant drift")

Q28. Design a recommendation system for an OTT platform with 50M users and 10K titles.

Problem: Rank 10K titles per user at page load time (<100ms)

Stage 1 -- Retrieval (10K -> 500 candidates):
  - Collaborative filtering: Matrix factorization (ALS via Spark)
  - Content-based: Title embeddings (title, genre, cast) via MLP
  - Trending: Popularity signals (last 7d views) per genre x region
  - Recency: Last 5 watched genres as query vector

Stage 2 -- Ranking (500 -> 20 final):
  - Two-tower model: user tower + item tower -> dot product score
  - Features: user history, device context, time of day, completion rate
  - Fine-grained: XGBoost re-ranker with engagement signals
  - Business rules: new releases boost, licensed content expiry boost

Stage 3 -- Post-processing:
  - Diversity: MMR (Maximal Marginal Relevance) to avoid filter bubbles
  - Freshness: penalize already-seen titles
  - Experimentation: holdout buckets for A/B testing new models

Serving:
  - Precomputed candidate sets refreshed hourly (Spark batch)
  - Online ranking: ONNX two-tower model in triton, <30ms P99
  - Feature store: Redis (user history, last-session features)
  - Fallback: popularity-based ranking on cold-start users

Evaluation:
  - Offline: NDCG@20, Hit Rate@20 on held-out interactions
  - Online: watch-start rate, completion rate, session depth

FAQ

Q: What Python libraries should I master for data science interviews?

A: pandas (data manipulation), NumPy (numerical operations), scikit-learn (ML), XGBoost/LightGBM (boosting), matplotlib/seaborn (visualization), scipy.stats (statistical tests). Candidates from public preparation resources confirm these six cover the vast majority of hands-on coding rounds.

Q: How important is SQL in data science interviews vs Python?

A: Both are tested and both can be eliminators. Many companies start with a SQL screen. Window functions, CTEs, joins, and aggregations are the core of DS SQL rounds. Confirm the specific interview structure on the official company careers portal.

Q: Should I memorize ML algorithm math?

A: Derivation-level math is asked at research scientist roles and FAANG. For most DS roles, you need conceptual clarity on bias-variance, loss functions, gradient descent, and evaluation metrics -- not full proofs. Tree-based model internals (gini impurity, information gain, split criteria) are frequently asked at all levels.

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp

Data Science Interview Questions 2026: 30 Answers with Code

Which Companies Hire for Data Science Roles?

EASY: Statistics and Python Foundations (Questions 1-10)

Q1. What is the difference between Type I and Type II errors? How do they relate to significance level and power?

Q2. Explain p-value. A p-value of 0.03 -- what does it mean?

Q3. What is the Central Limit Theorem? Why is it important for data science?

Q4. What is Bayes' theorem? Give a data science example.

Q5. What is the difference between correlation and causation? Give an example.

Q6. How do you handle missing data? What are the different strategies?

Q7. What is the difference between variance and standard deviation? When do you use each?

Q8. Explain outlier detection methods. How do you handle outliers?

Q9. What is the difference between normalization and standardization?

Q10. What is EDA? Walk me through your EDA process.

MEDIUM: Feature Engineering and Modeling (Questions 11-22)

Q11. What feature engineering techniques do you apply to datetime columns?

Q12. What is target encoding? What are its risks and how do you mitigate them?

Q13. How do you handle high-cardinality categorical features?

Q14. Compare logistic regression, random forest, and gradient boosting for a classification task.

Q15. How do you deal with class imbalance in a binary classification problem?

Q16. What is feature importance? Compare SHAP vs permutation importance vs model-native importance.

Q17. What is cross-validation? Describe k-fold, stratified k-fold, and time-series CV.

Q18. What is dimensionality reduction? Compare PCA vs t-SNE vs UMAP.

Q19. How do you build a churn prediction model? Walk through the full pipeline.

Q20. How do you design and analyze an A/B test?

Q21. What is multicollinearity? How do you detect and address it?

Q22. Explain gradient boosting. How does XGBoost improve on vanilla gradient boosting?

HARD: System Design and Case Studies (Questions 23-30)

Q23. Design a demand forecasting system for an e-commerce platform.

Q24. A product metric dropped 20% overnight. How do you diagnose it?

Q25. How do you build a real-time fraud detection model?

Q26. What is the difference between parametric and non-parametric statistical tests?

Q27. How do you monitor a deployed ML model for data drift?

Q28. Design a recommendation system for an OTT platform with 50M users and 10K titles.

FAQ

Q: What Python libraries should I master for data science interviews?

Q: How important is SQL in data science interviews vs Python?

Q: Should I memorize ML algorithm math?

More resources in Interview Questions

Sat this this year? Share your story, earn ₹500.

Take a free timed mock test

Scikit-Learn Interview Questions 2026: 28 Answers with Code

Machine Learning Interview Questions 2026: 30 Answers with Code

Pandas Interview Questions 2026: 28 Answers with Code

PyTorch Interview Questions 2026: 28 Answers with Code

TensorFlow Interview Questions 2026: 28 Answers with Code

Share this guide