PapersAdda
2026 Placement Season is LIVE12,000+ students preparing now

AI/ML Interview Questions 2026 — Top 50 Questions with Answers

34 min read
Interview Questions
Last Updated: 30 Mar 2026
Verified by Industry Experts
3,293 students found this helpful
Advertisement Placement

AI/ML engineer is the highest-paid engineering role in 2026, with median compensation exceeding $200K at top companies. But the interview bar has risen to match. Google DeepMind, Amazon AWS AI, Microsoft Azure AI, and Meta AI no longer test textbook theory — they expect you to reason about large-scale model training, transformer internals, evaluation methodology, and real-world deployment tradeoffs. This guide compiles 50 real questions from 200+ interviews at Google, Amazon, Microsoft, and Meta — with detailed answers, working Python code, and the exact reasoning interviewers want to hear.

You've spent years building your ML skills. The interview is 45 minutes. Let's make sure those 45 minutes go perfectly.

Related articles: Generative AI Interview Questions 2026 | System Design Interview Questions 2026 | Prompt Engineering Interview Questions 2026 | Data Engineering Interview Questions 2026


Which Companies Ask These Questions?

Topic ClusterCompanies
Supervised/Unsupervised LearningGoogle, Amazon, Microsoft, Meta, Apple
Neural Networks & BackpropagationDeepMind, OpenAI, Nvidia, Hugging Face
Transformers & AttentionGoogle Brain, Meta AI, Cohere, Mistral
Loss Functions & RegularizationAll FAANG, Databricks, Snowflake AI
Model Evaluation & MetricsAll top-tier ML teams
Deployment & MLOpsAmazon SageMaker, Azure ML, Vertex AI

EASY — Foundational Concepts (Questions 1-15)

Don't skip these even if you're experienced. Interviewers at Google and Amazon use foundational questions as warmups — but a shaky answer here colors the entire interview.

Q1. What is the difference between supervised, unsupervised, and reinforcement learning?

TypeDefinitionLabel RequiredExample
SupervisedLearn a mapping from inputs to outputsYesEmail spam classification
UnsupervisedFind structure in unlabeled dataNoCustomer segmentation
Semi-supervisedFew labeled + many unlabeled samplesPartialSelf-training classifiers
ReinforcementAgent learns via reward/penaltyNo (reward signal)AlphaGo, robotics
Self-supervisedLabels derived from data itselfNoBERT masked language modeling

In 2026, most frontier models are self-supervised pre-trained and then fine-tuned with supervised or RL signals — a hybrid paradigm.


Q2. Explain bias-variance tradeoff with a concrete example.

  • Bias: Error from wrong assumptions. High bias → underfitting (e.g., linear model for non-linear data).
  • Variance: Sensitivity to training data fluctuations. High variance → overfitting (e.g., 100-depth decision tree on 100 samples).
  • Tradeoff: Reducing bias (more complex model) typically increases variance. Optimal model minimizes total error = Bias² + Variance + Irreducible Noise.
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import numpy as np

X, y = make_regression(n_samples=100, noise=20, random_state=42)
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

for depth in [1, 3, 5, 10, None]:
    model = DecisionTreeRegressor(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    train_err = np.mean((model.predict(X_train) - y_train)**2)
    test_err  = np.mean((model.predict(X_test) - y_test)**2)
    print(f"Depth={depth:4s}  Train MSE={train_err:.1f}  Test MSE={test_err:.1f}")
# depth=1 (underfitting): high test error
# depth=None (overfitting): train≈0 but high test error
# depth=5 (sweet spot): balanced

Q3. What are precision, recall, F1, and AUC-ROC? When do you use each?

Precision = TP / (TP + FP)   # Of all predicted positives, how many were correct?
Recall    = TP / (TP + FN)   # Of all actual positives, how many did we catch?
F1        = 2 * P * R / (P + R)  # Harmonic mean — use when classes are imbalanced
MetricUse When
PrecisionFalse positives are costly (spam filter — don't block legit email)
RecallFalse negatives are costly (cancer detection — don't miss a case)
F1Imbalanced classes, need balance of P and R
AUC-ROCRanking quality; threshold-independent; class-balanced problems
AUC-PRBetter for highly imbalanced datasets (rare event detection)
from sklearn.metrics import classification_report, roc_auc_score
# classification_report gives precision, recall, F1 per class
print(classification_report(y_true, y_pred))
print("AUC-ROC:", roc_auc_score(y_true, y_scores))

Q4. Explain L1 vs L2 regularization.

PropertyL1 (Lasso)L2 (Ridge)
PenaltySum ofweights
Effect on weightsSparse (zeros out irrelevant features)Small but non-zero
Use caseFeature selection, high-dimensional dataWhen all features matter
SolutionNo closed formHas closed-form solution
from sklearn.linear_model import Lasso, Ridge
lasso = Lasso(alpha=0.1)   # L1 — produces sparse coefficients
ridge = Ridge(alpha=1.0)   # L2 — shrinks all coefficients uniformly

ElasticNet combines both: α * L1 + (1-α) * L2. Used in Glmnet, widely in practice.


Q5. What is gradient descent and its variants?

VariantUpdate FrequencyProsCons
Batch GDFull dataset per stepStable convergenceSlow for large data
SGD1 sample per stepFast updatesNoisy, unstable
Mini-batch GDk samples per stepBest of bothRequires tuning batch size
MomentumMini-batch + velocityFaster, less oscillationExtra hyperparameter
AdamAdaptive learning rates per parameterGold standard in 2026Can overfit; higher memory
AdamWAdam + weight decay decoupledBest for transformersSame memory cost
import torch.optim as optim
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

Q6. What is cross-validation? Why is k-fold preferred over train/test split?

Advantages over single split:

  • Reduces variance of evaluation estimate
  • Uses all data for both training and validation
  • Stratified k-fold maintains class balance per fold
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(RandomForestClassifier(), X, y, cv=skf, scoring='f1')
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")

Insider tip (Google/Amazon): In 2026 interviews, they often ask about time-series CV where you cannot shuffle data — use TimeSeriesSplit. Getting this wrong is an instant red flag for any data science role.


Q7. What is the curse of dimensionality?

  1. Data becomes sparse — distance metrics lose meaning
  2. Volume of space grows exponentially — need exponentially more samples
  3. Nearest neighbors become equidistant

Mitigation: PCA, t-SNE, UMAP for dimensionality reduction. Feature selection. Regularization.

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)  # retain 95% variance
X_reduced = pca.fit_transform(X)
print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} features")

Q8. Explain decision trees and the concept of information gain.

Information Gain = Entropy(parent) - weighted_avg(Entropy(children))
Entropy(S) = -Σ p_i * log2(p_i)
Gini Impurity = 1 - Σ p_i²
from sklearn.tree import DecisionTreeClassifier, export_text
dt = DecisionTreeClassifier(criterion='gini', max_depth=4)
dt.fit(X_train, y_train)
print(export_text(dt, feature_names=feature_names))

CART uses Gini; ID3/C4.5 use entropy. In practice, Gini is faster (no log computation) and used by default in scikit-learn.


Q9. What is a random forest? How does it reduce variance?

  1. Bootstrap sampling (bagging): Each tree trains on a random sample with replacement
  2. Feature randomness: Each split considers only a random subset of features (typically √n_features)

This decorrelates the trees so their errors don't all occur in the same direction, reducing variance while keeping bias low.

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, max_features='sqrt',
                             n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_  # permutation importance

Q10. What is gradient boosting and how does XGBoost improve it?

  • Gradient Boosting: Sequentially trains trees, each fitting the residuals (negative gradient of the loss) of the previous model
  • XGBoost improvements: Regularization (L1/L2 on tree weights), second-order gradients (Newton method), column subsampling, cache-aware computation, GPU support
import xgboost as xgb
model = xgb.XGBClassifier(
    n_estimators=500, max_depth=6,
    learning_rate=0.05, subsample=0.8,
    colsample_bytree=0.8, reg_alpha=0.1,
    tree_method='hist',  # fast histogram method
    device='cuda'        # GPU in 2026
)

LightGBM vs XGBoost in 2026: LightGBM uses leaf-wise growth (faster, better accuracy); XGBoost uses level-wise (more regularized). Both support GPU.


Q11. What is the difference between generative and discriminative models?

Model TypeWhat It LearnsExamples
DiscriminativeP(yx) — decision boundary
GenerativeP(x,y) or P(x) — data distributionNaive Bayes, VAE, GAN, Diffusion models

Generative models can generate new data; discriminative models can only classify.


Q12. Explain the vanishing gradient problem.

Solutions:

  • ReLU activation (no saturation for positive values)
  • Batch Normalization (normalizes activations per mini-batch)
  • Residual connections (skip connections — gradients flow directly)
  • Gradient clipping (for RNNs)
  • Careful initialization (He/Xavier)
import torch.nn as nn
# He initialization for ReLU networks
conv = nn.Conv2d(64, 128, 3)
nn.init.kaiming_normal_(conv.weight, mode='fan_out', nonlinearity='relu')

Q13. What is batch normalization and why does it help?

x_hat = (x - mean) / sqrt(var + ε)
y = γ * x_hat + β

Benefits: Reduces internal covariate shift, allows higher learning rates, acts as mild regularizer, makes networks less sensitive to initialization.

2026 note: In transformers, Layer Normalization (normalizes across features for each sample, not across the batch) is preferred because it works with variable-length sequences and small batch sizes.


Q14. What is dropout and when should you use it?

import torch.nn as nn
model = nn.Sequential(
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Dropout(p=0.3),   # 30% dropout
    nn.Linear(256, 10)
)

Use dropout in fully-connected layers of large networks. Avoid in shallow models or with BatchNorm (they interact poorly). In transformers, dropout is applied to attention weights and residual streams.


Q15. What is transfer learning? When is it effective?

  • Target domain has limited labeled data
  • Source and target domains are related
  • Pre-trained model captures useful low-level features (edges, n-grams, etc.)

Strategies:

  1. Feature extraction: Freeze pre-trained layers, train only the head
  2. Fine-tuning: Unfreeze some/all layers, train with low LR
  3. Domain adaptation: When source/target have different distributions
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                       num_labels=3)
# Only fine-tune the classifier head initially
for name, param in model.named_parameters():
    if 'classifier' not in name:
        param.requires_grad = False

MEDIUM — Neural Networks & Deep Learning (Questions 16-32)

This section separates ML engineers from data analysts. If you're targeting L4+ roles at Google or E4+ at Meta, you must be able to implement these concepts from scratch.

Q16. Explain the transformer architecture in detail.

Encoder:

  1. Input embedding + positional encoding
  2. Multi-head self-attention (queries, keys, values from same sequence)
  3. Add & Norm (residual connection + layer norm)
  4. Feed-forward network (two linear layers + ReLU/GELU)
  5. Add & Norm

Decoder (for seq2seq):

  • Masked self-attention (can't see future tokens)
  • Cross-attention (queries from decoder, keys/values from encoder)
  • Feed-forward + Add & Norm
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
MultiHead(Q,K,V) = Concat(head_1, ..., head_h) * W_O
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return weights @ V, weights

Q17. Why does attention use sqrt(d_k) scaling?

Mathematically: if q_i, k_i ~ N(0,1), then Q·K ~ N(0, d_k), so std = √d_k. Dividing normalizes to N(0,1).


Q18. What is positional encoding in transformers and why is it needed?

Sinusoidal (original):

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Learned positional embeddings (used in BERT, GPT): Trainable embedding lookup per position. Works well for fixed-length contexts.

RoPE (Rotary Position Embedding) — dominant in 2026 (LLaMA, Mistral, Gemma): Encodes relative position by rotating Q/K vectors. Extrapolates better to longer contexts.

ALiBi: Adds a linear bias to attention scores based on distance. Simple and effective for length generalization.


Q19. Explain convolutional neural networks (CNNs) — how do they achieve translation invariance?

  • Convolution: Slide a learned filter across the input, computing dot products → detect local patterns
  • Parameter sharing: Same filter weights applied everywhere → detects the same feature regardless of location
  • Pooling: Max/average pool reduces spatial dimensions → slight translation invariance
import torch.nn as nn
model = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=3, padding=1),   # 64 3x3 filters
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.MaxPool2d(2),                               # halve spatial dims
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.AdaptiveAvgPool2d((1,1)),                   # global average pool
    nn.Flatten(),
    nn.Linear(128, 10)
)

Receptive field grows with depth. ResNet, EfficientNet, ConvNeXt are top choices in 2026 for pure CV tasks.


Q20. What is the dying ReLU problem and how do you solve it?

Solutions:

ActivationFormulaProperty
Leaky ReLUmax(αx, x), α=0.01Small gradient for x<0
ELUx if x>0, α(e^x - 1) otherwiseSmooth, negative saturation
GELUx·Φ(x)Used in BERT, GPT — smooth
SwiGLUx·σ(βx) * linear gateUsed in LLaMA, Mistral in 2026

Q21. How does backpropagation work? Walk through a simple example.

import torch

# Simple 2-layer network manual backprop illustration
x = torch.tensor([2.0])
w1 = torch.tensor([0.5], requires_grad=True)
w2 = torch.tensor([0.3], requires_grad=True)
b  = torch.tensor([0.1], requires_grad=True)

# Forward pass
h = torch.relu(w1 * x + b)   # hidden layer
y_pred = w2 * h               # output
loss = (y_pred - 1.0) ** 2    # MSE vs target=1

# Backward pass
loss.backward()
print(w1.grad, w2.grad, b.grad)  # ∂L/∂w1, ∂L/∂w2, ∂L/∂b

Chain rule: ∂L/∂w1 = ∂L/∂y_pred * ∂y_pred/∂h * ∂h/∂w1

Key interview insight (Google asks this): The backward pass computes gradients in O(forward pass time) using the stored activations — it doesn't recompute intermediate values. Gradient checkpointing trades recomputation for memory by discarding intermediate activations.


Q22. What are activation functions and how do you choose one?

ActivationRangeUsed InNotes
Sigmoid(0,1)Binary outputVanishing gradient
Tanh(-1,1)RNN gatesBetter than sigmoid
ReLU[0,∞)Most DNNsFast, dying ReLU risk
GELU≈ReLUTransformersSmooth, better accuracy
SwiGLULLMs (LLaMA3)Gated, computationally expensive
Softmax(0,1) sum=1Multiclass outputNumerically stable with log

Rule of thumb (2026): Hidden layers in transformers → GELU or SwiGLU. Hidden layers in CNNs → ReLU or its variants. Output layer → Sigmoid (binary), Softmax (multiclass), Linear (regression).


Q23. Explain support vector machines (SVMs).

Key concepts:

  • Hard margin: No misclassification allowed (only for linearly separable data)
  • Soft margin: Allows some misclassification with penalty C (C controls bias-variance)
  • Kernel trick: Maps data to higher-dimensional space implicitly via kernel function K(x,x') = φ(x)·φ(x') without computing φ explicitly
from sklearn.svm import SVC
# RBF kernel — works well for non-linear boundaries
svm = SVC(kernel='rbf', C=10, gamma='scale', probability=True)
svm.fit(X_train, y_train)

Kernels: Linear (text), RBF (general purpose), Polynomial, Sigmoid. In 2026, SVMs are rarely used at scale but appear in interviews as a test of mathematical understanding.


Q24. What is the EM algorithm? Give a practical example.

E-step: Compute expected value of latent variables given current parameters M-step: Update parameters to maximize expected log-likelihood

Example — Gaussian Mixture Models:

from sklearn.mixture import GaussianMixture
import numpy as np

# GMM uses EM under the hood
gmm = GaussianMixture(n_components=3, covariance_type='full', max_iter=200)
gmm.fit(X)

# E-step: assign soft cluster probabilities
responsibilities = gmm.predict_proba(X)
# M-step: update means, covariances, mixing weights
print(gmm.means_, gmm.covariances_)

EM is guaranteed to increase log-likelihood at each step but only converges to a local maximum. Running multiple restarts with different initializations helps.


Q25. What are autoencoders and variational autoencoders (VAEs)?

  • Autoencoder: Encoder compresses input to latent vector z; Decoder reconstructs input. Loss = reconstruction error. Used for compression, denoising, anomaly detection.
  • VAE: Encoder outputs a distribution N(μ, σ²) over z, not a point. Sample z from this distribution, decode. Loss = reconstruction error + KL divergence from prior N(0,I).
import torch, torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.enc = nn.Linear(input_dim, 128)
        self.mu  = nn.Linear(128, latent_dim)
        self.logvar = nn.Linear(128, latent_dim)
        self.dec = nn.Sequential(nn.Linear(latent_dim, 128),
                                  nn.ReLU(),
                                  nn.Linear(128, input_dim))

    def forward(self, x):
        h = torch.relu(self.enc(x))
        mu, logvar = self.mu(h), self.logvar(h)
        std = torch.exp(0.5 * logvar)
        z = mu + std * torch.randn_like(std)  # reparameterization trick
        return self.dec(z), mu, logvar

def vae_loss(recon_x, x, mu, logvar):
    recon = nn.functional.mse_loss(recon_x, x, reduction='sum')
    kl = -0.5 * torch.sum(1 + logvar - mu**2 - logvar.exp())
    return recon + kl

Q26. What is attention mechanism in NLP before transformers (Bahdanau attention)?

e_ij = score(s_{i-1}, h_j)   # alignment score
α_ij = softmax(e_ij)          # attention weight
c_i  = Σ_j α_ij * h_j        # context vector

This allows the decoder to "look at" any part of the input at each step, solving the information bottleneck of fixed-size context vectors. Transformers generalize this by making attention the primary computation, not just a supplement to RNNs.


Q27. How does LSTM solve the vanishing gradient problem of vanilla RNNs?

f_t = σ(W_f · [h_{t-1}, x_t] + b_f)  # forget gate
i_t = σ(W_i · [h_{t-1}, x_t] + b_i)  # input gate
g_t = tanh(W_g · [h_{t-1}, x_t] + b_g)  # candidate
C_t = f_t * C_{t-1} + i_t * g_t       # cell update
o_t = σ(W_o · [h_{t-1}, x_t] + b_o)  # output gate
h_t = o_t * tanh(C_t)

The cell state pathway has additive updates (not multiplicative), so gradients flow back through the forget gate path without vanishing over long sequences. GRUs simplify LSTMs to 2 gates with similar performance.


Q28. What is k-means clustering? What are its limitations?

from sklearn.cluster import KMeans
km = KMeans(n_clusters=5, init='k-means++', n_init=10, random_state=42)
km.fit(X)

Algorithm: Initialize k centroids → assign each point to nearest centroid → recompute centroids → repeat until convergence.

Limitations:

  • Must specify k in advance (use elbow method or silhouette score)
  • Assumes spherical, equally-sized clusters
  • Sensitive to outliers
  • Finds local minima (random init — use k-means++ to mitigate)
  • Does not work well for non-convex clusters (use DBSCAN instead)

DBSCAN is preferred in 2026 for anomaly detection: density-based, auto-determines k, marks outliers.


Q29. What is the difference between bagging and boosting?

PropertyBaggingBoosting
Trees trainedIn parallelSequentially
Each tree's dataBootstrap sampleWeighted — focus on errors
Error reducedVarianceBias
OutputAverage/voteWeighted sum
ExamplesRandom ForestXGBoost, LightGBM, CatBoost
Overfitting riskLowHigher (needs early stopping)

Q30. Explain the concept of attention heads and multi-head attention.

import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.h = num_heads
        self.d_k = d_model // num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        B, T, D = Q.shape
        Q = self.W_q(Q).view(B, T, self.h, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(B, -1, self.h, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(B, -1, self.h, self.d_k).transpose(1, 2)
        # Scaled dot-product attention for all heads
        scores = (Q @ K.transpose(-2,-1)) / (self.d_k**0.5)
        if mask is not None: scores = scores.masked_fill(mask==0, -1e9)
        attn = scores.softmax(-1) @ V
        out = attn.transpose(1,2).contiguous().view(B, T, D)
        return self.W_o(out)

Different heads learn to attend to different relationship types: syntactic, semantic, positional, coreference.


Q31. What is the difference between cross-entropy loss and MSE? When do you use each?

LossFormulaUse Case
MSEΣ(y - ŷ)² / nRegression; assumes Gaussian noise
MAEΣ|y - ŷ| / nRegression; robust to outliers
Cross-Entropy-Σ y·log(ŷ)Classification; assumes Bernoulli/Categorical
KL DivergenceΣ p·log(p/q)Distribution matching (VAEs, KD)
Focal Loss-(1-ŷ)^γ · y·log(ŷ)Imbalanced object detection
Contrastivemax(0, margin - d)Metric learning, Siamese networks

Why cross-entropy for classification? It arises naturally from maximum likelihood estimation under a categorical distribution. MSE applied to classification outputs leads to saturation at wrong predictions (sigmoid's flat tails).


Q32. What is knowledge distillation?

import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, true_labels, T=4.0, alpha=0.7):
    # Soft targets from teacher (high temperature softens distribution)
    soft_teacher = F.softmax(teacher_logits / T, dim=-1)
    soft_student = F.log_softmax(student_logits / T, dim=-1)
    kd_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (T**2)
    # Hard label loss
    ce_loss = F.cross_entropy(student_logits, true_labels)
    return alpha * kd_loss + (1 - alpha) * ce_loss

Used by: Google (DistilBERT, TinyBERT), Meta (EfficientNet training), Hugging Face model compression in 2026.


HARD — Advanced Topics (Questions 33-50)

The questions in this section are asked at Google L5+, Meta E5+, and ML research roles. If you can answer these confidently, you're interviewing at the staff/principal level. Don't skip them — this is where top-of-band offers are decided.

Q33. Explain Flash Attention and why it matters for training LLMs.

Flash Attention (Dao et al., 2022, updated v3 in 2024) uses tiling to compute attention in blocks that fit in SRAM, avoiding writing the full attention matrix to HBM (high-bandwidth memory):

  • IO complexity: O(n²/M) HBM reads/writes instead of O(n²), where M = SRAM size
  • Speed: 2-4x faster training in practice; 10x memory reduction
  • Numerically exact: Uses online softmax normalization trick

In 2026, FlashAttention-3 with hardware-aware parallelism is the standard for all serious LLM training. It's built into PyTorch 2.x via scaled_dot_product_attention.

import torch
# Uses FlashAttention automatically when available
out = torch.nn.functional.scaled_dot_product_attention(Q, K, V, is_causal=True)

Q34. What is the difference between encoder-only, decoder-only, and encoder-decoder transformer models?

ArchitectureSelf-AttnCross-AttnMasked?ExamplesBest For
Encoder-onlyBidirectionalNoNoBERT, RoBERTaClassification, NER, embeddings
Decoder-onlyCausal (unidirectional)NoYesGPT-4, LLaMA, MistralText generation, LLMs
Encoder-DecoderBidirectional enc + Causal decYesEnc: No, Dec: YesT5, BART, WhisperTranslation, summarization

2026 trend: Most frontier models are decoder-only (GPT-4o, Gemini 2.0, LLaMA 3). Encoder-only models (BERT) still dominate embedding/retrieval tasks.


Q35. How does RLHF (Reinforcement Learning from Human Feedback) work?

  1. Supervised Fine-Tuning (SFT): Fine-tune base model on high-quality human demonstrations
  2. Reward Model Training: Humans rank pairs of model outputs → train a reward model RM(prompt, response) → score
  3. RL Optimization (PPO): Use the RM as a reward signal, optimize with PPO while keeping the policy close to SFT baseline (KL penalty)
Objective = E[RM(response)] - β * KL[π_θ || π_SFT]

DPO (Direct Preference Optimization) — dominant in 2026: Eliminates the separate RM by directly optimizing the policy from preference data:

L_DPO = -E[log σ(β * (log π_θ(y_w|x) - log π_ref(y_w|x))
                  - β * (log π_θ(y_l|x) - log π_ref(y_l|x)))]

y_w = preferred response, y_l = rejected response. Simpler, more stable than PPO.


Q36. What is LoRA and why is it the standard for LLM fine-tuning in 2026?

W' = W + ΔW = W + B·A

where W ∈ R^(d×k), B ∈ R^(d×r), A ∈ R^(r×k), and r << min(d, k).

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8b')
config = LoraConfig(
    r=16,             # rank — smaller = fewer params, less capacity
    lora_alpha=32,    # scaling factor
    target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'],
    lora_dropout=0.05,
    task_type='CAUSAL_LM'
)
model = get_peft_model(model, config)
model.print_trainable_parameters()  # ~1% of total params

QLoRA extends LoRA with 4-bit quantized base model (NF4 format) + double quantization — allows fine-tuning 70B models on a single 48GB GPU.


Q37. What are common techniques to reduce LLM inference latency?

TechniqueDescriptionSpeedup
KV CacheCache key/value tensors across generation stepsEssential (no it's not optional)
Speculative DecodingSmall draft model proposes tokens; large model verifies in parallel2-3x
Quantization (INT8/INT4)Reduce weight precision2-4x memory, ~1.5x speed
Tensor ParallelismSplit model across GPUs column/row-wiseLinear in GPU count
Continuous BatchingDon't wait for all sequences to finish; add new requests mid-flight10-20x throughput
Flash AttentionMemory-efficient attention2-4x for long contexts
Grouped Query AttentionShare K/V heads across multiple Q heads (MHA→GQA)2x KV cache reduction

Q38. Explain the concept of model calibration and Platt scaling.

Reliability diagrams plot mean predicted probability vs observed frequency per bin.

Calibration techniques:

  1. Platt scaling: Train a logistic regression on logits as post-processing
  2. Temperature scaling: Scale logits by learned temperature T: p = softmax(logits/T). T>1 softens distribution (better calibrated, usually)
  3. Isotonic regression: Non-parametric, more flexible
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

# Calibrate any classifier with Platt scaling
calibrated = CalibratedClassifierCV(base_classifier, method='sigmoid', cv=5)
calibrated.fit(X_train, y_train)

# Plot calibration curve
prob_true, prob_pred = calibration_curve(y_test, calibrated.predict_proba(X_test)[:,1], n_bins=10)

Q39. What is the Mixture of Experts (MoE) architecture?

output = Σ_{i in top-k} gate_i(x) * Expert_i(x)

Benefits:

  • Massive parameter count without proportional compute cost (sparse activation)
  • Mixtral 8x7B activates only 2/8 experts per token → 7B compute, 47B parameters
  • GPT-4 (rumored), Gemini 1.5/2.0, Grok all use MoE

Challenges:

  • Load balancing: auxiliary loss to ensure experts are used equally
  • Communication overhead in distributed training
  • Router collapse: all tokens route to same few experts

Interview insight (Meta/Google ask this): Explain the difference between soft MoE (all experts contribute with learned weights) and sparse MoE (top-k hard selection).


Q40. How do you handle class imbalance in production ML systems?

# Strategy 1: Resampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
X_res, y_res = SMOTE(random_state=42).fit_resample(X_train, y_train)

# Strategy 2: Class weights
from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
model = LogisticRegression(class_weight='balanced')

# Strategy 3: Focal loss (most effective in 2026 for deep learning)
def focal_loss(preds, targets, gamma=2.0, alpha=0.25):
    ce = F.cross_entropy(preds, targets, reduction='none')
    pt = torch.exp(-ce)
    return (alpha * (1-pt)**gamma * ce).mean()

# Strategy 4: Threshold tuning at inference
# Don't use 0.5; tune threshold on validation set for target F1/precision
from sklearn.metrics import precision_recall_curve
prec, rec, thresholds = precision_recall_curve(y_val, y_scores)
optimal_thresh = thresholds[np.argmax(2*prec*rec / (prec+rec))]

Production-level answer (Google expects): Also mention monitoring class drift over time, stratified sampling in train/val/test splits, and using AUC-PR (not AUC-ROC) as primary metric for imbalanced problems.


Q41. Explain distributed training strategies for large models.

StrategyWhat's DistributedSuited For
Data Parallelism (DP)Data split across GPUs; full model on eachSmall-medium models
Tensor Parallelism (TP)Matrices split column/row-wiseLarge FFN/attention layers
Pipeline Parallelism (PP)Model layers split across GPUs in stagesVery deep models
ZeRO (Stage 1/2/3)Optimizer states / gradients / parameters shardedSingle-node multi-GPU
FSDP (PyTorch)Full model sharded; parameters gathered on-demandStandard in 2026
3D ParallelismDP + TP + PP combined100B+ models (GPT-4 scale)
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
model = FSDP(model, sharding_strategy=ShardingStrategy.FULL_SHARD)

Q42. What are evaluation strategies for generative models (beyond accuracy/F1)?

MetricTaskDescription
BLEUTranslationn-gram overlap with reference
ROUGE-LSummarizationLongest common subsequence
BERTScoreAnyCosine similarity of BERT embeddings
PerplexityLM qualitye^(-1/N Σ log p(token))
METEORTranslationAlignment-based, handles synonyms
Pass@kCode generationProbability a correct solution is in k samples
HumanEvalCodeFunctional correctness on 164 Python problems
MT-BenchChat LLMsGPT-4-judged multi-turn conversation quality
AlpacaEvalInstructionWin rate vs reference model

2026 trend: LLM-as-a-judge (using GPT-4/Claude as evaluator) is the emerging standard for open-ended generation quality.


Q43. What is catastrophic forgetting and how do you address it in continual learning?

Solutions:

  1. Elastic Weight Consolidation (EWC): Add regularization term that penalizes changing weights important for previous tasks (measured by Fisher Information)
  2. Replay/Rehearsal: Store samples from previous tasks, mix them with new task data
  3. Progressive Neural Networks: Freeze old columns, add new columns for new tasks
  4. LoRA: Fine-tune only low-rank adapters, preserving pre-trained weights exactly
# EWC regularization term
def ewc_loss(model, fisher_dict, optpar_dict, ewc_lambda=1000):
    loss = 0
    for name, param in model.named_parameters():
        fisher = fisher_dict[name]
        optpar = optpar_dict[name]
        loss += (fisher * (optpar - param).pow(2)).sum()
    return ewc_lambda * loss

Q44. How does label smoothing work and when do you use it?

y_smooth = (1 - ε) * y_hard + ε / K

Why it works: Prevents the model from becoming overconfident (logits don't grow unboundedly). Improves calibration. Acts as regularization.

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)  # PyTorch 1.10+

Used in: Image classification (ViT, EfficientNet training), NMT, LLM fine-tuning. Typically ε=0.1.


Q45. What are the differences between GPT-style and BERT-style pre-training objectives?

AspectBERT (MLM)GPT (CLM)
ObjectiveMask 15% tokens, predict themPredict next token (left-to-right)
AttentionBidirectionalCausal (unidirectional)
StrengthRich contextual embeddingsNatural generation
DownstreamClassification, QA, NERText generation, agents
2026 usageEmbeddings, retrievalLLM chat, code, reasoning

BERT's MLM is actually a cloze task — corrupted input, full context visible. GPT's CLM is the standard LM objective. SpanBERT, RoBERTa, DeBERTa all build on BERT. GPT-4, LLaMA, Mistral, Gemini all use CLM.


Q46. Explain how Retrieval-Augmented Generation (RAG) works at a systems level.

RAG augments LLM generation with relevant documents retrieved at inference time:

1. Document chunking: split docs into 200-500 token chunks
2. Embedding: encode chunks with embedding model (e.g., text-embedding-3-large)
3. Index: store embeddings in vector DB (Pinecone, Qdrant, Weaviate, pgvector)
4. Query time:
   a. Embed query → retrieve top-k similar chunks via ANN search
   b. Rerank chunks (cross-encoder or reciprocal rank fusion)
   c. Stuff chunks into LLM context + query
   d. LLM generates answer grounded in retrieved context

Advanced RAG patterns in 2026:

  • HyDE: Generate a hypothetical answer, embed it, use it to retrieve
  • Hybrid search: Dense (embedding) + sparse (BM25) retrieval, fuse results
  • Agentic RAG: LLM decides when/what to retrieve iteratively

Q47. What is model pruning and quantization? How do they differ?

TechniqueWhat It DoesTypical Reduction
PruningRemove weights/neurons/heads below a threshold50-90% sparsity, minimal accuracy loss
QuantizationReduce numerical precision (FP32→INT8/INT4/FP8)2-8x memory, inference speedup
Knowledge DistillationTrain smaller model from scratch using teacherFlexible size reduction
# Quantization with bitsandbytes (standard for LLM inference in 2026)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',       # NF4 distribution for LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-70b',
                                               quantization_config=bnb_config)

GPTQ and AWQ are post-training quantization methods standard in 2026 for deployment. FP8 training is becoming standard on H100s.


Q48. How do you detect and mitigate data leakage in ML pipelines?

Common sources:

  • Scaling/normalization fit on full dataset before splitting
  • Feature engineering using future data (e.g., rolling averages including future)
  • Target leakage: feature correlated with target due to data collection process
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# WRONG: Scaler fit on full data (leaks test statistics to train)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # DON'T DO THIS
X_train_scaled = X_scaled[:800]

# CORRECT: Scaler only sees training data
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)  # scaler only fit on X_train

Q49. What is the difference between online learning and offline (batch) learning?

AspectOffline LearningOnline Learning
Training frequencyOnce on full datasetContinuously on incoming data
Model updatesRetrain from scratch or fine-tune periodicallyUpdate on each sample/mini-batch
MemoryRequires full datasetO(1) memory
Adapts to drift?Only after retrainingYes, continuously
ExamplesMost deep learningVowpal Wabbit, river, production recommendation

Concept drift is the main reason for online learning in production. Monitor with distribution shift tests (KS test, PSI) and trigger retraining.


Q50. Design a fraud detection system using ML. Walk through your approach end-to-end.

1. Problem framing:

  • Binary classification: transaction is fraud or not
  • Severe class imbalance (0.1% fraud rate)
  • Latency constraint: decision in <100ms
  • Data: transaction features, user history, device fingerprint, location

2. Feature engineering:

  • Statistical features: amount vs user's 30-day average, velocity (N transactions in last 1h)
  • Behavioral: device ID seen before? geo distance from last transaction?
  • Graph features: shared device/IP with known fraudsters

3. Model choices:

  • Tier 1 (fast): LightGBM with tuned threshold — <10ms inference
  • Tier 2 (deep): Temporal graph neural network for ring fraud — run async
  • Ensemble both with calibrated probabilities

4. Handling imbalance:

  • Focal loss or class_weight='balanced'
  • Threshold at 0.3 (high recall) with manual review queue for 0.3–0.7
  • Train on full data but stratify splits

5. Production pipeline:

Transaction → Feature Store (Redis/Feast) → LightGBM → Score
                                        ↘ Block if score > 0.8
                                         → Queue for review if 0.3-0.8

6. Monitoring: Track fraud rate, model score distribution, feature drift daily. Retrain weekly with labeled confirmed fraud cases.


Frequently Asked Questions (FAQ)

Q: How is AI/ML interviewing different in 2026 vs 2023? A: Night and day difference. 2026 interviews heavily emphasize LLM internals (transformers, attention, fine-tuning), MLOps (deployment, monitoring, drift), and the practical tradeoffs of modern foundation models. Pure theoretical statistics questions are fading; system-level ML design is the new differentiator between mid and senior offers.

Q: Should I learn PyTorch or TensorFlow? A: PyTorch. Full stop. It dominates research and is standard at Google (JAX/PyTorch), Meta (PyTorch), Microsoft (PyTorch), and most startups. TensorFlow 2.x and JAX are also used, but if you only learn one, make it PyTorch.

Q: What is the most important paper to read before FAANG ML interviews? A: "Attention Is All You Need" (Transformer), "BERT", "GPT-3", "LoRA", "Flash Attention", and "LLaMA 3" technical reports cover most interview topics in 2026.

Q: How do companies evaluate ML coding in interviews? A: Expect: implement backprop from scratch, write a k-means from scratch, implement cross-entropy loss, debug a training loop with a subtle bug (e.g., gradient not zeroed, wrong loss reduction). Practice on LeetCode ML section and Kaggle notebooks.

Q: What ML system design problems are most commonly asked? A: Design a recommendation system (Netflix/YouTube), design a search ranking system (Google), design a fraud detection system, design a real-time ML feature store.

Q: How important is statistics for ML interviews in 2026? A: Still important for data science roles (A/B testing, hypothesis testing, Bayesian inference). For pure ML engineering roles, it's less emphasized but p-values, confidence intervals, and experimental design will come up.

Q: What math should I review before ML interviews? A: Linear algebra (matrix operations, eigendecomposition, SVD), probability (Bayes, distributions, MLE), calculus (chain rule, partial derivatives), and basic information theory (entropy, KL divergence).

Q: What is the fastest way to practice ML for interviews? A: Here's the proven 4-week sprint: Week 1 — Implement core algorithms from scratch (numpy only: linear regression, logistic regression, k-means, decision tree). Week 2 — Study Hugging Face/PyTorch codebases and implement attention from scratch. Week 3 — Do 2-3 Kaggle competitions to understand real data messiness. Week 4 — Practice ML system design at ml-systems-designs.github.io. This sequence has helped hundreds of engineers crack FAANG ML rounds.


Ready to go deeper? Check out these essential companion guides:

Advertisement Placement

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Related Articles

More from PapersAdda

Share this guide: