AI/ML Interview Questions 2026 — Top 50 Questions with Answers

AI/ML engineer is the highest-paid engineering role in 2026, with median compensation exceeding $200K at top companies. But the interview bar has risen to match. Google DeepMind, Amazon AWS AI, Microsoft Azure AI, and Meta AI no longer test textbook theory — they expect you to reason about large-scale model training, transformer internals, evaluation methodology, and real-world deployment tradeoffs. This guide compiles 50 real questions from 200+ interviews at Google, Amazon, Microsoft, and Meta — with detailed answers, working Python code, and the exact reasoning interviewers want to hear.

You've spent years building your ML skills. The interview is 45 minutes. Let's make sure those 45 minutes go perfectly.

Related articles: Generative AI Interview Questions 2026 | System Design Interview Questions 2026 | Prompt Engineering Interview Questions 2026 | Data Engineering Interview Questions 2026

Which Companies Ask These Questions?

Topic Cluster	Companies
Supervised/Unsupervised Learning	Google, Amazon, Microsoft, Meta, Apple
Neural Networks & Backpropagation	DeepMind, OpenAI, Nvidia, Hugging Face
Transformers & Attention	Google Brain, Meta AI, Cohere, Mistral
Loss Functions & Regularization	All FAANG, Databricks, Snowflake AI
Model Evaluation & Metrics	All top-tier ML teams
Deployment & MLOps	Amazon SageMaker, Azure ML, Vertex AI

EASY — Foundational Concepts (Questions 1-15)

Don't skip these even if you're experienced. Interviewers at Google and Amazon use foundational questions as warmups — but a shaky answer here colors the entire interview.

Q1. What is the difference between supervised, unsupervised, and reinforcement learning?

Type	Definition	Label Required	Example
Supervised	Learn a mapping from inputs to outputs	Yes	Email spam classification
Unsupervised	Find structure in unlabeled data	No	Customer segmentation
Semi-supervised	Few labeled + many unlabeled samples	Partial	Self-training classifiers
Reinforcement	Agent learns via reward/penalty	No (reward signal)	AlphaGo, robotics
Self-supervised	Labels derived from data itself	No	BERT masked language modeling

In 2026, most frontier models are self-supervised pre-trained and then fine-tuned with supervised or RL signals — a hybrid paradigm.

Q2. Explain bias-variance tradeoff with a concrete example.

Bias: Error from wrong assumptions. High bias → underfitting (e.g., linear model for non-linear data).
Variance: Sensitivity to training data fluctuations. High variance → overfitting (e.g., 100-depth decision tree on 100 samples).
Tradeoff: Reducing bias (more complex model) typically increases variance. Optimal model minimizes total error = Bias² + Variance + Irreducible Noise.

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import numpy as np

X, y = make_regression(n_samples=100, noise=20, random_state=42)
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

for depth in [1, 3, 5, 10, None]:
    model = DecisionTreeRegressor(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    train_err = np.mean((model.predict(X_train) - y_train)**2)
    test_err  = np.mean((model.predict(X_test) - y_test)**2)
    print(f"Depth={depth:4s}  Train MSE={train_err:.1f}  Test MSE={test_err:.1f}")
# depth=1 (underfitting): high test error
# depth=None (overfitting): train≈0 but high test error
# depth=5 (sweet spot): balanced

Q3. What are precision, recall, F1, and AUC-ROC? When do you use each?

Precision = TP / (TP + FP)   # Of all predicted positives, how many were correct?
Recall    = TP / (TP + FN)   # Of all actual positives, how many did we catch?
F1        = 2 * P * R / (P + R)  # Harmonic mean — use when classes are imbalanced

Metric	Use When
Precision	False positives are costly (spam filter — don't block legit email)
Recall	False negatives are costly (cancer detection — don't miss a case)
F1	Imbalanced classes, need balance of P and R
AUC-ROC	Ranking quality; threshold-independent; class-balanced problems
AUC-PR	Better for highly imbalanced datasets (rare event detection)

from sklearn.metrics import classification_report, roc_auc_score
# classification_report gives precision, recall, F1 per class
print(classification_report(y_true, y_pred))
print("AUC-ROC:", roc_auc_score(y_true, y_scores))

Q4. Explain L1 vs L2 regularization.

Property	L1 (Lasso)	L2 (Ridge)
Penalty	Sum of	weights
Effect on weights	Sparse (zeros out irrelevant features)	Small but non-zero
Use case	Feature selection, high-dimensional data	When all features matter
Solution	No closed form	Has closed-form solution

from sklearn.linear_model import Lasso, Ridge
lasso = Lasso(alpha=0.1)   # L1 — produces sparse coefficients
ridge = Ridge(alpha=1.0)   # L2 — shrinks all coefficients uniformly

ElasticNet combines both: α * L1 + (1-α) * L2. Used in Glmnet, widely in practice.

Q5. What is gradient descent and its variants?

Variant	Update Frequency	Pros	Cons
Batch GD	Full dataset per step	Stable convergence	Slow for large data
SGD	1 sample per step	Fast updates	Noisy, unstable
Mini-batch GD	k samples per step	Best of both	Requires tuning batch size
Momentum	Mini-batch + velocity	Faster, less oscillation	Extra hyperparameter
Adam	Adaptive learning rates per parameter	Gold standard in 2026	Can overfit; higher memory
AdamW	Adam + weight decay decoupled	Best for transformers	Same memory cost

import torch.optim as optim
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

Q6. What is cross-validation? Why is k-fold preferred over train/test split?

Advantages over single split:

Reduces variance of evaluation estimate
Uses all data for both training and validation
Stratified k-fold maintains class balance per fold

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(RandomForestClassifier(), X, y, cv=skf, scoring='f1')
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")

Insider tip (Google/Amazon): In 2026 interviews, they often ask about time-series CV where you cannot shuffle data — use TimeSeriesSplit. Getting this wrong is an instant red flag for any data science role.

Q7. What is the curse of dimensionality?

Data becomes sparse — distance metrics lose meaning
Volume of space grows exponentially — need exponentially more samples
Nearest neighbors become equidistant

Mitigation: PCA, t-SNE, UMAP for dimensionality reduction. Feature selection. Regularization.

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)  # retain 95% variance
X_reduced = pca.fit_transform(X)
print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} features")

Q8. Explain decision trees and the concept of information gain.

Information Gain = Entropy(parent) - weighted_avg(Entropy(children))
Entropy(S) = -Σ p_i * log2(p_i)
Gini Impurity = 1 - Σ p_i²

from sklearn.tree import DecisionTreeClassifier, export_text
dt = DecisionTreeClassifier(criterion='gini', max_depth=4)
dt.fit(X_train, y_train)
print(export_text(dt, feature_names=feature_names))

CART uses Gini; ID3/C4.5 use entropy. In practice, Gini is faster (no log computation) and used by default in scikit-learn.

Q9. What is a random forest? How does it reduce variance?

Bootstrap sampling (bagging): Each tree trains on a random sample with replacement
Feature randomness: Each split considers only a random subset of features (typically √n_features)

This decorrelates the trees so their errors don't all occur in the same direction, reducing variance while keeping bias low.

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, max_features='sqrt',
                             n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_  # permutation importance

Q10. What is gradient boosting and how does XGBoost improve it?

Gradient Boosting: Sequentially trains trees, each fitting the residuals (negative gradient of the loss) of the previous model
XGBoost improvements: Regularization (L1/L2 on tree weights), second-order gradients (Newton method), column subsampling, cache-aware computation, GPU support

import xgboost as xgb
model = xgb.XGBClassifier(
    n_estimators=500, max_depth=6,
    learning_rate=0.05, subsample=0.8,
    colsample_bytree=0.8, reg_alpha=0.1,
    tree_method='hist',  # fast histogram method
    device='cuda'        # GPU in 2026
)

LightGBM vs XGBoost in 2026: LightGBM uses leaf-wise growth (faster, better accuracy); XGBoost uses level-wise (more regularized). Both support GPU.

Q11. What is the difference between generative and discriminative models?

Model Type	What It Learns	Examples
Discriminative	P(y	x) — decision boundary
Generative	P(x,y) or P(x) — data distribution	Naive Bayes, VAE, GAN, Diffusion models

Generative models can generate new data; discriminative models can only classify.

Q12. Explain the vanishing gradient problem.

Solutions:

ReLU activation (no saturation for positive values)
Batch Normalization (normalizes activations per mini-batch)
Residual connections (skip connections — gradients flow directly)
Gradient clipping (for RNNs)
Careful initialization (He/Xavier)

import torch.nn as nn
# He initialization for ReLU networks
conv = nn.Conv2d(64, 128, 3)
nn.init.kaiming_normal_(conv.weight, mode='fan_out', nonlinearity='relu')

Q13. What is batch normalization and why does it help?

x_hat = (x - mean) / sqrt(var + ε)
y = γ * x_hat + β

Benefits: Reduces internal covariate shift, allows higher learning rates, acts as mild regularizer, makes networks less sensitive to initialization.

2026 note: In transformers, Layer Normalization (normalizes across features for each sample, not across the batch) is preferred because it works with variable-length sequences and small batch sizes.

Q14. What is dropout and when should you use it?

import torch.nn as nn
model = nn.Sequential(
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Dropout(p=0.3),   # 30% dropout
    nn.Linear(256, 10)
)

Use dropout in fully-connected layers of large networks. Avoid in shallow models or with BatchNorm (they interact poorly). In transformers, dropout is applied to attention weights and residual streams.

Q15. What is transfer learning? When is it effective?

Target domain has limited labeled data
Source and target domains are related
Pre-trained model captures useful low-level features (edges, n-grams, etc.)

Strategies:

Feature extraction: Freeze pre-trained layers, train only the head
Fine-tuning: Unfreeze some/all layers, train with low LR
Domain adaptation: When source/target have different distributions

from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                       num_labels=3)
# Only fine-tune the classifier head initially
for name, param in model.named_parameters():
    if 'classifier' not in name:
        param.requires_grad = False

MEDIUM — Neural Networks & Deep Learning (Questions 16-32)

This section separates ML engineers from data analysts. If you're targeting L4+ roles at Google or E4+ at Meta, you must be able to implement these concepts from scratch.

Q16. Explain the transformer architecture in detail.

Encoder:

Input embedding + positional encoding
Multi-head self-attention (queries, keys, values from same sequence)
Add & Norm (residual connection + layer norm)
Feed-forward network (two linear layers + ReLU/GELU)
Add & Norm

Decoder (for seq2seq):

Masked self-attention (can't see future tokens)
Cross-attention (queries from decoder, keys/values from encoder)
Feed-forward + Add & Norm

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
MultiHead(Q,K,V) = Concat(head_1, ..., head_h) * W_O

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return weights @ V, weights

Q17. Why does attention use sqrt(d_k) scaling?

Mathematically: if q_i, k_i ~ N(0,1), then Q·K ~ N(0, d_k), so std = √d_k. Dividing normalizes to N(0,1).

Q18. What is positional encoding in transformers and why is it needed?

Sinusoidal (original):

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Learned positional embeddings (used in BERT, GPT): Trainable embedding lookup per position. Works well for fixed-length contexts.

RoPE (Rotary Position Embedding) — dominant in 2026 (LLaMA, Mistral, Gemma): Encodes relative position by rotating Q/K vectors. Extrapolates better to longer contexts.

ALiBi: Adds a linear bias to attention scores based on distance. Simple and effective for length generalization.

Q19. Explain convolutional neural networks (CNNs) — how do they achieve translation invariance?

Convolution: Slide a learned filter across the input, computing dot products → detect local patterns
Parameter sharing: Same filter weights applied everywhere → detects the same feature regardless of location
Pooling: Max/average pool reduces spatial dimensions → slight translation invariance

import torch.nn as nn
model = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=3, padding=1),   # 64 3x3 filters
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.MaxPool2d(2),                               # halve spatial dims
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.AdaptiveAvgPool2d((1,1)),                   # global average pool
    nn.Flatten(),
    nn.Linear(128, 10)
)

Receptive field grows with depth. ResNet, EfficientNet, ConvNeXt are top choices in 2026 for pure CV tasks.

Q20. What is the dying ReLU problem and how do you solve it?

Solutions:

Activation	Formula	Property
Leaky ReLU	max(αx, x), α=0.01	Small gradient for x<0
ELU	x if x>0, α(e^x - 1) otherwise	Smooth, negative saturation
GELU	x·Φ(x)	Used in BERT, GPT — smooth
SwiGLU	x·σ(βx) * linear gate	Used in LLaMA, Mistral in 2026

Q21. How does backpropagation work? Walk through a simple example.

import torch

# Simple 2-layer network manual backprop illustration
x = torch.tensor([2.0])
w1 = torch.tensor([0.5], requires_grad=True)
w2 = torch.tensor([0.3], requires_grad=True)
b  = torch.tensor([0.1], requires_grad=True)

# Forward pass
h = torch.relu(w1 * x + b)   # hidden layer
y_pred = w2 * h               # output
loss = (y_pred - 1.0) ** 2    # MSE vs target=1

# Backward pass
loss.backward()
print(w1.grad, w2.grad, b.grad)  # ∂L/∂w1, ∂L/∂w2, ∂L/∂b

Chain rule: ∂L/∂w1 = ∂L/∂y_pred * ∂y_pred/∂h * ∂h/∂w1

Key interview insight (Google asks this): The backward pass computes gradients in O(forward pass time) using the stored activations — it doesn't recompute intermediate values. Gradient checkpointing trades recomputation for memory by discarding intermediate activations.

Q22. What are activation functions and how do you choose one?

Activation	Range	Used In	Notes
Sigmoid	(0,1)	Binary output	Vanishing gradient
Tanh	(-1,1)	RNN gates	Better than sigmoid
ReLU	[0,∞)	Most DNNs	Fast, dying ReLU risk
GELU	≈ReLU	Transformers	Smooth, better accuracy
SwiGLU	—	LLMs (LLaMA3)	Gated, computationally expensive
Softmax	(0,1) sum=1	Multiclass output	Numerically stable with log

Rule of thumb (2026): Hidden layers in transformers → GELU or SwiGLU. Hidden layers in CNNs → ReLU or its variants. Output layer → Sigmoid (binary), Softmax (multiclass), Linear (regression).

Q23. Explain support vector machines (SVMs).

Key concepts:

Hard margin: No misclassification allowed (only for linearly separable data)
Soft margin: Allows some misclassification with penalty C (C controls bias-variance)
Kernel trick: Maps data to higher-dimensional space implicitly via kernel function K(x,x') = φ(x)·φ(x') without computing φ explicitly

from sklearn.svm import SVC
# RBF kernel — works well for non-linear boundaries
svm = SVC(kernel='rbf', C=10, gamma='scale', probability=True)
svm.fit(X_train, y_train)

Kernels: Linear (text), RBF (general purpose), Polynomial, Sigmoid. In 2026, SVMs are rarely used at scale but appear in interviews as a test of mathematical understanding.

Q24. What is the EM algorithm? Give a practical example.

E-step: Compute expected value of latent variables given current parameters M-step: Update parameters to maximize expected log-likelihood

Example — Gaussian Mixture Models:

from sklearn.mixture import GaussianMixture
import numpy as np

# GMM uses EM under the hood
gmm = GaussianMixture(n_components=3, covariance_type='full', max_iter=200)
gmm.fit(X)

# E-step: assign soft cluster probabilities
responsibilities = gmm.predict_proba(X)
# M-step: update means, covariances, mixing weights
print(gmm.means_, gmm.covariances_)

EM is guaranteed to increase log-likelihood at each step but only converges to a local maximum. Running multiple restarts with different initializations helps.

Q25. What are autoencoders and variational autoencoders (VAEs)?

Autoencoder: Encoder compresses input to latent vector z; Decoder reconstructs input. Loss = reconstruction error. Used for compression, denoising, anomaly detection.
VAE: Encoder outputs a distribution N(μ, σ²) over z, not a point. Sample z from this distribution, decode. Loss = reconstruction error + KL divergence from prior N(0,I).

import torch, torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.enc = nn.Linear(input_dim, 128)
        self.mu  = nn.Linear(128, latent_dim)
        self.logvar = nn.Linear(128, latent_dim)
        self.dec = nn.Sequential(nn.Linear(latent_dim, 128),
                                  nn.ReLU(),
                                  nn.Linear(128, input_dim))

    def forward(self, x):
        h = torch.relu(self.enc(x))
        mu, logvar = self.mu(h), self.logvar(h)
        std = torch.exp(0.5 * logvar)
        z = mu + std * torch.randn_like(std)  # reparameterization trick
        return self.dec(z), mu, logvar

def vae_loss(recon_x, x, mu, logvar):
    recon = nn.functional.mse_loss(recon_x, x, reduction='sum')
    kl = -0.5 * torch.sum(1 + logvar - mu**2 - logvar.exp())
    return recon + kl

Q26. What is attention mechanism in NLP before transformers (Bahdanau attention)?

e_ij = score(s_{i-1}, h_j)   # alignment score
α_ij = softmax(e_ij)          # attention weight
c_i  = Σ_j α_ij * h_j        # context vector

This allows the decoder to "look at" any part of the input at each step, solving the information bottleneck of fixed-size context vectors. Transformers generalize this by making attention the primary computation, not just a supplement to RNNs.

Q27. How does LSTM solve the vanishing gradient problem of vanilla RNNs?

f_t = σ(W_f · [h_{t-1}, x_t] + b_f)  # forget gate
i_t = σ(W_i · [h_{t-1}, x_t] + b_i)  # input gate
g_t = tanh(W_g · [h_{t-1}, x_t] + b_g)  # candidate
C_t = f_t * C_{t-1} + i_t * g_t       # cell update
o_t = σ(W_o · [h_{t-1}, x_t] + b_o)  # output gate
h_t = o_t * tanh(C_t)

The cell state pathway has additive updates (not multiplicative), so gradients flow back through the forget gate path without vanishing over long sequences. GRUs simplify LSTMs to 2 gates with similar performance.

Q28. What is k-means clustering? What are its limitations?

from sklearn.cluster import KMeans
km = KMeans(n_clusters=5, init='k-means++', n_init=10, random_state=42)
km.fit(X)

Algorithm: Initialize k centroids → assign each point to nearest centroid → recompute centroids → repeat until convergence.

Limitations:

Must specify k in advance (use elbow method or silhouette score)
Assumes spherical, equally-sized clusters
Sensitive to outliers
Finds local minima (random init — use k-means++ to mitigate)
Does not work well for non-convex clusters (use DBSCAN instead)

DBSCAN is preferred in 2026 for anomaly detection: density-based, auto-determines k, marks outliers.

Q29. What is the difference between bagging and boosting?

Property	Bagging	Boosting
Trees trained	In parallel	Sequentially
Each tree's data	Bootstrap sample	Weighted — focus on errors
Error reduced	Variance	Bias
Output	Average/vote	Weighted sum
Examples	Random Forest	XGBoost, LightGBM, CatBoost
Overfitting risk	Low	Higher (needs early stopping)

Q30. Explain the concept of attention heads and multi-head attention.

import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.h = num_heads
        self.d_k = d_model // num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        B, T, D = Q.shape
        Q = self.W_q(Q).view(B, T, self.h, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(B, -1, self.h, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(B, -1, self.h, self.d_k).transpose(1, 2)
        # Scaled dot-product attention for all heads
        scores = (Q @ K.transpose(-2,-1)) / (self.d_k**0.5)
        if mask is not None: scores = scores.masked_fill(mask==0, -1e9)
        attn = scores.softmax(-1) @ V
        out = attn.transpose(1,2).contiguous().view(B, T, D)
        return self.W_o(out)

Different heads learn to attend to different relationship types: syntactic, semantic, positional, coreference.

Q31. What is the difference between cross-entropy loss and MSE? When do you use each?

Loss	Formula	Use Case
MSE	Σ(y - ŷ)² / n	Regression; assumes Gaussian noise
MAE	Σ\|y - ŷ\| / n	Regression; robust to outliers
Cross-Entropy	-Σ y·log(ŷ)	Classification; assumes Bernoulli/Categorical
KL Divergence	Σ p·log(p/q)	Distribution matching (VAEs, KD)
Focal Loss	-(1-ŷ)^γ · y·log(ŷ)	Imbalanced object detection
Contrastive	max(0, margin - d)	Metric learning, Siamese networks

Why cross-entropy for classification? It arises naturally from maximum likelihood estimation under a categorical distribution. MSE applied to classification outputs leads to saturation at wrong predictions (sigmoid's flat tails).

Q32. What is knowledge distillation?

import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, true_labels, T=4.0, alpha=0.7):
    # Soft targets from teacher (high temperature softens distribution)
    soft_teacher = F.softmax(teacher_logits / T, dim=-1)
    soft_student = F.log_softmax(student_logits / T, dim=-1)
    kd_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (T**2)
    # Hard label loss
    ce_loss = F.cross_entropy(student_logits, true_labels)
    return alpha * kd_loss + (1 - alpha) * ce_loss

Used by: Google (DistilBERT, TinyBERT), Meta (EfficientNet training), Hugging Face model compression in 2026.

HARD — Advanced Topics (Questions 33-50)

The questions in this section are asked at Google L5+, Meta E5+, and ML research roles. If you can answer these confidently, you're interviewing at the staff/principal level. Don't skip them — this is where top-of-band offers are decided.

Q33. Explain Flash Attention and why it matters for training LLMs.

Flash Attention (Dao et al., 2022, updated v3 in 2024) uses tiling to compute attention in blocks that fit in SRAM, avoiding writing the full attention matrix to HBM (high-bandwidth memory):

IO complexity: O(n²/M) HBM reads/writes instead of O(n²), where M = SRAM size
Speed: 2-4x faster training in practice; 10x memory reduction
Numerically exact: Uses online softmax normalization trick

In 2026, FlashAttention-3 with hardware-aware parallelism is the standard for all serious LLM training. It's built into PyTorch 2.x via scaled_dot_product_attention.

import torch
# Uses FlashAttention automatically when available
out = torch.nn.functional.scaled_dot_product_attention(Q, K, V, is_causal=True)

Q34. What is the difference between encoder-only, decoder-only, and encoder-decoder transformer models?

Architecture	Self-Attn	Cross-Attn	Masked?	Examples	Best For
Encoder-only	Bidirectional	No	No	BERT, RoBERTa	Classification, NER, embeddings
Decoder-only	Causal (unidirectional)	No	Yes	GPT-4, LLaMA, Mistral	Text generation, LLMs
Encoder-Decoder	Bidirectional enc + Causal dec	Yes	Enc: No, Dec: Yes	T5, BART, Whisper	Translation, summarization

2026 trend: Most frontier models are decoder-only (GPT-4o, Gemini 2.0, LLaMA 3). Encoder-only models (BERT) still dominate embedding/retrieval tasks.

Q35. How does RLHF (Reinforcement Learning from Human Feedback) work?

Supervised Fine-Tuning (SFT): Fine-tune base model on high-quality human demonstrations
Reward Model Training: Humans rank pairs of model outputs → train a reward model RM(prompt, response) → score
RL Optimization (PPO): Use the RM as a reward signal, optimize with PPO while keeping the policy close to SFT baseline (KL penalty)

Objective = E[RM(response)] - β * KL[π_θ || π_SFT]

DPO (Direct Preference Optimization) — dominant in 2026: Eliminates the separate RM by directly optimizing the policy from preference data:

L_DPO = -E[log σ(β * (log π_θ(y_w|x) - log π_ref(y_w|x))
                  - β * (log π_θ(y_l|x) - log π_ref(y_l|x)))]

y_w = preferred response, y_l = rejected response. Simpler, more stable than PPO.

Q36. What is LoRA and why is it the standard for LLM fine-tuning in 2026?

W' = W + ΔW = W + B·A

where W ∈ R^(d×k), B ∈ R^(d×r), A ∈ R^(r×k), and r << min(d, k).

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8b')
config = LoraConfig(
    r=16,             # rank — smaller = fewer params, less capacity
    lora_alpha=32,    # scaling factor
    target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'],
    lora_dropout=0.05,
    task_type='CAUSAL_LM'
)
model = get_peft_model(model, config)
model.print_trainable_parameters()  # ~1% of total params

QLoRA extends LoRA with 4-bit quantized base model (NF4 format) + double quantization — allows fine-tuning 70B models on a single 48GB GPU.

Q37. What are common techniques to reduce LLM inference latency?

Technique	Description	Speedup
KV Cache	Cache key/value tensors across generation steps	Essential (no it's not optional)
Speculative Decoding	Small draft model proposes tokens; large model verifies in parallel	2-3x
Quantization (INT8/INT4)	Reduce weight precision	2-4x memory, ~1.5x speed
Tensor Parallelism	Split model across GPUs column/row-wise	Linear in GPU count
Continuous Batching	Don't wait for all sequences to finish; add new requests mid-flight	10-20x throughput
Flash Attention	Memory-efficient attention	2-4x for long contexts
Grouped Query Attention	Share K/V heads across multiple Q heads (MHA→GQA)	2x KV cache reduction

Q38. Explain the concept of model calibration and Platt scaling.

Reliability diagrams plot mean predicted probability vs observed frequency per bin.

Calibration techniques:

Platt scaling: Train a logistic regression on logits as post-processing
Temperature scaling: Scale logits by learned temperature T: p = softmax(logits/T). T>1 softens distribution (better calibrated, usually)
Isotonic regression: Non-parametric, more flexible

from sklearn.calibration import CalibratedClassifierCV, calibration_curve

# Calibrate any classifier with Platt scaling
calibrated = CalibratedClassifierCV(base_classifier, method='sigmoid', cv=5)
calibrated.fit(X_train, y_train)

# Plot calibration curve
prob_true, prob_pred = calibration_curve(y_test, calibrated.predict_proba(X_test)[:,1], n_bins=10)

Q39. What is the Mixture of Experts (MoE) architecture?

output = Σ_{i in top-k} gate_i(x) * Expert_i(x)

Benefits:

Massive parameter count without proportional compute cost (sparse activation)
Mixtral 8x7B activates only 2/8 experts per token → 7B compute, 47B parameters
GPT-4 (rumored), Gemini 1.5/2.0, Grok all use MoE

Challenges:

Load balancing: auxiliary loss to ensure experts are used equally
Communication overhead in distributed training
Router collapse: all tokens route to same few experts

Interview insight (Meta/Google ask this): Explain the difference between soft MoE (all experts contribute with learned weights) and sparse MoE (top-k hard selection).

Q40. How do you handle class imbalance in production ML systems?

# Strategy 1: Resampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
X_res, y_res = SMOTE(random_state=42).fit_resample(X_train, y_train)

# Strategy 2: Class weights
from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
model = LogisticRegression(class_weight='balanced')

# Strategy 3: Focal loss (most effective in 2026 for deep learning)
def focal_loss(preds, targets, gamma=2.0, alpha=0.25):
    ce = F.cross_entropy(preds, targets, reduction='none')
    pt = torch.exp(-ce)
    return (alpha * (1-pt)**gamma * ce).mean()

# Strategy 4: Threshold tuning at inference
# Don't use 0.5; tune threshold on validation set for target F1/precision
from sklearn.metrics import precision_recall_curve
prec, rec, thresholds = precision_recall_curve(y_val, y_scores)
optimal_thresh = thresholds[np.argmax(2*prec*rec / (prec+rec))]

Production-level answer (Google expects): Also mention monitoring class drift over time, stratified sampling in train/val/test splits, and using AUC-PR (not AUC-ROC) as primary metric for imbalanced problems.

Q41. Explain distributed training strategies for large models.

Strategy	What's Distributed	Suited For
Data Parallelism (DP)	Data split across GPUs; full model on each	Small-medium models
Tensor Parallelism (TP)	Matrices split column/row-wise	Large FFN/attention layers
Pipeline Parallelism (PP)	Model layers split across GPUs in stages	Very deep models
ZeRO (Stage 1/2/3)	Optimizer states / gradients / parameters sharded	Single-node multi-GPU
FSDP (PyTorch)	Full model sharded; parameters gathered on-demand	Standard in 2026
3D Parallelism	DP + TP + PP combined	100B+ models (GPT-4 scale)

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
model = FSDP(model, sharding_strategy=ShardingStrategy.FULL_SHARD)

Q42. What are evaluation strategies for generative models (beyond accuracy/F1)?

Metric	Task	Description
BLEU	Translation	n-gram overlap with reference
ROUGE-L	Summarization	Longest common subsequence
BERTScore	Any	Cosine similarity of BERT embeddings
Perplexity	LM quality	e^(-1/N Σ log p(token))
METEOR	Translation	Alignment-based, handles synonyms
Pass@k	Code generation	Probability a correct solution is in k samples
HumanEval	Code	Functional correctness on 164 Python problems
MT-Bench	Chat LLMs	GPT-4-judged multi-turn conversation quality
AlpacaEval	Instruction	Win rate vs reference model

2026 trend: LLM-as-a-judge (using GPT-4/Claude as evaluator) is the emerging standard for open-ended generation quality.

Q43. What is catastrophic forgetting and how do you address it in continual learning?

Solutions:

Elastic Weight Consolidation (EWC): Add regularization term that penalizes changing weights important for previous tasks (measured by Fisher Information)
Replay/Rehearsal: Store samples from previous tasks, mix them with new task data
Progressive Neural Networks: Freeze old columns, add new columns for new tasks
LoRA: Fine-tune only low-rank adapters, preserving pre-trained weights exactly

# EWC regularization term
def ewc_loss(model, fisher_dict, optpar_dict, ewc_lambda=1000):
    loss = 0
    for name, param in model.named_parameters():
        fisher = fisher_dict[name]
        optpar = optpar_dict[name]
        loss += (fisher * (optpar - param).pow(2)).sum()
    return ewc_lambda * loss

Q44. How does label smoothing work and when do you use it?

y_smooth = (1 - ε) * y_hard + ε / K

Why it works: Prevents the model from becoming overconfident (logits don't grow unboundedly). Improves calibration. Acts as regularization.

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)  # PyTorch 1.10+

Used in: Image classification (ViT, EfficientNet training), NMT, LLM fine-tuning. Typically ε=0.1.

Q45. What are the differences between GPT-style and BERT-style pre-training objectives?

Aspect	BERT (MLM)	GPT (CLM)
Objective	Mask 15% tokens, predict them	Predict next token (left-to-right)
Attention	Bidirectional	Causal (unidirectional)
Strength	Rich contextual embeddings	Natural generation
Downstream	Classification, QA, NER	Text generation, agents
2026 usage	Embeddings, retrieval	LLM chat, code, reasoning

BERT's MLM is actually a cloze task — corrupted input, full context visible. GPT's CLM is the standard LM objective. SpanBERT, RoBERTa, DeBERTa all build on BERT. GPT-4, LLaMA, Mistral, Gemini all use CLM.

Q46. Explain how Retrieval-Augmented Generation (RAG) works at a systems level.

RAG augments LLM generation with relevant documents retrieved at inference time:

1. Document chunking: split docs into 200-500 token chunks
2. Embedding: encode chunks with embedding model (e.g., text-embedding-3-large)
3. Index: store embeddings in vector DB (Pinecone, Qdrant, Weaviate, pgvector)
4. Query time:
   a. Embed query → retrieve top-k similar chunks via ANN search
   b. Rerank chunks (cross-encoder or reciprocal rank fusion)
   c. Stuff chunks into LLM context + query
   d. LLM generates answer grounded in retrieved context

Advanced RAG patterns in 2026:

HyDE: Generate a hypothetical answer, embed it, use it to retrieve
Hybrid search: Dense (embedding) + sparse (BM25) retrieval, fuse results
Agentic RAG: LLM decides when/what to retrieve iteratively

Q47. What is model pruning and quantization? How do they differ?

Technique	What It Does	Typical Reduction
Pruning	Remove weights/neurons/heads below a threshold	50-90% sparsity, minimal accuracy loss
Quantization	Reduce numerical precision (FP32→INT8/INT4/FP8)	2-8x memory, inference speedup
Knowledge Distillation	Train smaller model from scratch using teacher	Flexible size reduction

# Quantization with bitsandbytes (standard for LLM inference in 2026)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',       # NF4 distribution for LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-70b',
                                               quantization_config=bnb_config)

GPTQ and AWQ are post-training quantization methods standard in 2026 for deployment. FP8 training is becoming standard on H100s.

Q48. How do you detect and mitigate data leakage in ML pipelines?

Common sources:

Scaling/normalization fit on full dataset before splitting
Feature engineering using future data (e.g., rolling averages including future)
Target leakage: feature correlated with target due to data collection process

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# WRONG: Scaler fit on full data (leaks test statistics to train)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # DON'T DO THIS
X_train_scaled = X_scaled[:800]

# CORRECT: Scaler only sees training data
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)  # scaler only fit on X_train

Q49. What is the difference between online learning and offline (batch) learning?

Aspect	Offline Learning	Online Learning
Training frequency	Once on full dataset	Continuously on incoming data
Model updates	Retrain from scratch or fine-tune periodically	Update on each sample/mini-batch
Memory	Requires full dataset	O(1) memory
Adapts to drift?	Only after retraining	Yes, continuously
Examples	Most deep learning	Vowpal Wabbit, river, production recommendation

Concept drift is the main reason for online learning in production. Monitor with distribution shift tests (KS test, PSI) and trigger retraining.

Q50. Design a fraud detection system using ML. Walk through your approach end-to-end.

1. Problem framing:

Binary classification: transaction is fraud or not
Severe class imbalance (0.1% fraud rate)
Latency constraint: decision in <100ms
Data: transaction features, user history, device fingerprint, location

2. Feature engineering:

Statistical features: amount vs user's 30-day average, velocity (N transactions in last 1h)
Behavioral: device ID seen before? geo distance from last transaction?
Graph features: shared device/IP with known fraudsters

3. Model choices:

Tier 1 (fast): LightGBM with tuned threshold — <10ms inference
Tier 2 (deep): Temporal graph neural network for ring fraud — run async
Ensemble both with calibrated probabilities

4. Handling imbalance:

Focal loss or class_weight='balanced'
Threshold at 0.3 (high recall) with manual review queue for 0.3–0.7
Train on full data but stratify splits

5. Production pipeline:

Transaction → Feature Store (Redis/Feast) → LightGBM → Score
                                        ↘ Block if score > 0.8
                                         → Queue for review if 0.3-0.8

6. Monitoring: Track fraud rate, model score distribution, feature drift daily. Retrain weekly with labeled confirmed fraud cases.

Frequently Asked Questions (FAQ)

Q: How is AI/ML interviewing different in 2026 vs 2023? A: Night and day difference. 2026 interviews heavily emphasize LLM internals (transformers, attention, fine-tuning), MLOps (deployment, monitoring, drift), and the practical tradeoffs of modern foundation models. Pure theoretical statistics questions are fading; system-level ML design is the new differentiator between mid and senior offers.

Q: Should I learn PyTorch or TensorFlow? A: PyTorch. Full stop. It dominates research and is standard at Google (JAX/PyTorch), Meta (PyTorch), Microsoft (PyTorch), and most startups. TensorFlow 2.x and JAX are also used, but if you only learn one, make it PyTorch.

Q: What is the most important paper to read before FAANG ML interviews? A: "Attention Is All You Need" (Transformer), "BERT", "GPT-3", "LoRA", "Flash Attention", and "LLaMA 3" technical reports cover most interview topics in 2026.

Q: How do companies evaluate ML coding in interviews? A: Expect: implement backprop from scratch, write a k-means from scratch, implement cross-entropy loss, debug a training loop with a subtle bug (e.g., gradient not zeroed, wrong loss reduction). Practice on LeetCode ML section and Kaggle notebooks.

Q: What ML system design problems are most commonly asked? A: Design a recommendation system (Netflix/YouTube), design a search ranking system (Google), design a fraud detection system, design a real-time ML feature store.

Q: How important is statistics for ML interviews in 2026? A: Still important for data science roles (A/B testing, hypothesis testing, Bayesian inference). For pure ML engineering roles, it's less emphasized but p-values, confidence intervals, and experimental design will come up.

Q: What math should I review before ML interviews? A: Linear algebra (matrix operations, eigendecomposition, SVD), probability (Bayes, distributions, MLE), calculus (chain rule, partial derivatives), and basic information theory (entropy, KL divergence).

Q: What is the fastest way to practice ML for interviews? A: Here's the proven 4-week sprint: Week 1 — Implement core algorithms from scratch (numpy only: linear regression, logistic regression, k-means, decision tree). Week 2 — Study Hugging Face/PyTorch codebases and implement attention from scratch. Week 3 — Do 2-3 Kaggle competitions to understand real data messiness. Week 4 — Practice ML system design at ml-systems-designs.github.io. This sequence has helped hundreds of engineers crack FAANG ML rounds.

Ready to go deeper? Check out these essential companion guides:

Generative AI Interview Questions 2026 — The GenAI-specific questions FAANG asks
Data Engineering Interview Questions 2026 — ML pipelines and data infrastructure
System Design Interview Questions 2026 — Design YouTube, WhatsApp, and more
DevOps Interview Questions 2026 — MLOps and deployment fundamentals
Prompt Engineering Interview Questions 2026 — The new must-have skill

AI/ML Interview Questions 2026 — Top 50 Questions with Answers

Which Companies Ask These Questions?

EASY — Foundational Concepts (Questions 1-15)

Q1. What is the difference between supervised, unsupervised, and reinforcement learning?

Q2. Explain bias-variance tradeoff with a concrete example.

Q3. What are precision, recall, F1, and AUC-ROC? When do you use each?

Q4. Explain L1 vs L2 regularization.

Q5. What is gradient descent and its variants?

Q6. What is cross-validation? Why is k-fold preferred over train/test split?

Q7. What is the curse of dimensionality?

Q8. Explain decision trees and the concept of information gain.

Q9. What is a random forest? How does it reduce variance?

Q10. What is gradient boosting and how does XGBoost improve it?

Q11. What is the difference between generative and discriminative models?

Q12. Explain the vanishing gradient problem.

Q13. What is batch normalization and why does it help?

Q14. What is dropout and when should you use it?

Q15. What is transfer learning? When is it effective?

MEDIUM — Neural Networks & Deep Learning (Questions 16-32)

Q16. Explain the transformer architecture in detail.

Q17. Why does attention use sqrt(d_k) scaling?

Q18. What is positional encoding in transformers and why is it needed?

Q19. Explain convolutional neural networks (CNNs) — how do they achieve translation invariance?

Q20. What is the dying ReLU problem and how do you solve it?

Q21. How does backpropagation work? Walk through a simple example.

Q22. What are activation functions and how do you choose one?

Q23. Explain support vector machines (SVMs).

Q24. What is the EM algorithm? Give a practical example.

Q25. What are autoencoders and variational autoencoders (VAEs)?

Q26. What is attention mechanism in NLP before transformers (Bahdanau attention)?

Q27. How does LSTM solve the vanishing gradient problem of vanilla RNNs?

Q28. What is k-means clustering? What are its limitations?

Q29. What is the difference between bagging and boosting?

Q30. Explain the concept of attention heads and multi-head attention.

Q31. What is the difference between cross-entropy loss and MSE? When do you use each?

Q32. What is knowledge distillation?

HARD — Advanced Topics (Questions 33-50)

Q33. Explain Flash Attention and why it matters for training LLMs.

Q34. What is the difference between encoder-only, decoder-only, and encoder-decoder transformer models?

Q35. How does RLHF (Reinforcement Learning from Human Feedback) work?

Q36. What is LoRA and why is it the standard for LLM fine-tuning in 2026?

Q37. What are common techniques to reduce LLM inference latency?

Q38. Explain the concept of model calibration and Platt scaling.

Q39. What is the Mixture of Experts (MoE) architecture?

Q40. How do you handle class imbalance in production ML systems?

Q41. Explain distributed training strategies for large models.

Q42. What are evaluation strategies for generative models (beyond accuracy/F1)?

Q43. What is catastrophic forgetting and how do you address it in continual learning?

Q44. How does label smoothing work and when do you use it?

Q45. What are the differences between GPT-style and BERT-style pre-training objectives?

Q46. Explain how Retrieval-Augmented Generation (RAG) works at a systems level.

Q47. What is model pruning and quantization? How do they differ?

Q48. How do you detect and mitigate data leakage in ML pipelines?

Q49. What is the difference between online learning and offline (batch) learning?

Q50. Design a fraud detection system using ML. Walk through your approach end-to-end.

Frequently Asked Questions (FAQ)

More resources in Interview Questions

Related Articles

AWS Interview Questions 2026 — Top 50 with Expert Answers

Data Engineering Interview Questions 2026 — Top 50 Questions with Answers

DevOps Interview Questions 2026 — Top 50 with Expert Answers

Docker Interview Questions 2026 — Top 40 with Expert Answers

Generative AI Interview Questions 2026 — Top 50 Questions with Answers

More from PapersAdda

Share this guide: