Deep Learning Interview Questions 2026: 30 Answers with Code

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
Deep learning is no longer a niche specialization. In 2026, every ML interview at a product company includes at least five deep learning questions. Understanding how neural networks actually work, including the math behind backpropagation, the engineering behind transformers, and the tradeoffs in production deployment, separates candidates who clear rounds from those who don't. This guide covers 30 essential deep learning questions with complete PyTorch code.
PapersAdda's take: Memorizing that "batch normalization normalizes activations" is not enough. You need to know what goes wrong without it, why layer norm is preferred in transformers, and how to debug a training loop that is not converging. That depth is what this guide delivers. Candidates report that FAANG deep learning rounds always include a "debug this training loop" segment. According to candidate accounts from public preparation resources, transformer internals (attention math, positional encoding) appear in nearly every senior ML round at Google and Meta. Confirm the specific interview format on the official careers portal of your target company.
Related articles: AI/ML Interview Questions 2026 | Machine Learning Interview Questions 2026 | NLP Interview Questions 2026 | Computer Vision Interview Questions 2026 | PyTorch Interview Questions 2026 | MLOps Interview Questions 2026
Which Companies Ask These Questions?
| Topic Cluster | Companies |
|---|---|
| Backpropagation and Gradients | Google, Meta, Microsoft, Nvidia |
| CNN Architecture | Google DeepMind, Meta AI, Samsung |
| RNN and LSTM | Amazon Alexa, Microsoft Azure AI |
| Transformers and Attention | All frontier AI labs, OpenAI, Cohere |
| Training Tricks | Every company with an ML team |
| Model Compression | Mobile-first teams, Qualcomm, Apple |
| Distributed Training | Databricks, Nvidia, hyperscalers |
EASY: Fundamentals (Questions 1-10)
Q1. What is a neural network? Explain forward propagation.
Layer output: a^{l} = f(W^{l} * a^{l-1} + b^{l})
Forward propagation: Pass input through each layer in sequence, compute activations, and produce a final output.
import torch
import torch.nn as nn
class FeedForwardNet(nn.Module):
def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
super().__init__()
layers = []
prev_dim = input_dim
for dim in hidden_dims:
layers.extend([
nn.Linear(prev_dim, dim),
nn.LayerNorm(dim),
nn.GELU(),
nn.Dropout(dropout)
])
prev_dim = dim
layers.append(nn.Linear(prev_dim, output_dim))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
model = FeedForwardNet(784, [512, 256, 128], 10)
x = torch.randn(32, 784) # batch of 32, 784 features
logits = model(x) # forward pass
print(logits.shape) # [32, 10]
Q2. Explain backpropagation with the chain rule.
Simple example:
import torch
# Computation graph: x -> a=Wx+b -> h=relu(a) -> y_hat=Wh -> loss=(y_hat-y)^2
x = torch.tensor([1.0, 2.0], requires_grad=False)
W1 = torch.tensor([[0.5, 0.3], [0.2, 0.4]], requires_grad=True)
b1 = torch.zeros(2, requires_grad=True)
W2 = torch.tensor([[0.6, 0.7]], requires_grad=True)
y = torch.tensor([1.0])
a = W1 @ x + b1 # linear
h = torch.relu(a) # activation
y_hat = W2 @ h # output
loss = (y_hat - y).pow(2) # MSE loss
loss.backward() # backprop: PyTorch auto-differentiates
print("dL/dW1:", W1.grad) # chain rule: dL/dy_hat * dy_hat/dh * dh/da * da/dW1
print("dL/dW2:", W2.grad)
Chain rule spelled out:
dL/dW1 = (dL/dy_hat) * (dy_hat/dh) * (dh/da) * (da/dW1)
= 2*(y_hat-y) * W2 * relu'(a) * x
Q3. What are activation functions? Compare ReLU, GELU, and SwiGLU.
| Activation | Formula | Range | Used In | Key Property |
|---|---|---|---|---|
| Sigmoid | 1/(1+e^-x) | (0,1) | Output layer | Vanishing gradients |
| Tanh | (e^x-e^-x)/(e^x+e^-x) | (-1,1) | RNN gates | Better than sigmoid |
| ReLU | max(0,x) | [0,inf) | CNNs, MLPs | Fast, dying ReLU risk |
| Leaky ReLU | max(0.01x, x) | (-inf,inf) | CNNs | Fixes dying ReLU |
| GELU | x*Phi(x) | smooth | BERT, GPT | Smooth, better than ReLU |
| SwiGLU | xsigmoid(betax) * linear gate | smooth | LLaMA, Mistral | Gated, best in LLMs |
import torch
import torch.nn.functional as F
x = torch.randn(100)
relu = F.relu(x)
gelu = F.gelu(x)
silu = F.silu(x) # same as SiLU = Swish = x*sigmoid(x)
# SwiGLU: requires splitting the projection in two (gate mechanism)
class SwiGLU(nn.Module):
def __init__(self, dim):
super().__init__()
self.proj = nn.Linear(dim, dim * 2) # doubled projection
def forward(self, x):
x_proj, gate = self.proj(x).chunk(2, dim=-1)
return x_proj * F.silu(gate)
Q4. What is weight initialization and why does it matter?
| Initialization | Best For | Formula |
|---|---|---|
| Xavier / Glorot | Sigmoid, Tanh | W ~ Uniform(-sqrt(6/(n_in+n_out)), ...) |
| He / Kaiming | ReLU and variants | W ~ Normal(0, sqrt(2/n_in)) |
| Zero initialization | Never for weights | Symmetry breaking fails |
| Small random | Deprecated for deep nets | Too slow convergence |
import torch.nn as nn
conv = nn.Conv2d(64, 128, 3)
linear = nn.Linear(512, 256)
# He initialization for ReLU
nn.init.kaiming_normal_(conv.weight, mode='fan_out', nonlinearity='relu')
nn.init.zeros_(conv.bias)
# Xavier for transformers
nn.init.xavier_uniform_(linear.weight)
nn.init.zeros_(linear.bias)
# PyTorch 2.x: most modules auto-initialize correctly
# But in custom models, always initialize explicitly
Q5. What is batch normalization? How is it different from layer normalization?
Batch Normalization (BatchNorm):
- Normalizes across the batch dimension for each feature
- Statistics: mean and variance computed per feature across all samples in the batch
- Requires large batch size to work well; requires separate treatment at inference (running statistics)
- Used in: CNNs, MLPs
Layer Normalization (LayerNorm):
- Normalizes across the feature dimension for each sample independently
- Statistics: mean and variance computed per sample across all features
- Works with batch size 1; identical behavior at train and inference
- Used in: Transformers, RNNs, any variable-length or small-batch scenario
import torch
import torch.nn as nn
batch_norm = nn.BatchNorm2d(64) # for CNN: normalizes across N,H,W per channel
layer_norm = nn.LayerNorm(768) # for transformer: normalizes across 768 features per token
rms_norm = nn.RMSNorm(768) # variant: no mean subtraction; used in LLaMA
# What BatchNorm does:
# x_hat = (x - mean_per_channel) / std_per_channel
# out = gamma * x_hat + beta (learnable scale/shift)
# At inference: use running_mean and running_var (computed during training via EMA)
# At training: use batch statistics
batch_norm.eval() # switches from batch stats to running stats
Q6. What is dropout and how does it prevent overfitting?
- Prevents co-adaptation: neurons can't rely on specific other neurons always being present
- Acts like training an ensemble of 2^n thinned networks
- At inference: all neurons active, outputs scaled by (1-p) to match expected training output
import torch
import torch.nn as nn
class ResidualBlock(nn.Module):
def __init__(self, dim, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(dim, dim * 4),
nn.GELU(),
nn.Dropout(dropout), # dropout after activation
nn.Linear(dim * 4, dim),
nn.Dropout(dropout) # dropout before residual add
)
self.norm = nn.LayerNorm(dim)
def forward(self, x):
return x + self.net(self.norm(x)) # pre-norm residual
# Concrete effect
model = nn.Linear(100, 50)
dropout = nn.Dropout(p=0.5)
model.train() # dropout active: 50% zeros
out_train = dropout(model(torch.randn(32, 100)))
model.eval() # dropout disabled
out_eval = model(torch.randn(32, 100))
Q7. How do you prevent exploding gradients?
| Technique | Description | Where Used |
|---|---|---|
| Gradient clipping | Rescale gradients if norm exceeds threshold | RNNs, LSTMs, LLM training |
| Weight initialization | He/Xavier init prevents early-stage explosion | All networks |
| Batch/Layer normalization | Keeps activations in stable range | CNNs, transformers |
| Residual connections | Gradients flow directly through skip path | ResNet, transformers |
| Smaller learning rate | Reduces step size | General |
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.LSTM(512, 512, num_layers=4, batch_first=True)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Training loop with gradient clipping
for batch in dataloader:
optimizer.zero_grad()
out, _ = model(batch['x'])
loss = criterion(out, batch['y'])
loss.backward()
# Clip gradient norm to 1.0 (standard for RNN/LLM training)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
Q8. What is the difference between SGD, Adam, and AdamW? Which do you use in 2026?
| Optimizer | Update Rule | Memory | Best For |
|---|---|---|---|
| SGD | w -= lr * grad | O(params) | CNNs with carefully tuned LR + momentum |
| SGD + Momentum | Adds velocity term | O(params) | Better convergence than vanilla SGD |
| Adam | Adaptive per-param LR from 1st and 2nd moments | O(3*params) | Default for most tasks |
| AdamW | Adam with decoupled weight decay | O(3*params) | Transformers, LLM fine-tuning (standard in 2026) |
| Sophia | Second-order (Hessian-based) | O(2*params) | Emerging for LLM pre-training |
import torch.optim as optim
# AdamW with cosine LR schedule -- standard for transformers in 2026
optimizer = optim.AdamW(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.999), # momentum coefficients
eps=1e-8,
weight_decay=0.01 # decoupled L2 penalty
)
# Cosine annealing with warmup
from transformers import get_cosine_schedule_with_warmup
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=500,
num_training_steps=10000
)
Q9. What is learning rate scheduling? What schedules do you use for different architectures?
| Schedule | Description | Best For |
|---|---|---|
| Step decay | LR *= gamma every N steps | CNNs, simple MLPs |
| Cosine annealing | LR follows cosine curve from LR_max to LR_min | Transformers, general |
| Warmup + cosine | Linear warmup then cosine decay | LLM training (standard) |
| Cyclic LR | Oscillate LR between bounds | Finding optimal LR |
| One-cycle policy | LR climbs then falls; one cycle total | Fast training (fast.ai) |
| ReduceLROnPlateau | Reduce LR when val metric stops improving | When you don't know train steps |
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR
# Cosine annealing (most common for CNNs)
scheduler_cosine = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
# One-cycle (best for fast training)
scheduler_onecycle = OneCycleLR(
optimizer, max_lr=0.1,
steps_per_epoch=len(dataloader),
epochs=50,
pct_start=0.3 # warmup takes 30% of steps
)
# In training loop
for epoch in range(n_epochs):
for batch in dataloader:
# ... train step ...
scheduler_onecycle.step() # step each batch for OneCycleLR
scheduler_cosine.step() # step each epoch for CosineAnnealingLR
Q10. What is gradient checkpointing and when is it used?
- Memory savings: Reduces activation memory from O(n_layers) to O(sqrt(n_layers))
- Compute cost: ~33% extra FLOPs (one extra forward pass per layer)
- When to use: When training large models and memory is the bottleneck (always for LLM fine-tuning on a single GPU)
import torch
from torch.utils.checkpoint import checkpoint
class TransformerLayer(nn.Module):
def __init__(self, dim, heads):
super().__init__()
self.attn = nn.MultiheadAttention(dim, heads, batch_first=True)
self.ffn = nn.Sequential(
nn.LayerNorm(dim),
nn.Linear(dim, dim * 4),
nn.GELU(),
nn.Linear(dim * 4, dim)
)
def forward(self, x):
# gradient_checkpoint: recompute this during backward instead of storing
attn_out = checkpoint(self.attn, x, x, x, use_reentrant=False)[0]
return x + self.ffn(attn_out)
# For HuggingFace models: one line
model.gradient_checkpointing_enable()
MEDIUM: Architectures (Questions 11-22)
Q11. Explain the ResNet architecture and why residual connections matter.
Residual block:
output = F(x, {W_i}) + x
Instead of learning the mapping H(x), the network learns the residual F(x) = H(x) - x.
import torch.nn as nn
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.downsample = None
if stride != 1 or in_channels != out_channels:
self.downsample = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
identity = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
if self.downsample:
identity = self.downsample(x)
return self.relu(out + identity) # skip connection
Why it works: Gradients can flow directly through the skip connection path, bypassing any layer-specific transformation. The network can never perform worse than the identity mapping.
Q12. What is self-attention and how does it compute relationships between tokens?
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
For each token:
- Compute a Query (what am I looking for?), Key (what do I contain?), Value (what do I return?)
- Score with all other tokens via dot product
- Softmax to get attention weights (sum to 1)
- Weighted sum of Values
import torch
import torch.nn.functional as F
def self_attention(x, W_q, W_k, W_v, mask=None):
"""x: [B, T, D], W_*: [D, d_k]"""
Q = x @ W_q # [B, T, d_k]
K = x @ W_k # [B, T, d_k]
V = x @ W_v # [B, T, d_v]
d_k = Q.shape[-1]
scores = Q @ K.transpose(-2, -1) / d_k**0.5 # [B, T, T]
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1) # [B, T, T]
return weights @ V # [B, T, d_v]
Causal (autoregressive) attention: Mask the upper triangle to prevent each token from attending to future tokens. Used in GPT-style models.
Q13. What is the difference between RNNs, LSTMs, and Transformers for sequence modeling?
| Aspect | RNN | LSTM | Transformer |
|---|---|---|---|
| Long-range dependencies | Poor (vanishing grad) | Better (cell state) | Excellent (direct attention) |
| Parallelizable (training) | No (sequential) | No (sequential) | Yes (full parallelism) |
| Memory O(n) | O(1) per step | O(1) per step | O(n) (attention matrix) |
| 2026 status | Deprecated for NLP | Legacy use (edge devices) | Dominant for all NLP/LLM |
| Context window | Effectively ~100 tokens | ~500 tokens | Up to 1M tokens (with FlashAttn) |
import torch.nn as nn
# LSTM (still used in some production systems for low-latency inference)
lstm = nn.LSTM(input_size=256, hidden_size=512,
num_layers=2, batch_first=True,
dropout=0.1, bidirectional=True)
# Transformer encoder layer (modern approach)
encoder_layer = nn.TransformerEncoderLayer(
d_model=512, nhead=8,
dim_feedforward=2048,
dropout=0.1,
activation='gelu',
batch_first=True,
norm_first=True # pre-norm (better convergence in 2026)
)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
Q14. How does a convolutional neural network process an image? Explain stride, padding, and receptive field.
- Convolution: Slide filter over image, compute dot product at each position
- Stride: How many pixels to skip between filter positions. Stride=2 halves spatial dimensions
- Padding: Adds zeros around border.
samepadding keeps spatial size;validpadding shrinks it - Receptive field: The region of the input that influences one output neuron. Grows with depth
import torch.nn as nn
# Compute output size: floor((H + 2*padding - kernel) / stride) + 1
# Input: 32x32, kernel=3, padding=1, stride=1 -> output: 32x32 (same)
# Input: 32x32, kernel=3, padding=0, stride=2 -> output: 15x15
model = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1, stride=1), # 3->32, 32x32
nn.BatchNorm2d(32), nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=3, padding=1, stride=2), # 32->64, 16x16
nn.BatchNorm2d(64), nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3, padding=1, stride=2),# 64->128, 8x8
nn.BatchNorm2d(128), nn.ReLU(),
nn.AdaptiveAvgPool2d((1,1)), # global average pooling -> 1x1
nn.Flatten(),
nn.Linear(128, 10)
)
x = torch.randn(4, 3, 32, 32)
print(model(x).shape) # [4, 10]
Q15. What is transfer learning for deep learning? When do you fine-tune vs use as feature extractor?
| Strategy | Description | Use When |
|---|---|---|
| Feature extraction | Freeze all pre-trained layers; train only head | Target dataset very small (<1K samples); similar domain |
| Fine-tune top layers | Freeze bottom layers; fine-tune top N layers + head | Medium dataset; similar domain |
| Full fine-tuning | Unfreeze all layers; train with small LR | Large dataset; or different domain |
| LoRA/QLoRA | Freeze base, add low-rank adapters | LLM fine-tuning (standard in 2026) |
import torchvision.models as models
import torch.nn as nn
# Feature extraction
backbone = models.efficientnet_b0(weights='IMAGENET1K_V1')
for param in backbone.parameters():
param.requires_grad = False
# Replace head
n_features = backbone.classifier[1].in_features
backbone.classifier = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(n_features, 5) # 5-class problem
)
# Only classifier parameters have requires_grad=True
# Fine-tune last 2 blocks + head
for name, param in backbone.named_parameters():
if 'features.7' in name or 'features.8' in name or 'classifier' in name:
param.requires_grad = True
optimizer = torch.optim.AdamW([
{'params': backbone.features.parameters(), 'lr': 1e-5}, # low LR for backbone
{'params': backbone.classifier.parameters(), 'lr': 1e-3} # high LR for head
])
Q16. How does an autoencoder work? What is a VAE?
Autoencoder: Encoder compresses input to latent vector z; Decoder reconstructs input from z. Trained to minimize reconstruction error. Forces the network to learn a compressed representation.
Variational Autoencoder (VAE): Encoder outputs a distribution N(mu, sigma^2) over z, not a point. z is sampled from this distribution. Loss = reconstruction error + KL divergence from prior N(0,I). Enables generation of new data by sampling from the prior.
import torch
import torch.nn as nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
super().__init__()
# Encoder
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc_mu = nn.Linear(hidden_dim, latent_dim)
self.fc_var = nn.Linear(hidden_dim, latent_dim)
# Decoder
self.fc3 = nn.Linear(latent_dim, hidden_dim)
self.fc4 = nn.Linear(hidden_dim, input_dim)
def encode(self, x):
h = F.relu(self.fc1(x))
return self.fc_mu(h), self.fc_var(h)
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std # differentiable sampling
def decode(self, z):
return torch.sigmoid(self.fc4(F.relu(self.fc3(z))))
def forward(self, x):
mu, logvar = self.encode(x.view(-1, 784))
z = self.reparameterize(mu, logvar)
recon = self.decode(z)
return recon, mu, logvar
def vae_loss(recon_x, x, mu, logvar, beta=1.0):
bce = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')
kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return bce + beta * kl
Q17. What is a GAN? How does training work and what are the common failure modes?
- Generator G: Takes random noise z, outputs fake samples
- Discriminator D: Takes a sample (real or fake), outputs P(real)
Objective (minimax):
min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]
# Training loop sketch
for real_batch in dataloader:
# Train Discriminator
D_optimizer.zero_grad()
real_output = discriminator(real_batch)
d_real_loss = F.binary_cross_entropy(real_output, torch.ones_like(real_output))
z = torch.randn(batch_size, latent_dim)
fake = generator(z).detach()
fake_output = discriminator(fake)
d_fake_loss = F.binary_cross_entropy(fake_output, torch.zeros_like(fake_output))
(d_real_loss + d_fake_loss).backward()
D_optimizer.step()
# Train Generator
G_optimizer.zero_grad()
z = torch.randn(batch_size, latent_dim)
fake_output = discriminator(generator(z))
g_loss = F.binary_cross_entropy(fake_output, torch.ones_like(fake_output))
g_loss.backward()
G_optimizer.step()
Failure modes:
| Mode | Symptom | Fix |
|---|---|---|
| Mode collapse | Generator produces only one or few samples | Minibatch discrimination, unrolled GANs |
| Training instability | Loss oscillates wildly | Gradient penalty (WGAN-GP), spectral norm |
| Discriminator wins too fast | Generator receives zero gradients | Balance update frequency |
2026 status: Diffusion models have replaced GANs for image generation (Stable Diffusion, DALL-E, Midjourney). GANs still appear in video and real-time generation.
Q18. What is the transformer attention complexity? How does FlashAttention solve the memory problem?
Standard attention complexity:
- Time: O(n^2 * d)
- Memory: O(n^2) for the attention matrix
For n=4096, d=64, the attention matrix is 4096^2 * 2 bytes = 32MB per head. With 32 heads, that is 1GB per layer. For a 96-layer model, memory is infeasible.
FlashAttention (Dao et al. 2022):
- Avoids materializing the full n x n attention matrix in HBM (GPU main memory)
- Uses tiling: processes blocks of Q, K, V that fit in SRAM (fast on-chip memory)
- Uses online softmax computation: maintains running max and sum to compute softmax exactly
- Result: Same mathematical output, O(n) memory instead of O(n^2)
import torch
# PyTorch 2.x uses FlashAttention automatically
Q = torch.randn(2, 8, 512, 64, dtype=torch.float16, device='cuda') # [B, heads, T, d_k]
K = torch.randn(2, 8, 512, 64, dtype=torch.float16, device='cuda')
V = torch.randn(2, 8, 512, 64, dtype=torch.float16, device='cuda')
# This calls FlashAttention when available (CUDA + float16/bfloat16 + contiguous)
with torch.backends.cuda.sdp_kernel(enable_flash=True):
out = torch.nn.functional.scaled_dot_product_attention(Q, K, V, is_causal=True)
Q19. What is mixed precision training? How do bfloat16 and float16 differ?
| Format | Bits | Range | Precision | Best For |
|---|---|---|---|---|
| FP32 | 32 | Large | High | Reference, optimizer states |
| FP16 | 16 | 65504 max | Medium | Inference, some training |
| BF16 | 16 | Same as FP32 | Lower mantissa | LLM training (A100, H100) |
BF16 vs FP16: BF16 has the same exponent range as FP32 (8 bits) so overflow/underflow is rare. FP16 has a narrow range (5-bit exponent), requiring loss scaling.
import torch
from torch.cuda.amp import autocast, GradScaler
model = model.to('cuda')
scaler = GradScaler() # only needed for FP16; not needed for BF16
for batch in dataloader:
optimizer.zero_grad()
# FP16 with loss scaling
with autocast(dtype=torch.float16):
output = model(batch['x'])
loss = criterion(output, batch['y'])
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
# BF16 (cleaner, no loss scaling)
with autocast(dtype=torch.bfloat16):
output = model(batch['x'])
Q20. What is LoRA? How does it reduce parameters for fine-tuning?
W' = W + delta_W = W + B * A
where W in R^(d x k), B in R^(d x r), A in R^(r x k), r << min(d, k)
Parameter savings: A 4096 x 4096 weight matrix has 16.7M parameters. With rank 16, B and A together have 409616 + 164096 = 131K parameters, a reduction of 127x.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8b',
torch_dtype=torch.bfloat16,
device_map='auto')
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling: effective lr = lr * lora_alpha / r
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj',
'gate_proj', 'up_proj', 'down_proj'],
lora_dropout=0.05,
bias='none',
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~41M || all params: ~8B || trainable%: ~0.5%
Q21. Explain the concept of knowledge distillation with a PyTorch implementation.
import torch.nn.functional as F
import torch
def distillation_loss(student_logits, teacher_logits, true_labels,
T=4.0, alpha=0.7):
"""
T: temperature (higher = softer distribution, more info transfer)
alpha: weight for soft target loss (1-alpha for hard label loss)
"""
# Soft target loss (KL divergence)
soft_targets = F.softmax(teacher_logits / T, dim=-1)
soft_student = F.log_softmax(student_logits / T, dim=-1)
kd_loss = F.kl_div(soft_student, soft_targets, reduction='batchmean') * (T ** 2)
# Hard label loss
ce_loss = F.cross_entropy(student_logits, true_labels)
return alpha * kd_loss + (1 - alpha) * ce_loss
# Training loop with teacher-student
teacher.eval()
for batch_x, batch_y in dataloader:
with torch.no_grad():
teacher_logits = teacher(batch_x) # teacher inference, no grad
student_logits = student(batch_x)
loss = distillation_loss(student_logits, teacher_logits, batch_y, T=4.0, alpha=0.7)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Q22. What is multi-task learning and when does it improve performance?
Benefits:
- Prevents overfitting via auxiliary tasks
- Fewer total parameters than N separate models
- Task A can provide useful gradient signal for task B
class MultiTaskModel(nn.Module):
def __init__(self, d_model=512, n_classes_task1=3, n_classes_task2=5):
super().__init__()
# Shared backbone
self.backbone = nn.Sequential(
nn.Linear(100, d_model),
nn.LayerNorm(d_model),
nn.GELU(),
nn.Linear(d_model, d_model)
)
# Task-specific heads
self.head_task1 = nn.Linear(d_model, n_classes_task1)
self.head_task2 = nn.Linear(d_model, n_classes_task2)
def forward(self, x):
shared = self.backbone(x)
return self.head_task1(shared), self.head_task2(shared)
def mtl_loss(logits1, logits2, y1, y2, lambda1=1.0, lambda2=0.5):
loss1 = F.cross_entropy(logits1, y1)
loss2 = F.cross_entropy(logits2, y2)
return lambda1 * loss1 + lambda2 * loss2
Works best when tasks are related (e.g., named entity recognition + part-of-speech tagging, or classification + auxiliary self-supervised task). Divergent tasks hurt each other.
HARD: Advanced Topics (Questions 23-30)
Q23. What is quantization? Compare PTQ and QAT.
| Method | Full Name | How | Quality | Speed |
|---|---|---|---|---|
| PTQ | Post-Training Quantization | Quantize after training; calibrate on small dataset | Lower | Fastest |
| QAT | Quantization-Aware Training | Simulate quantization during training; fine-tune | Higher | Slower but better |
| GPTQ | GPU-based PTQ for LLMs | Minimize weight reconstruction error per layer | High for LLMs | Standard in 2026 |
| AWQ | Activation-aware Weight Quant | Scale weights based on salient activations | Better than GPTQ | Standard in 2026 |
# PyTorch dynamic quantization (simplest PTQ)
import torch.quantization
model_int8 = torch.quantization.quantize_dynamic(
model,
{nn.Linear}, # quantize only linear layers
dtype=torch.qint8
)
# bitsandbytes 4-bit loading (LLM standard in 2026)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained('model_name',
quantization_config=bnb_config)
Q24. What is structured vs unstructured pruning?
| Type | What is removed | Speed on Hardware | Quality loss |
|---|---|---|---|
| Unstructured | Individual weights (sparse matrix) | Minimal without sparse hardware | Minimal |
| Structured | Entire neurons, heads, or layers | Immediate (dense matrix shrinks) | Moderate |
| Magnitude pruning | Smallest-magnitude weights | Depends on type | Low |
| Lottery Ticket Hypothesis | Retrain sparse subnetwork from scratch | Experimental | Low |
import torch.nn.utils.prune as prune
# Unstructured L1 magnitude pruning (30% of weights zeroed)
prune.l1_unstructured(model.fc1, name='weight', amount=0.3)
# Structured pruning: remove entire rows (neurons) of a linear layer
prune.ln_structured(model.fc1, name='weight', amount=0.3, n=2, dim=0)
# Make pruning permanent (remove mask; recompute weight)
prune.remove(model.fc1, 'weight')
# Global pruning: prune 20% of all weights across the whole model
parameters_to_prune = [(layer, 'weight') for layer in model.modules()
if isinstance(layer, nn.Linear)]
prune.global_unstructured(parameters_to_prune,
pruning_method=prune.L1Unstructured,
amount=0.2)
Q25. Explain the training stability tricks for large language models.
| Trick | Description | Purpose |
|---|---|---|
| Pre-LN (pre-norm) | LayerNorm before attention/FFN, not after | More stable gradients vs post-LN |
| QK-Norm | Normalize Q and K before dot-product | Prevents logit growth, entropy collapse |
| Gradient clipping | Clip norm to 1.0 | Prevents single bad batch from destroying training |
| Weight tying | Share embedding and output projection weights | Reduces parameters, improves language modeling |
| Z-loss | Penalize large logit magnitudes | Prevents softmax saturation |
| Warmup LR | Linear ramp for first 1-5% of steps | Prevents early instability |
# Z-loss (prevents entropy collapse in MoE routing and output softmax)
def z_loss(logits, z_loss_coef=0.001):
log_z = torch.logsumexp(logits, dim=-1)
return z_loss_coef * log_z.pow(2).mean()
# QK-Norm implementation
class AttentionWithQKNorm(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.q_norm = nn.RMSNorm(self.d_k)
self.k_norm = nn.RMSNorm(self.d_k)
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
def forward(self, x, mask=None):
B, T, D = x.shape
# Apply per-head normalization to Q and K
# (simplified sketch; full impl would reshape, norm, reshape back)
return self.attn(x, x, x, attn_mask=mask)[0]
Q26. What is RLHF and DPO? How are they used to align LLMs?
RLHF (Reinforcement Learning from Human Feedback):
- Supervised fine-tuning (SFT) on curated demonstrations
- Train reward model on human preference pairs (chosen > rejected)
- PPO optimization: maximize reward while staying close to SFT policy (KL penalty)
DPO (Direct Preference Optimization): Eliminates the reward model. Directly optimizes on preference pairs:
L_DPO = -E[log sigma(beta * (log pi_theta(y_w|x) - log pi_ref(y_w|x))
- beta * (log pi_theta(y_l|x) - log pi_ref(y_l|x)))]
# DPO loss implementation
import torch.nn.functional as F
def dpo_loss(policy_logprobs_chosen, policy_logprobs_rejected,
ref_logprobs_chosen, ref_logprobs_rejected, beta=0.1):
"""
policy_*: log probabilities from model being trained
ref_*: log probabilities from reference (SFT) model
beta: KL penalty coefficient
"""
# Log probability ratios (policy vs reference)
pi_logratios = policy_logprobs_chosen - policy_logprobs_rejected
ref_logratios = ref_logprobs_chosen - ref_logprobs_rejected
# DPO objective
logits = beta * (pi_logratios - ref_logratios)
loss = -F.logsigmoid(logits).mean()
return loss
Why DPO in 2026: DPO is simpler (no RL, no reward model), more stable, and achieves comparable alignment quality. TRL library from HuggingFace has production DPO trainer.
Q27. How does speculative decoding accelerate LLM inference?
Speculative decoding uses a small fast "draft" model to propose k tokens, then verifies all k in a single pass of the large model:
1. Draft model proposes tokens t_1, ..., t_k (k serial small-model passes)
2. Large model verifies all k tokens in ONE forward pass (k tokens in parallel)
3. Accept tokens where large model agrees; reject and resample from first disagreement
4. Speedup: ~2-3x for typical draft acceptance rates of 70-90%
# Simplified speculative decoding loop
def speculative_decode(draft_model, target_model, prompt, max_new=100, k=5):
input_ids = prompt
generated = []
while len(generated) < max_new:
# Draft k tokens greedily
draft_tokens = []
draft_logprobs = []
ids = input_ids
for _ in range(k):
with torch.no_grad():
draft_logits = draft_model(ids).logits[:, -1]
draft_tok = draft_logits.argmax(-1)
draft_tokens.append(draft_tok)
draft_logprobs.append(F.log_softmax(draft_logits, dim=-1))
ids = torch.cat([ids, draft_tok.unsqueeze(1)], dim=1)
# Verify with target model in ONE forward pass
with torch.no_grad():
target_logits = target_model(ids).logits[:, -k-1:-1] # k target logits
# Accept/reject tokens and resample from first rejection
# ... (full implementation involves token-level probability comparison)
return torch.cat(generated)
Q28. Explain distributed training: data parallelism, tensor parallelism, and FSDP.
| Strategy | What is sharded | Scale | Overhead |
|---|---|---|---|
| Data Parallelism (DDP) | Data; full model on each GPU | Linear with GPUs | Gradient all-reduce |
| Tensor Parallelism (TP) | Weight matrices split column/row-wise | LLM attention/FFN | Complex, needs tensor aware code |
| Pipeline Parallelism (PP) | Model layers split in stages | Very deep models | Micro-batching required |
| ZeRO Stage 3 | Optimizer states + gradients + params | Large models | Higher communication |
| FSDP (PyTorch) | Full model sharded; gather on demand | Large models | Native PyTorch, standard in 2026 |
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.fully_sharded_data_parallel import (
CPUOffload, ShardingStrategy
)
import torch.distributed as dist
# Initialize distributed backend
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
model = MyLargeModel().to(local_rank)
model = FSDP(
model,
sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO-3 equivalent
cpu_offload=CPUOffload(offload_params=True), # offload to CPU RAM
auto_wrap_policy=transformer_auto_wrap_policy,
device_id=local_rank
)
Q29. What is contrastive learning? Explain CLIP and SimCLR.
SimCLR (self-supervised for vision):
- Two augmented views of the same image = positive pair
- Views from different images = negative pairs
- NT-Xent (normalized temperature-scaled cross-entropy) loss
CLIP (OpenAI, multimodal):
- Positive pair: (image, its text caption)
- Negative pairs: (image, all other captions in batch)
- Train image encoder + text encoder together on 400M image-text pairs
- Creates aligned image-text embedding space
import torch.nn.functional as F
def nt_xent_loss(z1, z2, temperature=0.5):
"""
z1, z2: [N, D] normalized embeddings of two augmented views
"""
N = z1.shape[0]
z = torch.cat([z1, z2], dim=0) # [2N, D]
# Cosine similarity matrix
sim = F.cosine_similarity(z.unsqueeze(1), z.unsqueeze(0), dim=-1) / temperature
# Mask out diagonal (self-similarity)
mask = torch.eye(2*N, dtype=torch.bool, device=z.device)
sim.masked_fill_(mask, float('-inf'))
# Labels: for sample i, positive is at i+N (and vice versa)
labels = torch.cat([torch.arange(N, 2*N), torch.arange(0, N)]).to(z.device)
return F.cross_entropy(sim, labels)
Q30. How do you debug a deep learning model that is not training?
1. Check data pipeline first
- Visualize a batch: print shapes, check label distribution, spot-check samples
- Verify normalization: mean~0, std~1 after preprocessing
2. Check that loss decreases on a single batch
- If loss does NOT decrease on 1 batch: bug in forward pass or loss
- If loss decreases on 1 batch but NOT across epochs: data pipeline issue
3. Check gradients
- Any NaN? -> exploding gradients; clip or reduce LR
- All zero? -> dying ReLUs, wrong loss, disconnected graph
- Too small? -> vanishing gradients; use ResNet, LayerNorm, better init
4. Check learning rate
- Too high: loss oscillates or diverges
- Too low: loss decreases but very slowly
- Use LR range test (Leslie Smith 1cycle)
5. Check label correctness
- Common bug: labels accidentally 0-indexed vs 1-indexed
- Common bug: label tensor dtype should be torch.long for CrossEntropyLoss
# Gradient checking utility
for name, param in model.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
if grad_norm == 0:
print(f"ZERO grad: {name}")
elif grad_norm > 100:
print(f"LARGE grad: {name} = {grad_norm:.2f}")
elif torch.isnan(param.grad).any():
print(f"NaN grad: {name}")
# Overfit one batch test
x_one, y_one = next(iter(dataloader))
for step in range(200):
optimizer.zero_grad()
pred = model(x_one)
loss = criterion(pred, y_one)
loss.backward()
optimizer.step()
if step % 20 == 0:
print(f"Step {step}: loss={loss.item():.4f}")
# If loss does not reach near-zero after 200 steps: architecture or init bug
Comparison Table: Architecture Choices in 2026
| Task | Architecture | Framework | Notes |
|---|---|---|---|
| Image classification | EfficientNetV2, ViT-B | PyTorch | Pretrained on ImageNet-21K |
| Object detection | YOLOv9, RT-DETR | PyTorch | RT-DETR for production |
| NLP classification | BERT, DeBERTa-v3 | HuggingFace | Fine-tune on domain data |
| Text generation | LLaMA-3, Mistral | HuggingFace | QLoRA fine-tuning |
| Speech | Whisper | HuggingFace | OpenAI Whisper-large-v3 |
| Multimodal | CLIP, LLaVA | HuggingFace | Vision-language tasks |
| Tabular | LightGBM, XGBoost | Native | Still beats NNs on tabular |
| Time series | PatchTST, N-HiTS | PyTorch | Transformer-based TS models |
FAQ
Q: PyTorch vs TensorFlow in 2026 interviews? A: PyTorch is the answer. All major research uses PyTorch. TF 2.x is used at some Google-adjacent teams and has Keras integration, but if asked to pick one, pick PyTorch.
Q: How do you debug NaN loss during training?
A: In order: check for zero-division in your loss, enable torch.autograd.detect_anomaly() during debugging, check gradient norms, reduce learning rate, check input normalization, add gradient clipping.
Q: What is the difference between an epoch and a step? A: One step = one forward+backward pass on one mini-batch. One epoch = one pass over the entire dataset. Steps per epoch = dataset_size / batch_size.
Q: When should I use a pre-trained model vs train from scratch? A: Always start with a pre-trained model when one exists for your domain. Training from scratch is warranted only when: the domain is highly specialized (medical, satellite, proprietary signal), you have massive data (>10M labeled samples), or the architecture is custom.
Related articles on PapersAdda:
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →Related Articles
Airbnb Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airbnb's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Airtel Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airtel's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
AMD Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing AMD's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Atlassian Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Atlassian's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical,...
Barclays Interview Questions 2026
_Last verified by [Aditya Sharma](/author/aditya-sharma/) · cross-checked against PapersAdda Hiring Pulse and...
More from PapersAdda
Accenture Interview Questions 2026 (with Answers for Freshers)
Capgemini Interview Questions 2026 (with Answers for Freshers)
HCLTech Interview Questions 2026 (TechBee + TGT, with Answers)
IBM Interview Questions 2026 (with Answers for Freshers)