placement brief / Interview Questions / interview questions / 08 Jun 2026

Deep Learning Interview Questions 2026: 30 Answers with Code

Q: How do you debug NaN loss during training?

In order: check for zero-division in your loss, enable torch.autograd.detect_anomaly() during debugging, check gradient norms, reduce learning rate, check input normalization, add gradient clipping.

30 deep learning interview questions with PyTorch code covering neural networks, CNNs, RNNs, transformers, training tricks, and production deployment for 2026 interviews.

By Aditya SharmaPublished 8 Jun 20262 sources listedSpot an error? Corrections open

10 min read last revised 8 Jun 2026

on this page§ 06

Deep learning is no longer a niche specialization. In 2026, every ML interview at a product company includes at least five deep learning questions. Understanding how neural networks actually work, including the math behind backpropagation, the engineering behind transformers, and the tradeoffs in production deployment, separates candidates who clear rounds from those who don't. This guide covers 30 essential deep learning questions with complete PyTorch code.

PapersAdda's take: Memorizing that "batch normalization normalizes activations" is not enough. You need to know what goes wrong without it, why layer norm is preferred in transformers, and how to debug a training loop that is not converging. That depth is what this guide delivers. Candidates report that FAANG deep learning rounds always include a "debug this training loop" segment. According to candidate accounts from public preparation resources, transformer internals (attention math, positional encoding) appear in nearly every senior ML round at Google and Meta. Confirm the specific interview format on the official careers portal of your target company.

Related articles: AI/ML Interview Questions 2026 | Machine Learning Interview Questions 2026 | NLP Interview Questions 2026 | Computer Vision Interview Questions 2026 | PyTorch Interview Questions 2026 | MLOps Interview Questions 2026

Which Companies Ask These Questions?

Topic Cluster	Companies
Backpropagation and Gradients	Google, Meta, Microsoft, Nvidia
CNN Architecture	Google DeepMind, Meta AI, Samsung
RNN and LSTM	Amazon Alexa, Microsoft Azure AI
Transformers and Attention	All frontier AI labs, OpenAI, Cohere
Training Tricks	Every company with an ML team
Model Compression	Mobile-first teams, Qualcomm, Apple
Distributed Training	Databricks, Nvidia, hyperscalers

EASY: Fundamentals (Questions 1-10)

Q1. What is a neural network? Explain forward propagation.

Layer output: a^{l} = f(W^{l} * a^{l-1} + b^{l})

Forward propagation: Pass input through each layer in sequence, compute activations, and produce a final output.

import torch
import torch.nn as nn

class FeedForwardNet(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
        super().__init__()
        layers = []
        prev_dim = input_dim
        for dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, dim),
                nn.LayerNorm(dim),
                nn.GELU(),
                nn.Dropout(dropout)
            ])
            prev_dim = dim
        layers.append(nn.Linear(prev_dim, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = FeedForwardNet(784, [512, 256, 128], 10)
x = torch.randn(32, 784)          # batch of 32, 784 features
logits = model(x)                  # forward pass
print(logits.shape)                # [32, 10]

Q2. Explain backpropagation with the chain rule.

Simple example:

import torch

# Computation graph: x -> a=Wx+b -> h=relu(a) -> y_hat=Wh -> loss=(y_hat-y)^2
x = torch.tensor([1.0, 2.0], requires_grad=False)
W1 = torch.tensor([[0.5, 0.3], [0.2, 0.4]], requires_grad=True)
b1 = torch.zeros(2, requires_grad=True)
W2 = torch.tensor([[0.6, 0.7]], requires_grad=True)
y  = torch.tensor([1.0])

a = W1 @ x + b1          # linear
h = torch.relu(a)         # activation
y_hat = W2 @ h            # output
loss = (y_hat - y).pow(2) # MSE loss

loss.backward()            # backprop: PyTorch auto-differentiates
print("dL/dW1:", W1.grad)  # chain rule: dL/dy_hat * dy_hat/dh * dh/da * da/dW1
print("dL/dW2:", W2.grad)

Chain rule spelled out:

dL/dW1 = (dL/dy_hat) * (dy_hat/dh) * (dh/da) * (da/dW1)
        = 2*(y_hat-y) * W2 * relu'(a) * x

Q3. What are activation functions? Compare ReLU, GELU, and SwiGLU.

Activation	Formula	Range	Used In	Key Property
Sigmoid	1/(1+e^-x)	(0,1)	Output layer	Vanishing gradients
Tanh	(e^x-e^-x)/(e^x+e^-x)	(-1,1)	RNN gates	Better than sigmoid
ReLU	max(0,x)	[0,inf)	CNNs, MLPs	Fast, dying ReLU risk
Leaky ReLU	max(0.01x, x)	(-inf,inf)	CNNs	Fixes dying ReLU
GELU	x*Phi(x)	smooth	BERT, GPT	Smooth, better than ReLU
SwiGLU	xsigmoid(betax) * linear gate	smooth	LLaMA, Mistral	Gated, best in LLMs

import torch
import torch.nn.functional as F

x = torch.randn(100)

relu   = F.relu(x)
gelu   = F.gelu(x)
silu   = F.silu(x)       # same as SiLU = Swish = x*sigmoid(x)

# SwiGLU: requires splitting the projection in two (gate mechanism)
class SwiGLU(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.proj = nn.Linear(dim, dim * 2)  # doubled projection

    def forward(self, x):
        x_proj, gate = self.proj(x).chunk(2, dim=-1)
        return x_proj * F.silu(gate)

Q4. What is weight initialization and why does it matter?

Initialization	Best For	Formula
Xavier / Glorot	Sigmoid, Tanh	W ~ Uniform(-sqrt(6/(n_in+n_out)), ...)
He / Kaiming	ReLU and variants	W ~ Normal(0, sqrt(2/n_in))
Zero initialization	Never for weights	Symmetry breaking fails
Small random	Deprecated for deep nets	Too slow convergence

import torch.nn as nn

conv = nn.Conv2d(64, 128, 3)
linear = nn.Linear(512, 256)

# He initialization for ReLU
nn.init.kaiming_normal_(conv.weight, mode='fan_out', nonlinearity='relu')
nn.init.zeros_(conv.bias)

# Xavier for transformers
nn.init.xavier_uniform_(linear.weight)
nn.init.zeros_(linear.bias)

# PyTorch 2.x: most modules auto-initialize correctly
# But in custom models, always initialize explicitly

Q5. What is batch normalization? How is it different from layer normalization?

Batch Normalization (BatchNorm):

Normalizes across the batch dimension for each feature
Statistics: mean and variance computed per feature across all samples in the batch
Requires large batch size to work well; requires separate treatment at inference (running statistics)
Used in: CNNs, MLPs

Layer Normalization (LayerNorm):

Normalizes across the feature dimension for each sample independently
Statistics: mean and variance computed per sample across all features
Works with batch size 1; identical behavior at train and inference
Used in: Transformers, RNNs, any variable-length or small-batch scenario

import torch
import torch.nn as nn

batch_norm = nn.BatchNorm2d(64)        # for CNN: normalizes across N,H,W per channel
layer_norm = nn.LayerNorm(768)         # for transformer: normalizes across 768 features per token
rms_norm   = nn.RMSNorm(768)           # variant: no mean subtraction; used in LLaMA

# What BatchNorm does:
# x_hat = (x - mean_per_channel) / std_per_channel
# out = gamma * x_hat + beta   (learnable scale/shift)

# At inference: use running_mean and running_var (computed during training via EMA)
# At training: use batch statistics
batch_norm.eval()   # switches from batch stats to running stats

Q6. What is dropout and how does it prevent overfitting?

Prevents co-adaptation: neurons can't rely on specific other neurons always being present
Acts like training an ensemble of 2^n thinned networks
At inference: all neurons active, outputs scaled by (1-p) to match expected training output

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, dim, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Dropout(dropout),       # dropout after activation
            nn.Linear(dim * 4, dim),
            nn.Dropout(dropout)        # dropout before residual add
        )
        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        return x + self.net(self.norm(x))  # pre-norm residual

# Concrete effect
model = nn.Linear(100, 50)
dropout = nn.Dropout(p=0.5)

model.train()   # dropout active: 50% zeros
out_train = dropout(model(torch.randn(32, 100)))

model.eval()    # dropout disabled
out_eval = model(torch.randn(32, 100))

Q7. How do you prevent exploding gradients?

Technique	Description	Where Used
Gradient clipping	Rescale gradients if norm exceeds threshold	RNNs, LSTMs, LLM training
Weight initialization	He/Xavier init prevents early-stage explosion	All networks
Batch/Layer normalization	Keeps activations in stable range	CNNs, transformers
Residual connections	Gradients flow directly through skip path	ResNet, transformers
Smaller learning rate	Reduces step size	General

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.LSTM(512, 512, num_layers=4, batch_first=True)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Training loop with gradient clipping
for batch in dataloader:
    optimizer.zero_grad()
    out, _ = model(batch['x'])
    loss = criterion(out, batch['y'])
    loss.backward()

    # Clip gradient norm to 1.0 (standard for RNN/LLM training)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

Q8. What is the difference between SGD, Adam, and AdamW? Which do you use in 2026?

Optimizer	Update Rule	Memory	Best For
SGD	w -= lr * grad	O(params)	CNNs with carefully tuned LR + momentum
SGD + Momentum	Adds velocity term	O(params)	Better convergence than vanilla SGD
Adam	Adaptive per-param LR from 1st and 2nd moments	O(3*params)	Default for most tasks
AdamW	Adam with decoupled weight decay	O(3*params)	Transformers, LLM fine-tuning (standard in 2026)
Sophia	Second-order (Hessian-based)	O(2*params)	Emerging for LLM pre-training

import torch.optim as optim

# AdamW with cosine LR schedule -- standard for transformers in 2026
optimizer = optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),     # momentum coefficients
    eps=1e-8,
    weight_decay=0.01        # decoupled L2 penalty
)

# Cosine annealing with warmup
from transformers import get_cosine_schedule_with_warmup
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=500,
    num_training_steps=10000
)

Q9. What is learning rate scheduling? What schedules do you use for different architectures?

Schedule	Description	Best For
Step decay	LR *= gamma every N steps	CNNs, simple MLPs
Cosine annealing	LR follows cosine curve from LR_max to LR_min	Transformers, general
Warmup + cosine	Linear warmup then cosine decay	LLM training (standard)
Cyclic LR	Oscillate LR between bounds	Finding optimal LR
One-cycle policy	LR climbs then falls; one cycle total	Fast training (fast.ai)
ReduceLROnPlateau	Reduce LR when val metric stops improving	When you don't know train steps

import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR

# Cosine annealing (most common for CNNs)
scheduler_cosine = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# One-cycle (best for fast training)
scheduler_onecycle = OneCycleLR(
    optimizer, max_lr=0.1,
    steps_per_epoch=len(dataloader),
    epochs=50,
    pct_start=0.3       # warmup takes 30% of steps
)

# In training loop
for epoch in range(n_epochs):
    for batch in dataloader:
        # ... train step ...
        scheduler_onecycle.step()   # step each batch for OneCycleLR
    scheduler_cosine.step()         # step each epoch for CosineAnnealingLR

Q10. What is gradient checkpointing and when is it used?

Memory savings: Reduces activation memory from O(n_layers) to O(sqrt(n_layers))
Compute cost: ~33% extra FLOPs (one extra forward pass per layer)
When to use: When training large models and memory is the bottleneck (always for LLM fine-tuning on a single GPU)

import torch
from torch.utils.checkpoint import checkpoint

class TransformerLayer(nn.Module):
    def __init__(self, dim, heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(dim, heads, batch_first=True)
        self.ffn  = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Linear(dim * 4, dim)
        )

    def forward(self, x):
        # gradient_checkpoint: recompute this during backward instead of storing
        attn_out = checkpoint(self.attn, x, x, x, use_reentrant=False)[0]
        return x + self.ffn(attn_out)

# For HuggingFace models: one line
model.gradient_checkpointing_enable()

MEDIUM: Architectures (Questions 11-22)

Q11. Explain the ResNet architecture and why residual connections matter.

Residual block:

output = F(x, {W_i}) + x

Instead of learning the mapping H(x), the network learns the residual F(x) = H(x) - x.

import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                                stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_channels)
        self.relu  = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                                padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_channels)

        self.downsample = None
        if stride != 1 or in_channels != out_channels:
            self.downsample = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        if self.downsample:
            identity = self.downsample(x)
        return self.relu(out + identity)   # skip connection

Why it works: Gradients can flow directly through the skip connection path, bypassing any layer-specific transformation. The network can never perform worse than the identity mapping.

Q12. What is self-attention and how does it compute relationships between tokens?

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

For each token:

Compute a Query (what am I looking for?), Key (what do I contain?), Value (what do I return?)
Score with all other tokens via dot product
Softmax to get attention weights (sum to 1)
Weighted sum of Values

import torch
import torch.nn.functional as F

def self_attention(x, W_q, W_k, W_v, mask=None):
    """x: [B, T, D], W_*: [D, d_k]"""
    Q = x @ W_q   # [B, T, d_k]
    K = x @ W_k   # [B, T, d_k]
    V = x @ W_v   # [B, T, d_v]

    d_k = Q.shape[-1]
    scores = Q @ K.transpose(-2, -1) / d_k**0.5   # [B, T, T]

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    weights = F.softmax(scores, dim=-1)   # [B, T, T]
    return weights @ V                    # [B, T, d_v]

Causal (autoregressive) attention: Mask the upper triangle to prevent each token from attending to future tokens. Used in GPT-style models.

Q13. What is the difference between RNNs, LSTMs, and Transformers for sequence modeling?

Aspect	RNN	LSTM	Transformer
Long-range dependencies	Poor (vanishing grad)	Better (cell state)	Excellent (direct attention)
Parallelizable (training)	No (sequential)	No (sequential)	Yes (full parallelism)
Memory O(n)	O(1) per step	O(1) per step	O(n) (attention matrix)
2026 status	Deprecated for NLP	Legacy use (edge devices)	Dominant for all NLP/LLM
Context window	Effectively ~100 tokens	~500 tokens	Up to 1M tokens (with FlashAttn)

import torch.nn as nn

# LSTM (still used in some production systems for low-latency inference)
lstm = nn.LSTM(input_size=256, hidden_size=512,
               num_layers=2, batch_first=True,
               dropout=0.1, bidirectional=True)

# Transformer encoder layer (modern approach)
encoder_layer = nn.TransformerEncoderLayer(
    d_model=512, nhead=8,
    dim_feedforward=2048,
    dropout=0.1,
    activation='gelu',
    batch_first=True,
    norm_first=True   # pre-norm (better convergence in 2026)
)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)

Q14. How does a convolutional neural network process an image? Explain stride, padding, and receptive field.

Convolution: Slide filter over image, compute dot product at each position
Stride: How many pixels to skip between filter positions. Stride=2 halves spatial dimensions
Padding: Adds zeros around border. same padding keeps spatial size; valid padding shrinks it
Receptive field: The region of the input that influences one output neuron. Grows with depth

import torch.nn as nn

# Compute output size: floor((H + 2*padding - kernel) / stride) + 1
# Input: 32x32, kernel=3, padding=1, stride=1 -> output: 32x32 (same)
# Input: 32x32, kernel=3, padding=0, stride=2 -> output: 15x15

model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1, stride=1),  # 3->32, 32x32
    nn.BatchNorm2d(32), nn.ReLU(),
    nn.Conv2d(32, 64, kernel_size=3, padding=1, stride=2), # 32->64, 16x16
    nn.BatchNorm2d(64), nn.ReLU(),
    nn.Conv2d(64, 128, kernel_size=3, padding=1, stride=2),# 64->128, 8x8
    nn.BatchNorm2d(128), nn.ReLU(),
    nn.AdaptiveAvgPool2d((1,1)),  # global average pooling -> 1x1
    nn.Flatten(),
    nn.Linear(128, 10)
)

x = torch.randn(4, 3, 32, 32)
print(model(x).shape)  # [4, 10]

Q15. What is transfer learning for deep learning? When do you fine-tune vs use as feature extractor?

Strategy	Description	Use When
Feature extraction	Freeze all pre-trained layers; train only head	Target dataset very small (<1K samples); similar domain
Fine-tune top layers	Freeze bottom layers; fine-tune top N layers + head	Medium dataset; similar domain
Full fine-tuning	Unfreeze all layers; train with small LR	Large dataset; or different domain
LoRA/QLoRA	Freeze base, add low-rank adapters	LLM fine-tuning (standard in 2026)

import torchvision.models as models
import torch.nn as nn

# Feature extraction
backbone = models.efficientnet_b0(weights='IMAGENET1K_V1')
for param in backbone.parameters():
    param.requires_grad = False

# Replace head
n_features = backbone.classifier[1].in_features
backbone.classifier = nn.Sequential(
    nn.Dropout(0.3),
    nn.Linear(n_features, 5)   # 5-class problem
)
# Only classifier parameters have requires_grad=True

# Fine-tune last 2 blocks + head
for name, param in backbone.named_parameters():
    if 'features.7' in name or 'features.8' in name or 'classifier' in name:
        param.requires_grad = True

optimizer = torch.optim.AdamW([
    {'params': backbone.features.parameters(), 'lr': 1e-5},  # low LR for backbone
    {'params': backbone.classifier.parameters(), 'lr': 1e-3} # high LR for head
])

Q16. How does an autoencoder work? What is a VAE?

Autoencoder: Encoder compresses input to latent vector z; Decoder reconstructs input from z. Trained to minimize reconstruction error. Forces the network to learn a compressed representation.

Variational Autoencoder (VAE): Encoder outputs a distribution N(mu, sigma^2) over z, not a point. z is sampled from this distribution. Loss = reconstruction error + KL divergence from prior N(0,I). Enables generation of new data by sampling from the prior.

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
        super().__init__()
        # Encoder
        self.fc1    = nn.Linear(input_dim, hidden_dim)
        self.fc_mu  = nn.Linear(hidden_dim, latent_dim)
        self.fc_var = nn.Linear(hidden_dim, latent_dim)
        # Decoder
        self.fc3 = nn.Linear(latent_dim, hidden_dim)
        self.fc4 = nn.Linear(hidden_dim, input_dim)

    def encode(self, x):
        h = F.relu(self.fc1(x))
        return self.fc_mu(h), self.fc_var(h)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std     # differentiable sampling

    def decode(self, z):
        return torch.sigmoid(self.fc4(F.relu(self.fc3(z))))

    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, 784))
        z = self.reparameterize(mu, logvar)
        recon = self.decode(z)
        return recon, mu, logvar

def vae_loss(recon_x, x, mu, logvar, beta=1.0):
    bce  = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')
    kl   = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return bce + beta * kl

Q17. What is a GAN? How does training work and what are the common failure modes?

Generator G: Takes random noise z, outputs fake samples
Discriminator D: Takes a sample (real or fake), outputs P(real)

Objective (minimax):

min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]

# Training loop sketch
for real_batch in dataloader:
    # Train Discriminator
    D_optimizer.zero_grad()
    real_output = discriminator(real_batch)
    d_real_loss = F.binary_cross_entropy(real_output, torch.ones_like(real_output))

    z = torch.randn(batch_size, latent_dim)
    fake = generator(z).detach()
    fake_output = discriminator(fake)
    d_fake_loss = F.binary_cross_entropy(fake_output, torch.zeros_like(fake_output))
    (d_real_loss + d_fake_loss).backward()
    D_optimizer.step()

    # Train Generator
    G_optimizer.zero_grad()
    z = torch.randn(batch_size, latent_dim)
    fake_output = discriminator(generator(z))
    g_loss = F.binary_cross_entropy(fake_output, torch.ones_like(fake_output))
    g_loss.backward()
    G_optimizer.step()

Failure modes:

Mode	Symptom	Fix
Mode collapse	Generator produces only one or few samples	Minibatch discrimination, unrolled GANs
Training instability	Loss oscillates wildly	Gradient penalty (WGAN-GP), spectral norm
Discriminator wins too fast	Generator receives zero gradients	Balance update frequency

2026 status: Diffusion models have replaced GANs for image generation (Stable Diffusion, DALL-E, Midjourney). GANs still appear in video and real-time generation.

Q18. What is the transformer attention complexity? How does FlashAttention solve the memory problem?

Standard attention complexity:

Time: O(n^2 * d)
Memory: O(n^2) for the attention matrix

For n=4096, d=64, the attention matrix is 4096^2 * 2 bytes = 32MB per head. With 32 heads, that is 1GB per layer. For a 96-layer model, memory is infeasible.

FlashAttention (Dao et al. 2022):

Avoids materializing the full n x n attention matrix in HBM (GPU main memory)
Uses tiling: processes blocks of Q, K, V that fit in SRAM (fast on-chip memory)
Uses online softmax computation: maintains running max and sum to compute softmax exactly
Result: Same mathematical output, O(n) memory instead of O(n^2)

import torch

# PyTorch 2.x uses FlashAttention automatically
Q = torch.randn(2, 8, 512, 64, dtype=torch.float16, device='cuda')   # [B, heads, T, d_k]
K = torch.randn(2, 8, 512, 64, dtype=torch.float16, device='cuda')
V = torch.randn(2, 8, 512, 64, dtype=torch.float16, device='cuda')

# This calls FlashAttention when available (CUDA + float16/bfloat16 + contiguous)
with torch.backends.cuda.sdp_kernel(enable_flash=True):
    out = torch.nn.functional.scaled_dot_product_attention(Q, K, V, is_causal=True)

Q19. What is mixed precision training? How do bfloat16 and float16 differ?

Format	Bits	Range	Precision	Best For
FP32	32	Large	High	Reference, optimizer states
FP16	16	65504 max	Medium	Inference, some training
BF16	16	Same as FP32	Lower mantissa	LLM training (A100, H100)

BF16 vs FP16: BF16 has the same exponent range as FP32 (8 bits) so overflow/underflow is rare. FP16 has a narrow range (5-bit exponent), requiring loss scaling.

import torch
from torch.cuda.amp import autocast, GradScaler

model = model.to('cuda')
scaler = GradScaler()   # only needed for FP16; not needed for BF16

for batch in dataloader:
    optimizer.zero_grad()

    # FP16 with loss scaling
    with autocast(dtype=torch.float16):
        output = model(batch['x'])
        loss = criterion(output, batch['y'])
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

    # BF16 (cleaner, no loss scaling)
    with autocast(dtype=torch.bfloat16):
        output = model(batch['x'])

Q20. What is LoRA? How does it reduce parameters for fine-tuning?

W' = W + delta_W = W + B * A
where W in R^(d x k), B in R^(d x r), A in R^(r x k), r << min(d, k)

Parameter savings: A 4096 x 4096 weight matrix has 16.7M parameters. With rank 16, B and A together have 409616 + 164096 = 131K parameters, a reduction of 127x.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8b',
                                              torch_dtype=torch.bfloat16,
                                              device_map='auto')

lora_config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,           # scaling: effective lr = lr * lora_alpha / r
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj',
                    'gate_proj', 'up_proj', 'down_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~41M || all params: ~8B || trainable%: ~0.5%

Q21. Explain the concept of knowledge distillation with a PyTorch implementation.

import torch.nn.functional as F
import torch

def distillation_loss(student_logits, teacher_logits, true_labels,
                       T=4.0, alpha=0.7):
    """
    T: temperature (higher = softer distribution, more info transfer)
    alpha: weight for soft target loss (1-alpha for hard label loss)
    """
    # Soft target loss (KL divergence)
    soft_targets  = F.softmax(teacher_logits / T, dim=-1)
    soft_student  = F.log_softmax(student_logits / T, dim=-1)
    kd_loss = F.kl_div(soft_student, soft_targets, reduction='batchmean') * (T ** 2)

    # Hard label loss
    ce_loss = F.cross_entropy(student_logits, true_labels)

    return alpha * kd_loss + (1 - alpha) * ce_loss


# Training loop with teacher-student
teacher.eval()
for batch_x, batch_y in dataloader:
    with torch.no_grad():
        teacher_logits = teacher(batch_x)   # teacher inference, no grad

    student_logits = student(batch_x)
    loss = distillation_loss(student_logits, teacher_logits, batch_y, T=4.0, alpha=0.7)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Q22. What is multi-task learning and when does it improve performance?

Benefits:

Prevents overfitting via auxiliary tasks
Fewer total parameters than N separate models
Task A can provide useful gradient signal for task B

class MultiTaskModel(nn.Module):
    def __init__(self, d_model=512, n_classes_task1=3, n_classes_task2=5):
        super().__init__()
        # Shared backbone
        self.backbone = nn.Sequential(
            nn.Linear(100, d_model),
            nn.LayerNorm(d_model),
            nn.GELU(),
            nn.Linear(d_model, d_model)
        )
        # Task-specific heads
        self.head_task1 = nn.Linear(d_model, n_classes_task1)
        self.head_task2 = nn.Linear(d_model, n_classes_task2)

    def forward(self, x):
        shared = self.backbone(x)
        return self.head_task1(shared), self.head_task2(shared)

def mtl_loss(logits1, logits2, y1, y2, lambda1=1.0, lambda2=0.5):
    loss1 = F.cross_entropy(logits1, y1)
    loss2 = F.cross_entropy(logits2, y2)
    return lambda1 * loss1 + lambda2 * loss2

Works best when tasks are related (e.g., named entity recognition + part-of-speech tagging, or classification + auxiliary self-supervised task). Divergent tasks hurt each other.

HARD: Advanced Topics (Questions 23-30)

Q23. What is quantization? Compare PTQ and QAT.

Method	Full Name	How	Quality	Speed
PTQ	Post-Training Quantization	Quantize after training; calibrate on small dataset	Lower	Fastest
QAT	Quantization-Aware Training	Simulate quantization during training; fine-tune	Higher	Slower but better
GPTQ	GPU-based PTQ for LLMs	Minimize weight reconstruction error per layer	High for LLMs	Standard in 2026
AWQ	Activation-aware Weight Quant	Scale weights based on salient activations	Better than GPTQ	Standard in 2026

# PyTorch dynamic quantization (simplest PTQ)
import torch.quantization

model_int8 = torch.quantization.quantize_dynamic(
    model,
    {nn.Linear},           # quantize only linear layers
    dtype=torch.qint8
)

# bitsandbytes 4-bit loading (LLM standard in 2026)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained('model_name',
                                              quantization_config=bnb_config)

Q24. What is structured vs unstructured pruning?

Type	What is removed	Speed on Hardware	Quality loss
Unstructured	Individual weights (sparse matrix)	Minimal without sparse hardware	Minimal
Structured	Entire neurons, heads, or layers	Immediate (dense matrix shrinks)	Moderate
Magnitude pruning	Smallest-magnitude weights	Depends on type	Low
Lottery Ticket Hypothesis	Retrain sparse subnetwork from scratch	Experimental	Low

import torch.nn.utils.prune as prune

# Unstructured L1 magnitude pruning (30% of weights zeroed)
prune.l1_unstructured(model.fc1, name='weight', amount=0.3)

# Structured pruning: remove entire rows (neurons) of a linear layer
prune.ln_structured(model.fc1, name='weight', amount=0.3, n=2, dim=0)

# Make pruning permanent (remove mask; recompute weight)
prune.remove(model.fc1, 'weight')

# Global pruning: prune 20% of all weights across the whole model
parameters_to_prune = [(layer, 'weight') for layer in model.modules()
                        if isinstance(layer, nn.Linear)]
prune.global_unstructured(parameters_to_prune,
                           pruning_method=prune.L1Unstructured,
                           amount=0.2)

Q25. Explain the training stability tricks for large language models.

Trick	Description	Purpose
Pre-LN (pre-norm)	LayerNorm before attention/FFN, not after	More stable gradients vs post-LN
QK-Norm	Normalize Q and K before dot-product	Prevents logit growth, entropy collapse
Gradient clipping	Clip norm to 1.0	Prevents single bad batch from destroying training
Weight tying	Share embedding and output projection weights	Reduces parameters, improves language modeling
Z-loss	Penalize large logit magnitudes	Prevents softmax saturation
Warmup LR	Linear ramp for first 1-5% of steps	Prevents early instability

# Z-loss (prevents entropy collapse in MoE routing and output softmax)
def z_loss(logits, z_loss_coef=0.001):
    log_z = torch.logsumexp(logits, dim=-1)
    return z_loss_coef * log_z.pow(2).mean()

# QK-Norm implementation
class AttentionWithQKNorm(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.q_norm = nn.RMSNorm(self.d_k)
        self.k_norm = nn.RMSNorm(self.d_k)
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)

    def forward(self, x, mask=None):
        B, T, D = x.shape
        # Apply per-head normalization to Q and K
        # (simplified sketch; full impl would reshape, norm, reshape back)
        return self.attn(x, x, x, attn_mask=mask)[0]

Q26. What is RLHF and DPO? How are they used to align LLMs?

RLHF (Reinforcement Learning from Human Feedback):

Supervised fine-tuning (SFT) on curated demonstrations
Train reward model on human preference pairs (chosen > rejected)
PPO optimization: maximize reward while staying close to SFT policy (KL penalty)

DPO (Direct Preference Optimization): Eliminates the reward model. Directly optimizes on preference pairs:

L_DPO = -E[log sigma(beta * (log pi_theta(y_w|x) - log pi_ref(y_w|x))
                      - beta * (log pi_theta(y_l|x) - log pi_ref(y_l|x)))]

# DPO loss implementation
import torch.nn.functional as F

def dpo_loss(policy_logprobs_chosen, policy_logprobs_rejected,
             ref_logprobs_chosen, ref_logprobs_rejected, beta=0.1):
    """
    policy_*: log probabilities from model being trained
    ref_*:    log probabilities from reference (SFT) model
    beta:     KL penalty coefficient
    """
    # Log probability ratios (policy vs reference)
    pi_logratios = policy_logprobs_chosen - policy_logprobs_rejected
    ref_logratios = ref_logprobs_chosen - ref_logprobs_rejected

    # DPO objective
    logits = beta * (pi_logratios - ref_logratios)
    loss = -F.logsigmoid(logits).mean()
    return loss

Why DPO in 2026: DPO is simpler (no RL, no reward model), more stable, and achieves comparable alignment quality. TRL library from HuggingFace has production DPO trainer.

Q27. How does speculative decoding accelerate LLM inference?

Speculative decoding uses a small fast "draft" model to propose k tokens, then verifies all k in a single pass of the large model:

1. Draft model proposes tokens t_1, ..., t_k (k serial small-model passes)
2. Large model verifies all k tokens in ONE forward pass (k tokens in parallel)
3. Accept tokens where large model agrees; reject and resample from first disagreement
4. Speedup: ~2-3x for typical draft acceptance rates of 70-90%

# Simplified speculative decoding loop
def speculative_decode(draft_model, target_model, prompt, max_new=100, k=5):
    input_ids = prompt
    generated = []

    while len(generated) < max_new:
        # Draft k tokens greedily
        draft_tokens = []
        draft_logprobs = []
        ids = input_ids
        for _ in range(k):
            with torch.no_grad():
                draft_logits = draft_model(ids).logits[:, -1]
            draft_tok = draft_logits.argmax(-1)
            draft_tokens.append(draft_tok)
            draft_logprobs.append(F.log_softmax(draft_logits, dim=-1))
            ids = torch.cat([ids, draft_tok.unsqueeze(1)], dim=1)

        # Verify with target model in ONE forward pass
        with torch.no_grad():
            target_logits = target_model(ids).logits[:, -k-1:-1]  # k target logits
        # Accept/reject tokens and resample from first rejection
        # ... (full implementation involves token-level probability comparison)

    return torch.cat(generated)

Q28. Explain distributed training: data parallelism, tensor parallelism, and FSDP.

Strategy	What is sharded	Scale	Overhead
Data Parallelism (DDP)	Data; full model on each GPU	Linear with GPUs	Gradient all-reduce
Tensor Parallelism (TP)	Weight matrices split column/row-wise	LLM attention/FFN	Complex, needs tensor aware code
Pipeline Parallelism (PP)	Model layers split in stages	Very deep models	Micro-batching required
ZeRO Stage 3	Optimizer states + gradients + params	Large models	Higher communication
FSDP (PyTorch)	Full model sharded; gather on demand	Large models	Native PyTorch, standard in 2026

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.fully_sharded_data_parallel import (
    CPUOffload, ShardingStrategy
)
import torch.distributed as dist

# Initialize distributed backend
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)

model = MyLargeModel().to(local_rank)
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,    # ZeRO-3 equivalent
    cpu_offload=CPUOffload(offload_params=True),       # offload to CPU RAM
    auto_wrap_policy=transformer_auto_wrap_policy,
    device_id=local_rank
)

Q29. What is contrastive learning? Explain CLIP and SimCLR.

SimCLR (self-supervised for vision):

Two augmented views of the same image = positive pair
Views from different images = negative pairs
NT-Xent (normalized temperature-scaled cross-entropy) loss

CLIP (OpenAI, multimodal):

Positive pair: (image, its text caption)
Negative pairs: (image, all other captions in batch)
Train image encoder + text encoder together on 400M image-text pairs
Creates aligned image-text embedding space

import torch.nn.functional as F

def nt_xent_loss(z1, z2, temperature=0.5):
    """
    z1, z2: [N, D] normalized embeddings of two augmented views
    """
    N = z1.shape[0]
    z = torch.cat([z1, z2], dim=0)   # [2N, D]
    # Cosine similarity matrix
    sim = F.cosine_similarity(z.unsqueeze(1), z.unsqueeze(0), dim=-1) / temperature
    # Mask out diagonal (self-similarity)
    mask = torch.eye(2*N, dtype=torch.bool, device=z.device)
    sim.masked_fill_(mask, float('-inf'))
    # Labels: for sample i, positive is at i+N (and vice versa)
    labels = torch.cat([torch.arange(N, 2*N), torch.arange(0, N)]).to(z.device)
    return F.cross_entropy(sim, labels)

Q30. How do you debug a deep learning model that is not training?

1. Check data pipeline first
   - Visualize a batch: print shapes, check label distribution, spot-check samples
   - Verify normalization: mean~0, std~1 after preprocessing

2. Check that loss decreases on a single batch
   - If loss does NOT decrease on 1 batch: bug in forward pass or loss
   - If loss decreases on 1 batch but NOT across epochs: data pipeline issue

3. Check gradients
   - Any NaN? -> exploding gradients; clip or reduce LR
   - All zero? -> dying ReLUs, wrong loss, disconnected graph
   - Too small? -> vanishing gradients; use ResNet, LayerNorm, better init

4. Check learning rate
   - Too high: loss oscillates or diverges
   - Too low: loss decreases but very slowly
   - Use LR range test (Leslie Smith 1cycle)

5. Check label correctness
   - Common bug: labels accidentally 0-indexed vs 1-indexed
   - Common bug: label tensor dtype should be torch.long for CrossEntropyLoss

# Gradient checking utility
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        if grad_norm == 0:
            print(f"ZERO grad: {name}")
        elif grad_norm > 100:
            print(f"LARGE grad: {name} = {grad_norm:.2f}")
        elif torch.isnan(param.grad).any():
            print(f"NaN grad: {name}")

# Overfit one batch test
x_one, y_one = next(iter(dataloader))
for step in range(200):
    optimizer.zero_grad()
    pred = model(x_one)
    loss = criterion(pred, y_one)
    loss.backward()
    optimizer.step()
    if step % 20 == 0:
        print(f"Step {step}: loss={loss.item():.4f}")
# If loss does not reach near-zero after 200 steps: architecture or init bug

Comparison Table: Architecture Choices in 2026

Task	Architecture	Framework	Notes
Image classification	EfficientNetV2, ViT-B	PyTorch	Pretrained on ImageNet-21K
Object detection	YOLOv9, RT-DETR	PyTorch	RT-DETR for production
NLP classification	BERT, DeBERTa-v3	HuggingFace	Fine-tune on domain data
Text generation	LLaMA-3, Mistral	HuggingFace	QLoRA fine-tuning
Speech	Whisper	HuggingFace	OpenAI Whisper-large-v3
Multimodal	CLIP, LLaVA	HuggingFace	Vision-language tasks
Tabular	LightGBM, XGBoost	Native	Still beats NNs on tabular
Time series	PatchTST, N-HiTS	PyTorch	Transformer-based TS models

FAQ

Q: PyTorch vs TensorFlow in 2026 interviews?

A: PyTorch is the answer. All major research uses PyTorch. TF 2.x is used at some Google-adjacent teams and has Keras integration, but if asked to pick one, pick PyTorch.

Q: How do you debug NaN loss during training?

A: In order: check for zero-division in your loss, enable torch.autograd.detect_anomaly() during debugging, check gradient norms, reduce learning rate, check input normalization, add gradient clipping.

Q: What is the difference between an epoch and a step?

A: One step = one forward+backward pass on one mini-batch. One epoch = one pass over the entire dataset. Steps per epoch = dataset_size / batch_size.

Q: When should I use a pre-trained model vs train from scratch?

A: Always start with a pre-trained model when one exists for your domain. Training from scratch is warranted only when: the domain is highly specialized (medical, satellite, proprietary signal), you have massive data (>10M labeled samples), or the architecture is custom.

Related articles on PapersAdda:

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp

Deep Learning Interview Questions 2026: 30 Answers with Code

Which Companies Ask These Questions?

EASY: Fundamentals (Questions 1-10)

Q1. What is a neural network? Explain forward propagation.

Q2. Explain backpropagation with the chain rule.

Q3. What are activation functions? Compare ReLU, GELU, and SwiGLU.

Q4. What is weight initialization and why does it matter?

Q5. What is batch normalization? How is it different from layer normalization?

Q6. What is dropout and how does it prevent overfitting?

Q7. How do you prevent exploding gradients?

Q8. What is the difference between SGD, Adam, and AdamW? Which do you use in 2026?

Q9. What is learning rate scheduling? What schedules do you use for different architectures?

Q10. What is gradient checkpointing and when is it used?

MEDIUM: Architectures (Questions 11-22)

Q11. Explain the ResNet architecture and why residual connections matter.

Q12. What is self-attention and how does it compute relationships between tokens?

Q13. What is the difference between RNNs, LSTMs, and Transformers for sequence modeling?

Q14. How does a convolutional neural network process an image? Explain stride, padding, and receptive field.

Q15. What is transfer learning for deep learning? When do you fine-tune vs use as feature extractor?

Q16. How does an autoencoder work? What is a VAE?

Q17. What is a GAN? How does training work and what are the common failure modes?

Q18. What is the transformer attention complexity? How does FlashAttention solve the memory problem?

Q19. What is mixed precision training? How do bfloat16 and float16 differ?

Q20. What is LoRA? How does it reduce parameters for fine-tuning?

Q21. Explain the concept of knowledge distillation with a PyTorch implementation.

Q22. What is multi-task learning and when does it improve performance?

HARD: Advanced Topics (Questions 23-30)

Q23. What is quantization? Compare PTQ and QAT.

Q24. What is structured vs unstructured pruning?

Q25. Explain the training stability tricks for large language models.

Q26. What is RLHF and DPO? How are they used to align LLMs?

Q27. How does speculative decoding accelerate LLM inference?

Q28. Explain distributed training: data parallelism, tensor parallelism, and FSDP.

Q29. What is contrastive learning? Explain CLIP and SimCLR.

Q30. How do you debug a deep learning model that is not training?

Comparison Table: Architecture Choices in 2026

FAQ

Q: PyTorch vs TensorFlow in 2026 interviews?

Q: How do you debug NaN loss during training?

Q: What is the difference between an epoch and a step?

Q: When should I use a pre-trained model vs train from scratch?

More resources in Interview Questions

Sat this this year? Share your story, earn ₹500.

Take a free timed mock test

PyTorch Interview Questions 2026: 28 Answers with Code

Computer Vision Interview Questions 2026: 28 Answers with Code

TensorFlow Interview Questions 2026: 28 Answers with Code

AI/ML Interview Questions 2026: 50 Answers [Verified]

LLM Interview Questions 2026: 28 Answers with Code

Share this guide