issue 117apr 27mmxxvi
est. 2017
Sun, 27 Apr 2026
vol. IX · no. 117
PapersAdda
placement intelligence, since 2017
640+ briefs · 24 campuses · by reservation
verified offers · sourced from r/developersIndia
razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1

Deep Learning Interview Questions 2026: 30 Answers with Code

30 min read
Interview Questions
Updated: 8 Jun 2026
Aditya Sharma
Aditya's Edit

PapersAdda 2026 Placement Cycle

By Aditya Sharma·Founder & Editor, PapersAdda

What changed in 2026 drives

Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.

What I'd actually study for this

  • 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
  • 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
  • 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
  • 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken

Where most candidates trip up

The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.

Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.

Deep learning is no longer a niche specialization. In 2026, every ML interview at a product company includes at least five deep learning questions. Understanding how neural networks actually work, including the math behind backpropagation, the engineering behind transformers, and the tradeoffs in production deployment, separates candidates who clear rounds from those who don't. This guide covers 30 essential deep learning questions with complete PyTorch code.

PapersAdda's take: Memorizing that "batch normalization normalizes activations" is not enough. You need to know what goes wrong without it, why layer norm is preferred in transformers, and how to debug a training loop that is not converging. That depth is what this guide delivers. Candidates report that FAANG deep learning rounds always include a "debug this training loop" segment. According to candidate accounts from public preparation resources, transformer internals (attention math, positional encoding) appear in nearly every senior ML round at Google and Meta. Confirm the specific interview format on the official careers portal of your target company.

Related articles: AI/ML Interview Questions 2026 | Machine Learning Interview Questions 2026 | NLP Interview Questions 2026 | Computer Vision Interview Questions 2026 | PyTorch Interview Questions 2026 | MLOps Interview Questions 2026


Which Companies Ask These Questions?

Topic ClusterCompanies
Backpropagation and GradientsGoogle, Meta, Microsoft, Nvidia
CNN ArchitectureGoogle DeepMind, Meta AI, Samsung
RNN and LSTMAmazon Alexa, Microsoft Azure AI
Transformers and AttentionAll frontier AI labs, OpenAI, Cohere
Training TricksEvery company with an ML team
Model CompressionMobile-first teams, Qualcomm, Apple
Distributed TrainingDatabricks, Nvidia, hyperscalers

EASY: Fundamentals (Questions 1-10)

Q1. What is a neural network? Explain forward propagation.

Layer output: a^{l} = f(W^{l} * a^{l-1} + b^{l})

Forward propagation: Pass input through each layer in sequence, compute activations, and produce a final output.

import torch
import torch.nn as nn

class FeedForwardNet(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
        super().__init__()
        layers = []
        prev_dim = input_dim
        for dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, dim),
                nn.LayerNorm(dim),
                nn.GELU(),
                nn.Dropout(dropout)
            ])
            prev_dim = dim
        layers.append(nn.Linear(prev_dim, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = FeedForwardNet(784, [512, 256, 128], 10)
x = torch.randn(32, 784)          # batch of 32, 784 features
logits = model(x)                  # forward pass
print(logits.shape)                # [32, 10]

Q2. Explain backpropagation with the chain rule.

Simple example:

import torch

# Computation graph: x -> a=Wx+b -> h=relu(a) -> y_hat=Wh -> loss=(y_hat-y)^2
x = torch.tensor([1.0, 2.0], requires_grad=False)
W1 = torch.tensor([[0.5, 0.3], [0.2, 0.4]], requires_grad=True)
b1 = torch.zeros(2, requires_grad=True)
W2 = torch.tensor([[0.6, 0.7]], requires_grad=True)
y  = torch.tensor([1.0])

a = W1 @ x + b1          # linear
h = torch.relu(a)         # activation
y_hat = W2 @ h            # output
loss = (y_hat - y).pow(2) # MSE loss

loss.backward()            # backprop: PyTorch auto-differentiates
print("dL/dW1:", W1.grad)  # chain rule: dL/dy_hat * dy_hat/dh * dh/da * da/dW1
print("dL/dW2:", W2.grad)

Chain rule spelled out:

dL/dW1 = (dL/dy_hat) * (dy_hat/dh) * (dh/da) * (da/dW1)
        = 2*(y_hat-y) * W2 * relu'(a) * x

Q3. What are activation functions? Compare ReLU, GELU, and SwiGLU.

ActivationFormulaRangeUsed InKey Property
Sigmoid1/(1+e^-x)(0,1)Output layerVanishing gradients
Tanh(e^x-e^-x)/(e^x+e^-x)(-1,1)RNN gatesBetter than sigmoid
ReLUmax(0,x)[0,inf)CNNs, MLPsFast, dying ReLU risk
Leaky ReLUmax(0.01x, x)(-inf,inf)CNNsFixes dying ReLU
GELUx*Phi(x)smoothBERT, GPTSmooth, better than ReLU
SwiGLUxsigmoid(betax) * linear gatesmoothLLaMA, MistralGated, best in LLMs
import torch
import torch.nn.functional as F

x = torch.randn(100)

relu   = F.relu(x)
gelu   = F.gelu(x)
silu   = F.silu(x)       # same as SiLU = Swish = x*sigmoid(x)

# SwiGLU: requires splitting the projection in two (gate mechanism)
class SwiGLU(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.proj = nn.Linear(dim, dim * 2)  # doubled projection

    def forward(self, x):
        x_proj, gate = self.proj(x).chunk(2, dim=-1)
        return x_proj * F.silu(gate)

Q4. What is weight initialization and why does it matter?

InitializationBest ForFormula
Xavier / GlorotSigmoid, TanhW ~ Uniform(-sqrt(6/(n_in+n_out)), ...)
He / KaimingReLU and variantsW ~ Normal(0, sqrt(2/n_in))
Zero initializationNever for weightsSymmetry breaking fails
Small randomDeprecated for deep netsToo slow convergence
import torch.nn as nn

conv = nn.Conv2d(64, 128, 3)
linear = nn.Linear(512, 256)

# He initialization for ReLU
nn.init.kaiming_normal_(conv.weight, mode='fan_out', nonlinearity='relu')
nn.init.zeros_(conv.bias)

# Xavier for transformers
nn.init.xavier_uniform_(linear.weight)
nn.init.zeros_(linear.bias)

# PyTorch 2.x: most modules auto-initialize correctly
# But in custom models, always initialize explicitly

Q5. What is batch normalization? How is it different from layer normalization?

Batch Normalization (BatchNorm):

  • Normalizes across the batch dimension for each feature
  • Statistics: mean and variance computed per feature across all samples in the batch
  • Requires large batch size to work well; requires separate treatment at inference (running statistics)
  • Used in: CNNs, MLPs

Layer Normalization (LayerNorm):

  • Normalizes across the feature dimension for each sample independently
  • Statistics: mean and variance computed per sample across all features
  • Works with batch size 1; identical behavior at train and inference
  • Used in: Transformers, RNNs, any variable-length or small-batch scenario
import torch
import torch.nn as nn

batch_norm = nn.BatchNorm2d(64)        # for CNN: normalizes across N,H,W per channel
layer_norm = nn.LayerNorm(768)         # for transformer: normalizes across 768 features per token
rms_norm   = nn.RMSNorm(768)           # variant: no mean subtraction; used in LLaMA

# What BatchNorm does:
# x_hat = (x - mean_per_channel) / std_per_channel
# out = gamma * x_hat + beta   (learnable scale/shift)

# At inference: use running_mean and running_var (computed during training via EMA)
# At training: use batch statistics
batch_norm.eval()   # switches from batch stats to running stats

Q6. What is dropout and how does it prevent overfitting?

  1. Prevents co-adaptation: neurons can't rely on specific other neurons always being present
  2. Acts like training an ensemble of 2^n thinned networks
  3. At inference: all neurons active, outputs scaled by (1-p) to match expected training output
import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, dim, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Dropout(dropout),       # dropout after activation
            nn.Linear(dim * 4, dim),
            nn.Dropout(dropout)        # dropout before residual add
        )
        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        return x + self.net(self.norm(x))  # pre-norm residual

# Concrete effect
model = nn.Linear(100, 50)
dropout = nn.Dropout(p=0.5)

model.train()   # dropout active: 50% zeros
out_train = dropout(model(torch.randn(32, 100)))

model.eval()    # dropout disabled
out_eval = model(torch.randn(32, 100))

Q7. How do you prevent exploding gradients?

TechniqueDescriptionWhere Used
Gradient clippingRescale gradients if norm exceeds thresholdRNNs, LSTMs, LLM training
Weight initializationHe/Xavier init prevents early-stage explosionAll networks
Batch/Layer normalizationKeeps activations in stable rangeCNNs, transformers
Residual connectionsGradients flow directly through skip pathResNet, transformers
Smaller learning rateReduces step sizeGeneral
import torch
import torch.nn as nn
import torch.optim as optim

model = nn.LSTM(512, 512, num_layers=4, batch_first=True)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Training loop with gradient clipping
for batch in dataloader:
    optimizer.zero_grad()
    out, _ = model(batch['x'])
    loss = criterion(out, batch['y'])
    loss.backward()

    # Clip gradient norm to 1.0 (standard for RNN/LLM training)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

Q8. What is the difference between SGD, Adam, and AdamW? Which do you use in 2026?

OptimizerUpdate RuleMemoryBest For
SGDw -= lr * gradO(params)CNNs with carefully tuned LR + momentum
SGD + MomentumAdds velocity termO(params)Better convergence than vanilla SGD
AdamAdaptive per-param LR from 1st and 2nd momentsO(3*params)Default for most tasks
AdamWAdam with decoupled weight decayO(3*params)Transformers, LLM fine-tuning (standard in 2026)
SophiaSecond-order (Hessian-based)O(2*params)Emerging for LLM pre-training
import torch.optim as optim

# AdamW with cosine LR schedule -- standard for transformers in 2026
optimizer = optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),     # momentum coefficients
    eps=1e-8,
    weight_decay=0.01        # decoupled L2 penalty
)

# Cosine annealing with warmup
from transformers import get_cosine_schedule_with_warmup
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=500,
    num_training_steps=10000
)

Q9. What is learning rate scheduling? What schedules do you use for different architectures?

ScheduleDescriptionBest For
Step decayLR *= gamma every N stepsCNNs, simple MLPs
Cosine annealingLR follows cosine curve from LR_max to LR_minTransformers, general
Warmup + cosineLinear warmup then cosine decayLLM training (standard)
Cyclic LROscillate LR between boundsFinding optimal LR
One-cycle policyLR climbs then falls; one cycle totalFast training (fast.ai)
ReduceLROnPlateauReduce LR when val metric stops improvingWhen you don't know train steps
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR

# Cosine annealing (most common for CNNs)
scheduler_cosine = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# One-cycle (best for fast training)
scheduler_onecycle = OneCycleLR(
    optimizer, max_lr=0.1,
    steps_per_epoch=len(dataloader),
    epochs=50,
    pct_start=0.3       # warmup takes 30% of steps
)

# In training loop
for epoch in range(n_epochs):
    for batch in dataloader:
        # ... train step ...
        scheduler_onecycle.step()   # step each batch for OneCycleLR
    scheduler_cosine.step()         # step each epoch for CosineAnnealingLR

Q10. What is gradient checkpointing and when is it used?

  • Memory savings: Reduces activation memory from O(n_layers) to O(sqrt(n_layers))
  • Compute cost: ~33% extra FLOPs (one extra forward pass per layer)
  • When to use: When training large models and memory is the bottleneck (always for LLM fine-tuning on a single GPU)
import torch
from torch.utils.checkpoint import checkpoint

class TransformerLayer(nn.Module):
    def __init__(self, dim, heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(dim, heads, batch_first=True)
        self.ffn  = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Linear(dim * 4, dim)
        )

    def forward(self, x):
        # gradient_checkpoint: recompute this during backward instead of storing
        attn_out = checkpoint(self.attn, x, x, x, use_reentrant=False)[0]
        return x + self.ffn(attn_out)

# For HuggingFace models: one line
model.gradient_checkpointing_enable()

MEDIUM: Architectures (Questions 11-22)

Q11. Explain the ResNet architecture and why residual connections matter.

Residual block:

output = F(x, {W_i}) + x

Instead of learning the mapping H(x), the network learns the residual F(x) = H(x) - x.

import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                                stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_channels)
        self.relu  = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                                padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_channels)

        self.downsample = None
        if stride != 1 or in_channels != out_channels:
            self.downsample = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        if self.downsample:
            identity = self.downsample(x)
        return self.relu(out + identity)   # skip connection

Why it works: Gradients can flow directly through the skip connection path, bypassing any layer-specific transformation. The network can never perform worse than the identity mapping.


Q12. What is self-attention and how does it compute relationships between tokens?

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

For each token:

  1. Compute a Query (what am I looking for?), Key (what do I contain?), Value (what do I return?)
  2. Score with all other tokens via dot product
  3. Softmax to get attention weights (sum to 1)
  4. Weighted sum of Values
import torch
import torch.nn.functional as F

def self_attention(x, W_q, W_k, W_v, mask=None):
    """x: [B, T, D], W_*: [D, d_k]"""
    Q = x @ W_q   # [B, T, d_k]
    K = x @ W_k   # [B, T, d_k]
    V = x @ W_v   # [B, T, d_v]

    d_k = Q.shape[-1]
    scores = Q @ K.transpose(-2, -1) / d_k**0.5   # [B, T, T]

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    weights = F.softmax(scores, dim=-1)   # [B, T, T]
    return weights @ V                    # [B, T, d_v]

Causal (autoregressive) attention: Mask the upper triangle to prevent each token from attending to future tokens. Used in GPT-style models.


Q13. What is the difference between RNNs, LSTMs, and Transformers for sequence modeling?

AspectRNNLSTMTransformer
Long-range dependenciesPoor (vanishing grad)Better (cell state)Excellent (direct attention)
Parallelizable (training)No (sequential)No (sequential)Yes (full parallelism)
Memory O(n)O(1) per stepO(1) per stepO(n) (attention matrix)
2026 statusDeprecated for NLPLegacy use (edge devices)Dominant for all NLP/LLM
Context windowEffectively ~100 tokens~500 tokensUp to 1M tokens (with FlashAttn)
import torch.nn as nn

# LSTM (still used in some production systems for low-latency inference)
lstm = nn.LSTM(input_size=256, hidden_size=512,
               num_layers=2, batch_first=True,
               dropout=0.1, bidirectional=True)

# Transformer encoder layer (modern approach)
encoder_layer = nn.TransformerEncoderLayer(
    d_model=512, nhead=8,
    dim_feedforward=2048,
    dropout=0.1,
    activation='gelu',
    batch_first=True,
    norm_first=True   # pre-norm (better convergence in 2026)
)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)

Q14. How does a convolutional neural network process an image? Explain stride, padding, and receptive field.

  • Convolution: Slide filter over image, compute dot product at each position
  • Stride: How many pixels to skip between filter positions. Stride=2 halves spatial dimensions
  • Padding: Adds zeros around border. same padding keeps spatial size; valid padding shrinks it
  • Receptive field: The region of the input that influences one output neuron. Grows with depth
import torch.nn as nn

# Compute output size: floor((H + 2*padding - kernel) / stride) + 1
# Input: 32x32, kernel=3, padding=1, stride=1 -> output: 32x32 (same)
# Input: 32x32, kernel=3, padding=0, stride=2 -> output: 15x15

model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1, stride=1),  # 3->32, 32x32
    nn.BatchNorm2d(32), nn.ReLU(),
    nn.Conv2d(32, 64, kernel_size=3, padding=1, stride=2), # 32->64, 16x16
    nn.BatchNorm2d(64), nn.ReLU(),
    nn.Conv2d(64, 128, kernel_size=3, padding=1, stride=2),# 64->128, 8x8
    nn.BatchNorm2d(128), nn.ReLU(),
    nn.AdaptiveAvgPool2d((1,1)),  # global average pooling -> 1x1
    nn.Flatten(),
    nn.Linear(128, 10)
)

x = torch.randn(4, 3, 32, 32)
print(model(x).shape)  # [4, 10]

Q15. What is transfer learning for deep learning? When do you fine-tune vs use as feature extractor?

StrategyDescriptionUse When
Feature extractionFreeze all pre-trained layers; train only headTarget dataset very small (<1K samples); similar domain
Fine-tune top layersFreeze bottom layers; fine-tune top N layers + headMedium dataset; similar domain
Full fine-tuningUnfreeze all layers; train with small LRLarge dataset; or different domain
LoRA/QLoRAFreeze base, add low-rank adaptersLLM fine-tuning (standard in 2026)
import torchvision.models as models
import torch.nn as nn

# Feature extraction
backbone = models.efficientnet_b0(weights='IMAGENET1K_V1')
for param in backbone.parameters():
    param.requires_grad = False

# Replace head
n_features = backbone.classifier[1].in_features
backbone.classifier = nn.Sequential(
    nn.Dropout(0.3),
    nn.Linear(n_features, 5)   # 5-class problem
)
# Only classifier parameters have requires_grad=True

# Fine-tune last 2 blocks + head
for name, param in backbone.named_parameters():
    if 'features.7' in name or 'features.8' in name or 'classifier' in name:
        param.requires_grad = True

optimizer = torch.optim.AdamW([
    {'params': backbone.features.parameters(), 'lr': 1e-5},  # low LR for backbone
    {'params': backbone.classifier.parameters(), 'lr': 1e-3} # high LR for head
])

Q16. How does an autoencoder work? What is a VAE?

Autoencoder: Encoder compresses input to latent vector z; Decoder reconstructs input from z. Trained to minimize reconstruction error. Forces the network to learn a compressed representation.

Variational Autoencoder (VAE): Encoder outputs a distribution N(mu, sigma^2) over z, not a point. z is sampled from this distribution. Loss = reconstruction error + KL divergence from prior N(0,I). Enables generation of new data by sampling from the prior.

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
        super().__init__()
        # Encoder
        self.fc1    = nn.Linear(input_dim, hidden_dim)
        self.fc_mu  = nn.Linear(hidden_dim, latent_dim)
        self.fc_var = nn.Linear(hidden_dim, latent_dim)
        # Decoder
        self.fc3 = nn.Linear(latent_dim, hidden_dim)
        self.fc4 = nn.Linear(hidden_dim, input_dim)

    def encode(self, x):
        h = F.relu(self.fc1(x))
        return self.fc_mu(h), self.fc_var(h)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std     # differentiable sampling

    def decode(self, z):
        return torch.sigmoid(self.fc4(F.relu(self.fc3(z))))

    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, 784))
        z = self.reparameterize(mu, logvar)
        recon = self.decode(z)
        return recon, mu, logvar

def vae_loss(recon_x, x, mu, logvar, beta=1.0):
    bce  = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')
    kl   = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return bce + beta * kl

Q17. What is a GAN? How does training work and what are the common failure modes?

  • Generator G: Takes random noise z, outputs fake samples
  • Discriminator D: Takes a sample (real or fake), outputs P(real)

Objective (minimax):

min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]
# Training loop sketch
for real_batch in dataloader:
    # Train Discriminator
    D_optimizer.zero_grad()
    real_output = discriminator(real_batch)
    d_real_loss = F.binary_cross_entropy(real_output, torch.ones_like(real_output))

    z = torch.randn(batch_size, latent_dim)
    fake = generator(z).detach()
    fake_output = discriminator(fake)
    d_fake_loss = F.binary_cross_entropy(fake_output, torch.zeros_like(fake_output))
    (d_real_loss + d_fake_loss).backward()
    D_optimizer.step()

    # Train Generator
    G_optimizer.zero_grad()
    z = torch.randn(batch_size, latent_dim)
    fake_output = discriminator(generator(z))
    g_loss = F.binary_cross_entropy(fake_output, torch.ones_like(fake_output))
    g_loss.backward()
    G_optimizer.step()

Failure modes:

ModeSymptomFix
Mode collapseGenerator produces only one or few samplesMinibatch discrimination, unrolled GANs
Training instabilityLoss oscillates wildlyGradient penalty (WGAN-GP), spectral norm
Discriminator wins too fastGenerator receives zero gradientsBalance update frequency

2026 status: Diffusion models have replaced GANs for image generation (Stable Diffusion, DALL-E, Midjourney). GANs still appear in video and real-time generation.


Q18. What is the transformer attention complexity? How does FlashAttention solve the memory problem?

Standard attention complexity:

  • Time: O(n^2 * d)
  • Memory: O(n^2) for the attention matrix

For n=4096, d=64, the attention matrix is 4096^2 * 2 bytes = 32MB per head. With 32 heads, that is 1GB per layer. For a 96-layer model, memory is infeasible.

FlashAttention (Dao et al. 2022):

  • Avoids materializing the full n x n attention matrix in HBM (GPU main memory)
  • Uses tiling: processes blocks of Q, K, V that fit in SRAM (fast on-chip memory)
  • Uses online softmax computation: maintains running max and sum to compute softmax exactly
  • Result: Same mathematical output, O(n) memory instead of O(n^2)
import torch

# PyTorch 2.x uses FlashAttention automatically
Q = torch.randn(2, 8, 512, 64, dtype=torch.float16, device='cuda')   # [B, heads, T, d_k]
K = torch.randn(2, 8, 512, 64, dtype=torch.float16, device='cuda')
V = torch.randn(2, 8, 512, 64, dtype=torch.float16, device='cuda')

# This calls FlashAttention when available (CUDA + float16/bfloat16 + contiguous)
with torch.backends.cuda.sdp_kernel(enable_flash=True):
    out = torch.nn.functional.scaled_dot_product_attention(Q, K, V, is_causal=True)

Q19. What is mixed precision training? How do bfloat16 and float16 differ?

FormatBitsRangePrecisionBest For
FP3232LargeHighReference, optimizer states
FP161665504 maxMediumInference, some training
BF1616Same as FP32Lower mantissaLLM training (A100, H100)

BF16 vs FP16: BF16 has the same exponent range as FP32 (8 bits) so overflow/underflow is rare. FP16 has a narrow range (5-bit exponent), requiring loss scaling.

import torch
from torch.cuda.amp import autocast, GradScaler

model = model.to('cuda')
scaler = GradScaler()   # only needed for FP16; not needed for BF16

for batch in dataloader:
    optimizer.zero_grad()

    # FP16 with loss scaling
    with autocast(dtype=torch.float16):
        output = model(batch['x'])
        loss = criterion(output, batch['y'])
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

    # BF16 (cleaner, no loss scaling)
    with autocast(dtype=torch.bfloat16):
        output = model(batch['x'])

Q20. What is LoRA? How does it reduce parameters for fine-tuning?

W' = W + delta_W = W + B * A
where W in R^(d x k), B in R^(d x r), A in R^(r x k), r << min(d, k)

Parameter savings: A 4096 x 4096 weight matrix has 16.7M parameters. With rank 16, B and A together have 409616 + 164096 = 131K parameters, a reduction of 127x.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8b',
                                              torch_dtype=torch.bfloat16,
                                              device_map='auto')

lora_config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,           # scaling: effective lr = lr * lora_alpha / r
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj',
                    'gate_proj', 'up_proj', 'down_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~41M || all params: ~8B || trainable%: ~0.5%

Q21. Explain the concept of knowledge distillation with a PyTorch implementation.

import torch.nn.functional as F
import torch

def distillation_loss(student_logits, teacher_logits, true_labels,
                       T=4.0, alpha=0.7):
    """
    T: temperature (higher = softer distribution, more info transfer)
    alpha: weight for soft target loss (1-alpha for hard label loss)
    """
    # Soft target loss (KL divergence)
    soft_targets  = F.softmax(teacher_logits / T, dim=-1)
    soft_student  = F.log_softmax(student_logits / T, dim=-1)
    kd_loss = F.kl_div(soft_student, soft_targets, reduction='batchmean') * (T ** 2)

    # Hard label loss
    ce_loss = F.cross_entropy(student_logits, true_labels)

    return alpha * kd_loss + (1 - alpha) * ce_loss


# Training loop with teacher-student
teacher.eval()
for batch_x, batch_y in dataloader:
    with torch.no_grad():
        teacher_logits = teacher(batch_x)   # teacher inference, no grad

    student_logits = student(batch_x)
    loss = distillation_loss(student_logits, teacher_logits, batch_y, T=4.0, alpha=0.7)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Q22. What is multi-task learning and when does it improve performance?

Benefits:

  • Prevents overfitting via auxiliary tasks
  • Fewer total parameters than N separate models
  • Task A can provide useful gradient signal for task B
class MultiTaskModel(nn.Module):
    def __init__(self, d_model=512, n_classes_task1=3, n_classes_task2=5):
        super().__init__()
        # Shared backbone
        self.backbone = nn.Sequential(
            nn.Linear(100, d_model),
            nn.LayerNorm(d_model),
            nn.GELU(),
            nn.Linear(d_model, d_model)
        )
        # Task-specific heads
        self.head_task1 = nn.Linear(d_model, n_classes_task1)
        self.head_task2 = nn.Linear(d_model, n_classes_task2)

    def forward(self, x):
        shared = self.backbone(x)
        return self.head_task1(shared), self.head_task2(shared)

def mtl_loss(logits1, logits2, y1, y2, lambda1=1.0, lambda2=0.5):
    loss1 = F.cross_entropy(logits1, y1)
    loss2 = F.cross_entropy(logits2, y2)
    return lambda1 * loss1 + lambda2 * loss2

Works best when tasks are related (e.g., named entity recognition + part-of-speech tagging, or classification + auxiliary self-supervised task). Divergent tasks hurt each other.


HARD: Advanced Topics (Questions 23-30)

Q23. What is quantization? Compare PTQ and QAT.

MethodFull NameHowQualitySpeed
PTQPost-Training QuantizationQuantize after training; calibrate on small datasetLowerFastest
QATQuantization-Aware TrainingSimulate quantization during training; fine-tuneHigherSlower but better
GPTQGPU-based PTQ for LLMsMinimize weight reconstruction error per layerHigh for LLMsStandard in 2026
AWQActivation-aware Weight QuantScale weights based on salient activationsBetter than GPTQStandard in 2026
# PyTorch dynamic quantization (simplest PTQ)
import torch.quantization

model_int8 = torch.quantization.quantize_dynamic(
    model,
    {nn.Linear},           # quantize only linear layers
    dtype=torch.qint8
)

# bitsandbytes 4-bit loading (LLM standard in 2026)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained('model_name',
                                              quantization_config=bnb_config)

Q24. What is structured vs unstructured pruning?

TypeWhat is removedSpeed on HardwareQuality loss
UnstructuredIndividual weights (sparse matrix)Minimal without sparse hardwareMinimal
StructuredEntire neurons, heads, or layersImmediate (dense matrix shrinks)Moderate
Magnitude pruningSmallest-magnitude weightsDepends on typeLow
Lottery Ticket HypothesisRetrain sparse subnetwork from scratchExperimentalLow
import torch.nn.utils.prune as prune

# Unstructured L1 magnitude pruning (30% of weights zeroed)
prune.l1_unstructured(model.fc1, name='weight', amount=0.3)

# Structured pruning: remove entire rows (neurons) of a linear layer
prune.ln_structured(model.fc1, name='weight', amount=0.3, n=2, dim=0)

# Make pruning permanent (remove mask; recompute weight)
prune.remove(model.fc1, 'weight')

# Global pruning: prune 20% of all weights across the whole model
parameters_to_prune = [(layer, 'weight') for layer in model.modules()
                        if isinstance(layer, nn.Linear)]
prune.global_unstructured(parameters_to_prune,
                           pruning_method=prune.L1Unstructured,
                           amount=0.2)

Q25. Explain the training stability tricks for large language models.

TrickDescriptionPurpose
Pre-LN (pre-norm)LayerNorm before attention/FFN, not afterMore stable gradients vs post-LN
QK-NormNormalize Q and K before dot-productPrevents logit growth, entropy collapse
Gradient clippingClip norm to 1.0Prevents single bad batch from destroying training
Weight tyingShare embedding and output projection weightsReduces parameters, improves language modeling
Z-lossPenalize large logit magnitudesPrevents softmax saturation
Warmup LRLinear ramp for first 1-5% of stepsPrevents early instability
# Z-loss (prevents entropy collapse in MoE routing and output softmax)
def z_loss(logits, z_loss_coef=0.001):
    log_z = torch.logsumexp(logits, dim=-1)
    return z_loss_coef * log_z.pow(2).mean()

# QK-Norm implementation
class AttentionWithQKNorm(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.q_norm = nn.RMSNorm(self.d_k)
        self.k_norm = nn.RMSNorm(self.d_k)
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)

    def forward(self, x, mask=None):
        B, T, D = x.shape
        # Apply per-head normalization to Q and K
        # (simplified sketch; full impl would reshape, norm, reshape back)
        return self.attn(x, x, x, attn_mask=mask)[0]

Q26. What is RLHF and DPO? How are they used to align LLMs?

RLHF (Reinforcement Learning from Human Feedback):

  1. Supervised fine-tuning (SFT) on curated demonstrations
  2. Train reward model on human preference pairs (chosen > rejected)
  3. PPO optimization: maximize reward while staying close to SFT policy (KL penalty)

DPO (Direct Preference Optimization): Eliminates the reward model. Directly optimizes on preference pairs:

L_DPO = -E[log sigma(beta * (log pi_theta(y_w|x) - log pi_ref(y_w|x))
                      - beta * (log pi_theta(y_l|x) - log pi_ref(y_l|x)))]
# DPO loss implementation
import torch.nn.functional as F

def dpo_loss(policy_logprobs_chosen, policy_logprobs_rejected,
             ref_logprobs_chosen, ref_logprobs_rejected, beta=0.1):
    """
    policy_*: log probabilities from model being trained
    ref_*:    log probabilities from reference (SFT) model
    beta:     KL penalty coefficient
    """
    # Log probability ratios (policy vs reference)
    pi_logratios = policy_logprobs_chosen - policy_logprobs_rejected
    ref_logratios = ref_logprobs_chosen - ref_logprobs_rejected

    # DPO objective
    logits = beta * (pi_logratios - ref_logratios)
    loss = -F.logsigmoid(logits).mean()
    return loss

Why DPO in 2026: DPO is simpler (no RL, no reward model), more stable, and achieves comparable alignment quality. TRL library from HuggingFace has production DPO trainer.


Q27. How does speculative decoding accelerate LLM inference?

Speculative decoding uses a small fast "draft" model to propose k tokens, then verifies all k in a single pass of the large model:

1. Draft model proposes tokens t_1, ..., t_k (k serial small-model passes)
2. Large model verifies all k tokens in ONE forward pass (k tokens in parallel)
3. Accept tokens where large model agrees; reject and resample from first disagreement
4. Speedup: ~2-3x for typical draft acceptance rates of 70-90%
# Simplified speculative decoding loop
def speculative_decode(draft_model, target_model, prompt, max_new=100, k=5):
    input_ids = prompt
    generated = []

    while len(generated) < max_new:
        # Draft k tokens greedily
        draft_tokens = []
        draft_logprobs = []
        ids = input_ids
        for _ in range(k):
            with torch.no_grad():
                draft_logits = draft_model(ids).logits[:, -1]
            draft_tok = draft_logits.argmax(-1)
            draft_tokens.append(draft_tok)
            draft_logprobs.append(F.log_softmax(draft_logits, dim=-1))
            ids = torch.cat([ids, draft_tok.unsqueeze(1)], dim=1)

        # Verify with target model in ONE forward pass
        with torch.no_grad():
            target_logits = target_model(ids).logits[:, -k-1:-1]  # k target logits
        # Accept/reject tokens and resample from first rejection
        # ... (full implementation involves token-level probability comparison)

    return torch.cat(generated)

Q28. Explain distributed training: data parallelism, tensor parallelism, and FSDP.

StrategyWhat is shardedScaleOverhead
Data Parallelism (DDP)Data; full model on each GPULinear with GPUsGradient all-reduce
Tensor Parallelism (TP)Weight matrices split column/row-wiseLLM attention/FFNComplex, needs tensor aware code
Pipeline Parallelism (PP)Model layers split in stagesVery deep modelsMicro-batching required
ZeRO Stage 3Optimizer states + gradients + paramsLarge modelsHigher communication
FSDP (PyTorch)Full model sharded; gather on demandLarge modelsNative PyTorch, standard in 2026
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.fully_sharded_data_parallel import (
    CPUOffload, ShardingStrategy
)
import torch.distributed as dist

# Initialize distributed backend
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)

model = MyLargeModel().to(local_rank)
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,    # ZeRO-3 equivalent
    cpu_offload=CPUOffload(offload_params=True),       # offload to CPU RAM
    auto_wrap_policy=transformer_auto_wrap_policy,
    device_id=local_rank
)

Q29. What is contrastive learning? Explain CLIP and SimCLR.

SimCLR (self-supervised for vision):

  • Two augmented views of the same image = positive pair
  • Views from different images = negative pairs
  • NT-Xent (normalized temperature-scaled cross-entropy) loss

CLIP (OpenAI, multimodal):

  • Positive pair: (image, its text caption)
  • Negative pairs: (image, all other captions in batch)
  • Train image encoder + text encoder together on 400M image-text pairs
  • Creates aligned image-text embedding space
import torch.nn.functional as F

def nt_xent_loss(z1, z2, temperature=0.5):
    """
    z1, z2: [N, D] normalized embeddings of two augmented views
    """
    N = z1.shape[0]
    z = torch.cat([z1, z2], dim=0)   # [2N, D]
    # Cosine similarity matrix
    sim = F.cosine_similarity(z.unsqueeze(1), z.unsqueeze(0), dim=-1) / temperature
    # Mask out diagonal (self-similarity)
    mask = torch.eye(2*N, dtype=torch.bool, device=z.device)
    sim.masked_fill_(mask, float('-inf'))
    # Labels: for sample i, positive is at i+N (and vice versa)
    labels = torch.cat([torch.arange(N, 2*N), torch.arange(0, N)]).to(z.device)
    return F.cross_entropy(sim, labels)

Q30. How do you debug a deep learning model that is not training?

1. Check data pipeline first
   - Visualize a batch: print shapes, check label distribution, spot-check samples
   - Verify normalization: mean~0, std~1 after preprocessing

2. Check that loss decreases on a single batch
   - If loss does NOT decrease on 1 batch: bug in forward pass or loss
   - If loss decreases on 1 batch but NOT across epochs: data pipeline issue

3. Check gradients
   - Any NaN? -> exploding gradients; clip or reduce LR
   - All zero? -> dying ReLUs, wrong loss, disconnected graph
   - Too small? -> vanishing gradients; use ResNet, LayerNorm, better init

4. Check learning rate
   - Too high: loss oscillates or diverges
   - Too low: loss decreases but very slowly
   - Use LR range test (Leslie Smith 1cycle)

5. Check label correctness
   - Common bug: labels accidentally 0-indexed vs 1-indexed
   - Common bug: label tensor dtype should be torch.long for CrossEntropyLoss
# Gradient checking utility
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        if grad_norm == 0:
            print(f"ZERO grad: {name}")
        elif grad_norm > 100:
            print(f"LARGE grad: {name} = {grad_norm:.2f}")
        elif torch.isnan(param.grad).any():
            print(f"NaN grad: {name}")

# Overfit one batch test
x_one, y_one = next(iter(dataloader))
for step in range(200):
    optimizer.zero_grad()
    pred = model(x_one)
    loss = criterion(pred, y_one)
    loss.backward()
    optimizer.step()
    if step % 20 == 0:
        print(f"Step {step}: loss={loss.item():.4f}")
# If loss does not reach near-zero after 200 steps: architecture or init bug

Comparison Table: Architecture Choices in 2026

TaskArchitectureFrameworkNotes
Image classificationEfficientNetV2, ViT-BPyTorchPretrained on ImageNet-21K
Object detectionYOLOv9, RT-DETRPyTorchRT-DETR for production
NLP classificationBERT, DeBERTa-v3HuggingFaceFine-tune on domain data
Text generationLLaMA-3, MistralHuggingFaceQLoRA fine-tuning
SpeechWhisperHuggingFaceOpenAI Whisper-large-v3
MultimodalCLIP, LLaVAHuggingFaceVision-language tasks
TabularLightGBM, XGBoostNativeStill beats NNs on tabular
Time seriesPatchTST, N-HiTSPyTorchTransformer-based TS models

FAQ

Q: PyTorch vs TensorFlow in 2026 interviews? A: PyTorch is the answer. All major research uses PyTorch. TF 2.x is used at some Google-adjacent teams and has Keras integration, but if asked to pick one, pick PyTorch.

Q: How do you debug NaN loss during training? A: In order: check for zero-division in your loss, enable torch.autograd.detect_anomaly() during debugging, check gradient norms, reduce learning rate, check input normalization, add gradient clipping.

Q: What is the difference between an epoch and a step? A: One step = one forward+backward pass on one mini-batch. One epoch = one pass over the entire dataset. Steps per epoch = dataset_size / batch_size.

Q: When should I use a pre-trained model vs train from scratch? A: Always start with a pre-trained model when one exists for your domain. Training from scratch is warranted only when: the domain is highly specialized (medical, satellite, proprietary signal), you have massive data (>10M labeled samples), or the architecture is custom.


Related articles on PapersAdda:

Methodology applied to this articlelast verified 8 Jun 2026
Sources used
Public exam-pattern documents, official recruiter pages, and verified candidate reports on r/developersIndia and LinkedIn.
Verification window
Page last edited 8 Jun 2026 by Aditya Sharma. Numbers and patterns sanity-checked against the most recent 2026 cycle drives we tracked.
What we did NOT do
  • No fabricated salary numbers or success rates. If we quote a range, it's sourced.
  • No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
  • No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Paid contributor programme

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.

Submit your story →

Ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start Free Mock Test →

Related Articles

More from PapersAdda

Share this guide: