PyTorch Interview Questions 2026: 28 Answers with Code

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
PyTorch is the framework of choice for deep learning in 2026. Every top AI lab (Meta AI, OpenAI, Anthropic, DeepMind, Mistral) uses PyTorch. It is the standard at Google for research (alongside JAX), and the default at Microsoft and Amazon ML teams. If you are interviewing for any ML engineering, ML research, or deep learning role, PyTorch proficiency is non-negotiable. This guide covers 28 PyTorch interview questions with complete code.
PapersAdda's take: PyTorch interviews test three things: Do you understand the tensor and autograd system? Can you write a clean nn.Module? Can you debug a training loop? The code in this guide is write-from-memory interview code, not tutorial snippets. Candidates report that the autograd computation graph and custom training loop questions appear in virtually every PyTorch-focused interview round. According to candidate accounts from public preparation resources, distributed training (DDP, FSDP) questions are increasingly common at senior ML engineer levels. Confirm the exact interview format and required skills on the official company careers portal.
Related articles: Deep Learning Interview Questions 2026 | TensorFlow Interview Questions 2026 | MLOps Interview Questions 2026 | Machine Learning Interview Questions 2026 | Computer Vision Interview Questions 2026
Which Companies Ask PyTorch Questions?
| Company | PyTorch Usage |
|---|---|
| Meta / Facebook AI | Created PyTorch; uses internally for all research and production |
| Microsoft Azure AI | Standard for ML research, Azure ML services |
| Amazon AWS | SageMaker, Bedrock; PyTorch is default framework |
| Google DeepMind | Research uses PyTorch alongside JAX |
| OpenAI, Anthropic | Training frontier models on PyTorch |
| Indian unicorns (Zomato, Swiggy, Meesho, PhonePe) | ML teams use PyTorch |
EASY: Tensors and Core Concepts (Questions 1-10)
Q1. What is a PyTorch tensor? How is it different from a NumPy array?
| Property | PyTorch Tensor | NumPy Array |
|---|---|---|
| GPU support | Yes (.to('cuda')) | No (CPU only) |
| Autograd | Yes (tracks gradients) | No |
| Memory sharing with NumPy | Yes (same memory if CPU) | Yes |
| Broadcasting | Yes | Yes |
| Sparse support | Yes | Limited |
| Mixed precision | Yes (float16, bfloat16) | Limited |
import torch
import numpy as np
# Creation
x = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32)
x_gpu = x.to('cuda') # move to GPU
x_np = x.numpy() # share memory with NumPy (CPU only)
# Zero-copy bridge
arr = np.array([1.0, 2.0, 3.0])
t = torch.from_numpy(arr) # shares memory
arr[0] = 99
print(t) # tensor([99., 2., 3.]) -- same memory
# Shapes and operations
a = torch.randn(3, 4) # shape [3, 4]
print(a.shape, a.dtype, a.device)
# Reshape vs view
b = a.view(4, 3) # view: shares storage, contiguous only
c = a.reshape(12) # reshape: may copy if not contiguous
d = a.T.contiguous() # make contiguous after transpose
# Indexing
print(a[:, 0]) # first column [3]
print(a[a > 0]) # boolean indexing
Q2. What is autograd? How does requires_grad work?
import torch
# requires_grad=True: track this tensor for gradient computation
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = torch.tensor([1.0, 4.0], requires_grad=True)
# Forward pass: every operation is recorded
z = x ** 2 + 3 * y + x * y
loss = z.sum()
# Backward pass: compute gradients via chain rule
loss.backward()
print("dL/dx:", x.grad) # [2x + y, 2x + y] = [7, 10]
print("dL/dy:", y.grad) # [3 + x, 3 + x] = [5, 6]
# Stop tracking gradients (for inference or non-trainable parts)
with torch.no_grad():
output = model(x) # no graph built, no grad computation
# Detach a tensor from the computation graph
detached = some_tensor.detach() # no grad, shares data
Q3. What is the difference between .backward() and torch.autograd.grad()?
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 3).sum() # y = x1^3 + x2^3 + x3^3
# Method 1: loss.backward() -> accumulates into x.grad
y.backward()
print("x.grad:", x.grad) # [3*1^2, 3*2^2, 3*3^2] = [3, 12, 27]
# Method 2: autograd.grad() -> returns gradient tensors explicitly
x2 = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y2 = (x2 ** 3).sum()
grads = torch.autograd.grad(y2, x2)
print("autograd.grad:", grads[0]) # same result
# autograd.grad is useful when you need:
# 1. Gradients w.r.t. intermediate tensors (not model params)
# 2. Higher-order gradients (grad of grad)
# 3. Partial gradients (only some inputs)
# Higher-order: Hessian-vector product
x3 = torch.tensor([1.0, 2.0], requires_grad=True)
y3 = (x3 ** 2).sum()
grads3 = torch.autograd.grad(y3, x3, create_graph=True)[0] # keep graph for 2nd order
hessian_v = torch.autograd.grad(grads3.sum(), x3)[0]
print("2nd order:", hessian_v) # [2, 2] (Hessian of sum of x^2 is diag(2))
Q4. What is nn.Module? What are the key methods you must implement?
- Parameter tracking (all nn.Parameter and sub-modules registered automatically)
trainingflag (affects BatchNorm, Dropout)to(device)andto(dtype)for moving all paramsstate_dict()/load_state_dict()for saving/loading
Must implement: __init__ (define layers), forward (forward computation).
import torch
import torch.nn as nn
class FeedForward(nn.Module):
def __init__(self, in_dim: int, hidden_dim: int, out_dim: int,
dropout: float = 0.1):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(in_dim, hidden_dim),
nn.LayerNorm(hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, out_dim)
)
self._init_weights()
def _init_weights(self):
for module in self.modules():
if isinstance(module, nn.Linear):
nn.init.xavier_uniform_(module.weight)
if module.bias is not None:
nn.init.zeros_(module.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.layers(x)
model = FeedForward(784, 256, 10)
print(model)
print(f"Params: {sum(p.numel() for p in model.parameters()):,}")
# Move to GPU: all parameters moved
model.to('cuda')
# Separate trainable vs frozen parameters
frozen_params = [p for p in model.parameters() if not p.requires_grad]
trainable_params = [p for p in model.parameters() if p.requires_grad]
Q5. How do you write a training loop in PyTorch?
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
def train_epoch(model, dataloader, optimizer, criterion, device, scaler=None):
model.train()
total_loss = 0.0
for batch_x, batch_y in dataloader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
optimizer.zero_grad() # ALWAYS zero before forward pass
if scaler:
# Mixed precision (AMP)
with torch.autocast(device_type='cuda', dtype=torch.float16):
output = model(batch_x)
loss = criterion(output, batch_y)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
else:
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
total_loss += loss.item()
return total_loss / len(dataloader)
def evaluate(model, dataloader, criterion, device):
model.eval()
correct = total = 0
with torch.no_grad():
for batch_x, batch_y in dataloader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
output = model(batch_x)
_, predicted = output.max(1)
correct += (predicted == batch_y).sum().item()
total += len(batch_y)
return correct / total
# Full training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = FeedForward(784, 512, 10).to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
criterion = nn.CrossEntropyLoss()
scaler = torch.cuda.amp.GradScaler() # for FP16
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30)
for epoch in range(30):
train_loss = train_epoch(model, train_loader, optimizer, criterion, device, scaler)
val_acc = evaluate(model, val_loader, criterion, device)
scheduler.step()
print(f"Epoch {epoch+1}: loss={train_loss:.4f}, val_acc={val_acc:.4f}")
Q6. What is the PyTorch Dataset and DataLoader? How do you write a custom dataset?
from torch.utils.data import Dataset, DataLoader
import torch
import pandas as pd
import numpy as np
from pathlib import Path
from PIL import Image
import torchvision.transforms as T
class ImageClassificationDataset(Dataset):
def __init__(self, csv_path: str, img_dir: str, transform=None):
self.df = pd.read_csv(csv_path) # columns: filename, label
self.img_dir = Path(img_dir)
self.transform = transform
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
row = self.df.iloc[idx]
img = Image.open(self.img_dir / row['filename']).convert('RGB')
label = torch.tensor(row['label'], dtype=torch.long)
if self.transform:
img = self.transform(img)
return img, label
# Usage
train_transform = T.Compose([
T.RandomResizedCrop(224), T.RandomHorizontalFlip(),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
val_transform = T.Compose([
T.Resize(256), T.CenterCrop(224),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
train_ds = ImageClassificationDataset('train.csv', 'images/', train_transform)
val_ds = ImageClassificationDataset('val.csv', 'images/', val_transform)
# DataLoader: handles batching, shuffling, parallel loading
train_loader = DataLoader(
train_ds, batch_size=64, shuffle=True,
num_workers=4, # parallel data loading (set to CPU cores)
pin_memory=True, # faster CPU->GPU transfer (if CUDA)
persistent_workers=True, # keep workers alive across epochs
drop_last=True # drop last incomplete batch
)
Q7. What is gradient accumulation and why is it used?
import torch
ACCUMULATION_STEPS = 8 # effective batch = actual_batch * 8
optimizer.zero_grad()
for step, (batch_x, batch_y) in enumerate(dataloader):
batch_x = batch_x.to(device)
batch_y = batch_y.to(device)
# Scale loss by accumulation steps (so effective gradient magnitude is same)
loss = criterion(model(batch_x), batch_y) / ACCUMULATION_STEPS
loss.backward() # accumulate gradients; DON'T zero here
if (step + 1) % ACCUMULATION_STEPS == 0:
# All gradients accumulated; now clip and step
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad() # zero AFTER step
# With HuggingFace Trainer: use gradient_accumulation_steps=8 in TrainingArguments
# With PyTorch native: use above pattern
Q8. How does PyTorch handle memory management on GPU? What causes OOM errors?
| Cause | Fix |
|---|---|
| Batch too large | Reduce batch size; use gradient accumulation |
| Storing activations unnecessarily | Use with torch.no_grad(): for inference |
| Tensor accumulation in Python list | Call .detach() or .item() on scalar tensors |
| Gradient checkpointing not used | Enable torch.utils.checkpoint.checkpoint() |
| Memory fragmentation | Call torch.cuda.empty_cache() (limited help) |
| Dead tensors still referenced | Delete tensors and call gc.collect() |
import torch
import gc
# OOM debug toolkit
def gpu_memory_stats():
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Max alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
gpu_memory_stats()
# Common mistake: accumulating loss tensors
losses = []
for batch_x, batch_y in dataloader:
loss = criterion(model(batch_x), batch_y)
losses.append(loss) # BAD: keeps computation graph alive
losses.append(loss.item()) # GOOD: scalar, no graph
# Proper inference (no gradients)
model.eval()
with torch.no_grad(): # disables autograd tape: saves ~50% memory
outputs = model(X_test)
# Release cache
torch.cuda.empty_cache() # returns reserved but unused memory to OS
gc.collect()
Q9. What is torch.nn.functional vs nn.Module? When do you use each?
| Form | State (weights) | When to Use |
|---|---|---|
nn.Module (e.g., nn.Linear) | Yes: stores weights as nn.Parameters | Modules with learnable parameters |
nn.functional (e.g., F.linear) | No: pure function | Activations, loss functions, custom operations |
import torch.nn as nn
import torch.nn.functional as F
# As Module (has weights)
linear = nn.Linear(10, 5)
out = linear(x) # stores W and b as parameters
# As functional (no weights; use separately defined parameters)
W = nn.Parameter(torch.randn(5, 10))
b = nn.Parameter(torch.zeros(5))
out = F.linear(x, W, b) # identical computation
# Functional preferred for:
F.relu(x) # activation (no params)
F.dropout(x, p=0.3, training=self.training) # training-mode-aware dropout
F.cross_entropy(logits, labels) # loss
F.softmax(x, dim=-1)
# Module preferred for:
self.conv = nn.Conv2d(3, 64, 3) # has learnable weights
self.bn = nn.BatchNorm2d(64) # has gamma, beta, running stats
self.embed = nn.Embedding(10000, 128) # embedding table
# The hybrid pattern (most common in practice)
class ConvBlock(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.conv = nn.Conv2d(in_ch, out_ch, 3, padding=1, bias=False)
self.bn = nn.BatchNorm2d(out_ch)
def forward(self, x):
return F.relu(self.bn(self.conv(x)), inplace=True)
Q10. What is torch.no_grad() vs torch.inference_mode()? What is detach()?
| Method | Autograd disabled | Tensor views allowed from input | Speed |
|---|---|---|---|
torch.no_grad() | Yes | Yes | Moderate |
torch.inference_mode() | Yes + stronger | No (prevents accidental views that could leak) | Faster |
.detach() | Yes for that tensor | N/A (tensor-level) | No overhead |
import torch
x = torch.randn(100, 100, requires_grad=True)
# no_grad: context manager; preferred for eval loops
with torch.no_grad():
y = model(x) # no grad computation
# inference_mode: stronger; preferred for pure inference (2026 standard)
with torch.inference_mode():
y = model(x) # fastest; cannot be used as input to grad computation later
# detach: cuts a tensor from the graph
y = model(x)
y_detached = y.detach() # y_detached.grad_fn is None
# Use for: discriminator in GAN (detach generator output)
# Use for: target network in RL (detach target model outputs)
# Timing comparison (inference_mode is consistently fastest)
import timeit
with torch.no_grad():
t_no_grad = timeit.timeit(lambda: model(x), number=100)
with torch.inference_mode():
t_inf = timeit.timeit(lambda: model(x), number=100)
print(f"no_grad: {t_no_grad:.2f}s inference_mode: {t_inf:.2f}s")
MEDIUM: Model Building and Training (Questions 11-20)
Q11. How do you implement a ResNet block from scratch?
import torch
import torch.nn as nn
import torch.nn.functional as F
class ResBlock(nn.Module):
expansion = 1 # for BasicBlock; BottleneckBlock uses 4
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
# Skip connection: project if shape changes
self.shortcut = nn.Identity()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
return F.relu(out + self.shortcut(x)) # residual connection
class ResNet(nn.Module):
def __init__(self, block, layers, num_classes=10):
super().__init__()
self.in_channels = 64
self.stem = nn.Sequential(
nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False),
nn.BatchNorm2d(64), nn.ReLU(inplace=True),
nn.MaxPool2d(3, stride=2, padding=1)
)
self.layer1 = self._make_layer(block, 64, layers[0], stride=1)
self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
self.avgpool = nn.AdaptiveAvgPool2d((1,1))
self.fc = nn.Linear(512, num_classes)
def _make_layer(self, block, out_ch, n_blocks, stride):
layers = [block(self.in_channels, out_ch, stride)]
self.in_channels = out_ch
for _ in range(1, n_blocks):
layers.append(block(out_ch, out_ch))
return nn.Sequential(*layers)
def forward(self, x):
x = self.stem(x)
x = self.layer4(self.layer3(self.layer2(self.layer1(x))))
return self.fc(self.avgpool(x).flatten(1))
resnet18 = ResNet(ResBlock, [2, 2, 2, 2], num_classes=1000)
Q12. How do you implement a custom loss function in PyTorch?
import torch
import torch.nn as nn
import torch.nn.functional as F
# Option 1: Simple function
def focal_loss(logits, targets, gamma=2.0, alpha=0.25):
"""Focal loss for class imbalance."""
probs = torch.sigmoid(logits)
ce = F.binary_cross_entropy_with_logits(logits, targets.float(),
reduction='none')
p_t = torch.where(targets == 1, probs, 1 - probs)
fl = alpha * (1 - p_t) ** gamma * ce
return fl.mean()
# Option 2: nn.Module (better: maintains state, can be part of model)
class LabelSmoothingCrossEntropy(nn.Module):
def __init__(self, epsilon: float = 0.1):
super().__init__()
self.epsilon = epsilon
def forward(self, logits, targets):
n_classes = logits.size(-1)
log_probs = F.log_softmax(logits, dim=-1)
# Smooth targets: (1-eps)*one_hot + eps/n_classes
with torch.no_grad():
smooth = log_probs.new_full(log_probs.shape, self.epsilon / n_classes)
smooth.scatter_(-1, targets.unsqueeze(-1), 1 - self.epsilon + self.epsilon / n_classes)
return -(smooth * log_probs).sum(dim=-1).mean()
# Option 3: Custom autograd function (for non-standard gradients)
class StraightThroughEstimator(torch.autograd.Function):
"""STE: quantize in forward, pass gradient straight through in backward."""
@staticmethod
def forward(ctx, x):
return x.round() # quantize to nearest integer
@staticmethod
def backward(ctx, grad_output):
return grad_output # gradient flows straight through
Q13. What is the difference between nn.Sequential, nn.ModuleList, and nn.ModuleDict?
import torch.nn as nn
# nn.Sequential: ordered chain; supports indexing and forward passes list of inputs
seq = nn.Sequential(
nn.Linear(64, 128), nn.ReLU(),
nn.Linear(128, 10)
)
out = seq(x) # applies each module in order
# nn.ModuleList: list of modules; NOT a chain (no auto-forward)
class MultiHeadModel(nn.Module):
def __init__(self, n_heads):
super().__init__()
# Registers all Linear layers as parameters of the parent module
self.heads = nn.ModuleList([nn.Linear(64, 10) for _ in range(n_heads)])
def forward(self, x):
return torch.stack([head(x) for head in self.heads], dim=1)
# nn.ModuleDict: dict of named modules; useful for conditional routing
class MoELayer(nn.Module):
def __init__(self, experts_dict):
super().__init__()
self.experts = nn.ModuleDict(experts_dict) # all params registered
def forward(self, x, expert_name):
return self.experts[expert_name](x)
# WRONG: plain Python list -- modules NOT registered, won't be saved
class BrokenModel(nn.Module):
def __init__(self):
super().__init__()
self.layers = [nn.Linear(64, 64) for _ in range(3)] # NOT registered!
# Fix: self.layers = nn.ModuleList([nn.Linear(64,64) for _ in range(3)])
Q14. How do you implement mixed precision training in PyTorch?
import torch
from torch.cuda.amp import autocast, GradScaler
model = MyModel().to('cuda')
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = GradScaler() # manages FP16 loss scaling
for epoch in range(n_epochs):
for batch_x, batch_y in dataloader:
batch_x = batch_x.to('cuda')
batch_y = batch_y.to('cuda')
optimizer.zero_grad()
# autocast: ops run in float16 where safe, float32 otherwise
with autocast(dtype=torch.float16):
output = model(batch_x)
loss = criterion(output, batch_y)
# Scale loss to prevent underflow in FP16 gradients
scaler.scale(loss).backward()
# Unscale gradients before clipping
scaler.unscale_(optimizer)
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Step optimizer (unscales gradients internally if not already)
scaler.step(optimizer)
# Update the loss scale for next iteration
scaler.update()
# BFloat16 (preferred on A100/H100: no loss scaling needed)
with autocast(dtype=torch.bfloat16):
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward() # no scaler needed for bfloat16
optimizer.step()
Q15. How do you implement a Transformer encoder from scratch in PyTorch?
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
super().__init__()
assert d_model % n_heads == 0
self.d_k = d_model // n_heads
self.n_heads = n_heads
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
self.W_o = nn.Linear(d_model, d_model, bias=False)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask=None):
B, T, D = x.shape
Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
scale = math.sqrt(self.d_k)
scores = torch.matmul(Q, K.transpose(-2, -1)) / scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = self.dropout(F.softmax(scores, dim=-1))
out = torch.matmul(weights, V)
out = out.transpose(1, 2).contiguous().view(B, T, D)
return self.W_o(out)
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model: int, n_heads: int, d_ff: int,
dropout: float = 0.1):
super().__init__()
self.attn = MultiHeadAttention(d_model, n_heads, dropout)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff), nn.GELU(), nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.drop = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Pre-norm architecture (more stable than post-norm)
x = x + self.drop(self.attn(self.norm1(x), mask=mask))
x = x + self.drop(self.ff(self.norm2(x)))
return x
class TransformerEncoder(nn.Module):
def __init__(self, n_layers, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.layers = nn.ModuleList([
TransformerEncoderLayer(d_model, n_heads, d_ff, dropout)
for _ in range(n_layers)
])
self.norm = nn.LayerNorm(d_model)
def forward(self, x, mask=None):
for layer in self.layers:
x = layer(x, mask=mask)
return self.norm(x)
Q16. How do you save and load models in PyTorch?
import torch
# Recommended: save only state_dict (portable, works across architectures)
torch.save(model.state_dict(), 'model_weights.pth')
model.load_state_dict(torch.load('model_weights.pth', map_location='cpu'))
# Full model save (includes architecture; tied to class definition)
torch.save(model, 'full_model.pth')
model_loaded = torch.load('full_model.pth')
# Checkpoint (training state: model + optimizer + epoch)
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'loss': best_val_loss,
'config': {'d_model': 512, 'n_heads': 8}
}
torch.save(checkpoint, f'checkpoint_epoch_{epoch}.pth')
# Resume training
ckpt = torch.load('checkpoint_epoch_10.pth', map_location=device)
model.load_state_dict(ckpt['model_state_dict'])
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
scheduler.load_state_dict(ckpt['scheduler_state_dict'])
start_epoch = ckpt['epoch'] + 1
# Common mistake: map_location
# If model was saved on GPU but you're loading on CPU:
ckpt = torch.load('model.pth', map_location=torch.device('cpu'))
Q17. What is DataParallel vs DistributedDataParallel? Which should you use?
| Feature | DataParallel (DP) | DistributedDataParallel (DDP) |
|---|---|---|
| Multi-GPU support | Single-machine only | Multi-machine + multi-GPU |
| GIL bottleneck | Yes (Python GIL limits parallelism) | No (one process per GPU) |
| Memory efficiency | Worse (gradient sync on one GPU) | Better (gradients averaged in-place) |
| Speed | 2-3x on 4 GPUs | Near-linear scaling |
| Complexity | Simple | Needs init_process_group |
| 2026 recommendation | Use only for quick debugging | Production standard |
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
import os
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group('nccl', rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
def train_ddp(rank, world_size, model_class, dataset):
setup(rank, world_size)
model = model_class().to(rank)
model = DDP(model, device_ids=[rank], output_device=rank)
sampler = torch.utils.data.distributed.DistributedSampler(
dataset, num_replicas=world_size, rank=rank
)
dataloader = DataLoader(dataset, batch_size=64, sampler=sampler,
num_workers=4, pin_memory=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
for epoch in range(n_epochs):
sampler.set_epoch(epoch) # shuffle differently each epoch
for batch_x, batch_y in dataloader:
batch_x, batch_y = batch_x.to(rank), batch_y.to(rank)
loss = criterion(model(batch_x), batch_y)
optimizer.zero_grad()
loss.backward() # DDP auto-synchronizes gradients
optimizer.step()
cleanup()
# Launch with torchrun:
# torchrun --nproc_per_node=4 train.py
Q18. How does PyTorch handle sparse tensors? When are they useful?
import torch
# Sparse COO tensor (row indices, col indices, values)
indices = torch.tensor([[0, 1, 2], [1, 0, 2]]) # shape [2, nnz]
values = torch.tensor([3.0, 4.0, 5.0])
sparse_tensor = torch.sparse_coo_tensor(
indices, values, size=(3, 3), dtype=torch.float32
)
# Convert to dense for visualization
print(sparse_tensor.to_dense())
# Sparse linear layer (useful for very high-dimensional inputs)
# Instead of storing a full n x m matrix, store only non-zero weights
# Example: word embeddings in NLP (vocabulary = 100K words, embed_dim = 128)
# Standard: 100K * 128 * 4 bytes = 51MB (acceptable)
# But for 1M vocab: sparse embedding lookup is memory-efficient
# SparseTensor for graph neural networks (adjacency matrix)
row = torch.tensor([0, 1, 2, 1])
col = torch.tensor([1, 0, 3, 2])
edge_index = torch.stack([row, col])
edge_weight = torch.ones(4)
adj = torch.sparse_coo_tensor(edge_index, edge_weight, (4, 4))
# Matrix multiply: message passing in GNNs
node_features = torch.randn(4, 16)
messages = torch.sparse.mm(adj, node_features)
Q19. How do you profile a PyTorch model?
import torch
from torch.profiler import profile, record_function, ProfilerActivity
model = MyModel().to('cuda')
x = torch.randn(64, 3, 224, 224).to('cuda')
# PyTorch Profiler
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
with record_function("model_inference"):
with torch.inference_mode():
out = model(x)
# View top operations by CUDA time
print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=20))
# Export for Chrome tracing visualization
prof.export_chrome_trace('trace.json')
# Memory profiling
print(prof.key_averages().table(sort_by='self_cpu_memory_usage', row_limit=10))
# Quick FLOPs count (thop library)
from thop import profile as thop_profile
flops, params = thop_profile(model, inputs=(torch.randn(1, 3, 224, 224).cuda(),))
print(f"FLOPs: {flops/1e9:.2f} G Params: {params/1e6:.2f} M")
Q20. What is TorchScript and how do you export a model for production?
import torch
# torch.jit.trace: fastest; works if control flow doesn't change with input
model.eval()
example_input = torch.randn(1, 3, 224, 224).cuda()
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')
# torch.jit.script: handles Python control flow (if/for based on tensors)
@torch.jit.script
def relu_if(x: torch.Tensor, threshold: float) -> torch.Tensor:
if x.max() > threshold:
return torch.relu(x)
return x
class ScriptableModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc = torch.nn.Linear(100, 10)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# All operations must be scriptable (no Python-only ops)
return self.fc(x)
scripted = torch.jit.script(ScriptableModel())
scripted.save('scripted_model.pt')
# ONNX export (most portable, TensorRT-compatible)
torch.onnx.export(
model, example_input,
'model.onnx',
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}},
opset_version=17
)
HARD: Advanced PyTorch (Questions 21-28)
Q21. How do you implement FSDP (Fully Sharded Data Parallel) in PyTorch?
import torch
import torch.distributed as dist
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, CPUOffload
from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools
# Initialize distributed
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
# Auto-wrap policy: wrap each TransformerLayer individually
auto_wrap = functools.partial(
transformer_auto_wrap_policy,
transformer_layer_cls={TransformerEncoderLayer}
)
model = LargeTransformerModel()
model = FSDP(
model,
auto_wrap_policy=auto_wrap,
sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO-3 equivalent
cpu_offload=CPUOffload(offload_params=True), # offload params to RAM
mixed_precision=None,
device_id=local_rank
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = torch.cuda.amp.GradScaler()
# Saving with FSDP (must use FSDP-aware save)
with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT):
state = model.state_dict()
if local_rank == 0:
torch.save(state, 'fsdp_checkpoint.pth')
Q22. What is custom autograd in PyTorch? When do you need it?
- The operation is non-differentiable but you have a surrogate gradient (STE, REINFORCE)
- The default backward is numerically unstable
- You want a custom gradient for efficiency
import torch
class ClampSTE(torch.autograd.Function):
"""Clamp + Straight-Through Estimator gradient."""
@staticmethod
def forward(ctx, x, min_val, max_val):
ctx.save_for_backward(x)
ctx.min_val = min_val
ctx.max_val = max_val
return x.clamp(min_val, max_val)
@staticmethod
def backward(ctx, grad_output):
x, = ctx.saved_tensors
# Pass gradient through wherever x is in [min, max]; zero otherwise
mask = (x >= ctx.min_val) & (x <= ctx.max_val)
return grad_output * mask.float(), None, None
# Usage
x = torch.randn(10, requires_grad=True)
y = ClampSTE.apply(x, -1.0, 1.0)
y.sum().backward()
print(x.grad) # non-zero only where -1 < x < 1
# Numerically stable log-sum-exp
class StableLogSoftmax(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
softmax = torch.softmax(x, dim=-1)
ctx.save_for_backward(softmax)
return torch.log(softmax)
@staticmethod
def backward(ctx, grad):
softmax, = ctx.saved_tensors
return grad - softmax * grad.sum(dim=-1, keepdim=True)
Q23. How do you implement a learning rate warmup + cosine decay scheduler?
import torch.optim as optim
import math
class WarmupCosineScheduler(optim.lr_scheduler.LRScheduler):
def __init__(self, optimizer, warmup_steps, total_steps, min_lr=1e-6, last_epoch=-1):
self.warmup_steps = warmup_steps
self.total_steps = total_steps
self.min_lr = min_lr
super().__init__(optimizer, last_epoch)
def get_lr(self):
step = self.last_epoch
if step < self.warmup_steps:
# Linear warmup: LR goes from 0 to base_lr
factor = step / max(1, self.warmup_steps)
else:
# Cosine decay: LR goes from base_lr to min_lr
progress = (step - self.warmup_steps) / max(1, self.total_steps - self.warmup_steps)
factor = self.min_lr + 0.5 * (1 - self.min_lr) * (1 + math.cos(math.pi * progress))
return [base_lr * factor for base_lr in self.base_lrs]
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = WarmupCosineScheduler(optimizer, warmup_steps=500, total_steps=10000)
# Alternatively, use HuggingFace transformers scheduler
from transformers import get_cosine_schedule_with_warmup
scheduler = get_cosine_schedule_with_warmup(
optimizer, num_warmup_steps=500, num_training_steps=10000
)
Q24. How do you use PyTorch for Graph Neural Networks (GNNs)?
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, GATConv, global_mean_pool
from torch_geometric.data import Data, DataLoader
# Basic GCN layer (manual message passing)
class GCNLayer(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.linear = nn.Linear(in_channels, out_channels)
def forward(self, x, edge_index):
"""
x: [N, in_channels] node features
edge_index: [2, E] edge list (sparse adjacency)
"""
# Aggregate: sum over neighbors via scatter_add
row, col = edge_index
out = torch.zeros_like(x)
out.scatter_add_(0, col.unsqueeze(-1).expand_as(x[row]), x[row])
# Normalize by degree
deg = torch.zeros(x.size(0), device=x.device)
deg.scatter_add_(0, col, torch.ones(col.size(0), device=x.device))
deg = deg.clamp(min=1).unsqueeze(-1)
return F.relu(self.linear(out / deg))
# Using PyG (PyTorch Geometric) -- standard library for GNNs
class GraphClassifier(nn.Module):
def __init__(self, in_channels, hidden_channels, num_classes):
super().__init__()
self.conv1 = GCNConv(in_channels, hidden_channels)
self.conv2 = GCNConv(hidden_channels, hidden_channels)
self.conv3 = GATConv(hidden_channels, hidden_channels, heads=4, concat=False)
self.fc = nn.Linear(hidden_channels, num_classes)
def forward(self, data):
x, edge_index, batch = data.x, data.edge_index, data.batch
x = F.relu(self.conv1(x, edge_index))
x = F.dropout(x, p=0.3, training=self.training)
x = F.relu(self.conv2(x, edge_index))
x = self.conv3(x, edge_index)
x = global_mean_pool(x, batch) # aggregate nodes per graph
return self.fc(x)
Q25. What is the PyTorch compile() API? How does torch.compile work?
import torch
model = MyModel().cuda()
# Compile: first call takes longer (compilation), subsequent calls are faster
compiled_model = torch.compile(model, mode='reduce-overhead')
# Modes:
# 'default': good general-purpose optimization
# 'reduce-overhead': minimize CUDA launch overhead (good for small batch)
# 'max-autotune': autotune Triton kernels (slowest compile, fastest runtime)
# 'max-autotune-no-cudagraphs': max-autotune without CUDA graphs
# Typical speedup: 1.5-3x for transformer models
x = torch.randn(32, 3, 224, 224, device='cuda')
import time
t = time.time(); compiled_model(x); torch.cuda.synchronize()
print(f"Compiled: {time.time()-t:.3f}s")
# torch.compile is compatible with DDP and FSDP
model = DDP(torch.compile(model))
# torch.compile with dynamic shapes
compiled_dynamic = torch.compile(model, dynamic=True)
# Handles variable sequence lengths without recompiling
Q26. How do you implement gradient checkpointing in a custom model?
import torch
from torch.utils.checkpoint import checkpoint, checkpoint_sequential
class CheckpointedTransformer(nn.Module):
def __init__(self, n_layers, d_model, n_heads, d_ff):
super().__init__()
self.layers = nn.ModuleList([
TransformerEncoderLayer(d_model, n_heads, d_ff)
for _ in range(n_layers)
])
def forward(self, x, mask=None):
for layer in self.layers:
if self.training:
# Recompute layer during backward instead of storing activations
# use_reentrant=False: new API, more flexible
x = checkpoint(layer, x, mask, use_reentrant=False)
else:
x = layer(x, mask)
return x
# Memory vs speed tradeoff
# With gradient checkpointing: ~1/3 the activation memory
# Cost: ~33% extra compute (recompute one forward pass)
# Rule of thumb: if GPU OOM with n_layers, checkpoint every other layer
# checkpoint_sequential: checkpoints every k layers
class SequentialModel(nn.Module):
def __init__(self, n_layers):
super().__init__()
self.layers = nn.Sequential(*[
TransformerEncoderLayer(512, 8, 2048) for _ in range(n_layers)
])
def forward(self, x):
if self.training:
# Checkpoint every 2 layers
return checkpoint_sequential(self.layers, segments=len(self.layers)//2,
input=x, use_reentrant=False)
return self.layers(x)
Q27. How do you implement a custom optimizer in PyTorch?
import torch
from torch.optim import Optimizer
class Lion(Optimizer):
"""Lion optimizer (Google DeepMind, 2023): sign SGD with momentum."""
def __init__(self, params, lr=1e-4, betas=(0.9, 0.99), weight_decay=0.0):
defaults = dict(lr=lr, betas=betas, weight_decay=weight_decay)
super().__init__(params, defaults)
@torch.no_grad()
def step(self, closure=None):
loss = None
if closure is not None:
with torch.enable_grad():
loss = closure()
for group in self.param_groups:
lr = group['lr']
beta1, beta2 = group['betas']
wd = group['weight_decay']
for p in group['params']:
if p.grad is None:
continue
grad = p.grad
state = self.state[p]
if len(state) == 0:
state['exp_avg'] = torch.zeros_like(p)
exp_avg = state['exp_avg']
# Update: sign of interpolated momentum
update = exp_avg * beta1 + grad * (1 - beta1)
p.data.add_(torch.sign(update), alpha=-lr)
# Weight decay (decoupled)
p.data.mul_(1 - lr * wd)
# Update momentum
exp_avg.mul_(beta2).add_(grad, alpha=1 - beta2)
return loss
# Lion typically uses 2-10x smaller LR than Adam
# and 3-5x smaller weight_decay
optimizer = Lion(model.parameters(), lr=1e-4, betas=(0.9, 0.99), weight_decay=0.01)
Q28. What are common PyTorch debugging patterns?
import torch
# 1. NaN/Inf detection
torch.autograd.set_detect_anomaly(True) # enables full stack trace on NaN
# Warning: slows training significantly; disable in production
# 2. Check gradient flow
def check_gradients(model, threshold=1e-6):
no_grad = []
nan_grad = []
for name, param in model.named_parameters():
if param.grad is None:
no_grad.append(name)
elif torch.isnan(param.grad).any():
nan_grad.append(name)
elif param.grad.abs().max() < threshold:
print(f"Very small grad: {name} max={param.grad.abs().max():.2e}")
if no_grad: print("No gradient:", no_grad[:3])
if nan_grad: print("NaN gradient:", nan_grad)
# 3. Overfit one batch (golden test)
x_one, y_one = next(iter(train_loader))
x_one, y_one = x_one.to(device), y_one.to(device)
for step in range(300):
optimizer.zero_grad()
loss = criterion(model(x_one), y_one)
loss.backward()
optimizer.step()
if step % 50 == 0: print(f"Step {step}: {loss.item():.6f}")
# If loss doesn't go near zero: architecture or loss bug
# 4. Shape debugging
class ShapeLogger(nn.Module):
def __init__(self, name, wrapped):
super().__init__()
self.name = name
self.module = wrapped
def forward(self, x):
out = self.module(x)
print(f"{self.name}: in={tuple(x.shape)} out={tuple(out.shape)}")
return out
# 5. Reproducibility
def set_seed(seed=42):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
import numpy as np, random
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False # set True for speed if input size is fixed
PyTorch Ecosystem in 2026
| Tool | Purpose | Use |
|---|---|---|
| PyTorch Lightning | High-level training API | Clean research code |
| HuggingFace Transformers | Pre-trained models + Trainer | NLP/Vision/Speech |
| PEFT (LoRA, QLoRA) | Parameter-efficient fine-tuning | LLM fine-tuning |
| TRL (SFT, DPO, PPO) | LLM alignment training | RLHF workflows |
| timm | 700+ vision models | Image classification |
| PyTorch Geometric | Graph neural networks | GNN research |
| Triton | GPU kernel programming | Custom CUDA ops |
| vLLM | Fast LLM inference | Production LLM serving |
| torchao | Quantization and sparsity | Model compression |
FAQ
Q: What is the difference between .cuda() and .to('cuda')?
A: .to('cuda') is the modern API; it is more flexible (supports device strings like 'cuda:1', device objects, dtypes). .cuda() is older but still works. Prefer .to(device) where device = torch.device('cuda' if ... else 'cpu').
Q: Why does torch.nn.CrossEntropyLoss expect logits, not probabilities?
A: It uses log_softmax + NLL internally in a numerically stable way. Passing probabilities through softmax first and then log is less stable (log of small numbers). Pass raw logits always.
Q: What is the difference between model.train() and model.eval()?
A: model.train() sets self.training = True on all modules; model.eval() sets it to False. This affects BatchNorm (train: batch stats; eval: running stats) and Dropout (train: active; eval: disabled).
Q: When should I use torch.compile vs TorchScript?
A: torch.compile for research and training acceleration (easier, more powerful). TorchScript for mobile deployment or C++ inference where you need Python-free execution.
Related articles on PapersAdda:
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →Related Articles
Airbnb Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airbnb's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Airtel Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airtel's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
AMD Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing AMD's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Atlassian Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Atlassian's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical,...
Barclays Interview Questions 2026
_Last verified by [Aditya Sharma](/author/aditya-sharma/) · cross-checked against PapersAdda Hiring Pulse and...
More from PapersAdda
Accenture Interview Questions 2026 (with Answers for Freshers)
Capgemini Interview Questions 2026 (with Answers for Freshers)
HCLTech Interview Questions 2026 (TechBee + TGT, with Answers)
IBM Interview Questions 2026 (with Answers for Freshers)