issue 117apr 27mmxxvi
est. 2017
Sun, 27 Apr 2026
vol. IX · no. 117
PapersAdda
placement intelligence, since 2017
640+ briefs · 24 campuses · by reservation
verified offers · sourced from r/developersIndia
razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1

PyTorch Interview Questions 2026: 28 Answers with Code

27 min read
Interview Questions
Updated: 8 Jun 2026
Aditya Sharma
Aditya's Edit

PapersAdda 2026 Placement Cycle

By Aditya Sharma·Founder & Editor, PapersAdda

What changed in 2026 drives

Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.

What I'd actually study for this

  • 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
  • 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
  • 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
  • 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken

Where most candidates trip up

The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.

Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.

PyTorch is the framework of choice for deep learning in 2026. Every top AI lab (Meta AI, OpenAI, Anthropic, DeepMind, Mistral) uses PyTorch. It is the standard at Google for research (alongside JAX), and the default at Microsoft and Amazon ML teams. If you are interviewing for any ML engineering, ML research, or deep learning role, PyTorch proficiency is non-negotiable. This guide covers 28 PyTorch interview questions with complete code.

PapersAdda's take: PyTorch interviews test three things: Do you understand the tensor and autograd system? Can you write a clean nn.Module? Can you debug a training loop? The code in this guide is write-from-memory interview code, not tutorial snippets. Candidates report that the autograd computation graph and custom training loop questions appear in virtually every PyTorch-focused interview round. According to candidate accounts from public preparation resources, distributed training (DDP, FSDP) questions are increasingly common at senior ML engineer levels. Confirm the exact interview format and required skills on the official company careers portal.

Related articles: Deep Learning Interview Questions 2026 | TensorFlow Interview Questions 2026 | MLOps Interview Questions 2026 | Machine Learning Interview Questions 2026 | Computer Vision Interview Questions 2026


Which Companies Ask PyTorch Questions?

CompanyPyTorch Usage
Meta / Facebook AICreated PyTorch; uses internally for all research and production
Microsoft Azure AIStandard for ML research, Azure ML services
Amazon AWSSageMaker, Bedrock; PyTorch is default framework
Google DeepMindResearch uses PyTorch alongside JAX
OpenAI, AnthropicTraining frontier models on PyTorch
Indian unicorns (Zomato, Swiggy, Meesho, PhonePe)ML teams use PyTorch

EASY: Tensors and Core Concepts (Questions 1-10)

Q1. What is a PyTorch tensor? How is it different from a NumPy array?

PropertyPyTorch TensorNumPy Array
GPU supportYes (.to('cuda'))No (CPU only)
AutogradYes (tracks gradients)No
Memory sharing with NumPyYes (same memory if CPU)Yes
BroadcastingYesYes
Sparse supportYesLimited
Mixed precisionYes (float16, bfloat16)Limited
import torch
import numpy as np

# Creation
x = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32)
x_gpu = x.to('cuda')                            # move to GPU
x_np = x.numpy()                                # share memory with NumPy (CPU only)

# Zero-copy bridge
arr = np.array([1.0, 2.0, 3.0])
t = torch.from_numpy(arr)                        # shares memory
arr[0] = 99
print(t)   # tensor([99., 2., 3.])  -- same memory

# Shapes and operations
a = torch.randn(3, 4)     # shape [3, 4]
print(a.shape, a.dtype, a.device)

# Reshape vs view
b = a.view(4, 3)           # view: shares storage, contiguous only
c = a.reshape(12)          # reshape: may copy if not contiguous
d = a.T.contiguous()       # make contiguous after transpose

# Indexing
print(a[:, 0])             # first column [3]
print(a[a > 0])            # boolean indexing

Q2. What is autograd? How does requires_grad work?

import torch

# requires_grad=True: track this tensor for gradient computation
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = torch.tensor([1.0, 4.0], requires_grad=True)

# Forward pass: every operation is recorded
z = x ** 2 + 3 * y + x * y
loss = z.sum()

# Backward pass: compute gradients via chain rule
loss.backward()

print("dL/dx:", x.grad)   # [2x + y, 2x + y] = [7, 10]
print("dL/dy:", y.grad)   # [3 + x, 3 + x]  = [5, 6]

# Stop tracking gradients (for inference or non-trainable parts)
with torch.no_grad():
    output = model(x)    # no graph built, no grad computation

# Detach a tensor from the computation graph
detached = some_tensor.detach()   # no grad, shares data

Q3. What is the difference between .backward() and torch.autograd.grad()?

import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 3).sum()   # y = x1^3 + x2^3 + x3^3

# Method 1: loss.backward() -> accumulates into x.grad
y.backward()
print("x.grad:", x.grad)   # [3*1^2, 3*2^2, 3*3^2] = [3, 12, 27]

# Method 2: autograd.grad() -> returns gradient tensors explicitly
x2 = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y2 = (x2 ** 3).sum()
grads = torch.autograd.grad(y2, x2)
print("autograd.grad:", grads[0])   # same result

# autograd.grad is useful when you need:
# 1. Gradients w.r.t. intermediate tensors (not model params)
# 2. Higher-order gradients (grad of grad)
# 3. Partial gradients (only some inputs)

# Higher-order: Hessian-vector product
x3 = torch.tensor([1.0, 2.0], requires_grad=True)
y3 = (x3 ** 2).sum()
grads3 = torch.autograd.grad(y3, x3, create_graph=True)[0]  # keep graph for 2nd order
hessian_v = torch.autograd.grad(grads3.sum(), x3)[0]
print("2nd order:", hessian_v)  # [2, 2] (Hessian of sum of x^2 is diag(2))

Q4. What is nn.Module? What are the key methods you must implement?

  • Parameter tracking (all nn.Parameter and sub-modules registered automatically)
  • training flag (affects BatchNorm, Dropout)
  • to(device) and to(dtype) for moving all params
  • state_dict() / load_state_dict() for saving/loading

Must implement: __init__ (define layers), forward (forward computation).

import torch
import torch.nn as nn

class FeedForward(nn.Module):
    def __init__(self, in_dim: int, hidden_dim: int, out_dim: int,
                  dropout: float = 0.1):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, out_dim)
        )
        self._init_weights()

    def _init_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.layers(x)

model = FeedForward(784, 256, 10)
print(model)
print(f"Params: {sum(p.numel() for p in model.parameters()):,}")

# Move to GPU: all parameters moved
model.to('cuda')

# Separate trainable vs frozen parameters
frozen_params     = [p for p in model.parameters() if not p.requires_grad]
trainable_params  = [p for p in model.parameters() if p.requires_grad]

Q5. How do you write a training loop in PyTorch?

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

def train_epoch(model, dataloader, optimizer, criterion, device, scaler=None):
    model.train()
    total_loss = 0.0

    for batch_x, batch_y in dataloader:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)

        optimizer.zero_grad()   # ALWAYS zero before forward pass

        if scaler:
            # Mixed precision (AMP)
            with torch.autocast(device_type='cuda', dtype=torch.float16):
                output = model(batch_x)
                loss   = criterion(output, batch_y)
            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            scaler.step(optimizer)
            scaler.update()
        else:
            output = model(batch_x)
            loss   = criterion(output, batch_y)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)

def evaluate(model, dataloader, criterion, device):
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for batch_x, batch_y in dataloader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            output = model(batch_x)
            _, predicted = output.max(1)
            correct += (predicted == batch_y).sum().item()
            total += len(batch_y)
    return correct / total

# Full training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model  = FeedForward(784, 512, 10).to(device)
optimizer  = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
criterion  = nn.CrossEntropyLoss()
scaler     = torch.cuda.amp.GradScaler()   # for FP16
scheduler  = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30)

for epoch in range(30):
    train_loss = train_epoch(model, train_loader, optimizer, criterion, device, scaler)
    val_acc    = evaluate(model, val_loader, criterion, device)
    scheduler.step()
    print(f"Epoch {epoch+1}: loss={train_loss:.4f}, val_acc={val_acc:.4f}")

Q6. What is the PyTorch Dataset and DataLoader? How do you write a custom dataset?

from torch.utils.data import Dataset, DataLoader
import torch
import pandas as pd
import numpy as np
from pathlib import Path
from PIL import Image
import torchvision.transforms as T

class ImageClassificationDataset(Dataset):
    def __init__(self, csv_path: str, img_dir: str, transform=None):
        self.df        = pd.read_csv(csv_path)    # columns: filename, label
        self.img_dir   = Path(img_dir)
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row   = self.df.iloc[idx]
        img   = Image.open(self.img_dir / row['filename']).convert('RGB')
        label = torch.tensor(row['label'], dtype=torch.long)

        if self.transform:
            img = self.transform(img)

        return img, label

# Usage
train_transform = T.Compose([
    T.RandomResizedCrop(224), T.RandomHorizontalFlip(),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
val_transform = T.Compose([
    T.Resize(256), T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

train_ds = ImageClassificationDataset('train.csv', 'images/', train_transform)
val_ds   = ImageClassificationDataset('val.csv',   'images/', val_transform)

# DataLoader: handles batching, shuffling, parallel loading
train_loader = DataLoader(
    train_ds, batch_size=64, shuffle=True,
    num_workers=4,           # parallel data loading (set to CPU cores)
    pin_memory=True,         # faster CPU->GPU transfer (if CUDA)
    persistent_workers=True, # keep workers alive across epochs
    drop_last=True           # drop last incomplete batch
)

Q7. What is gradient accumulation and why is it used?

import torch

ACCUMULATION_STEPS = 8   # effective batch = actual_batch * 8

optimizer.zero_grad()

for step, (batch_x, batch_y) in enumerate(dataloader):
    batch_x = batch_x.to(device)
    batch_y = batch_y.to(device)

    # Scale loss by accumulation steps (so effective gradient magnitude is same)
    loss = criterion(model(batch_x), batch_y) / ACCUMULATION_STEPS
    loss.backward()  # accumulate gradients; DON'T zero here

    if (step + 1) % ACCUMULATION_STEPS == 0:
        # All gradients accumulated; now clip and step
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad()   # zero AFTER step

# With HuggingFace Trainer: use gradient_accumulation_steps=8 in TrainingArguments
# With PyTorch native: use above pattern

Q8. How does PyTorch handle memory management on GPU? What causes OOM errors?

CauseFix
Batch too largeReduce batch size; use gradient accumulation
Storing activations unnecessarilyUse with torch.no_grad(): for inference
Tensor accumulation in Python listCall .detach() or .item() on scalar tensors
Gradient checkpointing not usedEnable torch.utils.checkpoint.checkpoint()
Memory fragmentationCall torch.cuda.empty_cache() (limited help)
Dead tensors still referencedDelete tensors and call gc.collect()
import torch
import gc

# OOM debug toolkit
def gpu_memory_stats():
    print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
    print(f"Reserved:  {torch.cuda.memory_reserved()/1e9:.2f} GB")
    print(f"Max alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

gpu_memory_stats()

# Common mistake: accumulating loss tensors
losses = []
for batch_x, batch_y in dataloader:
    loss = criterion(model(batch_x), batch_y)
    losses.append(loss)       # BAD: keeps computation graph alive
    losses.append(loss.item())  # GOOD: scalar, no graph

# Proper inference (no gradients)
model.eval()
with torch.no_grad():        # disables autograd tape: saves ~50% memory
    outputs = model(X_test)

# Release cache
torch.cuda.empty_cache()   # returns reserved but unused memory to OS
gc.collect()

Q9. What is torch.nn.functional vs nn.Module? When do you use each?

FormState (weights)When to Use
nn.Module (e.g., nn.Linear)Yes: stores weights as nn.ParametersModules with learnable parameters
nn.functional (e.g., F.linear)No: pure functionActivations, loss functions, custom operations
import torch.nn as nn
import torch.nn.functional as F

# As Module (has weights)
linear = nn.Linear(10, 5)
out = linear(x)   # stores W and b as parameters

# As functional (no weights; use separately defined parameters)
W = nn.Parameter(torch.randn(5, 10))
b = nn.Parameter(torch.zeros(5))
out = F.linear(x, W, b)   # identical computation

# Functional preferred for:
F.relu(x)                   # activation (no params)
F.dropout(x, p=0.3, training=self.training)   # training-mode-aware dropout
F.cross_entropy(logits, labels)               # loss
F.softmax(x, dim=-1)

# Module preferred for:
self.conv   = nn.Conv2d(3, 64, 3)    # has learnable weights
self.bn     = nn.BatchNorm2d(64)     # has gamma, beta, running stats
self.embed  = nn.Embedding(10000, 128)  # embedding table

# The hybrid pattern (most common in practice)
class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, 3, padding=1, bias=False)
        self.bn   = nn.BatchNorm2d(out_ch)

    def forward(self, x):
        return F.relu(self.bn(self.conv(x)), inplace=True)

Q10. What is torch.no_grad() vs torch.inference_mode()? What is detach()?

MethodAutograd disabledTensor views allowed from inputSpeed
torch.no_grad()YesYesModerate
torch.inference_mode()Yes + strongerNo (prevents accidental views that could leak)Faster
.detach()Yes for that tensorN/A (tensor-level)No overhead
import torch

x = torch.randn(100, 100, requires_grad=True)

# no_grad: context manager; preferred for eval loops
with torch.no_grad():
    y = model(x)   # no grad computation

# inference_mode: stronger; preferred for pure inference (2026 standard)
with torch.inference_mode():
    y = model(x)   # fastest; cannot be used as input to grad computation later

# detach: cuts a tensor from the graph
y = model(x)
y_detached = y.detach()   # y_detached.grad_fn is None
# Use for: discriminator in GAN (detach generator output)
# Use for: target network in RL (detach target model outputs)

# Timing comparison (inference_mode is consistently fastest)
import timeit
with torch.no_grad():
    t_no_grad = timeit.timeit(lambda: model(x), number=100)
with torch.inference_mode():
    t_inf     = timeit.timeit(lambda: model(x), number=100)
print(f"no_grad: {t_no_grad:.2f}s  inference_mode: {t_inf:.2f}s")

MEDIUM: Model Building and Training (Questions 11-20)

Q11. How do you implement a ResNet block from scratch?

import torch
import torch.nn as nn
import torch.nn.functional as F

class ResBlock(nn.Module):
    expansion = 1  # for BasicBlock; BottleneckBlock uses 4

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                                stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                                padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_channels)

        # Skip connection: project if shape changes
        self.shortcut = nn.Identity()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return F.relu(out + self.shortcut(x))   # residual connection

class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=10):
        super().__init__()
        self.in_channels = 64
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64), nn.ReLU(inplace=True),
            nn.MaxPool2d(3, stride=2, padding=1)
        )
        self.layer1 = self._make_layer(block, 64,  layers[0], stride=1)
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.fc      = nn.Linear(512, num_classes)

    def _make_layer(self, block, out_ch, n_blocks, stride):
        layers = [block(self.in_channels, out_ch, stride)]
        self.in_channels = out_ch
        for _ in range(1, n_blocks):
            layers.append(block(out_ch, out_ch))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.stem(x)
        x = self.layer4(self.layer3(self.layer2(self.layer1(x))))
        return self.fc(self.avgpool(x).flatten(1))

resnet18 = ResNet(ResBlock, [2, 2, 2, 2], num_classes=1000)

Q12. How do you implement a custom loss function in PyTorch?

import torch
import torch.nn as nn
import torch.nn.functional as F

# Option 1: Simple function
def focal_loss(logits, targets, gamma=2.0, alpha=0.25):
    """Focal loss for class imbalance."""
    probs = torch.sigmoid(logits)
    ce    = F.binary_cross_entropy_with_logits(logits, targets.float(),
                                                reduction='none')
    p_t   = torch.where(targets == 1, probs, 1 - probs)
    fl    = alpha * (1 - p_t) ** gamma * ce
    return fl.mean()

# Option 2: nn.Module (better: maintains state, can be part of model)
class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self, epsilon: float = 0.1):
        super().__init__()
        self.epsilon = epsilon

    def forward(self, logits, targets):
        n_classes = logits.size(-1)
        log_probs = F.log_softmax(logits, dim=-1)

        # Smooth targets: (1-eps)*one_hot + eps/n_classes
        with torch.no_grad():
            smooth = log_probs.new_full(log_probs.shape, self.epsilon / n_classes)
            smooth.scatter_(-1, targets.unsqueeze(-1), 1 - self.epsilon + self.epsilon / n_classes)

        return -(smooth * log_probs).sum(dim=-1).mean()

# Option 3: Custom autograd function (for non-standard gradients)
class StraightThroughEstimator(torch.autograd.Function):
    """STE: quantize in forward, pass gradient straight through in backward."""
    @staticmethod
    def forward(ctx, x):
        return x.round()   # quantize to nearest integer

    @staticmethod
    def backward(ctx, grad_output):
        return grad_output  # gradient flows straight through

Q13. What is the difference between nn.Sequential, nn.ModuleList, and nn.ModuleDict?

import torch.nn as nn

# nn.Sequential: ordered chain; supports indexing and forward passes list of inputs
seq = nn.Sequential(
    nn.Linear(64, 128), nn.ReLU(),
    nn.Linear(128, 10)
)
out = seq(x)   # applies each module in order

# nn.ModuleList: list of modules; NOT a chain (no auto-forward)
class MultiHeadModel(nn.Module):
    def __init__(self, n_heads):
        super().__init__()
        # Registers all Linear layers as parameters of the parent module
        self.heads = nn.ModuleList([nn.Linear(64, 10) for _ in range(n_heads)])

    def forward(self, x):
        return torch.stack([head(x) for head in self.heads], dim=1)

# nn.ModuleDict: dict of named modules; useful for conditional routing
class MoELayer(nn.Module):
    def __init__(self, experts_dict):
        super().__init__()
        self.experts = nn.ModuleDict(experts_dict)  # all params registered

    def forward(self, x, expert_name):
        return self.experts[expert_name](x)

# WRONG: plain Python list -- modules NOT registered, won't be saved
class BrokenModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = [nn.Linear(64, 64) for _ in range(3)]   # NOT registered!
        # Fix: self.layers = nn.ModuleList([nn.Linear(64,64) for _ in range(3)])

Q14. How do you implement mixed precision training in PyTorch?

import torch
from torch.cuda.amp import autocast, GradScaler

model = MyModel().to('cuda')
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = GradScaler()   # manages FP16 loss scaling

for epoch in range(n_epochs):
    for batch_x, batch_y in dataloader:
        batch_x = batch_x.to('cuda')
        batch_y = batch_y.to('cuda')

        optimizer.zero_grad()

        # autocast: ops run in float16 where safe, float32 otherwise
        with autocast(dtype=torch.float16):
            output = model(batch_x)
            loss   = criterion(output, batch_y)

        # Scale loss to prevent underflow in FP16 gradients
        scaler.scale(loss).backward()

        # Unscale gradients before clipping
        scaler.unscale_(optimizer)
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Step optimizer (unscales gradients internally if not already)
        scaler.step(optimizer)

        # Update the loss scale for next iteration
        scaler.update()

# BFloat16 (preferred on A100/H100: no loss scaling needed)
with autocast(dtype=torch.bfloat16):
    output = model(batch_x)
    loss = criterion(output, batch_y)
loss.backward()   # no scaler needed for bfloat16
optimizer.step()

Q15. How do you implement a Transformer encoder from scratch in PyTorch?

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k    = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask=None):
        B, T, D = x.shape
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scale = math.sqrt(self.d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        weights = self.dropout(F.softmax(scores, dim=-1))
        out = torch.matmul(weights, V)
        out = out.transpose(1, 2).contiguous().view(B, T, D)
        return self.W_o(out)

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int,
                  dropout: float = 0.1):
        super().__init__()
        self.attn  = MultiHeadAttention(d_model, n_heads, dropout)
        self.ff    = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop  = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-norm architecture (more stable than post-norm)
        x = x + self.drop(self.attn(self.norm1(x), mask=mask))
        x = x + self.drop(self.ff(self.norm2(x)))
        return x

class TransformerEncoder(nn.Module):
    def __init__(self, n_layers, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask=mask)
        return self.norm(x)

Q16. How do you save and load models in PyTorch?

import torch

# Recommended: save only state_dict (portable, works across architectures)
torch.save(model.state_dict(), 'model_weights.pth')
model.load_state_dict(torch.load('model_weights.pth', map_location='cpu'))

# Full model save (includes architecture; tied to class definition)
torch.save(model, 'full_model.pth')
model_loaded = torch.load('full_model.pth')

# Checkpoint (training state: model + optimizer + epoch)
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'scheduler_state_dict': scheduler.state_dict(),
    'loss': best_val_loss,
    'config': {'d_model': 512, 'n_heads': 8}
}
torch.save(checkpoint, f'checkpoint_epoch_{epoch}.pth')

# Resume training
ckpt = torch.load('checkpoint_epoch_10.pth', map_location=device)
model.load_state_dict(ckpt['model_state_dict'])
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
scheduler.load_state_dict(ckpt['scheduler_state_dict'])
start_epoch = ckpt['epoch'] + 1

# Common mistake: map_location
# If model was saved on GPU but you're loading on CPU:
ckpt = torch.load('model.pth', map_location=torch.device('cpu'))

Q17. What is DataParallel vs DistributedDataParallel? Which should you use?

FeatureDataParallel (DP)DistributedDataParallel (DDP)
Multi-GPU supportSingle-machine onlyMulti-machine + multi-GPU
GIL bottleneckYes (Python GIL limits parallelism)No (one process per GPU)
Memory efficiencyWorse (gradient sync on one GPU)Better (gradients averaged in-place)
Speed2-3x on 4 GPUsNear-linear scaling
ComplexitySimpleNeeds init_process_group
2026 recommendationUse only for quick debuggingProduction standard
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train_ddp(rank, world_size, model_class, dataset):
    setup(rank, world_size)

    model = model_class().to(rank)
    model = DDP(model, device_ids=[rank], output_device=rank)

    sampler = torch.utils.data.distributed.DistributedSampler(
        dataset, num_replicas=world_size, rank=rank
    )
    dataloader = DataLoader(dataset, batch_size=64, sampler=sampler,
                             num_workers=4, pin_memory=True)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

    for epoch in range(n_epochs):
        sampler.set_epoch(epoch)   # shuffle differently each epoch
        for batch_x, batch_y in dataloader:
            batch_x, batch_y = batch_x.to(rank), batch_y.to(rank)
            loss = criterion(model(batch_x), batch_y)
            optimizer.zero_grad()
            loss.backward()        # DDP auto-synchronizes gradients
            optimizer.step()

    cleanup()

# Launch with torchrun:
# torchrun --nproc_per_node=4 train.py

Q18. How does PyTorch handle sparse tensors? When are they useful?

import torch

# Sparse COO tensor (row indices, col indices, values)
indices = torch.tensor([[0, 1, 2], [1, 0, 2]])   # shape [2, nnz]
values  = torch.tensor([3.0, 4.0, 5.0])
sparse_tensor = torch.sparse_coo_tensor(
    indices, values, size=(3, 3), dtype=torch.float32
)

# Convert to dense for visualization
print(sparse_tensor.to_dense())

# Sparse linear layer (useful for very high-dimensional inputs)
# Instead of storing a full n x m matrix, store only non-zero weights
# Example: word embeddings in NLP (vocabulary = 100K words, embed_dim = 128)
# Standard: 100K * 128 * 4 bytes = 51MB (acceptable)
# But for 1M vocab: sparse embedding lookup is memory-efficient

# SparseTensor for graph neural networks (adjacency matrix)
row = torch.tensor([0, 1, 2, 1])
col = torch.tensor([1, 0, 3, 2])
edge_index = torch.stack([row, col])
edge_weight = torch.ones(4)
adj = torch.sparse_coo_tensor(edge_index, edge_weight, (4, 4))
# Matrix multiply: message passing in GNNs
node_features = torch.randn(4, 16)
messages = torch.sparse.mm(adj, node_features)

Q19. How do you profile a PyTorch model?

import torch
from torch.profiler import profile, record_function, ProfilerActivity

model = MyModel().to('cuda')
x = torch.randn(64, 3, 224, 224).to('cuda')

# PyTorch Profiler
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    with record_function("model_inference"):
        with torch.inference_mode():
            out = model(x)

# View top operations by CUDA time
print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=20))

# Export for Chrome tracing visualization
prof.export_chrome_trace('trace.json')

# Memory profiling
print(prof.key_averages().table(sort_by='self_cpu_memory_usage', row_limit=10))

# Quick FLOPs count (thop library)
from thop import profile as thop_profile
flops, params = thop_profile(model, inputs=(torch.randn(1, 3, 224, 224).cuda(),))
print(f"FLOPs: {flops/1e9:.2f} G  Params: {params/1e6:.2f} M")

Q20. What is TorchScript and how do you export a model for production?

import torch

# torch.jit.trace: fastest; works if control flow doesn't change with input
model.eval()
example_input = torch.randn(1, 3, 224, 224).cuda()
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')

# torch.jit.script: handles Python control flow (if/for based on tensors)
@torch.jit.script
def relu_if(x: torch.Tensor, threshold: float) -> torch.Tensor:
    if x.max() > threshold:
        return torch.relu(x)
    return x

class ScriptableModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(100, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # All operations must be scriptable (no Python-only ops)
        return self.fc(x)

scripted = torch.jit.script(ScriptableModel())
scripted.save('scripted_model.pt')

# ONNX export (most portable, TensorRT-compatible)
torch.onnx.export(
    model, example_input,
    'model.onnx',
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}},
    opset_version=17
)

HARD: Advanced PyTorch (Questions 21-28)

Q21. How do you implement FSDP (Fully Sharded Data Parallel) in PyTorch?

import torch
import torch.distributed as dist
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, CPUOffload
from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools

# Initialize distributed
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)

# Auto-wrap policy: wrap each TransformerLayer individually
auto_wrap = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={TransformerEncoderLayer}
)

model = LargeTransformerModel()
model = FSDP(
    model,
    auto_wrap_policy=auto_wrap,
    sharding_strategy=ShardingStrategy.FULL_SHARD,   # ZeRO-3 equivalent
    cpu_offload=CPUOffload(offload_params=True),      # offload params to RAM
    mixed_precision=None,
    device_id=local_rank
)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = torch.cuda.amp.GradScaler()

# Saving with FSDP (must use FSDP-aware save)
with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT):
    state = model.state_dict()
    if local_rank == 0:
        torch.save(state, 'fsdp_checkpoint.pth')

Q22. What is custom autograd in PyTorch? When do you need it?

  1. The operation is non-differentiable but you have a surrogate gradient (STE, REINFORCE)
  2. The default backward is numerically unstable
  3. You want a custom gradient for efficiency
import torch

class ClampSTE(torch.autograd.Function):
    """Clamp + Straight-Through Estimator gradient."""
    @staticmethod
    def forward(ctx, x, min_val, max_val):
        ctx.save_for_backward(x)
        ctx.min_val = min_val
        ctx.max_val = max_val
        return x.clamp(min_val, max_val)

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        # Pass gradient through wherever x is in [min, max]; zero otherwise
        mask = (x >= ctx.min_val) & (x <= ctx.max_val)
        return grad_output * mask.float(), None, None

# Usage
x = torch.randn(10, requires_grad=True)
y = ClampSTE.apply(x, -1.0, 1.0)
y.sum().backward()
print(x.grad)   # non-zero only where -1 < x < 1

# Numerically stable log-sum-exp
class StableLogSoftmax(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        softmax = torch.softmax(x, dim=-1)
        ctx.save_for_backward(softmax)
        return torch.log(softmax)

    @staticmethod
    def backward(ctx, grad):
        softmax, = ctx.saved_tensors
        return grad - softmax * grad.sum(dim=-1, keepdim=True)

Q23. How do you implement a learning rate warmup + cosine decay scheduler?

import torch.optim as optim
import math

class WarmupCosineScheduler(optim.lr_scheduler.LRScheduler):
    def __init__(self, optimizer, warmup_steps, total_steps, min_lr=1e-6, last_epoch=-1):
        self.warmup_steps = warmup_steps
        self.total_steps  = total_steps
        self.min_lr       = min_lr
        super().__init__(optimizer, last_epoch)

    def get_lr(self):
        step = self.last_epoch
        if step < self.warmup_steps:
            # Linear warmup: LR goes from 0 to base_lr
            factor = step / max(1, self.warmup_steps)
        else:
            # Cosine decay: LR goes from base_lr to min_lr
            progress = (step - self.warmup_steps) / max(1, self.total_steps - self.warmup_steps)
            factor = self.min_lr + 0.5 * (1 - self.min_lr) * (1 + math.cos(math.pi * progress))
        return [base_lr * factor for base_lr in self.base_lrs]

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = WarmupCosineScheduler(optimizer, warmup_steps=500, total_steps=10000)

# Alternatively, use HuggingFace transformers scheduler
from transformers import get_cosine_schedule_with_warmup
scheduler = get_cosine_schedule_with_warmup(
    optimizer, num_warmup_steps=500, num_training_steps=10000
)

Q24. How do you use PyTorch for Graph Neural Networks (GNNs)?

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, GATConv, global_mean_pool
from torch_geometric.data import Data, DataLoader

# Basic GCN layer (manual message passing)
class GCNLayer(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.linear = nn.Linear(in_channels, out_channels)

    def forward(self, x, edge_index):
        """
        x: [N, in_channels] node features
        edge_index: [2, E] edge list (sparse adjacency)
        """
        # Aggregate: sum over neighbors via scatter_add
        row, col = edge_index
        out = torch.zeros_like(x)
        out.scatter_add_(0, col.unsqueeze(-1).expand_as(x[row]), x[row])
        # Normalize by degree
        deg = torch.zeros(x.size(0), device=x.device)
        deg.scatter_add_(0, col, torch.ones(col.size(0), device=x.device))
        deg = deg.clamp(min=1).unsqueeze(-1)
        return F.relu(self.linear(out / deg))

# Using PyG (PyTorch Geometric) -- standard library for GNNs
class GraphClassifier(nn.Module):
    def __init__(self, in_channels, hidden_channels, num_classes):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.conv3 = GATConv(hidden_channels, hidden_channels, heads=4, concat=False)
        self.fc = nn.Linear(hidden_channels, num_classes)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.3, training=self.training)
        x = F.relu(self.conv2(x, edge_index))
        x = self.conv3(x, edge_index)
        x = global_mean_pool(x, batch)   # aggregate nodes per graph
        return self.fc(x)

Q25. What is the PyTorch compile() API? How does torch.compile work?

import torch

model = MyModel().cuda()

# Compile: first call takes longer (compilation), subsequent calls are faster
compiled_model = torch.compile(model, mode='reduce-overhead')
# Modes:
# 'default': good general-purpose optimization
# 'reduce-overhead': minimize CUDA launch overhead (good for small batch)
# 'max-autotune': autotune Triton kernels (slowest compile, fastest runtime)
# 'max-autotune-no-cudagraphs': max-autotune without CUDA graphs

# Typical speedup: 1.5-3x for transformer models
x = torch.randn(32, 3, 224, 224, device='cuda')
import time
t = time.time(); compiled_model(x); torch.cuda.synchronize()
print(f"Compiled: {time.time()-t:.3f}s")

# torch.compile is compatible with DDP and FSDP
model = DDP(torch.compile(model))

# torch.compile with dynamic shapes
compiled_dynamic = torch.compile(model, dynamic=True)
# Handles variable sequence lengths without recompiling

Q26. How do you implement gradient checkpointing in a custom model?

import torch
from torch.utils.checkpoint import checkpoint, checkpoint_sequential

class CheckpointedTransformer(nn.Module):
    def __init__(self, n_layers, d_model, n_heads, d_ff):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, n_heads, d_ff)
            for _ in range(n_layers)
        ])

    def forward(self, x, mask=None):
        for layer in self.layers:
            if self.training:
                # Recompute layer during backward instead of storing activations
                # use_reentrant=False: new API, more flexible
                x = checkpoint(layer, x, mask, use_reentrant=False)
            else:
                x = layer(x, mask)
        return x

# Memory vs speed tradeoff
# With gradient checkpointing: ~1/3 the activation memory
# Cost: ~33% extra compute (recompute one forward pass)
# Rule of thumb: if GPU OOM with n_layers, checkpoint every other layer

# checkpoint_sequential: checkpoints every k layers
class SequentialModel(nn.Module):
    def __init__(self, n_layers):
        super().__init__()
        self.layers = nn.Sequential(*[
            TransformerEncoderLayer(512, 8, 2048) for _ in range(n_layers)
        ])

    def forward(self, x):
        if self.training:
            # Checkpoint every 2 layers
            return checkpoint_sequential(self.layers, segments=len(self.layers)//2,
                                          input=x, use_reentrant=False)
        return self.layers(x)

Q27. How do you implement a custom optimizer in PyTorch?

import torch
from torch.optim import Optimizer

class Lion(Optimizer):
    """Lion optimizer (Google DeepMind, 2023): sign SGD with momentum."""
    def __init__(self, params, lr=1e-4, betas=(0.9, 0.99), weight_decay=0.0):
        defaults = dict(lr=lr, betas=betas, weight_decay=weight_decay)
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            lr = group['lr']
            beta1, beta2 = group['betas']
            wd = group['weight_decay']

            for p in group['params']:
                if p.grad is None:
                    continue

                grad = p.grad
                state = self.state[p]

                if len(state) == 0:
                    state['exp_avg'] = torch.zeros_like(p)

                exp_avg = state['exp_avg']

                # Update: sign of interpolated momentum
                update = exp_avg * beta1 + grad * (1 - beta1)
                p.data.add_(torch.sign(update), alpha=-lr)

                # Weight decay (decoupled)
                p.data.mul_(1 - lr * wd)

                # Update momentum
                exp_avg.mul_(beta2).add_(grad, alpha=1 - beta2)

        return loss

# Lion typically uses 2-10x smaller LR than Adam
# and 3-5x smaller weight_decay
optimizer = Lion(model.parameters(), lr=1e-4, betas=(0.9, 0.99), weight_decay=0.01)

Q28. What are common PyTorch debugging patterns?

import torch

# 1. NaN/Inf detection
torch.autograd.set_detect_anomaly(True)   # enables full stack trace on NaN
# Warning: slows training significantly; disable in production

# 2. Check gradient flow
def check_gradients(model, threshold=1e-6):
    no_grad  = []
    nan_grad = []
    for name, param in model.named_parameters():
        if param.grad is None:
            no_grad.append(name)
        elif torch.isnan(param.grad).any():
            nan_grad.append(name)
        elif param.grad.abs().max() < threshold:
            print(f"Very small grad: {name} max={param.grad.abs().max():.2e}")
    if no_grad:  print("No gradient:", no_grad[:3])
    if nan_grad: print("NaN gradient:", nan_grad)

# 3. Overfit one batch (golden test)
x_one, y_one = next(iter(train_loader))
x_one, y_one = x_one.to(device), y_one.to(device)
for step in range(300):
    optimizer.zero_grad()
    loss = criterion(model(x_one), y_one)
    loss.backward()
    optimizer.step()
    if step % 50 == 0: print(f"Step {step}: {loss.item():.6f}")
# If loss doesn't go near zero: architecture or loss bug

# 4. Shape debugging
class ShapeLogger(nn.Module):
    def __init__(self, name, wrapped):
        super().__init__()
        self.name = name
        self.module = wrapped

    def forward(self, x):
        out = self.module(x)
        print(f"{self.name}: in={tuple(x.shape)} out={tuple(out.shape)}")
        return out

# 5. Reproducibility
def set_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    import numpy as np, random
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False  # set True for speed if input size is fixed

PyTorch Ecosystem in 2026

ToolPurposeUse
PyTorch LightningHigh-level training APIClean research code
HuggingFace TransformersPre-trained models + TrainerNLP/Vision/Speech
PEFT (LoRA, QLoRA)Parameter-efficient fine-tuningLLM fine-tuning
TRL (SFT, DPO, PPO)LLM alignment trainingRLHF workflows
timm700+ vision modelsImage classification
PyTorch GeometricGraph neural networksGNN research
TritonGPU kernel programmingCustom CUDA ops
vLLMFast LLM inferenceProduction LLM serving
torchaoQuantization and sparsityModel compression

FAQ

Q: What is the difference between .cuda() and .to('cuda')? A: .to('cuda') is the modern API; it is more flexible (supports device strings like 'cuda:1', device objects, dtypes). .cuda() is older but still works. Prefer .to(device) where device = torch.device('cuda' if ... else 'cpu').

Q: Why does torch.nn.CrossEntropyLoss expect logits, not probabilities? A: It uses log_softmax + NLL internally in a numerically stable way. Passing probabilities through softmax first and then log is less stable (log of small numbers). Pass raw logits always.

Q: What is the difference between model.train() and model.eval()? A: model.train() sets self.training = True on all modules; model.eval() sets it to False. This affects BatchNorm (train: batch stats; eval: running stats) and Dropout (train: active; eval: disabled).

Q: When should I use torch.compile vs TorchScript? A: torch.compile for research and training acceleration (easier, more powerful). TorchScript for mobile deployment or C++ inference where you need Python-free execution.


Related articles on PapersAdda:

Methodology applied to this articlelast verified 8 Jun 2026
Sources used
Public exam-pattern documents, official recruiter pages, and verified candidate reports on r/developersIndia and LinkedIn.
Verification window
Page last edited 8 Jun 2026 by Aditya Sharma. Numbers and patterns sanity-checked against the most recent 2026 cycle drives we tracked.
What we did NOT do
  • No fabricated salary numbers or success rates. If we quote a range, it's sourced.
  • No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
  • No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Paid contributor programme

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.

Submit your story →

Ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start Free Mock Test →

Related Articles

More from PapersAdda

Share this guide: