placement brief / Interview Questions / interview questions / 08 Jun 2026

PyTorch Interview Questions 2026: 28 Answers with Code

Q: What is the difference between .cuda() and .to('cuda')?

.to('cuda') is the modern API; it is more flexible (supports device strings like 'cuda:1', device objects, dtypes). .cuda() is older but still works. Prefer .to(device) where device = torch.device('cuda' if ... else 'cpu').

Q: Why does torch.nn.CrossEntropyLoss expect logits, not probabilities?

It uses log_softmax + NLL internally in a numerically stable way. Passing probabilities through softmax first and then log is less stable (log of small numbers). Pass raw logits always.

Q: What is the difference between model.train() and model.eval()?

model.train() sets self.training = True on all modules; model.eval() sets it to False. This affects BatchNorm (train: batch stats; eval: running stats) and Dropout (train: active; eval: disabled).

Q: When should I use torch.compile vs TorchScript?

torch.compile for research and training acceleration (easier, more powerful). TorchScript for mobile deployment or C++ inference where you need Python-free execution. ---

28 PyTorch interview questions with complete code answers covering tensors, autograd, nn.Module, DataLoader, custom modules, distributed training, and deployment for 2026 interviews.

By Aditya SharmaPublished 8 Jun 20262 sources listedSpot an error? Corrections open

5 min read last revised 8 Jun 2026

on this page§ 06

PyTorch is the framework of choice for deep learning in 2026. Every top AI lab (Meta AI, OpenAI, Anthropic, DeepMind, Mistral) uses PyTorch. It is the standard at Google for research (alongside JAX), and the default at Microsoft and Amazon ML teams. If you are interviewing for any ML engineering, ML research, or deep learning role, PyTorch proficiency is non-negotiable. This guide covers 28 PyTorch interview questions with complete code.

PapersAdda's take: PyTorch interviews test three things: Do you understand the tensor and autograd system? Can you write a clean nn.Module? Can you debug a training loop? The code in this guide is write-from-memory interview code, not tutorial snippets. Candidates report that the autograd computation graph and custom training loop questions appear in virtually every PyTorch-focused interview round. According to candidate accounts from public preparation resources, distributed training (DDP, FSDP) questions are increasingly common at senior ML engineer levels. Confirm the exact interview format and required skills on the official company careers portal.

Related articles: Deep Learning Interview Questions 2026 | TensorFlow Interview Questions 2026 | MLOps Interview Questions 2026 | Machine Learning Interview Questions 2026 | Computer Vision Interview Questions 2026

Which Companies Ask PyTorch Questions?

Company	PyTorch Usage
Meta / Facebook AI	Created PyTorch; uses internally for all research and production
Microsoft Azure AI	Standard for ML research, Azure ML services
Amazon AWS	SageMaker, Bedrock; PyTorch is default framework
Google DeepMind	Research uses PyTorch alongside JAX
OpenAI, Anthropic	Training frontier models on PyTorch
Indian unicorns (Zomato, Swiggy, Meesho, PhonePe)	ML teams use PyTorch

EASY: Tensors and Core Concepts (Questions 1-10)

Q1. What is a PyTorch tensor? How is it different from a NumPy array?

Property	PyTorch Tensor	NumPy Array
GPU support	Yes (`.to('cuda')`)	No (CPU only)
Autograd	Yes (tracks gradients)	No
Memory sharing with NumPy	Yes (same memory if CPU)	Yes
Broadcasting	Yes	Yes
Sparse support	Yes	Limited
Mixed precision	Yes (float16, bfloat16)	Limited

import torch
import numpy as np

# Creation
x = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32)
x_gpu = x.to('cuda')                            # move to GPU
x_np = x.numpy()                                # share memory with NumPy (CPU only)

# Zero-copy bridge
arr = np.array([1.0, 2.0, 3.0])
t = torch.from_numpy(arr)                        # shares memory
arr[0] = 99
print(t)   # tensor([99., 2., 3.])  -- same memory

# Shapes and operations
a = torch.randn(3, 4)     # shape [3, 4]
print(a.shape, a.dtype, a.device)

# Reshape vs view
b = a.view(4, 3)           # view: shares storage, contiguous only
c = a.reshape(12)          # reshape: may copy if not contiguous
d = a.T.contiguous()       # make contiguous after transpose

# Indexing
print(a[:, 0])             # first column [3]
print(a[a > 0])            # boolean indexing

Q2. What is autograd? How does requires_grad work?

import torch

# requires_grad=True: track this tensor for gradient computation
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = torch.tensor([1.0, 4.0], requires_grad=True)

# Forward pass: every operation is recorded
z = x ** 2 + 3 * y + x * y
loss = z.sum()

# Backward pass: compute gradients via chain rule
loss.backward()

print("dL/dx:", x.grad)   # [2x + y, 2x + y] = [7, 10]
print("dL/dy:", y.grad)   # [3 + x, 3 + x]  = [5, 6]

# Stop tracking gradients (for inference or non-trainable parts)
with torch.no_grad():
    output = model(x)    # no graph built, no grad computation

# Detach a tensor from the computation graph
detached = some_tensor.detach()   # no grad, shares data

Q3. What is the difference between `.backward()` and `torch.autograd.grad()`?

import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 3).sum()   # y = x1^3 + x2^3 + x3^3

# Method 1: loss.backward() -> accumulates into x.grad
y.backward()
print("x.grad:", x.grad)   # [3*1^2, 3*2^2, 3*3^2] = [3, 12, 27]

# Method 2: autograd.grad() -> returns gradient tensors explicitly
x2 = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y2 = (x2 ** 3).sum()
grads = torch.autograd.grad(y2, x2)
print("autograd.grad:", grads[0])   # same result

# autograd.grad is useful when you need:
# 1. Gradients w.r.t. intermediate tensors (not model params)
# 2. Higher-order gradients (grad of grad)
# 3. Partial gradients (only some inputs)

# Higher-order: Hessian-vector product
x3 = torch.tensor([1.0, 2.0], requires_grad=True)
y3 = (x3 ** 2).sum()
grads3 = torch.autograd.grad(y3, x3, create_graph=True)[0]  # keep graph for 2nd order
hessian_v = torch.autograd.grad(grads3.sum(), x3)[0]
print("2nd order:", hessian_v)  # [2, 2] (Hessian of sum of x^2 is diag(2))

Q4. What is nn.Module? What are the key methods you must implement?

Parameter tracking (all nn.Parameter and sub-modules registered automatically)
training flag (affects BatchNorm, Dropout)
to(device) and to(dtype) for moving all params
state_dict() / load_state_dict() for saving/loading

Must implement: __init__ (define layers), forward (forward computation).

import torch
import torch.nn as nn

class FeedForward(nn.Module):
    def __init__(self, in_dim: int, hidden_dim: int, out_dim: int,
                  dropout: float = 0.1):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, out_dim)
        )
        self._init_weights()

    def _init_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.layers(x)

model = FeedForward(784, 256, 10)
print(model)
print(f"Params: {sum(p.numel() for p in model.parameters()):,}")

# Move to GPU: all parameters moved
model.to('cuda')

# Separate trainable vs frozen parameters
frozen_params     = [p for p in model.parameters() if not p.requires_grad]
trainable_params  = [p for p in model.parameters() if p.requires_grad]

Q5. How do you write a training loop in PyTorch?

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

def train_epoch(model, dataloader, optimizer, criterion, device, scaler=None):
    model.train()
    total_loss = 0.0

    for batch_x, batch_y in dataloader:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)

        optimizer.zero_grad()   # ALWAYS zero before forward pass

        if scaler:
            # Mixed precision (AMP)
            with torch.autocast(device_type='cuda', dtype=torch.float16):
                output = model(batch_x)
                loss   = criterion(output, batch_y)
            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            scaler.step(optimizer)
            scaler.update()
        else:
            output = model(batch_x)
            loss   = criterion(output, batch_y)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)

def evaluate(model, dataloader, criterion, device):
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for batch_x, batch_y in dataloader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            output = model(batch_x)
            _, predicted = output.max(1)
            correct += (predicted == batch_y).sum().item()
            total += len(batch_y)
    return correct / total

# Full training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model  = FeedForward(784, 512, 10).to(device)
optimizer  = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
criterion  = nn.CrossEntropyLoss()
scaler     = torch.cuda.amp.GradScaler()   # for FP16
scheduler  = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30)

for epoch in range(30):
    train_loss = train_epoch(model, train_loader, optimizer, criterion, device, scaler)
    val_acc    = evaluate(model, val_loader, criterion, device)
    scheduler.step()
    print(f"Epoch {epoch+1}: loss={train_loss:.4f}, val_acc={val_acc:.4f}")

Q6. What is the PyTorch Dataset and DataLoader? How do you write a custom dataset?

from torch.utils.data import Dataset, DataLoader
import torch
import pandas as pd
import numpy as np
from pathlib import Path
from PIL import Image
import torchvision.transforms as T

class ImageClassificationDataset(Dataset):
    def __init__(self, csv_path: str, img_dir: str, transform=None):
        self.df        = pd.read_csv(csv_path)    # columns: filename, label
        self.img_dir   = Path(img_dir)
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row   = self.df.iloc[idx]
        img   = Image.open(self.img_dir / row['filename']).convert('RGB')
        label = torch.tensor(row['label'], dtype=torch.long)

        if self.transform:
            img = self.transform(img)

        return img, label

# Usage
train_transform = T.Compose([
    T.RandomResizedCrop(224), T.RandomHorizontalFlip(),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
val_transform = T.Compose([
    T.Resize(256), T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

train_ds = ImageClassificationDataset('train.csv', 'images/', train_transform)
val_ds   = ImageClassificationDataset('val.csv',   'images/', val_transform)

# DataLoader: handles batching, shuffling, parallel loading
train_loader = DataLoader(
    train_ds, batch_size=64, shuffle=True,
    num_workers=4,           # parallel data loading (set to CPU cores)
    pin_memory=True,         # faster CPU->GPU transfer (if CUDA)
    persistent_workers=True, # keep workers alive across epochs
    drop_last=True           # drop last incomplete batch
)

Q7. What is gradient accumulation and why is it used?

import torch

ACCUMULATION_STEPS = 8   # effective batch = actual_batch * 8

optimizer.zero_grad()

for step, (batch_x, batch_y) in enumerate(dataloader):
    batch_x = batch_x.to(device)
    batch_y = batch_y.to(device)

    # Scale loss by accumulation steps (so effective gradient magnitude is same)
    loss = criterion(model(batch_x), batch_y) / ACCUMULATION_STEPS
    loss.backward()  # accumulate gradients; DON'T zero here

    if (step + 1) % ACCUMULATION_STEPS == 0:
        # All gradients accumulated; now clip and step
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad()   # zero AFTER step

# With HuggingFace Trainer: use gradient_accumulation_steps=8 in TrainingArguments
# With PyTorch native: use above pattern

Q8. How does PyTorch handle memory management on GPU? What causes OOM errors?

Cause	Fix
Batch too large	Reduce batch size; use gradient accumulation
Storing activations unnecessarily	Use `with torch.no_grad():` for inference
Tensor accumulation in Python list	Call `.detach()` or `.item()` on scalar tensors
Gradient checkpointing not used	Enable `torch.utils.checkpoint.checkpoint()`
Memory fragmentation	Call `torch.cuda.empty_cache()` (limited help)
Dead tensors still referenced	Delete tensors and call `gc.collect()`

import torch
import gc

# OOM debug toolkit
def gpu_memory_stats():
    print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
    print(f"Reserved:  {torch.cuda.memory_reserved()/1e9:.2f} GB")
    print(f"Max alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

gpu_memory_stats()

# Common mistake: accumulating loss tensors
losses = []
for batch_x, batch_y in dataloader:
    loss = criterion(model(batch_x), batch_y)
    losses.append(loss)       # BAD: keeps computation graph alive
    losses.append(loss.item())  # GOOD: scalar, no graph

# Proper inference (no gradients)
model.eval()
with torch.no_grad():        # disables autograd tape: saves ~50% memory
    outputs = model(X_test)

# Release cache
torch.cuda.empty_cache()   # returns reserved but unused memory to OS
gc.collect()

Q9. What is torch.nn.functional vs nn.Module? When do you use each?

Form	State (weights)	When to Use
`nn.Module` (e.g., `nn.Linear`)	Yes: stores weights as nn.Parameters	Modules with learnable parameters
`nn.functional` (e.g., `F.linear`)	No: pure function	Activations, loss functions, custom operations

import torch.nn as nn
import torch.nn.functional as F

# As Module (has weights)
linear = nn.Linear(10, 5)
out = linear(x)   # stores W and b as parameters

# As functional (no weights; use separately defined parameters)
W = nn.Parameter(torch.randn(5, 10))
b = nn.Parameter(torch.zeros(5))
out = F.linear(x, W, b)   # identical computation

# Functional preferred for:
F.relu(x)                   # activation (no params)
F.dropout(x, p=0.3, training=self.training)   # training-mode-aware dropout
F.cross_entropy(logits, labels)               # loss
F.softmax(x, dim=-1)

# Module preferred for:
self.conv   = nn.Conv2d(3, 64, 3)    # has learnable weights
self.bn     = nn.BatchNorm2d(64)     # has gamma, beta, running stats
self.embed  = nn.Embedding(10000, 128)  # embedding table

# The hybrid pattern (most common in practice)
class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, 3, padding=1, bias=False)
        self.bn   = nn.BatchNorm2d(out_ch)

    def forward(self, x):
        return F.relu(self.bn(self.conv(x)), inplace=True)

Q10. What is torch.no_grad() vs torch.inference_mode()? What is detach()?

Method	Autograd disabled	Tensor views allowed from input	Speed
`torch.no_grad()`	Yes	Yes	Moderate
`torch.inference_mode()`	Yes + stronger	No (prevents accidental views that could leak)	Faster
`.detach()`	Yes for that tensor	N/A (tensor-level)	No overhead

import torch

x = torch.randn(100, 100, requires_grad=True)

# no_grad: context manager; preferred for eval loops
with torch.no_grad():
    y = model(x)   # no grad computation

# inference_mode: stronger; preferred for pure inference (2026 standard)
with torch.inference_mode():
    y = model(x)   # fastest; cannot be used as input to grad computation later

# detach: cuts a tensor from the graph
y = model(x)
y_detached = y.detach()   # y_detached.grad_fn is None
# Use for: discriminator in GAN (detach generator output)
# Use for: target network in RL (detach target model outputs)

# Timing comparison (inference_mode is consistently fastest)
import timeit
with torch.no_grad():
    t_no_grad = timeit.timeit(lambda: model(x), number=100)
with torch.inference_mode():
    t_inf     = timeit.timeit(lambda: model(x), number=100)
print(f"no_grad: {t_no_grad:.2f}s  inference_mode: {t_inf:.2f}s")

MEDIUM: Model Building and Training (Questions 11-20)

Q11. How do you implement a ResNet block from scratch?

import torch
import torch.nn as nn
import torch.nn.functional as F

class ResBlock(nn.Module):
    expansion = 1  # for BasicBlock; BottleneckBlock uses 4

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                                stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                                padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_channels)

        # Skip connection: project if shape changes
        self.shortcut = nn.Identity()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return F.relu(out + self.shortcut(x))   # residual connection

class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=10):
        super().__init__()
        self.in_channels = 64
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64), nn.ReLU(inplace=True),
            nn.MaxPool2d(3, stride=2, padding=1)
        )
        self.layer1 = self._make_layer(block, 64,  layers[0], stride=1)
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.fc      = nn.Linear(512, num_classes)

    def _make_layer(self, block, out_ch, n_blocks, stride):
        layers = [block(self.in_channels, out_ch, stride)]
        self.in_channels = out_ch
        for _ in range(1, n_blocks):
            layers.append(block(out_ch, out_ch))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.stem(x)
        x = self.layer4(self.layer3(self.layer2(self.layer1(x))))
        return self.fc(self.avgpool(x).flatten(1))

resnet18 = ResNet(ResBlock, [2, 2, 2, 2], num_classes=1000)

Q12. How do you implement a custom loss function in PyTorch?

import torch
import torch.nn as nn
import torch.nn.functional as F

# Option 1: Simple function
def focal_loss(logits, targets, gamma=2.0, alpha=0.25):
    """Focal loss for class imbalance."""
    probs = torch.sigmoid(logits)
    ce    = F.binary_cross_entropy_with_logits(logits, targets.float(),
                                                reduction='none')
    p_t   = torch.where(targets == 1, probs, 1 - probs)
    fl    = alpha * (1 - p_t) ** gamma * ce
    return fl.mean()

# Option 2: nn.Module (better: maintains state, can be part of model)
class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self, epsilon: float = 0.1):
        super().__init__()
        self.epsilon = epsilon

    def forward(self, logits, targets):
        n_classes = logits.size(-1)
        log_probs = F.log_softmax(logits, dim=-1)

        # Smooth targets: (1-eps)*one_hot + eps/n_classes
        with torch.no_grad():
            smooth = log_probs.new_full(log_probs.shape, self.epsilon / n_classes)
            smooth.scatter_(-1, targets.unsqueeze(-1), 1 - self.epsilon + self.epsilon / n_classes)

        return -(smooth * log_probs).sum(dim=-1).mean()

# Option 3: Custom autograd function (for non-standard gradients)
class StraightThroughEstimator(torch.autograd.Function):
    """STE: quantize in forward, pass gradient straight through in backward."""
    @staticmethod
    def forward(ctx, x):
        return x.round()   # quantize to nearest integer

    @staticmethod
    def backward(ctx, grad_output):
        return grad_output  # gradient flows straight through

Q13. What is the difference between nn.Sequential, nn.ModuleList, and nn.ModuleDict?

import torch.nn as nn

# nn.Sequential: ordered chain; supports indexing and forward passes list of inputs
seq = nn.Sequential(
    nn.Linear(64, 128), nn.ReLU(),
    nn.Linear(128, 10)
)
out = seq(x)   # applies each module in order

# nn.ModuleList: list of modules; NOT a chain (no auto-forward)
class MultiHeadModel(nn.Module):
    def __init__(self, n_heads):
        super().__init__()
        # Registers all Linear layers as parameters of the parent module
        self.heads = nn.ModuleList([nn.Linear(64, 10) for _ in range(n_heads)])

    def forward(self, x):
        return torch.stack([head(x) for head in self.heads], dim=1)

# nn.ModuleDict: dict of named modules; useful for conditional routing
class MoELayer(nn.Module):
    def __init__(self, experts_dict):
        super().__init__()
        self.experts = nn.ModuleDict(experts_dict)  # all params registered

    def forward(self, x, expert_name):
        return self.experts[expert_name](x)

# WRONG: plain Python list -- modules NOT registered, won't be saved
class BrokenModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = [nn.Linear(64, 64) for _ in range(3)]   # NOT registered!
        # Fix: self.layers = nn.ModuleList([nn.Linear(64,64) for _ in range(3)])

Q14. How do you implement mixed precision training in PyTorch?

import torch
from torch.cuda.amp import autocast, GradScaler

model = MyModel().to('cuda')
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = GradScaler()   # manages FP16 loss scaling

for epoch in range(n_epochs):
    for batch_x, batch_y in dataloader:
        batch_x = batch_x.to('cuda')
        batch_y = batch_y.to('cuda')

        optimizer.zero_grad()

        # autocast: ops run in float16 where safe, float32 otherwise
        with autocast(dtype=torch.float16):
            output = model(batch_x)
            loss   = criterion(output, batch_y)

        # Scale loss to prevent underflow in FP16 gradients
        scaler.scale(loss).backward()

        # Unscale gradients before clipping
        scaler.unscale_(optimizer)
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Step optimizer (unscales gradients internally if not already)
        scaler.step(optimizer)

        # Update the loss scale for next iteration
        scaler.update()

# BFloat16 (preferred on A100/H100: no loss scaling needed)
with autocast(dtype=torch.bfloat16):
    output = model(batch_x)
    loss = criterion(output, batch_y)
loss.backward()   # no scaler needed for bfloat16
optimizer.step()

Q15. How do you implement a Transformer encoder from scratch in PyTorch?

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k    = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask=None):
        B, T, D = x.shape
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scale = math.sqrt(self.d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        weights = self.dropout(F.softmax(scores, dim=-1))
        out = torch.matmul(weights, V)
        out = out.transpose(1, 2).contiguous().view(B, T, D)
        return self.W_o(out)

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int,
                  dropout: float = 0.1):
        super().__init__()
        self.attn  = MultiHeadAttention(d_model, n_heads, dropout)
        self.ff    = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop  = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-norm architecture (more stable than post-norm)
        x = x + self.drop(self.attn(self.norm1(x), mask=mask))
        x = x + self.drop(self.ff(self.norm2(x)))
        return x

class TransformerEncoder(nn.Module):
    def __init__(self, n_layers, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask=mask)
        return self.norm(x)

Q16. How do you save and load models in PyTorch?

import torch

# Recommended: save only state_dict (portable, works across architectures)
torch.save(model.state_dict(), 'model_weights.pth')
model.load_state_dict(torch.load('model_weights.pth', map_location='cpu'))

# Full model save (includes architecture; tied to class definition)
torch.save(model, 'full_model.pth')
model_loaded = torch.load('full_model.pth')

# Checkpoint (training state: model + optimizer + epoch)
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'scheduler_state_dict': scheduler.state_dict(),
    'loss': best_val_loss,
    'config': {'d_model': 512, 'n_heads': 8}
}
torch.save(checkpoint, f'checkpoint_epoch_{epoch}.pth')

# Resume training
ckpt = torch.load('checkpoint_epoch_10.pth', map_location=device)
model.load_state_dict(ckpt['model_state_dict'])
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
scheduler.load_state_dict(ckpt['scheduler_state_dict'])
start_epoch = ckpt['epoch'] + 1

# Common mistake: map_location
# If model was saved on GPU but you're loading on CPU:
ckpt = torch.load('model.pth', map_location=torch.device('cpu'))

Q17. What is DataParallel vs DistributedDataParallel? Which should you use?

Feature	DataParallel (DP)	DistributedDataParallel (DDP)
Multi-GPU support	Single-machine only	Multi-machine + multi-GPU
GIL bottleneck	Yes (Python GIL limits parallelism)	No (one process per GPU)
Memory efficiency	Worse (gradient sync on one GPU)	Better (gradients averaged in-place)
Speed	2-3x on 4 GPUs	Near-linear scaling
Complexity	Simple	Needs init_process_group
2026 recommendation	Use only for quick debugging	Production standard

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train_ddp(rank, world_size, model_class, dataset):
    setup(rank, world_size)

    model = model_class().to(rank)
    model = DDP(model, device_ids=[rank], output_device=rank)

    sampler = torch.utils.data.distributed.DistributedSampler(
        dataset, num_replicas=world_size, rank=rank
    )
    dataloader = DataLoader(dataset, batch_size=64, sampler=sampler,
                             num_workers=4, pin_memory=True)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

    for epoch in range(n_epochs):
        sampler.set_epoch(epoch)   # shuffle differently each epoch
        for batch_x, batch_y in dataloader:
            batch_x, batch_y = batch_x.to(rank), batch_y.to(rank)
            loss = criterion(model(batch_x), batch_y)
            optimizer.zero_grad()
            loss.backward()        # DDP auto-synchronizes gradients
            optimizer.step()

    cleanup()

# Launch with torchrun:
# torchrun --nproc_per_node=4 train.py

Q18. How does PyTorch handle sparse tensors? When are they useful?

import torch

# Sparse COO tensor (row indices, col indices, values)
indices = torch.tensor([[0, 1, 2], [1, 0, 2]])   # shape [2, nnz]
values  = torch.tensor([3.0, 4.0, 5.0])
sparse_tensor = torch.sparse_coo_tensor(
    indices, values, size=(3, 3), dtype=torch.float32
)

# Convert to dense for visualization
print(sparse_tensor.to_dense())

# Sparse linear layer (useful for very high-dimensional inputs)
# Instead of storing a full n x m matrix, store only non-zero weights
# Example: word embeddings in NLP (vocabulary = 100K words, embed_dim = 128)
# Standard: 100K * 128 * 4 bytes = 51MB (acceptable)
# But for 1M vocab: sparse embedding lookup is memory-efficient

# SparseTensor for graph neural networks (adjacency matrix)
row = torch.tensor([0, 1, 2, 1])
col = torch.tensor([1, 0, 3, 2])
edge_index = torch.stack([row, col])
edge_weight = torch.ones(4)
adj = torch.sparse_coo_tensor(edge_index, edge_weight, (4, 4))
# Matrix multiply: message passing in GNNs
node_features = torch.randn(4, 16)
messages = torch.sparse.mm(adj, node_features)

Q19. How do you profile a PyTorch model?

import torch
from torch.profiler import profile, record_function, ProfilerActivity

model = MyModel().to('cuda')
x = torch.randn(64, 3, 224, 224).to('cuda')

# PyTorch Profiler
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    with record_function("model_inference"):
        with torch.inference_mode():
            out = model(x)

# View top operations by CUDA time
print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=20))

# Export for Chrome tracing visualization
prof.export_chrome_trace('trace.json')

# Memory profiling
print(prof.key_averages().table(sort_by='self_cpu_memory_usage', row_limit=10))

# Quick FLOPs count (thop library)
from thop import profile as thop_profile
flops, params = thop_profile(model, inputs=(torch.randn(1, 3, 224, 224).cuda(),))
print(f"FLOPs: {flops/1e9:.2f} G  Params: {params/1e6:.2f} M")

Q20. What is TorchScript and how do you export a model for production?

import torch

# torch.jit.trace: fastest; works if control flow doesn't change with input
model.eval()
example_input = torch.randn(1, 3, 224, 224).cuda()
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')

# torch.jit.script: handles Python control flow (if/for based on tensors)
@torch.jit.script
def relu_if(x: torch.Tensor, threshold: float) -> torch.Tensor:
    if x.max() > threshold:
        return torch.relu(x)
    return x

class ScriptableModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(100, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # All operations must be scriptable (no Python-only ops)
        return self.fc(x)

scripted = torch.jit.script(ScriptableModel())
scripted.save('scripted_model.pt')

# ONNX export (most portable, TensorRT-compatible)
torch.onnx.export(
    model, example_input,
    'model.onnx',
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}},
    opset_version=17
)

HARD: Advanced PyTorch (Questions 21-28)

Q21. How do you implement FSDP (Fully Sharded Data Parallel) in PyTorch?

import torch
import torch.distributed as dist
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy, CPUOffload
from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools

# Initialize distributed
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)

# Auto-wrap policy: wrap each TransformerLayer individually
auto_wrap = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={TransformerEncoderLayer}
)

model = LargeTransformerModel()
model = FSDP(
    model,
    auto_wrap_policy=auto_wrap,
    sharding_strategy=ShardingStrategy.FULL_SHARD,   # ZeRO-3 equivalent
    cpu_offload=CPUOffload(offload_params=True),      # offload params to RAM
    mixed_precision=None,
    device_id=local_rank
)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = torch.cuda.amp.GradScaler()

# Saving with FSDP (must use FSDP-aware save)
with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT):
    state = model.state_dict()
    if local_rank == 0:
        torch.save(state, 'fsdp_checkpoint.pth')

Q22. What is custom autograd in PyTorch? When do you need it?

The operation is non-differentiable but you have a surrogate gradient (STE, REINFORCE)
The default backward is numerically unstable
You want a custom gradient for efficiency

import torch

class ClampSTE(torch.autograd.Function):
    """Clamp + Straight-Through Estimator gradient."""
    @staticmethod
    def forward(ctx, x, min_val, max_val):
        ctx.save_for_backward(x)
        ctx.min_val = min_val
        ctx.max_val = max_val
        return x.clamp(min_val, max_val)

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        # Pass gradient through wherever x is in [min, max]; zero otherwise
        mask = (x >= ctx.min_val) & (x <= ctx.max_val)
        return grad_output * mask.float(), None, None

# Usage
x = torch.randn(10, requires_grad=True)
y = ClampSTE.apply(x, -1.0, 1.0)
y.sum().backward()
print(x.grad)   # non-zero only where -1 < x < 1

# Numerically stable log-sum-exp
class StableLogSoftmax(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        softmax = torch.softmax(x, dim=-1)
        ctx.save_for_backward(softmax)
        return torch.log(softmax)

    @staticmethod
    def backward(ctx, grad):
        softmax, = ctx.saved_tensors
        return grad - softmax * grad.sum(dim=-1, keepdim=True)

Q23. How do you implement a learning rate warmup + cosine decay scheduler?

import torch.optim as optim
import math

class WarmupCosineScheduler(optim.lr_scheduler.LRScheduler):
    def __init__(self, optimizer, warmup_steps, total_steps, min_lr=1e-6, last_epoch=-1):
        self.warmup_steps = warmup_steps
        self.total_steps  = total_steps
        self.min_lr       = min_lr
        super().__init__(optimizer, last_epoch)

    def get_lr(self):
        step = self.last_epoch
        if step < self.warmup_steps:
            # Linear warmup: LR goes from 0 to base_lr
            factor = step / max(1, self.warmup_steps)
        else:
            # Cosine decay: LR goes from base_lr to min_lr
            progress = (step - self.warmup_steps) / max(1, self.total_steps - self.warmup_steps)
            factor = self.min_lr + 0.5 * (1 - self.min_lr) * (1 + math.cos(math.pi * progress))
        return [base_lr * factor for base_lr in self.base_lrs]

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = WarmupCosineScheduler(optimizer, warmup_steps=500, total_steps=10000)

# Alternatively, use HuggingFace transformers scheduler
from transformers import get_cosine_schedule_with_warmup
scheduler = get_cosine_schedule_with_warmup(
    optimizer, num_warmup_steps=500, num_training_steps=10000
)

Q24. How do you use PyTorch for Graph Neural Networks (GNNs)?

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, GATConv, global_mean_pool
from torch_geometric.data import Data, DataLoader

# Basic GCN layer (manual message passing)
class GCNLayer(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.linear = nn.Linear(in_channels, out_channels)

    def forward(self, x, edge_index):
        """
        x: [N, in_channels] node features
        edge_index: [2, E] edge list (sparse adjacency)
        """
        # Aggregate: sum over neighbors via scatter_add
        row, col = edge_index
        out = torch.zeros_like(x)
        out.scatter_add_(0, col.unsqueeze(-1).expand_as(x[row]), x[row])
        # Normalize by degree
        deg = torch.zeros(x.size(0), device=x.device)
        deg.scatter_add_(0, col, torch.ones(col.size(0), device=x.device))
        deg = deg.clamp(min=1).unsqueeze(-1)
        return F.relu(self.linear(out / deg))

# Using PyG (PyTorch Geometric) -- standard library for GNNs
class GraphClassifier(nn.Module):
    def __init__(self, in_channels, hidden_channels, num_classes):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.conv3 = GATConv(hidden_channels, hidden_channels, heads=4, concat=False)
        self.fc = nn.Linear(hidden_channels, num_classes)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.3, training=self.training)
        x = F.relu(self.conv2(x, edge_index))
        x = self.conv3(x, edge_index)
        x = global_mean_pool(x, batch)   # aggregate nodes per graph
        return self.fc(x)

Q25. What is the PyTorch compile() API? How does torch.compile work?

import torch

model = MyModel().cuda()

# Compile: first call takes longer (compilation), subsequent calls are faster
compiled_model = torch.compile(model, mode='reduce-overhead')
# Modes:
# 'default': good general-purpose optimization
# 'reduce-overhead': minimize CUDA launch overhead (good for small batch)
# 'max-autotune': autotune Triton kernels (slowest compile, fastest runtime)
# 'max-autotune-no-cudagraphs': max-autotune without CUDA graphs

# Typical speedup: 1.5-3x for transformer models
x = torch.randn(32, 3, 224, 224, device='cuda')
import time
t = time.time(); compiled_model(x); torch.cuda.synchronize()
print(f"Compiled: {time.time()-t:.3f}s")

# torch.compile is compatible with DDP and FSDP
model = DDP(torch.compile(model))

# torch.compile with dynamic shapes
compiled_dynamic = torch.compile(model, dynamic=True)
# Handles variable sequence lengths without recompiling

Q26. How do you implement gradient checkpointing in a custom model?

import torch
from torch.utils.checkpoint import checkpoint, checkpoint_sequential

class CheckpointedTransformer(nn.Module):
    def __init__(self, n_layers, d_model, n_heads, d_ff):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, n_heads, d_ff)
            for _ in range(n_layers)
        ])

    def forward(self, x, mask=None):
        for layer in self.layers:
            if self.training:
                # Recompute layer during backward instead of storing activations
                # use_reentrant=False: new API, more flexible
                x = checkpoint(layer, x, mask, use_reentrant=False)
            else:
                x = layer(x, mask)
        return x

# Memory vs speed tradeoff
# With gradient checkpointing: ~1/3 the activation memory
# Cost: ~33% extra compute (recompute one forward pass)
# Rule of thumb: if GPU OOM with n_layers, checkpoint every other layer

# checkpoint_sequential: checkpoints every k layers
class SequentialModel(nn.Module):
    def __init__(self, n_layers):
        super().__init__()
        self.layers = nn.Sequential(*[
            TransformerEncoderLayer(512, 8, 2048) for _ in range(n_layers)
        ])

    def forward(self, x):
        if self.training:
            # Checkpoint every 2 layers
            return checkpoint_sequential(self.layers, segments=len(self.layers)//2,
                                          input=x, use_reentrant=False)
        return self.layers(x)

Q27. How do you implement a custom optimizer in PyTorch?

import torch
from torch.optim import Optimizer

class Lion(Optimizer):
    """Lion optimizer (Google DeepMind, 2023): sign SGD with momentum."""
    def __init__(self, params, lr=1e-4, betas=(0.9, 0.99), weight_decay=0.0):
        defaults = dict(lr=lr, betas=betas, weight_decay=weight_decay)
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            lr = group['lr']
            beta1, beta2 = group['betas']
            wd = group['weight_decay']

            for p in group['params']:
                if p.grad is None:
                    continue

                grad = p.grad
                state = self.state[p]

                if len(state) == 0:
                    state['exp_avg'] = torch.zeros_like(p)

                exp_avg = state['exp_avg']

                # Update: sign of interpolated momentum
                update = exp_avg * beta1 + grad * (1 - beta1)
                p.data.add_(torch.sign(update), alpha=-lr)

                # Weight decay (decoupled)
                p.data.mul_(1 - lr * wd)

                # Update momentum
                exp_avg.mul_(beta2).add_(grad, alpha=1 - beta2)

        return loss

# Lion typically uses 2-10x smaller LR than Adam
# and 3-5x smaller weight_decay
optimizer = Lion(model.parameters(), lr=1e-4, betas=(0.9, 0.99), weight_decay=0.01)

Q28. What are common PyTorch debugging patterns?

import torch

# 1. NaN/Inf detection
torch.autograd.set_detect_anomaly(True)   # enables full stack trace on NaN
# Warning: slows training significantly; disable in production

# 2. Check gradient flow
def check_gradients(model, threshold=1e-6):
    no_grad  = []
    nan_grad = []
    for name, param in model.named_parameters():
        if param.grad is None:
            no_grad.append(name)
        elif torch.isnan(param.grad).any():
            nan_grad.append(name)
        elif param.grad.abs().max() < threshold:
            print(f"Very small grad: {name} max={param.grad.abs().max():.2e}")
    if no_grad:  print("No gradient:", no_grad[:3])
    if nan_grad: print("NaN gradient:", nan_grad)

# 3. Overfit one batch (golden test)
x_one, y_one = next(iter(train_loader))
x_one, y_one = x_one.to(device), y_one.to(device)
for step in range(300):
    optimizer.zero_grad()
    loss = criterion(model(x_one), y_one)
    loss.backward()
    optimizer.step()
    if step % 50 == 0: print(f"Step {step}: {loss.item():.6f}")
# If loss doesn't go near zero: architecture or loss bug

# 4. Shape debugging
class ShapeLogger(nn.Module):
    def __init__(self, name, wrapped):
        super().__init__()
        self.name = name
        self.module = wrapped

    def forward(self, x):
        out = self.module(x)
        print(f"{self.name}: in={tuple(x.shape)} out={tuple(out.shape)}")
        return out

# 5. Reproducibility
def set_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    import numpy as np, random
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False  # set True for speed if input size is fixed

PyTorch Ecosystem in 2026

Tool	Purpose	Use
PyTorch Lightning	High-level training API	Clean research code
HuggingFace Transformers	Pre-trained models + Trainer	NLP/Vision/Speech
PEFT (LoRA, QLoRA)	Parameter-efficient fine-tuning	LLM fine-tuning
TRL (SFT, DPO, PPO)	LLM alignment training	RLHF workflows
timm	700+ vision models	Image classification
PyTorch Geometric	Graph neural networks	GNN research
Triton	GPU kernel programming	Custom CUDA ops
vLLM	Fast LLM inference	Production LLM serving
torchao	Quantization and sparsity	Model compression

FAQ

Q: What is the difference between .cuda() and .to('cuda')?

A: .to('cuda') is the modern API; it is more flexible (supports device strings like 'cuda:1', device objects, dtypes). .cuda() is older but still works. Prefer .to(device) where device = torch.device('cuda' if ... else 'cpu').

Q: Why does torch.nn.CrossEntropyLoss expect logits, not probabilities?

A: It uses log_softmax + NLL internally in a numerically stable way. Passing probabilities through softmax first and then log is less stable (log of small numbers). Pass raw logits always.

Q: What is the difference between model.train() and model.eval()?

A: model.train() sets self.training = True on all modules; model.eval() sets it to False. This affects BatchNorm (train: batch stats; eval: running stats) and Dropout (train: active; eval: disabled).

Q: When should I use torch.compile vs TorchScript?

A: torch.compile for research and training acceleration (easier, more powerful). TorchScript for mobile deployment or C++ inference where you need Python-free execution.

Related articles on PapersAdda:

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp

PyTorch Interview Questions 2026: 28 Answers with Code

Which Companies Ask PyTorch Questions?

EASY: Tensors and Core Concepts (Questions 1-10)

Q1. What is a PyTorch tensor? How is it different from a NumPy array?

Q2. What is autograd? How does requires_grad work?

Q3. What is the difference between .backward() and torch.autograd.grad()?

Q4. What is nn.Module? What are the key methods you must implement?

Q5. How do you write a training loop in PyTorch?

Q6. What is the PyTorch Dataset and DataLoader? How do you write a custom dataset?

Q7. What is gradient accumulation and why is it used?

Q8. How does PyTorch handle memory management on GPU? What causes OOM errors?

Q9. What is torch.nn.functional vs nn.Module? When do you use each?

Q10. What is torch.no_grad() vs torch.inference_mode()? What is detach()?

MEDIUM: Model Building and Training (Questions 11-20)

Q11. How do you implement a ResNet block from scratch?

Q12. How do you implement a custom loss function in PyTorch?

Q13. What is the difference between nn.Sequential, nn.ModuleList, and nn.ModuleDict?

Q14. How do you implement mixed precision training in PyTorch?

Q15. How do you implement a Transformer encoder from scratch in PyTorch?

Q16. How do you save and load models in PyTorch?

Q17. What is DataParallel vs DistributedDataParallel? Which should you use?

Q18. How does PyTorch handle sparse tensors? When are they useful?

Q19. How do you profile a PyTorch model?

Q20. What is TorchScript and how do you export a model for production?

HARD: Advanced PyTorch (Questions 21-28)

Q21. How do you implement FSDP (Fully Sharded Data Parallel) in PyTorch?

Q22. What is custom autograd in PyTorch? When do you need it?

Q23. How do you implement a learning rate warmup + cosine decay scheduler?

Q24. How do you use PyTorch for Graph Neural Networks (GNNs)?

Q25. What is the PyTorch compile() API? How does torch.compile work?

Q26. How do you implement gradient checkpointing in a custom model?

Q27. How do you implement a custom optimizer in PyTorch?

Q28. What are common PyTorch debugging patterns?

PyTorch Ecosystem in 2026

FAQ

Q: What is the difference between .cuda() and .to('cuda')?

Q: Why does torch.nn.CrossEntropyLoss expect logits, not probabilities?

Q: What is the difference between model.train() and model.eval()?

Q: When should I use torch.compile vs TorchScript?

More resources in Interview Questions

Sat this this year? Share your story, earn ₹500.

Take a free timed mock test

Deep Learning Interview Questions 2026: 30 Answers with Code

TensorFlow Interview Questions 2026: 28 Answers with Code

AI/ML Interview Questions 2026: 50 Answers [Verified]

Data Science Interview Questions 2026: 30 Answers with Code

Machine Learning Interview Questions 2026: 30 Answers with Code

Share this guide

Q3. What is the difference between `.backward()` and `torch.autograd.grad()`?