issue 117apr 27mmxxvi
est. 2017
Sun, 27 Apr 2026
vol. IX · no. 117
PapersAdda
placement intelligence, since 2017
640+ briefs · 24 campuses · by reservation
verified offers · sourced from r/developersIndia
razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1

Computer Vision Interview Questions 2026: 28 Answers with Code

27 min read
Interview Questions
Updated: 8 Jun 2026
Aditya Sharma
Aditya's Edit

PapersAdda 2026 Placement Cycle

By Aditya Sharma·Founder & Editor, PapersAdda

What changed in 2026 drives

Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.

What I'd actually study for this

  • 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
  • 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
  • 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
  • 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken

Where most candidates trip up

The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.

Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.

Computer vision powers autonomous vehicles, medical imaging, retail automation, and every modern camera application. In 2026, CV engineers are expected to know both classical image processing and modern deep learning architectures. This guide covers 28 computer vision interview questions with full answers, PyTorch code, and comparison tables.

PapersAdda's take: CV interviews at product companies ask you to implement convolutions, explain detection architectures, and reason about tradeoffs. At vision-focused companies (Ola Electric, Nuro, Waymo, Samsung), expect to implement augmentation pipelines and discuss evaluation metrics deeply. Candidates report that mAP calculation and NMS implementation are the most commonly tested CV topics in on-site rounds. According to candidate accounts from public preparation resources, object detection system design comes up in roughly half of senior CV interviews. Confirm the exact interview structure on the official careers portal of your target company.

Related articles: Deep Learning Interview Questions 2026 | AI/ML Interview Questions 2026 | PyTorch Interview Questions 2026 | Machine Learning Interview Questions 2026 | MLOps Interview Questions 2026


Which Companies Ask These Questions?

TopicCompanies
CNN ArchitectureGoogle, Meta, Samsung, Qualcomm
Object DetectionOla, Waymo, Tesla, Nuro, DoorDash
Image SegmentationMedical imaging startups, autonomous driving
Vision TransformersGoogle Brain, Meta AI, OpenAI
CV Data PipelinesAll companies with CV teams
Production CV SystemsAny company deploying camera-based AI

EASY: Image Processing and CNNs (Questions 1-10)

Q1. What is a convolution operation in a neural network?

import torch
import torch.nn as nn
import torch.nn.functional as F

# Manual 2D convolution (educational)
def conv2d_manual(image, kernel):
    """
    image:  [H, W]
    kernel: [kH, kW]
    """
    H, W = image.shape
    kH, kW = kernel.shape
    oH, oW = H - kH + 1, W - kW + 1
    output = torch.zeros(oH, oW)
    for i in range(oH):
        for j in range(oW):
            output[i,j] = (image[i:i+kH, j:j+kW] * kernel).sum()
    return output

# PyTorch nn.Conv2d
conv = nn.Conv2d(
    in_channels=3,    # RGB input
    out_channels=64,  # 64 different learned filters
    kernel_size=3,
    stride=1,
    padding=1,        # same padding
    bias=False        # often set to False before BatchNorm
)

x = torch.randn(1, 3, 224, 224)
out = conv(x)
print(out.shape)   # [1, 64, 224, 224]

# Parameter count: 3 * 64 * 3 * 3 = 1,728 params
print("Params:", sum(p.numel() for p in conv.parameters()))

Q2. What is the difference between valid and same padding?

PaddingFormulaOutput SizeUse Case
Valid (no padding)H_out = (H - k) / s + 1Smaller than inputWhen you intentionally shrink
Same (zero padding)H_out = H / sSame as input (stride=1)Preserve spatial dimensions
Reflect paddingMirror image at bordersSame as inputLess border artifacts
# Same padding: output size = input size (stride=1)
conv_same = nn.Conv2d(3, 64, kernel_size=3, padding=1)         # padding = kernel_size//2
x = torch.randn(1, 3, 32, 32)
print(conv_same(x).shape)   # [1, 64, 32, 32]

# Valid padding: output shrinks
conv_valid = nn.Conv2d(3, 64, kernel_size=3, padding=0)
print(conv_valid(x).shape)  # [1, 64, 30, 30]

# Strided convolution: reduces spatial resolution
conv_stride2 = nn.Conv2d(3, 64, kernel_size=3, padding=1, stride=2)
print(conv_stride2(x).shape) # [1, 64, 16, 16]

Q3. What is pooling? Compare max pooling and average pooling.

TypeOperationPropertyUse Case
Max PoolTake maximum in windowTranslation invariance, keeps strongest featureHidden layers
Avg PoolAverage in windowSmoother, preserves all infoFinal spatial reduction
Global Avg PoolAverage entire feature map to 1x1Spatial information to 1D; no FC layersFinal classification layer
Adaptive PoolOutput size specified, window computedHandles variable input sizesFinal layers
max_pool  = nn.MaxPool2d(kernel_size=2, stride=2)     # halves spatial dims
avg_pool  = nn.AvgPool2d(kernel_size=2, stride=2)
gap       = nn.AdaptiveAvgPool2d((1,1))               # global average pool

x = torch.randn(4, 64, 14, 14)
print(max_pool(x).shape)   # [4, 64, 7, 7]
print(avg_pool(x).shape)   # [4, 64, 7, 7]
print(gap(x).shape)        # [4, 64, 1, 1]
print(gap(x).flatten(1).shape)  # [4, 64] -- then Linear(64, num_classes)

Q4. What is data augmentation for images? What are the standard augmentations?

import torchvision.transforms.v2 as T

# Standard augmentation pipeline for classification (ImageNet-style)
train_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.8, 1.0)),     # crop and resize
    T.RandomHorizontalFlip(p=0.5),                   # mirror
    T.ColorJitter(brightness=0.4, contrast=0.4,
                   saturation=0.4, hue=0.1),          # color jitter
    T.RandomGrayscale(p=0.1),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],          # ImageNet statistics
                 std=[0.229, 0.224, 0.225])
])

# Advanced augmentation for robust models
advanced_transform = T.Compose([
    T.RandomResizedCrop(224),
    T.RandomHorizontalFlip(),
    T.TrivialAugmentWide(),      # AutoAugment variant, easy to use
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    T.RandomErasing(p=0.1)       # CutOut: random rectangles zeroed out
])

# MixUp and CutMix (state-of-the-art for classification in 2026)
cutmix = T.CutMix(num_classes=1000)
mixup  = T.MixUp(num_classes=1000)

Augmentations for object detection (must preserve bounding boxes):

import albumentations as A

detection_transform = A.Compose([
    A.RandomResizedCrop(height=640, width=640, scale=(0.5, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, p=0.5),
    A.Blur(blur_limit=3, p=0.1),
    A.GaussNoise(var_limit=(10, 50), p=0.1),
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))

Q5. What are the major CNN architectures? How did they evolve from AlexNet to 2026?

ArchitectureYearInnovationTop-1 ImageNet
AlexNet2012First deep CNN; ReLU, dropout, GPU57.1%
VGG-162014Deep uniform 3x3 convolutions71.5%
GoogLeNet2014Inception modules; 1x1 convolutions74.8%
ResNet-502015Residual connections; 152 layers76.1%
DenseNet2017Dense connections; feature reuse77.2%
EfficientNet-B72019Compound scaling; NAS84.3%
Vision Transformer (ViT-L)2020Patch-based self-attention87.8%
ConvNeXt-L2022Modernized ResNet with transformer tricks87.8%
EfficientNetV2-XL2022Progressive learning, Fused-MBConv87.3%
import torchvision.models as models

# Modern choices for 2026
efficientnet = models.efficientnet_v2_l(weights='IMAGENET1K_V1')
convnext = models.convnext_large(weights='IMAGENET1K_V1')
vit = models.vit_h_14(weights='IMAGENET1K_SWAG_E2E_V1')  # ViT-H/14, SWAG weights

# For transfer learning (fine-tuning just the head)
def make_classification_model(backbone_name, num_classes):
    if backbone_name == 'efficientnet_v2_m':
        model = models.efficientnet_v2_m(weights='IMAGENET1K_V1')
        in_features = model.classifier[1].in_features
        model.classifier = nn.Sequential(
            nn.Dropout(0.3), nn.Linear(in_features, num_classes)
        )
    return model

Q6. What is batch normalization in CNNs? What happens at inference?

Training:

x_hat = (x - mean_batch) / sqrt(var_batch + eps)
y = gamma * x_hat + beta

Computed over (N, H, W) for each channel C. Running statistics updated via EMA:

running_mean = 0.9 * running_mean + 0.1 * batch_mean

Inference: Use running_mean and running_var (accumulated during training). Call model.eval() to switch modes.

# Common mistake: forgetting model.eval() at inference
model.eval()   # MUST do this; switches BatchNorm to use running stats

# Debug: check if model is in train vs eval mode
for name, module in model.named_modules():
    if isinstance(module, nn.BatchNorm2d):
        print(f"{name}: training={module.training}, "
              f"running_mean={module.running_mean.mean():.3f}")

# BatchNorm vs InstanceNorm vs GroupNorm vs LayerNorm in CV
# BatchNorm: standard for CNNs with large batches
# InstanceNorm: style transfer, each sample normalized independently
# GroupNorm: small batches (detection, segmentation); stable with batch_size=2
# LayerNorm: Vision Transformers
bn = nn.BatchNorm2d(64)
gn = nn.GroupNorm(num_groups=8, num_channels=64)  # 8 groups of 8 channels

Q7. What is depthwise separable convolution? Why does MobileNet use it?

Depthwise separable convolution splits into:

  1. Depthwise: 1 filter per input channel (C_in filters, each k x k)
  2. Pointwise: 1x1 convolution to mix channels (C_in -> C_out)

Cost: (k^2 * C_in + C_in * C_out) * H * W

Savings: ~k^2 = ~9x for 3x3 convolutions.

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.depthwise  = nn.Conv2d(in_ch, in_ch, kernel_size=3,
                                     padding=1, stride=stride,
                                     groups=in_ch,  # groups=in_ch = depthwise
                                     bias=False)
        self.pointwise  = nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False)
        self.bn_dw      = nn.BatchNorm2d(in_ch)
        self.bn_pw      = nn.BatchNorm2d(out_ch)
        self.relu       = nn.ReLU6(inplace=True)

    def forward(self, x):
        x = self.relu(self.bn_dw(self.depthwise(x)))
        return self.relu(self.bn_pw(self.pointwise(x)))

# Parameter count comparison
standard = nn.Conv2d(32, 64, 3, padding=1)          # 32*64*9 = 18,432
dw_sep   = DepthwiseSeparableConv(32, 64)             # 32*9 + 32*64 = 2,336
print(f"Standard: {sum(p.numel() for p in standard.parameters())}")
print(f"DW-Sep:   {sum(p.numel() for p in dw_sep.parameters())}")

Q8. What is image segmentation? Compare semantic, instance, and panoptic segmentation.

TypeOutputDistinguishes Instances?Examples
SemanticClass per pixel (no instance IDs)NoLane segmentation
InstanceSeparate mask per object instanceYesIndividual person masks
PanopticSemantic + instance unifiedYesAll pixels labeled
from transformers import pipeline

# Semantic segmentation
seg_pipe = pipeline('image-segmentation',
                     model='facebook/mask2former-swin-large-ade-semantic')

# Instance segmentation
inst_pipe = pipeline('image-segmentation',
                      model='facebook/mask2former-swin-large-coco-instance')

# Using torchvision for segmentation inference
import torchvision
from torchvision.models.segmentation import deeplabv3_resnet101

# Semantic segmentation (DeepLab v3)
model = deeplabv3_resnet101(weights='DEFAULT')
model.eval()

import torchvision.transforms.functional as TF
from PIL import Image

img = Image.open("image.jpg").convert("RGB")
img_t = TF.to_tensor(img).unsqueeze(0)
with torch.no_grad():
    output = model(img_t)['out']    # [1, 21, H, W] - 21 COCO classes
seg_mask = output.argmax(dim=1)     # [1, H, W]

Q9. What are anchor boxes in object detection?

Why anchors: Direct regression to bounding box coordinates is unstable. Predicting offsets from anchors that already approximately match the object size is a much simpler task.

import torch
import torchvision

# Generating anchor boxes (simplified)
def generate_anchors(feature_map_size, scales, ratios, stride=16):
    """
    Returns anchor boxes as [cx, cy, w, h] for each cell in the feature map.
    """
    anchors = []
    H, W = feature_map_size
    for row in range(H):
        for col in range(W):
            cx = (col + 0.5) * stride
            cy = (row + 0.5) * stride
            for scale in scales:
                for ratio in ratios:
                    w = scale * (ratio ** 0.5)
                    h = scale / (ratio ** 0.5)
                    anchors.append([cx, cy, w, h])
    return torch.tensor(anchors)

anchors = generate_anchors((14, 14), scales=[32, 64, 128], ratios=[0.5, 1.0, 2.0])
print(f"Total anchors: {len(anchors)}")  # 14*14*9 = 1764

# Modern models (FCOS, DETR) are anchor-free, using points or queries instead

Q10. What is IoU (Intersection over Union) and why is it the standard metric for detection?

IoU = area(Predicted ∩ Ground Truth) / area(Predicted ∪ Ground Truth)

IoU of 1.0 = perfect overlap. IoU = 0 = no overlap. Standard threshold: IoU >= 0.5 for a True Positive.

import torch

def box_iou(box1, box2):
    """
    box1, box2: [x1, y1, x2, y2] format
    """
    # Intersection
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)

    # Union
    area1 = (box1[2]-box1[0]) * (box1[3]-box1[1])
    area2 = (box2[2]-box2[0]) * (box2[3]-box2[1])
    union = area1 + area2 - inter

    return inter / (union + 1e-6)

# Vectorized with torchvision
from torchvision.ops import box_iou as tv_box_iou
pred_boxes = torch.tensor([[50, 50, 200, 200]], dtype=torch.float)
gt_boxes   = torch.tensor([[70, 70, 210, 210]], dtype=torch.float)
print(tv_box_iou(pred_boxes, gt_boxes))   # [[0.73]] approx

# [email protected] and [email protected]:0.95 are standard detection benchmarks (COCO)

MEDIUM: Object Detection and Segmentation (Questions 11-20)

Q11. How does YOLO work? Explain the key design choices.

YOLO v1-v5 evolution: Grid cell regression -> anchor boxes -> feature pyramid networks -> batch norm + CSP blocks.

YOLOv8 (2023, standard in 2026):

  • Anchor-free detection head (predicts center point + width/height)
  • C2f modules (cross-stage partial bottleneck with 2 fused outputs)
  • Decoupled head: separate heads for classification and localization
  • Task-Aligned Loss with VFL (Varifocal Loss) + DFL (Distribution Focal Loss)
from ultralytics import YOLO

# Inference (fastest)
model = YOLO('yolov8n.pt')   # n=nano, s=small, m=medium, l=large, x=extra
results = model('image.jpg', conf=0.25, iou=0.45, classes=[0,1,2])
for r in results:
    for box in r.boxes:
        cls = int(box.cls)
        conf = float(box.conf)
        x1, y1, x2, y2 = box.xyxy[0].tolist()

# Fine-tuning on custom dataset
model = YOLO('yolov8m.pt')
results = model.train(
    data='custom_dataset.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    device='0',
    patience=20,           # early stopping
    augment=True,
    hsv_h=0.015, hsv_s=0.7, hsv_v=0.4,
    degrees=10.0,
    translate=0.1,
    scale=0.5,
    mosaic=1.0             # mosaic augmentation (YOLOv4 signature)
)

Q12. How does Faster R-CNN work? Compare with single-stage detectors.

Faster R-CNN (two-stage):

  1. Backbone (ResNet) extracts feature maps
  2. Region Proposal Network (RPN) proposes ~2,000 candidate regions
  3. RoI Pooling: aligns each proposal to fixed size
  4. Classification + box regression heads refine each proposal

Comparison:

PropertyTwo-Stage (Faster R-CNN)Single-Stage (YOLO, SSD)
AccuracyHigher, especially small objectsSlightly lower
SpeedSlower (~5 FPS on CPU)Fast (30+ FPS)
ComplexityHigher (two passes)Simpler
2026 usageMedical imaging, aerial imageryReal-time applications
import torchvision
from torchvision.models.detection import (
    fasterrcnn_resnet50_fpn_v2, FasterRCNN,
    fcos_resnet50_fpn   # anchor-free, often better
)

# Faster R-CNN for high-accuracy tasks
model = fasterrcnn_resnet50_fpn_v2(weights='DEFAULT')
model.eval()

# For custom classes
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

model = fasterrcnn_resnet50_fpn_v2(weights='DEFAULT')
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes=5+1)
# Fine-tune with standard DetectionDataset + DataLoader

Q13. What is Feature Pyramid Network (FPN) and why is it used for detection?

  1. Bottom-up pass: standard forward pass through backbone
  2. Top-down pass: upsample high-level features and add to lower-level features via lateral connections
import torch.nn as nn
import torch.nn.functional as F

class SimpleFPN(nn.Module):
    def __init__(self, in_channels_list, out_channels=256):
        super().__init__()
        # Lateral 1x1 convolutions to unify channel dims
        self.lateral = nn.ModuleList([
            nn.Conv2d(c, out_channels, 1) for c in in_channels_list
        ])
        # 3x3 smoothing convolutions
        self.smooth = nn.ModuleList([
            nn.Conv2d(out_channels, out_channels, 3, padding=1)
            for _ in in_channels_list
        ])

    def forward(self, features):
        # features: [C2, C3, C4, C5] from backbone
        laterals = [l(f) for l, f in zip(self.lateral, features)]
        # Top-down path
        for i in range(len(laterals)-2, -1, -1):
            laterals[i] += F.interpolate(laterals[i+1], size=laterals[i].shape[-2:],
                                          mode='nearest')
        return [self.smooth[i](laterals[i]) for i in range(len(laterals))]

Q14. What is non-maximum suppression (NMS)? Why is it needed?

  1. Sort all boxes by confidence score (descending)
  2. Take the highest-confidence box
  3. Remove all other boxes with IoU > threshold (usually 0.45)
  4. Repeat with remaining boxes
import torch
from torchvision.ops import nms, batched_nms

# Single class NMS
boxes  = torch.tensor([[10, 10, 50, 50], [12, 12, 52, 52], [100, 100, 200, 200]], dtype=torch.float)
scores = torch.tensor([0.9, 0.85, 0.7])
keep = nms(boxes, scores, iou_threshold=0.5)
print("Kept indices:", keep)  # [0, 2] - second box removed (IoU > 0.5 with first)

# Multi-class NMS (suppress within same class only)
classes = torch.tensor([0, 0, 1])
keep = batched_nms(boxes, scores, classes, iou_threshold=0.5)

# Soft-NMS: decay scores instead of hard removal
# Avoids missing overlapping objects (crowd scenes)
# from torchvision.ops import soft_nms (not in stable torchvision; use custom)

Q15. What is Mask R-CNN? How does it extend Faster R-CNN for instance segmentation?

Key addition: RoI Align (fixes the quantization misalignment in RoI Pooling):

  • Uses bilinear interpolation to sample exactly at float coordinates
  • Critical for mask quality
import torchvision
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

# Load with pretrained weights
model = maskrcnn_resnet50_fpn_v2(weights='DEFAULT')

# Adapt for custom dataset
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
model.roi_heads.mask_predictor = MaskRCNNPredictor(
    in_features_mask, hidden_layer, num_classes
)

# Inference: returns dict with 'boxes', 'labels', 'scores', 'masks'
model.eval()
with torch.no_grad():
    predictions = model([img_tensor])

for box, label, score, mask in zip(
    predictions[0]['boxes'], predictions[0]['labels'],
    predictions[0]['scores'], predictions[0]['masks']
):
    if score > 0.5:
        binary_mask = (mask[0] > 0.5).numpy()

Q16. What is DETR (Detection Transformer)? How does it work without anchors or NMS?

  1. CNN backbone extracts feature maps
  2. Transformer encoder processes feature map as sequence
  3. 100 learnable object queries attend to encoder output (cross-attention)
  4. Each query either predicts an object (class + box) or "no object"
  5. Hungarian matching: match predictions to ground truth optimally
from transformers import DetrImageProcessor, DetrForObjectDetection
import torch
from PIL import Image

processor = DetrImageProcessor.from_pretrained('facebook/detr-resnet-50')
model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')

image = Image.open("image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)

# Post-process outputs (handles confidence threshold + rescaling)
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(
    outputs, threshold=0.9, target_sizes=target_sizes
)[0]

for score, label, box in zip(results['scores'], results['labels'], results['boxes']):
    print(f"Detected {model.config.id2label[label.item()]} "
          f"({score:.3f}) at {[round(i,1) for i in box.tolist()]}")

Q17. What is the Vision Transformer (ViT)? How does it process images?

  1. Split 224x224 image into 16x16 patches: 196 patches
  2. Linear projection of each patch to d_model-dimensional embedding
  3. Add [CLS] token and positional embeddings
  4. Pass through transformer encoder (L layers of multi-head self-attention + FFN)
  5. Use [CLS] embedding for classification
import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.n_patches = (img_size // patch_size) ** 2
        # Patch projection via Conv2d with kernel=patch_size, stride=patch_size
        self.proj = nn.Conv2d(in_channels, embed_dim,
                               kernel_size=patch_size, stride=patch_size)
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, 1 + self.n_patches, embed_dim))
        nn.init.normal_(self.pos_embed, std=0.02)
        nn.init.normal_(self.cls_token, std=0.02)

    def forward(self, x):
        B = x.shape[0]
        patches = self.proj(x).flatten(2).transpose(1, 2)  # [B, 196, 768]
        cls = self.cls_token.expand(B, -1, -1)              # [B, 1, 768]
        x = torch.cat([cls, patches], dim=1)                # [B, 197, 768]
        return x + self.pos_embed

# Using timm (preferred in 2026)
import timm
vit = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)

Q18. What is Segment Anything Model (SAM)? How does it work?

Architecture:

  1. Image Encoder: ViT-H/16 MAE-pretrained backbone extracts image embeddings (offline, once per image)
  2. Prompt Encoder: Encodes sparse prompts (points, boxes) and dense prompts (masks)
  3. Mask Decoder: Lightweight transformer that attends over image features and prompt tokens, outputs 3 candidate masks (ambiguous cases)
from segment_anything import SamPredictor, sam_model_registry

sam = sam_model_registry['vit_h'](checkpoint='sam_vit_h_4b8939.pth')
sam.to('cuda')
predictor = SamPredictor(sam)

import cv2
image = cv2.imread('image.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image)    # encode image once

# Segment from a point prompt
masks, scores, logits = predictor.predict(
    point_coords=[[500, 375]],    # (x, y) coordinates
    point_labels=[1],              # 1=foreground, 0=background
    multimask_output=True
)
print(f"Best mask score: {scores.max():.3f}")
print(f"Mask shape: {masks[0].shape}")   # (H, W) boolean

# SAM 2 (2024): extends to video; tracks objects across frames

Q19. What is image classification on edge devices? How do you optimize for mobile?

TechniqueMethodSpeedup
Quantization (INT8)Convert weights from FP32 to INT82-4x
PruningRemove unimportant weights/channels2-10x (structured)
Architecture choiceMobileNetV3, EfficientNet-Lite, ShuffleNetV2Built-in
Knowledge distillationTrain small model from large teacherTask-dependent
TorchScript / ONNXExport for optimized runtimePlatform-dependent
import torch
import torchvision.models as models

# MobileNetV3-Small: best accuracy/latency for mobile
model = models.mobilenet_v3_small(weights='IMAGENET1K_V1')
model.eval()

# ONNX export for edge deployment
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, 'mobilenet_v3s.onnx',
                   input_names=['input'], output_names=['output'],
                   dynamic_axes={'input': {0: 'batch'}},
                   opset_version=17)

# TorchScript for mobile (avoids Python interpreter)
scripted = torch.jit.script(model)
scripted.save('mobilenet_v3s.pt')

# Quantization (INT8)
model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
torch.quantization.prepare(model, inplace=True)
# calibrate with representative data ...
model_int8 = torch.quantization.convert(model)

Q20. How do you compute mean Average Precision (mAP) for object detection?

COCO [email protected]:0.95: Average over IoU thresholds from 0.5 to 0.95 in 0.05 steps. More demanding than VOC [email protected].

For each class:
  1. Sort all detections by confidence
  2. At each threshold, assign TP (IoU >= threshold) or FP
  3. Compute precision and recall at each point
  4. AP = area under precision-recall curve

mAP = mean(AP across all classes)
from torchmetrics.detection import MeanAveragePrecision

metric = MeanAveragePrecision(iou_type='bbox', iou_thresholds=[0.5, 0.75])

preds = [{
    'boxes': torch.tensor([[10, 10, 100, 100], [200, 200, 300, 300]]),
    'scores': torch.tensor([0.9, 0.75]),
    'labels': torch.tensor([0, 1])
}]
targets = [{
    'boxes': torch.tensor([[15, 15, 105, 105], [210, 210, 310, 310]]),
    'labels': torch.tensor([0, 1])
}]

metric.update(preds, targets)
results = metric.compute()
print(f"[email protected]:    {results['map_50']:.3f}")
print(f"[email protected]:   {results['map_75']:.3f}")
print(f"[email protected]:0.95: {results['map']:.3f}")

HARD: Advanced CV (Questions 21-28)

Q21. What is self-supervised learning for vision? Explain DINO and MAE.

MethodHowWhat it learns
DINO (Meta, 2021)Student-teacher with momentum; patch-level self-distillationSemantic features without labels
MAE (Meta, 2022)Mask 75% of patches; reconstructDense pixel-level understanding
SimCLRContrastive learning; augmentation invarianceSemantic similarity
MoCo v3Momentum contrastive with transformerStable contrastive features

DINO is especially powerful because it naturally learns semantic segmentation structure (patches of the same object get similar representations) without any supervision:

import torch
from torchvision import transforms

# Load DINOv2 (Meta, 2023) - best self-supervised ViT in 2026
dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2.eval().cuda()

# Get patch features (14x14 patches for 224x224 input)
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

with torch.no_grad():
    img_t = transform(image).unsqueeze(0).cuda()
    features = dinov2.forward_features(img_t)
    patch_features = features['x_norm_patchtokens']   # [1, 256, 1024]
    cls_feature    = features['x_norm_clstoken']      # [1, 1024]

Q22. What is optical flow? How is it used in video understanding?

import cv2
import numpy as np

# Lucas-Kanade sparse optical flow (track specific points)
cap = cv2.VideoCapture('video.mp4')
ret, old_frame = cap.read()
old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)

# Detect corners to track
corners = cv2.goodFeaturesToTrack(old_gray, maxCorners=100,
                                    qualityLevel=0.3, minDistance=7)

while True:
    ret, frame = cap.read()
    if not ret: break
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Calculate optical flow
    new_pts, status, err = cv2.calcOpticalFlowPyrLK(
        old_gray, gray, corners, None,
        winSize=(15,15), maxLevel=2,
        criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 0.03)
    )
    good_new = new_pts[status==1]
    good_old = corners[status==1]
    old_gray = gray.copy()
    corners = good_new.reshape(-1, 1, 2)

# Deep optical flow: RAFT (state-of-the-art)
# torchvision.models.optical_flow.raft_large()

Q23. How do you handle class imbalance in object detection datasets?

ProblemExampleSolution
Rare class few examplesPedestrians vs cars in parking lotOversampling rare class images
Easy negatives dominateBackground vs objectFocal loss; hard negative mining
Scale imbalanceTiny objects vs largeFPN; multi-scale training; mosaic augmentation
Long-tail classesLVIS dataset (1203 categories)Federated Loss; repeat factor sampling
# Repeat Factor Sampling (LVIS standard)
# Images with rare categories are sampled more frequently
import numpy as np

def compute_repeat_factors(dataset, repeat_threshold=0.001):
    """
    Compute how many times each image should be repeated.
    Images with rare categories get higher repeat factors.
    """
    category_freq = {}    # category_id -> frequency
    for ann in dataset.annotations:
        for cat_id in ann['category_ids']:
            category_freq[cat_id] = category_freq.get(cat_id, 0) + 1

    total_images = len(dataset)
    for cat_id in category_freq:
        category_freq[cat_id] /= total_images

    image_repeat_factors = []
    for img in dataset.images:
        cat_ids = img['category_ids']
        rf = max([max(1.0, (repeat_threshold / category_freq[c]) ** 0.5)
                   for c in cat_ids] or [1.0])
        image_repeat_factors.append(rf)
    return image_repeat_factors

Q24. What is diffusion-based image generation? How does Stable Diffusion work?

Stable Diffusion architecture:

  1. Text encoder: CLIP encodes text prompt to conditioning vector
  2. VAE: Operates in compressed latent space (8x spatial compression) for efficiency
  3. U-Net denoiser: Predicts noise at each step, conditioned on text via cross-attention
  4. Scheduler: Controls denoising trajectory (DDPM, DDIM, DPM-Solver)
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    'stabilityai/stable-diffusion-2-1',
    torch_dtype=torch.float16
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to('cuda')

# Enable memory optimization
pipe.enable_attention_slicing()

image = pipe(
    prompt="a photorealistic portrait of an AI engineer, cinematic lighting",
    negative_prompt="blurry, low quality, cartoon",
    num_inference_steps=25,     # DPM-Solver is fast: 25 steps sufficient
    guidance_scale=7.5,         # CFG scale: 7-9 is typical
    height=768, width=512
).images[0]
image.save('generated.png')

Q25. Design a production pipeline for real-time license plate recognition.

System design for ANPR (Automatic Number Plate Recognition):

Input: Camera feed at 30fps

Stage 1: Vehicle Detection (< 5ms)
  Model: YOLOv8n (nano) - 3.2ms on GPU
  Task: Detect vehicles in frame; extract bounding boxes
  Optimization: Skip frames at low traffic; only run on regions of motion

Stage 2: Plate Detection (< 3ms)
  Model: YOLOv8s fine-tuned on license plate dataset
  Task: Within vehicle crop, detect plate location
  Dataset: Kaggle license plate + internal dataset

Stage 3: OCR (< 10ms)
  Model: TrOCR (transformer-based OCR) or PaddleOCR
  Task: Read characters on plate
  Post-process: Regex validation (plate format), correction

Stage 4: Post-processing
  Deduplication: Track plate across frames; confirm after 3 consistent reads
  Database lookup: Redis cache for banned/flagged plates (sub-millisecond)
  Alert: Webhook to parking/security system

Total pipeline: < 20ms per frame = 50+ FPS headroom
from ultralytics import YOLO
from paddleocr import PaddleOCR

vehicle_detector = YOLO('yolov8s.pt')
plate_detector   = YOLO('plate_detector.pt')
ocr = PaddleOCR(lang='en', use_angle_cls=True, use_gpu=True)

import re
PLATE_PATTERN = re.compile(r'^[A-Z]{2}\d{2}[A-Z]{2}\d{4}$')  # Indian plate

def process_frame(frame):
    vehicles = vehicle_detector(frame, classes=[2, 3, 5, 7])[0]  # car, motorcycle, bus, truck
    plates = []
    for box in vehicles.boxes:
        crop = frame[int(box.xyxy[0][1]):int(box.xyxy[0][3]),
                      int(box.xyxy[0][0]):int(box.xyxy[0][2])]
        plate_det = plate_detector(crop)[0]
        if len(plate_det.boxes):
            plate_crop = crop[...]   # crop plate from vehicle
            result = ocr.ocr(plate_crop)
            text = ''.join([r[1][0] for r in result[0]])
            if PLATE_PATTERN.match(text.replace(' ', '').upper()):
                plates.append(text)
    return plates

Q26. What is knowledge distillation for vision models? How do you compress ResNet-50 to ResNet-18?

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models

teacher = models.resnet50(weights='IMAGENET1K_V2')
student = models.resnet18(weights=None)   # train student from scratch

# Adapt student head to match teacher output
student.fc = nn.Linear(student.fc.in_features, 1000)

teacher.eval()
for param in teacher.parameters():
    param.requires_grad = False

def distill_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    kd = F.kl_div(
        F.log_softmax(student_logits / T, dim=-1),
        F.softmax(teacher_logits / T, dim=-1),
        reduction='batchmean'
    ) * (T ** 2)
    ce = F.cross_entropy(student_logits, labels)
    return alpha * kd + (1 - alpha) * ce

optimizer = torch.optim.SGD(student.parameters(), lr=0.1,
                              momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=90)

student.train()
for batch_x, batch_y in dataloader:
    with torch.no_grad():
        teacher_out = teacher(batch_x)
    student_out = student(batch_x)
    loss = distill_loss(student_out, teacher_out, batch_y)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Q27. What is domain adaptation in computer vision?

MethodHowLabel Required (Target)
Fine-tuningTrain on small labeled target setYes (few)
Domain-adversarial (DANN)Gradient reversal makes features domain-invariantNo
Self-trainingPseudo-label confident predictions; retrainNo
Style transfer pre-processingMake source look like targetNo
Test-time adaptation (TTA)Adapt model on unlabeled test imagesNo
import torch
import torch.nn as nn

class GradientReversalLayer(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, alpha):
        ctx.alpha = alpha
        return x.clone()

    @staticmethod
    def backward(ctx, grad_output):
        return -ctx.alpha * grad_output, None   # reverse gradient

class DomainAdversarialNN(nn.Module):
    def __init__(self, backbone, feature_dim, num_classes, num_domains=2):
        super().__init__()
        self.backbone = backbone
        self.classifier    = nn.Linear(feature_dim, num_classes)
        self.domain_head   = nn.Sequential(
            nn.Linear(feature_dim, 256), nn.ReLU(),
            nn.Linear(256, num_domains)
        )

    def forward(self, x, alpha=1.0):
        features   = self.backbone(x)
        class_out  = self.classifier(features)
        rev_features = GradientReversalLayer.apply(features, alpha)
        domain_out = self.domain_head(rev_features)
        return class_out, domain_out

Q28. Design a visual search engine (search by image, return similar products).

Visual Search Architecture:

Offline (indexing):
  1. Product catalog: 10M product images
  2. Feature extraction: EfficientNetV2-M backbone, strip head, extract 1792-dim embedding
  3. L2 normalization of embeddings
  4. Build FAISS IVF-PQ index: ~2GB for 10M embeddings (vs 80GB for flat index)
  5. Store: embedding -> product_id mapping

Online (query):
  1. User uploads image (or captures on mobile)
  2. Preprocessing: center crop 380x380, normalize (ImageNet stats)
  3. Feature extraction: same backbone, ~50ms on GPU (batched), ~200ms on CPU
  4. ANN search: FAISS returns top-100 nearest neighbors in < 5ms
  5. Re-rank: fine-grained similarity with more expensive cross-attention model
  6. Return top-10 with metadata (price, category, URL)
import faiss
import numpy as np
import torchvision.models as models
import torch

# Build index
backbone = models.efficientnet_v2_m(weights='IMAGENET1K_V1')
backbone.classifier = nn.Identity()   # remove classification head
backbone.eval().cuda()

def extract_features(images, batch_size=64):
    all_features = []
    for i in range(0, len(images), batch_size):
        batch = torch.stack(images[i:i+batch_size]).cuda()
        with torch.no_grad():
            feat = backbone(batch).cpu().numpy()
        all_features.append(feat)
    features = np.concatenate(all_features)
    faiss.normalize_L2(features)   # L2 normalize for cosine similarity
    return features

# IVF-PQ index: fast ANN for 10M+ vectors
d = 1792  # embedding dim
quantizer = faiss.IndexFlatIP(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist=4096, M=64, nbits=8)
index.train(catalog_features)
index.add(catalog_features)
faiss.write_index(index, 'visual_search.index')

# Query
def search(query_image, top_k=10):
    feat = extract_features([query_image])
    index.nprobe = 64   # check 64 of 4096 clusters (tradeoff: accuracy vs speed)
    distances, indices = index.search(feat, top_k)
    return [product_catalog[i] for i in indices[0]]

Computer Vision Tools at a Glance

Use CaseTool / LibraryNotes
Data loading and augmentationtorchvision, albumentationsalbumentations faster and more options
Classification (pretrained)timm (PyTorch Image Models)700+ models, best collection
Detection (production)Ultralytics YOLOv8Fastest iteration for custom datasets
Detection (research)MMDetection, Detectron2More architectures
SegmentationMask2Former, SAMHuggingFace for Mask2Former
Keypoint detectionMediaPipe, ViTPoseMediaPipe for real-time
OCRPaddleOCR, TrOCRPaddleOCR for production speed
Feature extractionDINOv2, CLIPBest self-supervised features
DeploymentONNX + TensorRT, TorchScriptTensorRT for NVIDIA production

FAQ

Q: YOLO vs Faster R-CNN: which should I use? A: YOLOv8 for most real-time applications (30+ FPS requirement). Faster R-CNN for medical imaging, aerial/satellite imagery, or when small object accuracy matters more than speed.

Q: Is OpenCV still needed in 2026? A: Yes for preprocessing, classical augmentation (blur, resize, color conversion), reading video streams, and operations not covered by torchvision. It's not being replaced.

Q: How much data do I need to fine-tune a YOLO model? A: A practical minimum is 200-500 annotated images per class. With data augmentation, this can produce a useful model. Quality matters more than quantity.

Q: What is the difference between semantic segmentation and instance segmentation? A: Semantic segmentation labels each pixel with a class (all cats = same color). Instance segmentation additionally separates individual objects of the same class (cat1 vs cat2 have different IDs).


Related articles on PapersAdda:

Methodology applied to this articlelast verified 8 Jun 2026
Sources used
Public exam-pattern documents, official recruiter pages, and verified candidate reports on r/developersIndia and LinkedIn.
Verification window
Page last edited 8 Jun 2026 by Aditya Sharma. Numbers and patterns sanity-checked against the most recent 2026 cycle drives we tracked.
What we did NOT do
  • No fabricated salary numbers or success rates. If we quote a range, it's sourced.
  • No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
  • No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Paid contributor programme

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.

Submit your story →

Ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start Free Mock Test →

Related Articles

More from PapersAdda

Share this guide: