placement brief / Interview Questions / interview questions / 08 Jun 2026

Computer Vision Interview Questions 2026: 28 Answers with Code

28 computer vision interview questions with PyTorch code covering CNNs, object detection, segmentation, ViTs, image augmentation, and production CV system design for 2026.

By Aditya SharmaPublished 8 Jun 20262 sources listedSpot an error? Corrections open

8 min read last revised 8 Jun 2026

on this page§ 06

Computer vision powers autonomous vehicles, medical imaging, retail automation, and every modern camera application. In 2026, CV engineers are expected to know both classical image processing and modern deep learning architectures. This guide covers 28 computer vision interview questions with full answers, PyTorch code, and comparison tables.

PapersAdda's take: CV interviews at product companies ask you to implement convolutions, explain detection architectures, and reason about tradeoffs. At vision-focused companies (Ola Electric, Nuro, Waymo, Samsung), expect to implement augmentation pipelines and discuss evaluation metrics deeply. Candidates report that mAP calculation and NMS implementation are the most commonly tested CV topics in on-site rounds. According to candidate accounts from public preparation resources, object detection system design comes up in roughly half of senior CV interviews. Confirm the exact interview structure on the official careers portal of your target company.

Related articles: Deep Learning Interview Questions 2026 | AI/ML Interview Questions 2026 | PyTorch Interview Questions 2026 | Machine Learning Interview Questions 2026 | MLOps Interview Questions 2026

Which Companies Ask These Questions?

Topic	Companies
CNN Architecture	Google, Meta, Samsung, Qualcomm
Object Detection	Ola, Waymo, Tesla, Nuro, DoorDash
Image Segmentation	Medical imaging startups, autonomous driving
Vision Transformers	Google Brain, Meta AI, OpenAI
CV Data Pipelines	All companies with CV teams
Production CV Systems	Any company deploying camera-based AI

EASY: Image Processing and CNNs (Questions 1-10)

Q1. What is a convolution operation in a neural network?

import torch
import torch.nn as nn
import torch.nn.functional as F

# Manual 2D convolution (educational)
def conv2d_manual(image, kernel):
    """
    image:  [H, W]
    kernel: [kH, kW]
    """
    H, W = image.shape
    kH, kW = kernel.shape
    oH, oW = H - kH + 1, W - kW + 1
    output = torch.zeros(oH, oW)
    for i in range(oH):
        for j in range(oW):
            output[i,j] = (image[i:i+kH, j:j+kW] * kernel).sum()
    return output

# PyTorch nn.Conv2d
conv = nn.Conv2d(
    in_channels=3,    # RGB input
    out_channels=64,  # 64 different learned filters
    kernel_size=3,
    stride=1,
    padding=1,        # same padding
    bias=False        # often set to False before BatchNorm
)

x = torch.randn(1, 3, 224, 224)
out = conv(x)
print(out.shape)   # [1, 64, 224, 224]

# Parameter count: 3 * 64 * 3 * 3 = 1,728 params
print("Params:", sum(p.numel() for p in conv.parameters()))

Q2. What is the difference between valid and same padding?

Padding	Formula	Output Size	Use Case
Valid (no padding)	H_out = (H - k) / s + 1	Smaller than input	When you intentionally shrink
Same (zero padding)	H_out = H / s	Same as input (stride=1)	Preserve spatial dimensions
Reflect padding	Mirror image at borders	Same as input	Less border artifacts

# Same padding: output size = input size (stride=1)
conv_same = nn.Conv2d(3, 64, kernel_size=3, padding=1)         # padding = kernel_size//2
x = torch.randn(1, 3, 32, 32)
print(conv_same(x).shape)   # [1, 64, 32, 32]

# Valid padding: output shrinks
conv_valid = nn.Conv2d(3, 64, kernel_size=3, padding=0)
print(conv_valid(x).shape)  # [1, 64, 30, 30]

# Strided convolution: reduces spatial resolution
conv_stride2 = nn.Conv2d(3, 64, kernel_size=3, padding=1, stride=2)
print(conv_stride2(x).shape) # [1, 64, 16, 16]

Q3. What is pooling? Compare max pooling and average pooling.

Type	Operation	Property	Use Case
Max Pool	Take maximum in window	Translation invariance, keeps strongest feature	Hidden layers
Avg Pool	Average in window	Smoother, preserves all info	Final spatial reduction
Global Avg Pool	Average entire feature map to 1x1	Spatial information to 1D; no FC layers	Final classification layer
Adaptive Pool	Output size specified, window computed	Handles variable input sizes	Final layers

max_pool  = nn.MaxPool2d(kernel_size=2, stride=2)     # halves spatial dims
avg_pool  = nn.AvgPool2d(kernel_size=2, stride=2)
gap       = nn.AdaptiveAvgPool2d((1,1))               # global average pool

x = torch.randn(4, 64, 14, 14)
print(max_pool(x).shape)   # [4, 64, 7, 7]
print(avg_pool(x).shape)   # [4, 64, 7, 7]
print(gap(x).shape)        # [4, 64, 1, 1]
print(gap(x).flatten(1).shape)  # [4, 64] -- then Linear(64, num_classes)

Q4. What is data augmentation for images? What are the standard augmentations?

import torchvision.transforms.v2 as T

# Standard augmentation pipeline for classification (ImageNet-style)
train_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.8, 1.0)),     # crop and resize
    T.RandomHorizontalFlip(p=0.5),                   # mirror
    T.ColorJitter(brightness=0.4, contrast=0.4,
                   saturation=0.4, hue=0.1),          # color jitter
    T.RandomGrayscale(p=0.1),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],          # ImageNet statistics
                 std=[0.229, 0.224, 0.225])
])

# Advanced augmentation for robust models
advanced_transform = T.Compose([
    T.RandomResizedCrop(224),
    T.RandomHorizontalFlip(),
    T.TrivialAugmentWide(),      # AutoAugment variant, easy to use
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    T.RandomErasing(p=0.1)       # CutOut: random rectangles zeroed out
])

# MixUp and CutMix (state-of-the-art for classification in 2026)
cutmix = T.CutMix(num_classes=1000)
mixup  = T.MixUp(num_classes=1000)

Augmentations for object detection (must preserve bounding boxes):

import albumentations as A

detection_transform = A.Compose([
    A.RandomResizedCrop(height=640, width=640, scale=(0.5, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, p=0.5),
    A.Blur(blur_limit=3, p=0.1),
    A.GaussNoise(var_limit=(10, 50), p=0.1),
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))

Q5. What are the major CNN architectures? How did they evolve from AlexNet to 2026?

Architecture	Year	Innovation	Top-1 ImageNet
AlexNet	2012	First deep CNN; ReLU, dropout, GPU	57.1%
VGG-16	2014	Deep uniform 3x3 convolutions	71.5%
GoogLeNet	2014	Inception modules; 1x1 convolutions	74.8%
ResNet-50	2015	Residual connections; 152 layers	76.1%
DenseNet	2017	Dense connections; feature reuse	77.2%
EfficientNet-B7	2019	Compound scaling; NAS	84.3%
Vision Transformer (ViT-L)	2020	Patch-based self-attention	87.8%
ConvNeXt-L	2022	Modernized ResNet with transformer tricks	87.8%
EfficientNetV2-XL	2022	Progressive learning, Fused-MBConv	87.3%

import torchvision.models as models

# Modern choices for 2026
efficientnet = models.efficientnet_v2_l(weights='IMAGENET1K_V1')
convnext = models.convnext_large(weights='IMAGENET1K_V1')
vit = models.vit_h_14(weights='IMAGENET1K_SWAG_E2E_V1')  # ViT-H/14, SWAG weights

# For transfer learning (fine-tuning just the head)
def make_classification_model(backbone_name, num_classes):
    if backbone_name == 'efficientnet_v2_m':
        model = models.efficientnet_v2_m(weights='IMAGENET1K_V1')
        in_features = model.classifier[1].in_features
        model.classifier = nn.Sequential(
            nn.Dropout(0.3), nn.Linear(in_features, num_classes)
        )
    return model

Q6. What is batch normalization in CNNs? What happens at inference?

Training:

x_hat = (x - mean_batch) / sqrt(var_batch + eps)
y = gamma * x_hat + beta

Computed over (N, H, W) for each channel C. Running statistics updated via EMA:

running_mean = 0.9 * running_mean + 0.1 * batch_mean

Inference: Use running_mean and running_var (accumulated during training). Call model.eval() to switch modes.

# Common mistake: forgetting model.eval() at inference
model.eval()   # MUST do this; switches BatchNorm to use running stats

# Debug: check if model is in train vs eval mode
for name, module in model.named_modules():
    if isinstance(module, nn.BatchNorm2d):
        print(f"{name}: training={module.training}, "
              f"running_mean={module.running_mean.mean():.3f}")

# BatchNorm vs InstanceNorm vs GroupNorm vs LayerNorm in CV
# BatchNorm: standard for CNNs with large batches
# InstanceNorm: style transfer, each sample normalized independently
# GroupNorm: small batches (detection, segmentation); stable with batch_size=2
# LayerNorm: Vision Transformers
bn = nn.BatchNorm2d(64)
gn = nn.GroupNorm(num_groups=8, num_channels=64)  # 8 groups of 8 channels

Q7. What is depthwise separable convolution? Why does MobileNet use it?

Depthwise separable convolution splits into:

Depthwise: 1 filter per input channel (C_in filters, each k x k)
Pointwise: 1x1 convolution to mix channels (C_in -> C_out)

Cost: (k^2 * C_in + C_in * C_out) * H * W

Savings: ~k^2 = ~9x for 3x3 convolutions.

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.depthwise  = nn.Conv2d(in_ch, in_ch, kernel_size=3,
                                     padding=1, stride=stride,
                                     groups=in_ch,  # groups=in_ch = depthwise
                                     bias=False)
        self.pointwise  = nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False)
        self.bn_dw      = nn.BatchNorm2d(in_ch)
        self.bn_pw      = nn.BatchNorm2d(out_ch)
        self.relu       = nn.ReLU6(inplace=True)

    def forward(self, x):
        x = self.relu(self.bn_dw(self.depthwise(x)))
        return self.relu(self.bn_pw(self.pointwise(x)))

# Parameter count comparison
standard = nn.Conv2d(32, 64, 3, padding=1)          # 32*64*9 = 18,432
dw_sep   = DepthwiseSeparableConv(32, 64)             # 32*9 + 32*64 = 2,336
print(f"Standard: {sum(p.numel() for p in standard.parameters())}")
print(f"DW-Sep:   {sum(p.numel() for p in dw_sep.parameters())}")

Q8. What is image segmentation? Compare semantic, instance, and panoptic segmentation.

Type	Output	Distinguishes Instances?	Examples
Semantic	Class per pixel (no instance IDs)	No	Lane segmentation
Instance	Separate mask per object instance	Yes	Individual person masks
Panoptic	Semantic + instance unified	Yes	All pixels labeled

from transformers import pipeline

# Semantic segmentation
seg_pipe = pipeline('image-segmentation',
                     model='facebook/mask2former-swin-large-ade-semantic')

# Instance segmentation
inst_pipe = pipeline('image-segmentation',
                      model='facebook/mask2former-swin-large-coco-instance')

# Using torchvision for segmentation inference
import torchvision
from torchvision.models.segmentation import deeplabv3_resnet101

# Semantic segmentation (DeepLab v3)
model = deeplabv3_resnet101(weights='DEFAULT')
model.eval()

import torchvision.transforms.functional as TF
from PIL import Image

img = Image.open("image.jpg").convert("RGB")
img_t = TF.to_tensor(img).unsqueeze(0)
with torch.no_grad():
    output = model(img_t)['out']    # [1, 21, H, W] - 21 COCO classes
seg_mask = output.argmax(dim=1)     # [1, H, W]

Q9. What are anchor boxes in object detection?

Why anchors: Direct regression to bounding box coordinates is unstable. Predicting offsets from anchors that already approximately match the object size is a much simpler task.

import torch
import torchvision

# Generating anchor boxes (simplified)
def generate_anchors(feature_map_size, scales, ratios, stride=16):
    """
    Returns anchor boxes as [cx, cy, w, h] for each cell in the feature map.
    """
    anchors = []
    H, W = feature_map_size
    for row in range(H):
        for col in range(W):
            cx = (col + 0.5) * stride
            cy = (row + 0.5) * stride
            for scale in scales:
                for ratio in ratios:
                    w = scale * (ratio ** 0.5)
                    h = scale / (ratio ** 0.5)
                    anchors.append([cx, cy, w, h])
    return torch.tensor(anchors)

anchors = generate_anchors((14, 14), scales=[32, 64, 128], ratios=[0.5, 1.0, 2.0])
print(f"Total anchors: {len(anchors)}")  # 14*14*9 = 1764

# Modern models (FCOS, DETR) are anchor-free, using points or queries instead

Q10. What is IoU (Intersection over Union) and why is it the standard metric for detection?

IoU = area(Predicted ∩ Ground Truth) / area(Predicted ∪ Ground Truth)

IoU of 1.0 = perfect overlap. IoU = 0 = no overlap. Standard threshold: IoU >= 0.5 for a True Positive.

import torch

def box_iou(box1, box2):
    """
    box1, box2: [x1, y1, x2, y2] format
    """
    # Intersection
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)

    # Union
    area1 = (box1[2]-box1[0]) * (box1[3]-box1[1])
    area2 = (box2[2]-box2[0]) * (box2[3]-box2[1])
    union = area1 + area2 - inter

    return inter / (union + 1e-6)

# Vectorized with torchvision
from torchvision.ops import box_iou as tv_box_iou
pred_boxes = torch.tensor([[50, 50, 200, 200]], dtype=torch.float)
gt_boxes   = torch.tensor([[70, 70, 210, 210]], dtype=torch.float)
print(tv_box_iou(pred_boxes, gt_boxes))   # [[0.73]] approx

# [email protected] and [email protected]:0.95 are standard detection benchmarks (COCO)

MEDIUM: Object Detection and Segmentation (Questions 11-20)

Q11. How does YOLO work? Explain the key design choices.

YOLO v1-v5 evolution: Grid cell regression -> anchor boxes -> feature pyramid networks -> batch norm + CSP blocks.

YOLOv8 (2023, standard in 2026):

Anchor-free detection head (predicts center point + width/height)
C2f modules (cross-stage partial bottleneck with 2 fused outputs)
Decoupled head: separate heads for classification and localization
Task-Aligned Loss with VFL (Varifocal Loss) + DFL (Distribution Focal Loss)

from ultralytics import YOLO

# Inference (fastest)
model = YOLO('yolov8n.pt')   # n=nano, s=small, m=medium, l=large, x=extra
results = model('image.jpg', conf=0.25, iou=0.45, classes=[0,1,2])
for r in results:
    for box in r.boxes:
        cls = int(box.cls)
        conf = float(box.conf)
        x1, y1, x2, y2 = box.xyxy[0].tolist()

# Fine-tuning on custom dataset
model = YOLO('yolov8m.pt')
results = model.train(
    data='custom_dataset.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    device='0',
    patience=20,           # early stopping
    augment=True,
    hsv_h=0.015, hsv_s=0.7, hsv_v=0.4,
    degrees=10.0,
    translate=0.1,
    scale=0.5,
    mosaic=1.0             # mosaic augmentation (YOLOv4 signature)
)

Q12. How does Faster R-CNN work? Compare with single-stage detectors.

Faster R-CNN (two-stage):

Backbone (ResNet) extracts feature maps
Region Proposal Network (RPN) proposes ~2,000 candidate regions
RoI Pooling: aligns each proposal to fixed size
Classification + box regression heads refine each proposal

Comparison:

Property	Two-Stage (Faster R-CNN)	Single-Stage (YOLO, SSD)
Accuracy	Higher, especially small objects	Slightly lower
Speed	Slower (~5 FPS on CPU)	Fast (30+ FPS)
Complexity	Higher (two passes)	Simpler
2026 usage	Medical imaging, aerial imagery	Real-time applications

import torchvision
from torchvision.models.detection import (
    fasterrcnn_resnet50_fpn_v2, FasterRCNN,
    fcos_resnet50_fpn   # anchor-free, often better
)

# Faster R-CNN for high-accuracy tasks
model = fasterrcnn_resnet50_fpn_v2(weights='DEFAULT')
model.eval()

# For custom classes
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

model = fasterrcnn_resnet50_fpn_v2(weights='DEFAULT')
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes=5+1)
# Fine-tune with standard DetectionDataset + DataLoader

Q13. What is Feature Pyramid Network (FPN) and why is it used for detection?

Bottom-up pass: standard forward pass through backbone
Top-down pass: upsample high-level features and add to lower-level features via lateral connections

import torch.nn as nn
import torch.nn.functional as F

class SimpleFPN(nn.Module):
    def __init__(self, in_channels_list, out_channels=256):
        super().__init__()
        # Lateral 1x1 convolutions to unify channel dims
        self.lateral = nn.ModuleList([
            nn.Conv2d(c, out_channels, 1) for c in in_channels_list
        ])
        # 3x3 smoothing convolutions
        self.smooth = nn.ModuleList([
            nn.Conv2d(out_channels, out_channels, 3, padding=1)
            for _ in in_channels_list
        ])

    def forward(self, features):
        # features: [C2, C3, C4, C5] from backbone
        laterals = [l(f) for l, f in zip(self.lateral, features)]
        # Top-down path
        for i in range(len(laterals)-2, -1, -1):
            laterals[i] += F.interpolate(laterals[i+1], size=laterals[i].shape[-2:],
                                          mode='nearest')
        return [self.smooth[i](laterals[i]) for i in range(len(laterals))]

Q14. What is non-maximum suppression (NMS)? Why is it needed?

Sort all boxes by confidence score (descending)
Take the highest-confidence box
Remove all other boxes with IoU > threshold (usually 0.45)
Repeat with remaining boxes

import torch
from torchvision.ops import nms, batched_nms

# Single class NMS
boxes  = torch.tensor([[10, 10, 50, 50], [12, 12, 52, 52], [100, 100, 200, 200]], dtype=torch.float)
scores = torch.tensor([0.9, 0.85, 0.7])
keep = nms(boxes, scores, iou_threshold=0.5)
print("Kept indices:", keep)  # [0, 2] - second box removed (IoU > 0.5 with first)

# Multi-class NMS (suppress within same class only)
classes = torch.tensor([0, 0, 1])
keep = batched_nms(boxes, scores, classes, iou_threshold=0.5)

# Soft-NMS: decay scores instead of hard removal
# Avoids missing overlapping objects (crowd scenes)
# from torchvision.ops import soft_nms (not in stable torchvision; use custom)

Q15. What is Mask R-CNN? How does it extend Faster R-CNN for instance segmentation?

Key addition: RoI Align (fixes the quantization misalignment in RoI Pooling):

Uses bilinear interpolation to sample exactly at float coordinates
Critical for mask quality

import torchvision
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

# Load with pretrained weights
model = maskrcnn_resnet50_fpn_v2(weights='DEFAULT')

# Adapt for custom dataset
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
model.roi_heads.mask_predictor = MaskRCNNPredictor(
    in_features_mask, hidden_layer, num_classes
)

# Inference: returns dict with 'boxes', 'labels', 'scores', 'masks'
model.eval()
with torch.no_grad():
    predictions = model([img_tensor])

for box, label, score, mask in zip(
    predictions[0]['boxes'], predictions[0]['labels'],
    predictions[0]['scores'], predictions[0]['masks']
):
    if score > 0.5:
        binary_mask = (mask[0] > 0.5).numpy()

Q16. What is DETR (Detection Transformer)? How does it work without anchors or NMS?

CNN backbone extracts feature maps
Transformer encoder processes feature map as sequence
100 learnable object queries attend to encoder output (cross-attention)
Each query either predicts an object (class + box) or "no object"
Hungarian matching: match predictions to ground truth optimally

from transformers import DetrImageProcessor, DetrForObjectDetection
import torch
from PIL import Image

processor = DetrImageProcessor.from_pretrained('facebook/detr-resnet-50')
model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')

image = Image.open("image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)

# Post-process outputs (handles confidence threshold + rescaling)
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(
    outputs, threshold=0.9, target_sizes=target_sizes
)[0]

for score, label, box in zip(results['scores'], results['labels'], results['boxes']):
    print(f"Detected {model.config.id2label[label.item()]} "
          f"({score:.3f}) at {[round(i,1) for i in box.tolist()]}")

Q17. What is the Vision Transformer (ViT)? How does it process images?

Split 224x224 image into 16x16 patches: 196 patches
Linear projection of each patch to d_model-dimensional embedding
Add [CLS] token and positional embeddings
Pass through transformer encoder (L layers of multi-head self-attention + FFN)
Use [CLS] embedding for classification

import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.n_patches = (img_size // patch_size) ** 2
        # Patch projection via Conv2d with kernel=patch_size, stride=patch_size
        self.proj = nn.Conv2d(in_channels, embed_dim,
                               kernel_size=patch_size, stride=patch_size)
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, 1 + self.n_patches, embed_dim))
        nn.init.normal_(self.pos_embed, std=0.02)
        nn.init.normal_(self.cls_token, std=0.02)

    def forward(self, x):
        B = x.shape[0]
        patches = self.proj(x).flatten(2).transpose(1, 2)  # [B, 196, 768]
        cls = self.cls_token.expand(B, -1, -1)              # [B, 1, 768]
        x = torch.cat([cls, patches], dim=1)                # [B, 197, 768]
        return x + self.pos_embed

# Using timm (preferred in 2026)
import timm
vit = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)

Q18. What is Segment Anything Model (SAM)? How does it work?

Architecture:

Image Encoder: ViT-H/16 MAE-pretrained backbone extracts image embeddings (offline, once per image)
Prompt Encoder: Encodes sparse prompts (points, boxes) and dense prompts (masks)
Mask Decoder: Lightweight transformer that attends over image features and prompt tokens, outputs 3 candidate masks (ambiguous cases)

from segment_anything import SamPredictor, sam_model_registry

sam = sam_model_registry['vit_h'](checkpoint='sam_vit_h_4b8939.pth')
sam.to('cuda')
predictor = SamPredictor(sam)

import cv2
image = cv2.imread('image.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image)    # encode image once

# Segment from a point prompt
masks, scores, logits = predictor.predict(
    point_coords=[[500, 375]],    # (x, y) coordinates
    point_labels=[1],              # 1=foreground, 0=background
    multimask_output=True
)
print(f"Best mask score: {scores.max():.3f}")
print(f"Mask shape: {masks[0].shape}")   # (H, W) boolean

# SAM 2 (2024): extends to video; tracks objects across frames

Q19. What is image classification on edge devices? How do you optimize for mobile?

Technique	Method	Speedup
Quantization (INT8)	Convert weights from FP32 to INT8	2-4x
Pruning	Remove unimportant weights/channels	2-10x (structured)
Architecture choice	MobileNetV3, EfficientNet-Lite, ShuffleNetV2	Built-in
Knowledge distillation	Train small model from large teacher	Task-dependent
TorchScript / ONNX	Export for optimized runtime	Platform-dependent

import torch
import torchvision.models as models

# MobileNetV3-Small: best accuracy/latency for mobile
model = models.mobilenet_v3_small(weights='IMAGENET1K_V1')
model.eval()

# ONNX export for edge deployment
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, 'mobilenet_v3s.onnx',
                   input_names=['input'], output_names=['output'],
                   dynamic_axes={'input': {0: 'batch'}},
                   opset_version=17)

# TorchScript for mobile (avoids Python interpreter)
scripted = torch.jit.script(model)
scripted.save('mobilenet_v3s.pt')

# Quantization (INT8)
model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
torch.quantization.prepare(model, inplace=True)
# calibrate with representative data ...
model_int8 = torch.quantization.convert(model)

Q20. How do you compute mean Average Precision (mAP) for object detection?

COCO [email protected]:0.95: Average over IoU thresholds from 0.5 to 0.95 in 0.05 steps. More demanding than VOC [email protected].

For each class:
  1. Sort all detections by confidence
  2. At each threshold, assign TP (IoU >= threshold) or FP
  3. Compute precision and recall at each point
  4. AP = area under precision-recall curve

mAP = mean(AP across all classes)

from torchmetrics.detection import MeanAveragePrecision

metric = MeanAveragePrecision(iou_type='bbox', iou_thresholds=[0.5, 0.75])

preds = [{
    'boxes': torch.tensor([[10, 10, 100, 100], [200, 200, 300, 300]]),
    'scores': torch.tensor([0.9, 0.75]),
    'labels': torch.tensor([0, 1])
}]
targets = [{
    'boxes': torch.tensor([[15, 15, 105, 105], [210, 210, 310, 310]]),
    'labels': torch.tensor([0, 1])
}]

metric.update(preds, targets)
results = metric.compute()
print(f"[email protected]:    {results['map_50']:.3f}")
print(f"[email protected]:   {results['map_75']:.3f}")
print(f"[email protected]:0.95: {results['map']:.3f}")

HARD: Advanced CV (Questions 21-28)

Q21. What is self-supervised learning for vision? Explain DINO and MAE.

Method	How	What it learns
DINO (Meta, 2021)	Student-teacher with momentum; patch-level self-distillation	Semantic features without labels
MAE (Meta, 2022)	Mask 75% of patches; reconstruct	Dense pixel-level understanding
SimCLR	Contrastive learning; augmentation invariance	Semantic similarity
MoCo v3	Momentum contrastive with transformer	Stable contrastive features

DINO is especially powerful because it naturally learns semantic segmentation structure (patches of the same object get similar representations) without any supervision:

import torch
from torchvision import transforms

# Load DINOv2 (Meta, 2023) - best self-supervised ViT in 2026
dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2.eval().cuda()

# Get patch features (14x14 patches for 224x224 input)
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

with torch.no_grad():
    img_t = transform(image).unsqueeze(0).cuda()
    features = dinov2.forward_features(img_t)
    patch_features = features['x_norm_patchtokens']   # [1, 256, 1024]
    cls_feature    = features['x_norm_clstoken']      # [1, 1024]

Q22. What is optical flow? How is it used in video understanding?

import cv2
import numpy as np

# Lucas-Kanade sparse optical flow (track specific points)
cap = cv2.VideoCapture('video.mp4')
ret, old_frame = cap.read()
old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)

# Detect corners to track
corners = cv2.goodFeaturesToTrack(old_gray, maxCorners=100,
                                    qualityLevel=0.3, minDistance=7)

while True:
    ret, frame = cap.read()
    if not ret: break
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Calculate optical flow
    new_pts, status, err = cv2.calcOpticalFlowPyrLK(
        old_gray, gray, corners, None,
        winSize=(15,15), maxLevel=2,
        criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 0.03)
    )
    good_new = new_pts[status==1]
    good_old = corners[status==1]
    old_gray = gray.copy()
    corners = good_new.reshape(-1, 1, 2)

# Deep optical flow: RAFT (state-of-the-art)
# torchvision.models.optical_flow.raft_large()

Q23. How do you handle class imbalance in object detection datasets?

Problem	Example	Solution
Rare class few examples	Pedestrians vs cars in parking lot	Oversampling rare class images
Easy negatives dominate	Background vs object	Focal loss; hard negative mining
Scale imbalance	Tiny objects vs large	FPN; multi-scale training; mosaic augmentation
Long-tail classes	LVIS dataset (1203 categories)	Federated Loss; repeat factor sampling

# Repeat Factor Sampling (LVIS standard)
# Images with rare categories are sampled more frequently
import numpy as np

def compute_repeat_factors(dataset, repeat_threshold=0.001):
    """
    Compute how many times each image should be repeated.
    Images with rare categories get higher repeat factors.
    """
    category_freq = {}    # category_id -> frequency
    for ann in dataset.annotations:
        for cat_id in ann['category_ids']:
            category_freq[cat_id] = category_freq.get(cat_id, 0) + 1

    total_images = len(dataset)
    for cat_id in category_freq:
        category_freq[cat_id] /= total_images

    image_repeat_factors = []
    for img in dataset.images:
        cat_ids = img['category_ids']
        rf = max([max(1.0, (repeat_threshold / category_freq[c]) ** 0.5)
                   for c in cat_ids] or [1.0])
        image_repeat_factors.append(rf)
    return image_repeat_factors

Q24. What is diffusion-based image generation? How does Stable Diffusion work?

Stable Diffusion architecture:

Text encoder: CLIP encodes text prompt to conditioning vector
VAE: Operates in compressed latent space (8x spatial compression) for efficiency
U-Net denoiser: Predicts noise at each step, conditioned on text via cross-attention
Scheduler: Controls denoising trajectory (DDPM, DDIM, DPM-Solver)

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    'stabilityai/stable-diffusion-2-1',
    torch_dtype=torch.float16
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to('cuda')

# Enable memory optimization
pipe.enable_attention_slicing()

image = pipe(
    prompt="a photorealistic portrait of an AI engineer, cinematic lighting",
    negative_prompt="blurry, low quality, cartoon",
    num_inference_steps=25,     # DPM-Solver is fast: 25 steps sufficient
    guidance_scale=7.5,         # CFG scale: 7-9 is typical
    height=768, width=512
).images[0]
image.save('generated.png')

Q25. Design a production pipeline for real-time license plate recognition.

System design for ANPR (Automatic Number Plate Recognition):

Input: Camera feed at 30fps

Stage 1: Vehicle Detection (< 5ms)
  Model: YOLOv8n (nano) - 3.2ms on GPU
  Task: Detect vehicles in frame; extract bounding boxes
  Optimization: Skip frames at low traffic; only run on regions of motion

Stage 2: Plate Detection (< 3ms)
  Model: YOLOv8s fine-tuned on license plate dataset
  Task: Within vehicle crop, detect plate location
  Dataset: Kaggle license plate + internal dataset

Stage 3: OCR (< 10ms)
  Model: TrOCR (transformer-based OCR) or PaddleOCR
  Task: Read characters on plate
  Post-process: Regex validation (plate format), correction

Stage 4: Post-processing
  Deduplication: Track plate across frames; confirm after 3 consistent reads
  Database lookup: Redis cache for banned/flagged plates (sub-millisecond)
  Alert: Webhook to parking/security system

Total pipeline: < 20ms per frame = 50+ FPS headroom

from ultralytics import YOLO
from paddleocr import PaddleOCR

vehicle_detector = YOLO('yolov8s.pt')
plate_detector   = YOLO('plate_detector.pt')
ocr = PaddleOCR(lang='en', use_angle_cls=True, use_gpu=True)

import re
PLATE_PATTERN = re.compile(r'^[A-Z]{2}\d{2}[A-Z]{2}\d{4}$')  # Indian plate

def process_frame(frame):
    vehicles = vehicle_detector(frame, classes=[2, 3, 5, 7])[0]  # car, motorcycle, bus, truck
    plates = []
    for box in vehicles.boxes:
        crop = frame[int(box.xyxy[0][1]):int(box.xyxy[0][3]),
                      int(box.xyxy[0][0]):int(box.xyxy[0][2])]
        plate_det = plate_detector(crop)[0]
        if len(plate_det.boxes):
            plate_crop = crop[...]   # crop plate from vehicle
            result = ocr.ocr(plate_crop)
            text = ''.join([r[1][0] for r in result[0]])
            if PLATE_PATTERN.match(text.replace(' ', '').upper()):
                plates.append(text)
    return plates

Q26. What is knowledge distillation for vision models? How do you compress ResNet-50 to ResNet-18?

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models

teacher = models.resnet50(weights='IMAGENET1K_V2')
student = models.resnet18(weights=None)   # train student from scratch

# Adapt student head to match teacher output
student.fc = nn.Linear(student.fc.in_features, 1000)

teacher.eval()
for param in teacher.parameters():
    param.requires_grad = False

def distill_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    kd = F.kl_div(
        F.log_softmax(student_logits / T, dim=-1),
        F.softmax(teacher_logits / T, dim=-1),
        reduction='batchmean'
    ) * (T ** 2)
    ce = F.cross_entropy(student_logits, labels)
    return alpha * kd + (1 - alpha) * ce

optimizer = torch.optim.SGD(student.parameters(), lr=0.1,
                              momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=90)

student.train()
for batch_x, batch_y in dataloader:
    with torch.no_grad():
        teacher_out = teacher(batch_x)
    student_out = student(batch_x)
    loss = distill_loss(student_out, teacher_out, batch_y)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Q27. What is domain adaptation in computer vision?

Method	How	Label Required (Target)
Fine-tuning	Train on small labeled target set	Yes (few)
Domain-adversarial (DANN)	Gradient reversal makes features domain-invariant	No
Self-training	Pseudo-label confident predictions; retrain	No
Style transfer pre-processing	Make source look like target	No
Test-time adaptation (TTA)	Adapt model on unlabeled test images	No

import torch
import torch.nn as nn

class GradientReversalLayer(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, alpha):
        ctx.alpha = alpha
        return x.clone()

    @staticmethod
    def backward(ctx, grad_output):
        return -ctx.alpha * grad_output, None   # reverse gradient

class DomainAdversarialNN(nn.Module):
    def __init__(self, backbone, feature_dim, num_classes, num_domains=2):
        super().__init__()
        self.backbone = backbone
        self.classifier    = nn.Linear(feature_dim, num_classes)
        self.domain_head   = nn.Sequential(
            nn.Linear(feature_dim, 256), nn.ReLU(),
            nn.Linear(256, num_domains)
        )

    def forward(self, x, alpha=1.0):
        features   = self.backbone(x)
        class_out  = self.classifier(features)
        rev_features = GradientReversalLayer.apply(features, alpha)
        domain_out = self.domain_head(rev_features)
        return class_out, domain_out

Q28. Design a visual search engine (search by image, return similar products).

Visual Search Architecture:

Offline (indexing):
  1. Product catalog: 10M product images
  2. Feature extraction: EfficientNetV2-M backbone, strip head, extract 1792-dim embedding
  3. L2 normalization of embeddings
  4. Build FAISS IVF-PQ index: ~2GB for 10M embeddings (vs 80GB for flat index)
  5. Store: embedding -> product_id mapping

Online (query):
  1. User uploads image (or captures on mobile)
  2. Preprocessing: center crop 380x380, normalize (ImageNet stats)
  3. Feature extraction: same backbone, ~50ms on GPU (batched), ~200ms on CPU
  4. ANN search: FAISS returns top-100 nearest neighbors in < 5ms
  5. Re-rank: fine-grained similarity with more expensive cross-attention model
  6. Return top-10 with metadata (price, category, URL)

import faiss
import numpy as np
import torchvision.models as models
import torch

# Build index
backbone = models.efficientnet_v2_m(weights='IMAGENET1K_V1')
backbone.classifier = nn.Identity()   # remove classification head
backbone.eval().cuda()

def extract_features(images, batch_size=64):
    all_features = []
    for i in range(0, len(images), batch_size):
        batch = torch.stack(images[i:i+batch_size]).cuda()
        with torch.no_grad():
            feat = backbone(batch).cpu().numpy()
        all_features.append(feat)
    features = np.concatenate(all_features)
    faiss.normalize_L2(features)   # L2 normalize for cosine similarity
    return features

# IVF-PQ index: fast ANN for 10M+ vectors
d = 1792  # embedding dim
quantizer = faiss.IndexFlatIP(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist=4096, M=64, nbits=8)
index.train(catalog_features)
index.add(catalog_features)
faiss.write_index(index, 'visual_search.index')

# Query
def search(query_image, top_k=10):
    feat = extract_features([query_image])
    index.nprobe = 64   # check 64 of 4096 clusters (tradeoff: accuracy vs speed)
    distances, indices = index.search(feat, top_k)
    return [product_catalog[i] for i in indices[0]]

Computer Vision Tools at a Glance

Use Case	Tool / Library	Notes
Data loading and augmentation	torchvision, albumentations	albumentations faster and more options
Classification (pretrained)	timm (PyTorch Image Models)	700+ models, best collection
Detection (production)	Ultralytics YOLOv8	Fastest iteration for custom datasets
Detection (research)	MMDetection, Detectron2	More architectures
Segmentation	Mask2Former, SAM	HuggingFace for Mask2Former
Keypoint detection	MediaPipe, ViTPose	MediaPipe for real-time
OCR	PaddleOCR, TrOCR	PaddleOCR for production speed
Feature extraction	DINOv2, CLIP	Best self-supervised features
Deployment	ONNX + TensorRT, TorchScript	TensorRT for NVIDIA production

FAQ

Q: YOLO vs Faster R-CNN: which should I use?

A: YOLOv8 for most real-time applications (30+ FPS requirement). Faster R-CNN for medical imaging, aerial/satellite imagery, or when small object accuracy matters more than speed.

Q: Is OpenCV still needed in 2026?

A: Yes for preprocessing, classical augmentation (blur, resize, color conversion), reading video streams, and operations not covered by torchvision. It's not being replaced.

Q: How much data do I need to fine-tune a YOLO model?

A: A practical minimum is 200-500 annotated images per class. With data augmentation, this can produce a useful model. Quality matters more than quantity.

Q: What is the difference between semantic segmentation and instance segmentation?

A: Semantic segmentation labels each pixel with a class (all cats = same color). Instance segmentation additionally separates individual objects of the same class (cat1 vs cat2 have different IDs).

Related articles on PapersAdda:

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp