Computer Vision Interview Questions 2026: 28 Answers with Code

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
Computer vision powers autonomous vehicles, medical imaging, retail automation, and every modern camera application. In 2026, CV engineers are expected to know both classical image processing and modern deep learning architectures. This guide covers 28 computer vision interview questions with full answers, PyTorch code, and comparison tables.
PapersAdda's take: CV interviews at product companies ask you to implement convolutions, explain detection architectures, and reason about tradeoffs. At vision-focused companies (Ola Electric, Nuro, Waymo, Samsung), expect to implement augmentation pipelines and discuss evaluation metrics deeply. Candidates report that mAP calculation and NMS implementation are the most commonly tested CV topics in on-site rounds. According to candidate accounts from public preparation resources, object detection system design comes up in roughly half of senior CV interviews. Confirm the exact interview structure on the official careers portal of your target company.
Related articles: Deep Learning Interview Questions 2026 | AI/ML Interview Questions 2026 | PyTorch Interview Questions 2026 | Machine Learning Interview Questions 2026 | MLOps Interview Questions 2026
Which Companies Ask These Questions?
| Topic | Companies |
|---|---|
| CNN Architecture | Google, Meta, Samsung, Qualcomm |
| Object Detection | Ola, Waymo, Tesla, Nuro, DoorDash |
| Image Segmentation | Medical imaging startups, autonomous driving |
| Vision Transformers | Google Brain, Meta AI, OpenAI |
| CV Data Pipelines | All companies with CV teams |
| Production CV Systems | Any company deploying camera-based AI |
EASY: Image Processing and CNNs (Questions 1-10)
Q1. What is a convolution operation in a neural network?
import torch
import torch.nn as nn
import torch.nn.functional as F
# Manual 2D convolution (educational)
def conv2d_manual(image, kernel):
"""
image: [H, W]
kernel: [kH, kW]
"""
H, W = image.shape
kH, kW = kernel.shape
oH, oW = H - kH + 1, W - kW + 1
output = torch.zeros(oH, oW)
for i in range(oH):
for j in range(oW):
output[i,j] = (image[i:i+kH, j:j+kW] * kernel).sum()
return output
# PyTorch nn.Conv2d
conv = nn.Conv2d(
in_channels=3, # RGB input
out_channels=64, # 64 different learned filters
kernel_size=3,
stride=1,
padding=1, # same padding
bias=False # often set to False before BatchNorm
)
x = torch.randn(1, 3, 224, 224)
out = conv(x)
print(out.shape) # [1, 64, 224, 224]
# Parameter count: 3 * 64 * 3 * 3 = 1,728 params
print("Params:", sum(p.numel() for p in conv.parameters()))
Q2. What is the difference between valid and same padding?
| Padding | Formula | Output Size | Use Case |
|---|---|---|---|
| Valid (no padding) | H_out = (H - k) / s + 1 | Smaller than input | When you intentionally shrink |
| Same (zero padding) | H_out = H / s | Same as input (stride=1) | Preserve spatial dimensions |
| Reflect padding | Mirror image at borders | Same as input | Less border artifacts |
# Same padding: output size = input size (stride=1)
conv_same = nn.Conv2d(3, 64, kernel_size=3, padding=1) # padding = kernel_size//2
x = torch.randn(1, 3, 32, 32)
print(conv_same(x).shape) # [1, 64, 32, 32]
# Valid padding: output shrinks
conv_valid = nn.Conv2d(3, 64, kernel_size=3, padding=0)
print(conv_valid(x).shape) # [1, 64, 30, 30]
# Strided convolution: reduces spatial resolution
conv_stride2 = nn.Conv2d(3, 64, kernel_size=3, padding=1, stride=2)
print(conv_stride2(x).shape) # [1, 64, 16, 16]
Q3. What is pooling? Compare max pooling and average pooling.
| Type | Operation | Property | Use Case |
|---|---|---|---|
| Max Pool | Take maximum in window | Translation invariance, keeps strongest feature | Hidden layers |
| Avg Pool | Average in window | Smoother, preserves all info | Final spatial reduction |
| Global Avg Pool | Average entire feature map to 1x1 | Spatial information to 1D; no FC layers | Final classification layer |
| Adaptive Pool | Output size specified, window computed | Handles variable input sizes | Final layers |
max_pool = nn.MaxPool2d(kernel_size=2, stride=2) # halves spatial dims
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
gap = nn.AdaptiveAvgPool2d((1,1)) # global average pool
x = torch.randn(4, 64, 14, 14)
print(max_pool(x).shape) # [4, 64, 7, 7]
print(avg_pool(x).shape) # [4, 64, 7, 7]
print(gap(x).shape) # [4, 64, 1, 1]
print(gap(x).flatten(1).shape) # [4, 64] -- then Linear(64, num_classes)
Q4. What is data augmentation for images? What are the standard augmentations?
import torchvision.transforms.v2 as T
# Standard augmentation pipeline for classification (ImageNet-style)
train_transform = T.Compose([
T.RandomResizedCrop(224, scale=(0.8, 1.0)), # crop and resize
T.RandomHorizontalFlip(p=0.5), # mirror
T.ColorJitter(brightness=0.4, contrast=0.4,
saturation=0.4, hue=0.1), # color jitter
T.RandomGrayscale(p=0.1),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], # ImageNet statistics
std=[0.229, 0.224, 0.225])
])
# Advanced augmentation for robust models
advanced_transform = T.Compose([
T.RandomResizedCrop(224),
T.RandomHorizontalFlip(),
T.TrivialAugmentWide(), # AutoAugment variant, easy to use
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
T.RandomErasing(p=0.1) # CutOut: random rectangles zeroed out
])
# MixUp and CutMix (state-of-the-art for classification in 2026)
cutmix = T.CutMix(num_classes=1000)
mixup = T.MixUp(num_classes=1000)
Augmentations for object detection (must preserve bounding boxes):
import albumentations as A
detection_transform = A.Compose([
A.RandomResizedCrop(height=640, width=640, scale=(0.5, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, p=0.5),
A.Blur(blur_limit=3, p=0.1),
A.GaussNoise(var_limit=(10, 50), p=0.1),
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))
Q5. What are the major CNN architectures? How did they evolve from AlexNet to 2026?
| Architecture | Year | Innovation | Top-1 ImageNet |
|---|---|---|---|
| AlexNet | 2012 | First deep CNN; ReLU, dropout, GPU | 57.1% |
| VGG-16 | 2014 | Deep uniform 3x3 convolutions | 71.5% |
| GoogLeNet | 2014 | Inception modules; 1x1 convolutions | 74.8% |
| ResNet-50 | 2015 | Residual connections; 152 layers | 76.1% |
| DenseNet | 2017 | Dense connections; feature reuse | 77.2% |
| EfficientNet-B7 | 2019 | Compound scaling; NAS | 84.3% |
| Vision Transformer (ViT-L) | 2020 | Patch-based self-attention | 87.8% |
| ConvNeXt-L | 2022 | Modernized ResNet with transformer tricks | 87.8% |
| EfficientNetV2-XL | 2022 | Progressive learning, Fused-MBConv | 87.3% |
import torchvision.models as models
# Modern choices for 2026
efficientnet = models.efficientnet_v2_l(weights='IMAGENET1K_V1')
convnext = models.convnext_large(weights='IMAGENET1K_V1')
vit = models.vit_h_14(weights='IMAGENET1K_SWAG_E2E_V1') # ViT-H/14, SWAG weights
# For transfer learning (fine-tuning just the head)
def make_classification_model(backbone_name, num_classes):
if backbone_name == 'efficientnet_v2_m':
model = models.efficientnet_v2_m(weights='IMAGENET1K_V1')
in_features = model.classifier[1].in_features
model.classifier = nn.Sequential(
nn.Dropout(0.3), nn.Linear(in_features, num_classes)
)
return model
Q6. What is batch normalization in CNNs? What happens at inference?
Training:
x_hat = (x - mean_batch) / sqrt(var_batch + eps)
y = gamma * x_hat + beta
Computed over (N, H, W) for each channel C. Running statistics updated via EMA:
running_mean = 0.9 * running_mean + 0.1 * batch_mean
Inference: Use running_mean and running_var (accumulated during training). Call model.eval() to switch modes.
# Common mistake: forgetting model.eval() at inference
model.eval() # MUST do this; switches BatchNorm to use running stats
# Debug: check if model is in train vs eval mode
for name, module in model.named_modules():
if isinstance(module, nn.BatchNorm2d):
print(f"{name}: training={module.training}, "
f"running_mean={module.running_mean.mean():.3f}")
# BatchNorm vs InstanceNorm vs GroupNorm vs LayerNorm in CV
# BatchNorm: standard for CNNs with large batches
# InstanceNorm: style transfer, each sample normalized independently
# GroupNorm: small batches (detection, segmentation); stable with batch_size=2
# LayerNorm: Vision Transformers
bn = nn.BatchNorm2d(64)
gn = nn.GroupNorm(num_groups=8, num_channels=64) # 8 groups of 8 channels
Q7. What is depthwise separable convolution? Why does MobileNet use it?
Depthwise separable convolution splits into:
- Depthwise: 1 filter per input channel (C_in filters, each k x k)
- Pointwise: 1x1 convolution to mix channels (C_in -> C_out)
Cost: (k^2 * C_in + C_in * C_out) * H * W
Savings: ~k^2 = ~9x for 3x3 convolutions.
class DepthwiseSeparableConv(nn.Module):
def __init__(self, in_ch, out_ch, stride=1):
super().__init__()
self.depthwise = nn.Conv2d(in_ch, in_ch, kernel_size=3,
padding=1, stride=stride,
groups=in_ch, # groups=in_ch = depthwise
bias=False)
self.pointwise = nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False)
self.bn_dw = nn.BatchNorm2d(in_ch)
self.bn_pw = nn.BatchNorm2d(out_ch)
self.relu = nn.ReLU6(inplace=True)
def forward(self, x):
x = self.relu(self.bn_dw(self.depthwise(x)))
return self.relu(self.bn_pw(self.pointwise(x)))
# Parameter count comparison
standard = nn.Conv2d(32, 64, 3, padding=1) # 32*64*9 = 18,432
dw_sep = DepthwiseSeparableConv(32, 64) # 32*9 + 32*64 = 2,336
print(f"Standard: {sum(p.numel() for p in standard.parameters())}")
print(f"DW-Sep: {sum(p.numel() for p in dw_sep.parameters())}")
Q8. What is image segmentation? Compare semantic, instance, and panoptic segmentation.
| Type | Output | Distinguishes Instances? | Examples |
|---|---|---|---|
| Semantic | Class per pixel (no instance IDs) | No | Lane segmentation |
| Instance | Separate mask per object instance | Yes | Individual person masks |
| Panoptic | Semantic + instance unified | Yes | All pixels labeled |
from transformers import pipeline
# Semantic segmentation
seg_pipe = pipeline('image-segmentation',
model='facebook/mask2former-swin-large-ade-semantic')
# Instance segmentation
inst_pipe = pipeline('image-segmentation',
model='facebook/mask2former-swin-large-coco-instance')
# Using torchvision for segmentation inference
import torchvision
from torchvision.models.segmentation import deeplabv3_resnet101
# Semantic segmentation (DeepLab v3)
model = deeplabv3_resnet101(weights='DEFAULT')
model.eval()
import torchvision.transforms.functional as TF
from PIL import Image
img = Image.open("image.jpg").convert("RGB")
img_t = TF.to_tensor(img).unsqueeze(0)
with torch.no_grad():
output = model(img_t)['out'] # [1, 21, H, W] - 21 COCO classes
seg_mask = output.argmax(dim=1) # [1, H, W]
Q9. What are anchor boxes in object detection?
Why anchors: Direct regression to bounding box coordinates is unstable. Predicting offsets from anchors that already approximately match the object size is a much simpler task.
import torch
import torchvision
# Generating anchor boxes (simplified)
def generate_anchors(feature_map_size, scales, ratios, stride=16):
"""
Returns anchor boxes as [cx, cy, w, h] for each cell in the feature map.
"""
anchors = []
H, W = feature_map_size
for row in range(H):
for col in range(W):
cx = (col + 0.5) * stride
cy = (row + 0.5) * stride
for scale in scales:
for ratio in ratios:
w = scale * (ratio ** 0.5)
h = scale / (ratio ** 0.5)
anchors.append([cx, cy, w, h])
return torch.tensor(anchors)
anchors = generate_anchors((14, 14), scales=[32, 64, 128], ratios=[0.5, 1.0, 2.0])
print(f"Total anchors: {len(anchors)}") # 14*14*9 = 1764
# Modern models (FCOS, DETR) are anchor-free, using points or queries instead
Q10. What is IoU (Intersection over Union) and why is it the standard metric for detection?
IoU = area(Predicted ∩ Ground Truth) / area(Predicted ∪ Ground Truth)
IoU of 1.0 = perfect overlap. IoU = 0 = no overlap. Standard threshold: IoU >= 0.5 for a True Positive.
import torch
def box_iou(box1, box2):
"""
box1, box2: [x1, y1, x2, y2] format
"""
# Intersection
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
# Union
area1 = (box1[2]-box1[0]) * (box1[3]-box1[1])
area2 = (box2[2]-box2[0]) * (box2[3]-box2[1])
union = area1 + area2 - inter
return inter / (union + 1e-6)
# Vectorized with torchvision
from torchvision.ops import box_iou as tv_box_iou
pred_boxes = torch.tensor([[50, 50, 200, 200]], dtype=torch.float)
gt_boxes = torch.tensor([[70, 70, 210, 210]], dtype=torch.float)
print(tv_box_iou(pred_boxes, gt_boxes)) # [[0.73]] approx
# [email protected] and [email protected]:0.95 are standard detection benchmarks (COCO)
MEDIUM: Object Detection and Segmentation (Questions 11-20)
Q11. How does YOLO work? Explain the key design choices.
YOLO v1-v5 evolution: Grid cell regression -> anchor boxes -> feature pyramid networks -> batch norm + CSP blocks.
YOLOv8 (2023, standard in 2026):
- Anchor-free detection head (predicts center point + width/height)
- C2f modules (cross-stage partial bottleneck with 2 fused outputs)
- Decoupled head: separate heads for classification and localization
- Task-Aligned Loss with VFL (Varifocal Loss) + DFL (Distribution Focal Loss)
from ultralytics import YOLO
# Inference (fastest)
model = YOLO('yolov8n.pt') # n=nano, s=small, m=medium, l=large, x=extra
results = model('image.jpg', conf=0.25, iou=0.45, classes=[0,1,2])
for r in results:
for box in r.boxes:
cls = int(box.cls)
conf = float(box.conf)
x1, y1, x2, y2 = box.xyxy[0].tolist()
# Fine-tuning on custom dataset
model = YOLO('yolov8m.pt')
results = model.train(
data='custom_dataset.yaml',
epochs=100,
imgsz=640,
batch=16,
device='0',
patience=20, # early stopping
augment=True,
hsv_h=0.015, hsv_s=0.7, hsv_v=0.4,
degrees=10.0,
translate=0.1,
scale=0.5,
mosaic=1.0 # mosaic augmentation (YOLOv4 signature)
)
Q12. How does Faster R-CNN work? Compare with single-stage detectors.
Faster R-CNN (two-stage):
- Backbone (ResNet) extracts feature maps
- Region Proposal Network (RPN) proposes ~2,000 candidate regions
- RoI Pooling: aligns each proposal to fixed size
- Classification + box regression heads refine each proposal
Comparison:
| Property | Two-Stage (Faster R-CNN) | Single-Stage (YOLO, SSD) |
|---|---|---|
| Accuracy | Higher, especially small objects | Slightly lower |
| Speed | Slower (~5 FPS on CPU) | Fast (30+ FPS) |
| Complexity | Higher (two passes) | Simpler |
| 2026 usage | Medical imaging, aerial imagery | Real-time applications |
import torchvision
from torchvision.models.detection import (
fasterrcnn_resnet50_fpn_v2, FasterRCNN,
fcos_resnet50_fpn # anchor-free, often better
)
# Faster R-CNN for high-accuracy tasks
model = fasterrcnn_resnet50_fpn_v2(weights='DEFAULT')
model.eval()
# For custom classes
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
model = fasterrcnn_resnet50_fpn_v2(weights='DEFAULT')
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes=5+1)
# Fine-tune with standard DetectionDataset + DataLoader
Q13. What is Feature Pyramid Network (FPN) and why is it used for detection?
- Bottom-up pass: standard forward pass through backbone
- Top-down pass: upsample high-level features and add to lower-level features via lateral connections
import torch.nn as nn
import torch.nn.functional as F
class SimpleFPN(nn.Module):
def __init__(self, in_channels_list, out_channels=256):
super().__init__()
# Lateral 1x1 convolutions to unify channel dims
self.lateral = nn.ModuleList([
nn.Conv2d(c, out_channels, 1) for c in in_channels_list
])
# 3x3 smoothing convolutions
self.smooth = nn.ModuleList([
nn.Conv2d(out_channels, out_channels, 3, padding=1)
for _ in in_channels_list
])
def forward(self, features):
# features: [C2, C3, C4, C5] from backbone
laterals = [l(f) for l, f in zip(self.lateral, features)]
# Top-down path
for i in range(len(laterals)-2, -1, -1):
laterals[i] += F.interpolate(laterals[i+1], size=laterals[i].shape[-2:],
mode='nearest')
return [self.smooth[i](laterals[i]) for i in range(len(laterals))]
Q14. What is non-maximum suppression (NMS)? Why is it needed?
- Sort all boxes by confidence score (descending)
- Take the highest-confidence box
- Remove all other boxes with IoU > threshold (usually 0.45)
- Repeat with remaining boxes
import torch
from torchvision.ops import nms, batched_nms
# Single class NMS
boxes = torch.tensor([[10, 10, 50, 50], [12, 12, 52, 52], [100, 100, 200, 200]], dtype=torch.float)
scores = torch.tensor([0.9, 0.85, 0.7])
keep = nms(boxes, scores, iou_threshold=0.5)
print("Kept indices:", keep) # [0, 2] - second box removed (IoU > 0.5 with first)
# Multi-class NMS (suppress within same class only)
classes = torch.tensor([0, 0, 1])
keep = batched_nms(boxes, scores, classes, iou_threshold=0.5)
# Soft-NMS: decay scores instead of hard removal
# Avoids missing overlapping objects (crowd scenes)
# from torchvision.ops import soft_nms (not in stable torchvision; use custom)
Q15. What is Mask R-CNN? How does it extend Faster R-CNN for instance segmentation?
Key addition: RoI Align (fixes the quantization misalignment in RoI Pooling):
- Uses bilinear interpolation to sample exactly at float coordinates
- Critical for mask quality
import torchvision
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
# Load with pretrained weights
model = maskrcnn_resnet50_fpn_v2(weights='DEFAULT')
# Adapt for custom dataset
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
model.roi_heads.mask_predictor = MaskRCNNPredictor(
in_features_mask, hidden_layer, num_classes
)
# Inference: returns dict with 'boxes', 'labels', 'scores', 'masks'
model.eval()
with torch.no_grad():
predictions = model([img_tensor])
for box, label, score, mask in zip(
predictions[0]['boxes'], predictions[0]['labels'],
predictions[0]['scores'], predictions[0]['masks']
):
if score > 0.5:
binary_mask = (mask[0] > 0.5).numpy()
Q16. What is DETR (Detection Transformer)? How does it work without anchors or NMS?
- CNN backbone extracts feature maps
- Transformer encoder processes feature map as sequence
- 100 learnable object queries attend to encoder output (cross-attention)
- Each query either predicts an object (class + box) or "no object"
- Hungarian matching: match predictions to ground truth optimally
from transformers import DetrImageProcessor, DetrForObjectDetection
import torch
from PIL import Image
processor = DetrImageProcessor.from_pretrained('facebook/detr-resnet-50')
model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')
image = Image.open("image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
# Post-process outputs (handles confidence threshold + rescaling)
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(
outputs, threshold=0.9, target_sizes=target_sizes
)[0]
for score, label, box in zip(results['scores'], results['labels'], results['boxes']):
print(f"Detected {model.config.id2label[label.item()]} "
f"({score:.3f}) at {[round(i,1) for i in box.tolist()]}")
Q17. What is the Vision Transformer (ViT)? How does it process images?
- Split 224x224 image into 16x16 patches: 196 patches
- Linear projection of each patch to d_model-dimensional embedding
- Add [CLS] token and positional embeddings
- Pass through transformer encoder (L layers of multi-head self-attention + FFN)
- Use [CLS] embedding for classification
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
super().__init__()
self.n_patches = (img_size // patch_size) ** 2
# Patch projection via Conv2d with kernel=patch_size, stride=patch_size
self.proj = nn.Conv2d(in_channels, embed_dim,
kernel_size=patch_size, stride=patch_size)
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.zeros(1, 1 + self.n_patches, embed_dim))
nn.init.normal_(self.pos_embed, std=0.02)
nn.init.normal_(self.cls_token, std=0.02)
def forward(self, x):
B = x.shape[0]
patches = self.proj(x).flatten(2).transpose(1, 2) # [B, 196, 768]
cls = self.cls_token.expand(B, -1, -1) # [B, 1, 768]
x = torch.cat([cls, patches], dim=1) # [B, 197, 768]
return x + self.pos_embed
# Using timm (preferred in 2026)
import timm
vit = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)
Q18. What is Segment Anything Model (SAM)? How does it work?
Architecture:
- Image Encoder: ViT-H/16 MAE-pretrained backbone extracts image embeddings (offline, once per image)
- Prompt Encoder: Encodes sparse prompts (points, boxes) and dense prompts (masks)
- Mask Decoder: Lightweight transformer that attends over image features and prompt tokens, outputs 3 candidate masks (ambiguous cases)
from segment_anything import SamPredictor, sam_model_registry
sam = sam_model_registry['vit_h'](checkpoint='sam_vit_h_4b8939.pth')
sam.to('cuda')
predictor = SamPredictor(sam)
import cv2
image = cv2.imread('image.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image) # encode image once
# Segment from a point prompt
masks, scores, logits = predictor.predict(
point_coords=[[500, 375]], # (x, y) coordinates
point_labels=[1], # 1=foreground, 0=background
multimask_output=True
)
print(f"Best mask score: {scores.max():.3f}")
print(f"Mask shape: {masks[0].shape}") # (H, W) boolean
# SAM 2 (2024): extends to video; tracks objects across frames
Q19. What is image classification on edge devices? How do you optimize for mobile?
| Technique | Method | Speedup |
|---|---|---|
| Quantization (INT8) | Convert weights from FP32 to INT8 | 2-4x |
| Pruning | Remove unimportant weights/channels | 2-10x (structured) |
| Architecture choice | MobileNetV3, EfficientNet-Lite, ShuffleNetV2 | Built-in |
| Knowledge distillation | Train small model from large teacher | Task-dependent |
| TorchScript / ONNX | Export for optimized runtime | Platform-dependent |
import torch
import torchvision.models as models
# MobileNetV3-Small: best accuracy/latency for mobile
model = models.mobilenet_v3_small(weights='IMAGENET1K_V1')
model.eval()
# ONNX export for edge deployment
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, 'mobilenet_v3s.onnx',
input_names=['input'], output_names=['output'],
dynamic_axes={'input': {0: 'batch'}},
opset_version=17)
# TorchScript for mobile (avoids Python interpreter)
scripted = torch.jit.script(model)
scripted.save('mobilenet_v3s.pt')
# Quantization (INT8)
model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
torch.quantization.prepare(model, inplace=True)
# calibrate with representative data ...
model_int8 = torch.quantization.convert(model)
Q20. How do you compute mean Average Precision (mAP) for object detection?
COCO [email protected]:0.95: Average over IoU thresholds from 0.5 to 0.95 in 0.05 steps. More demanding than VOC [email protected].
For each class:
1. Sort all detections by confidence
2. At each threshold, assign TP (IoU >= threshold) or FP
3. Compute precision and recall at each point
4. AP = area under precision-recall curve
mAP = mean(AP across all classes)
from torchmetrics.detection import MeanAveragePrecision
metric = MeanAveragePrecision(iou_type='bbox', iou_thresholds=[0.5, 0.75])
preds = [{
'boxes': torch.tensor([[10, 10, 100, 100], [200, 200, 300, 300]]),
'scores': torch.tensor([0.9, 0.75]),
'labels': torch.tensor([0, 1])
}]
targets = [{
'boxes': torch.tensor([[15, 15, 105, 105], [210, 210, 310, 310]]),
'labels': torch.tensor([0, 1])
}]
metric.update(preds, targets)
results = metric.compute()
print(f"[email protected]: {results['map_50']:.3f}")
print(f"[email protected]: {results['map_75']:.3f}")
print(f"[email protected]:0.95: {results['map']:.3f}")
HARD: Advanced CV (Questions 21-28)
Q21. What is self-supervised learning for vision? Explain DINO and MAE.
| Method | How | What it learns |
|---|---|---|
| DINO (Meta, 2021) | Student-teacher with momentum; patch-level self-distillation | Semantic features without labels |
| MAE (Meta, 2022) | Mask 75% of patches; reconstruct | Dense pixel-level understanding |
| SimCLR | Contrastive learning; augmentation invariance | Semantic similarity |
| MoCo v3 | Momentum contrastive with transformer | Stable contrastive features |
DINO is especially powerful because it naturally learns semantic segmentation structure (patches of the same object get similar representations) without any supervision:
import torch
from torchvision import transforms
# Load DINOv2 (Meta, 2023) - best self-supervised ViT in 2026
dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2.eval().cuda()
# Get patch features (14x14 patches for 224x224 input)
transform = transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
with torch.no_grad():
img_t = transform(image).unsqueeze(0).cuda()
features = dinov2.forward_features(img_t)
patch_features = features['x_norm_patchtokens'] # [1, 256, 1024]
cls_feature = features['x_norm_clstoken'] # [1, 1024]
Q22. What is optical flow? How is it used in video understanding?
import cv2
import numpy as np
# Lucas-Kanade sparse optical flow (track specific points)
cap = cv2.VideoCapture('video.mp4')
ret, old_frame = cap.read()
old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)
# Detect corners to track
corners = cv2.goodFeaturesToTrack(old_gray, maxCorners=100,
qualityLevel=0.3, minDistance=7)
while True:
ret, frame = cap.read()
if not ret: break
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# Calculate optical flow
new_pts, status, err = cv2.calcOpticalFlowPyrLK(
old_gray, gray, corners, None,
winSize=(15,15), maxLevel=2,
criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 0.03)
)
good_new = new_pts[status==1]
good_old = corners[status==1]
old_gray = gray.copy()
corners = good_new.reshape(-1, 1, 2)
# Deep optical flow: RAFT (state-of-the-art)
# torchvision.models.optical_flow.raft_large()
Q23. How do you handle class imbalance in object detection datasets?
| Problem | Example | Solution |
|---|---|---|
| Rare class few examples | Pedestrians vs cars in parking lot | Oversampling rare class images |
| Easy negatives dominate | Background vs object | Focal loss; hard negative mining |
| Scale imbalance | Tiny objects vs large | FPN; multi-scale training; mosaic augmentation |
| Long-tail classes | LVIS dataset (1203 categories) | Federated Loss; repeat factor sampling |
# Repeat Factor Sampling (LVIS standard)
# Images with rare categories are sampled more frequently
import numpy as np
def compute_repeat_factors(dataset, repeat_threshold=0.001):
"""
Compute how many times each image should be repeated.
Images with rare categories get higher repeat factors.
"""
category_freq = {} # category_id -> frequency
for ann in dataset.annotations:
for cat_id in ann['category_ids']:
category_freq[cat_id] = category_freq.get(cat_id, 0) + 1
total_images = len(dataset)
for cat_id in category_freq:
category_freq[cat_id] /= total_images
image_repeat_factors = []
for img in dataset.images:
cat_ids = img['category_ids']
rf = max([max(1.0, (repeat_threshold / category_freq[c]) ** 0.5)
for c in cat_ids] or [1.0])
image_repeat_factors.append(rf)
return image_repeat_factors
Q24. What is diffusion-based image generation? How does Stable Diffusion work?
Stable Diffusion architecture:
- Text encoder: CLIP encodes text prompt to conditioning vector
- VAE: Operates in compressed latent space (8x spatial compression) for efficiency
- U-Net denoiser: Predicts noise at each step, conditioned on text via cross-attention
- Scheduler: Controls denoising trajectory (DDPM, DDIM, DPM-Solver)
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
pipe = StableDiffusionPipeline.from_pretrained(
'stabilityai/stable-diffusion-2-1',
torch_dtype=torch.float16
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to('cuda')
# Enable memory optimization
pipe.enable_attention_slicing()
image = pipe(
prompt="a photorealistic portrait of an AI engineer, cinematic lighting",
negative_prompt="blurry, low quality, cartoon",
num_inference_steps=25, # DPM-Solver is fast: 25 steps sufficient
guidance_scale=7.5, # CFG scale: 7-9 is typical
height=768, width=512
).images[0]
image.save('generated.png')
Q25. Design a production pipeline for real-time license plate recognition.
System design for ANPR (Automatic Number Plate Recognition):
Input: Camera feed at 30fps
Stage 1: Vehicle Detection (< 5ms)
Model: YOLOv8n (nano) - 3.2ms on GPU
Task: Detect vehicles in frame; extract bounding boxes
Optimization: Skip frames at low traffic; only run on regions of motion
Stage 2: Plate Detection (< 3ms)
Model: YOLOv8s fine-tuned on license plate dataset
Task: Within vehicle crop, detect plate location
Dataset: Kaggle license plate + internal dataset
Stage 3: OCR (< 10ms)
Model: TrOCR (transformer-based OCR) or PaddleOCR
Task: Read characters on plate
Post-process: Regex validation (plate format), correction
Stage 4: Post-processing
Deduplication: Track plate across frames; confirm after 3 consistent reads
Database lookup: Redis cache for banned/flagged plates (sub-millisecond)
Alert: Webhook to parking/security system
Total pipeline: < 20ms per frame = 50+ FPS headroom
from ultralytics import YOLO
from paddleocr import PaddleOCR
vehicle_detector = YOLO('yolov8s.pt')
plate_detector = YOLO('plate_detector.pt')
ocr = PaddleOCR(lang='en', use_angle_cls=True, use_gpu=True)
import re
PLATE_PATTERN = re.compile(r'^[A-Z]{2}\d{2}[A-Z]{2}\d{4}$') # Indian plate
def process_frame(frame):
vehicles = vehicle_detector(frame, classes=[2, 3, 5, 7])[0] # car, motorcycle, bus, truck
plates = []
for box in vehicles.boxes:
crop = frame[int(box.xyxy[0][1]):int(box.xyxy[0][3]),
int(box.xyxy[0][0]):int(box.xyxy[0][2])]
plate_det = plate_detector(crop)[0]
if len(plate_det.boxes):
plate_crop = crop[...] # crop plate from vehicle
result = ocr.ocr(plate_crop)
text = ''.join([r[1][0] for r in result[0]])
if PLATE_PATTERN.match(text.replace(' ', '').upper()):
plates.append(text)
return plates
Q26. What is knowledge distillation for vision models? How do you compress ResNet-50 to ResNet-18?
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
teacher = models.resnet50(weights='IMAGENET1K_V2')
student = models.resnet18(weights=None) # train student from scratch
# Adapt student head to match teacher output
student.fc = nn.Linear(student.fc.in_features, 1000)
teacher.eval()
for param in teacher.parameters():
param.requires_grad = False
def distill_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
kd = F.kl_div(
F.log_softmax(student_logits / T, dim=-1),
F.softmax(teacher_logits / T, dim=-1),
reduction='batchmean'
) * (T ** 2)
ce = F.cross_entropy(student_logits, labels)
return alpha * kd + (1 - alpha) * ce
optimizer = torch.optim.SGD(student.parameters(), lr=0.1,
momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=90)
student.train()
for batch_x, batch_y in dataloader:
with torch.no_grad():
teacher_out = teacher(batch_x)
student_out = student(batch_x)
loss = distill_loss(student_out, teacher_out, batch_y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Q27. What is domain adaptation in computer vision?
| Method | How | Label Required (Target) |
|---|---|---|
| Fine-tuning | Train on small labeled target set | Yes (few) |
| Domain-adversarial (DANN) | Gradient reversal makes features domain-invariant | No |
| Self-training | Pseudo-label confident predictions; retrain | No |
| Style transfer pre-processing | Make source look like target | No |
| Test-time adaptation (TTA) | Adapt model on unlabeled test images | No |
import torch
import torch.nn as nn
class GradientReversalLayer(torch.autograd.Function):
@staticmethod
def forward(ctx, x, alpha):
ctx.alpha = alpha
return x.clone()
@staticmethod
def backward(ctx, grad_output):
return -ctx.alpha * grad_output, None # reverse gradient
class DomainAdversarialNN(nn.Module):
def __init__(self, backbone, feature_dim, num_classes, num_domains=2):
super().__init__()
self.backbone = backbone
self.classifier = nn.Linear(feature_dim, num_classes)
self.domain_head = nn.Sequential(
nn.Linear(feature_dim, 256), nn.ReLU(),
nn.Linear(256, num_domains)
)
def forward(self, x, alpha=1.0):
features = self.backbone(x)
class_out = self.classifier(features)
rev_features = GradientReversalLayer.apply(features, alpha)
domain_out = self.domain_head(rev_features)
return class_out, domain_out
Q28. Design a visual search engine (search by image, return similar products).
Visual Search Architecture:
Offline (indexing):
1. Product catalog: 10M product images
2. Feature extraction: EfficientNetV2-M backbone, strip head, extract 1792-dim embedding
3. L2 normalization of embeddings
4. Build FAISS IVF-PQ index: ~2GB for 10M embeddings (vs 80GB for flat index)
5. Store: embedding -> product_id mapping
Online (query):
1. User uploads image (or captures on mobile)
2. Preprocessing: center crop 380x380, normalize (ImageNet stats)
3. Feature extraction: same backbone, ~50ms on GPU (batched), ~200ms on CPU
4. ANN search: FAISS returns top-100 nearest neighbors in < 5ms
5. Re-rank: fine-grained similarity with more expensive cross-attention model
6. Return top-10 with metadata (price, category, URL)
import faiss
import numpy as np
import torchvision.models as models
import torch
# Build index
backbone = models.efficientnet_v2_m(weights='IMAGENET1K_V1')
backbone.classifier = nn.Identity() # remove classification head
backbone.eval().cuda()
def extract_features(images, batch_size=64):
all_features = []
for i in range(0, len(images), batch_size):
batch = torch.stack(images[i:i+batch_size]).cuda()
with torch.no_grad():
feat = backbone(batch).cpu().numpy()
all_features.append(feat)
features = np.concatenate(all_features)
faiss.normalize_L2(features) # L2 normalize for cosine similarity
return features
# IVF-PQ index: fast ANN for 10M+ vectors
d = 1792 # embedding dim
quantizer = faiss.IndexFlatIP(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist=4096, M=64, nbits=8)
index.train(catalog_features)
index.add(catalog_features)
faiss.write_index(index, 'visual_search.index')
# Query
def search(query_image, top_k=10):
feat = extract_features([query_image])
index.nprobe = 64 # check 64 of 4096 clusters (tradeoff: accuracy vs speed)
distances, indices = index.search(feat, top_k)
return [product_catalog[i] for i in indices[0]]
Computer Vision Tools at a Glance
| Use Case | Tool / Library | Notes |
|---|---|---|
| Data loading and augmentation | torchvision, albumentations | albumentations faster and more options |
| Classification (pretrained) | timm (PyTorch Image Models) | 700+ models, best collection |
| Detection (production) | Ultralytics YOLOv8 | Fastest iteration for custom datasets |
| Detection (research) | MMDetection, Detectron2 | More architectures |
| Segmentation | Mask2Former, SAM | HuggingFace for Mask2Former |
| Keypoint detection | MediaPipe, ViTPose | MediaPipe for real-time |
| OCR | PaddleOCR, TrOCR | PaddleOCR for production speed |
| Feature extraction | DINOv2, CLIP | Best self-supervised features |
| Deployment | ONNX + TensorRT, TorchScript | TensorRT for NVIDIA production |
FAQ
Q: YOLO vs Faster R-CNN: which should I use? A: YOLOv8 for most real-time applications (30+ FPS requirement). Faster R-CNN for medical imaging, aerial/satellite imagery, or when small object accuracy matters more than speed.
Q: Is OpenCV still needed in 2026? A: Yes for preprocessing, classical augmentation (blur, resize, color conversion), reading video streams, and operations not covered by torchvision. It's not being replaced.
Q: How much data do I need to fine-tune a YOLO model? A: A practical minimum is 200-500 annotated images per class. With data augmentation, this can produce a useful model. Quality matters more than quantity.
Q: What is the difference between semantic segmentation and instance segmentation? A: Semantic segmentation labels each pixel with a class (all cats = same color). Instance segmentation additionally separates individual objects of the same class (cat1 vs cat2 have different IDs).
Related articles on PapersAdda:
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →Related Articles
Airbnb Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airbnb's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Airtel Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airtel's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
AMD Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing AMD's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Atlassian Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Atlassian's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical,...
Barclays Interview Questions 2026
_Last verified by [Aditya Sharma](/author/aditya-sharma/) · cross-checked against PapersAdda Hiring Pulse and...
More from PapersAdda
Accenture Interview Questions 2026 (with Answers for Freshers)
Capgemini Interview Questions 2026 (with Answers for Freshers)
HCLTech Interview Questions 2026 (TechBee + TGT, with Answers)
IBM Interview Questions 2026 (with Answers for Freshers)