DevOps Interview Questions 2026 — Top 50 with Expert Answers
Elite DevOps teams deploy to production multiple times per day with a change failure rate under 5%. That's the bar companies are hiring for in 2026. DevOps has evolved from a cultural philosophy into a concrete set of engineering practices — and companies expect you to command CI/CD pipelines, infrastructure as code, observability, incident response, and reliability engineering at an expert level. This guide covers 50 real questions asked at product companies like Razorpay, Swiggy, Flipkart, and global FAANG firms, organized by difficulty with the exact answers that get offers.
DevOps/SRE is one of the fastest-growing career paths in India, with senior roles commanding Rs 45-90 LPA at top product companies. The skills gap is real — master these 50 questions and you're ahead of 90% of candidates.
Related: AWS Interview Questions 2026 | Kubernetes Interview Questions 2026 | Docker Interview Questions 2026 | System Design Interview Questions 2026
Beginner-Level DevOps Questions (Q1-Q15)
Even if you're a senior engineer, don't skip these. Interviewers at Razorpay and Flipkart use beginner questions to test whether you truly understand the "why" behind DevOps — not just the tools.
Q1. What is DevOps? How does it differ from traditional IT operations?
Traditional IT vs. DevOps:
| Aspect | Traditional IT | DevOps |
|---|---|---|
| Dev/Ops relationship | Siloed teams with handoffs | Shared ownership, shared goals |
| Deployment frequency | Quarterly or monthly releases | Multiple times per day |
| Deployment method | Manual, scripted | Automated CI/CD pipelines |
| Infrastructure | Pet servers (named, cared for) | Cattle (identical, replaceable) |
| Failure response | Blame, RCA blame game | Blameless post-mortems, SRE principles |
| Feedback loop | Months | Minutes to hours |
| Rollback | Manual, risky | Automated, safe |
DORA Metrics (DevOps Research and Assessment) measure DevOps performance:
- Deployment frequency: How often you deploy to production
- Lead time for changes: Code commit to production
- Change failure rate: % of deployments causing incidents
- Time to restore service: MTTR after incident
Elite performers (2024 DORA report): Deploy on-demand (multiple times/day), lead time <1 day, CFR <5%, MTTR <1 hour.
Q2. What is CI/CD? Explain each stage.
Continuous Integration (CI): Every code commit triggers automated build and test — developers merge frequently, preventing integration hell.
Continuous Delivery (CD): Every passing build is automatically deployed to staging. Human approval gates production deployment.
Continuous Deployment: Every passing build is automatically deployed to production — no human gates.
Full CI/CD pipeline stages:
Developer commits code
│
▼
1. Source Control (Git) — PR created, branch policies enforced
│
▼
2. CI Trigger — webhook fires pipeline
│
▼
3. Build
├── Compile / install dependencies
├── Run unit tests
├── Static code analysis (SonarQube, ESLint)
└── Security scan (SAST — Semgrep, CodeQL)
│
▼
4. Test
├── Integration tests
├── Contract tests (Pact)
└── Vulnerability scan (Trivy on Docker image)
│
▼
5. Artifact
├── Build Docker image
└── Push to registry (ECR, GCR)
│
▼
6. Deploy to Staging
└── Smoke tests / synthetic monitoring
│
▼
7. [Approval gate] — automated or manual
│
▼
8. Deploy to Production
├── Blue/green or canary deployment
└── Post-deploy health checks
│
▼
9. Monitor
└── Alert if error rate spikes → auto-rollback or PagerDuty alert
Q3. What is Infrastructure as Code (IaC)? Why is it important?
Benefits:
| Benefit | Explanation |
|---|---|
| Reproducibility | Same code creates identical environments (no "works on my machine") |
| Version control | Infrastructure changes tracked in Git — who changed what, when, why |
| Peer review | Infrastructure changes reviewed via pull requests |
| Automation | Environments created in minutes, not weeks |
| Drift detection | Know when actual state diverges from desired state |
| Disaster recovery | Re-create entire environment from code |
| Cost control | Spin down dev environments on weekends (schedule destroy) |
Popular IaC tools:
| Tool | Type | Best for |
|---|---|---|
| Terraform | Declarative, multi-cloud | General purpose, most popular |
| AWS CloudFormation | Declarative, AWS-only | Native AWS integration |
| AWS CDK | Programmatic IaC (TypeScript, Python) | Developers who prefer real languages |
| Pulumi | Programmatic IaC (any language) | Multi-cloud with programming constructs |
| Ansible | Imperative, configuration management | OS config, application deployment |
| Packer | Image builder | AMI, GCP image creation |
Q4. Explain Terraform workflow — init, plan, apply, destroy.
# 1. terraform init
# Downloads provider plugins, initializes backend (remote state)
terraform init
# 2. terraform plan
# Shows what will be created/modified/destroyed (dry run)
# ALWAYS review this before apply
terraform plan -out=tfplan
# 3. terraform apply
# Applies the planned changes
terraform apply tfplan
# Or interactively:
terraform apply # Shows plan, prompts for "yes"
# 4. terraform destroy
# Destroys all resources in the state
terraform destroy # Prompts for "yes"
terraform destroy -target=aws_instance.web # Destroy specific resource
State file (terraform.tfstate):
Terraform tracks the mapping between your config and real infrastructure in a state file. In teams, state is stored remotely (S3 + DynamoDB lock for AWS):
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/networking/terraform.tfstate"
region = "ap-south-1"
dynamodb_table = "terraform-state-locks"
encrypt = true
}
}
DynamoDB prevents concurrent applies (state locking). Never commit state files to Git — they contain sensitive data.
Asked at Flipkart, Razorpay, PhonePe infrastructure interviews
Q5. What is the difference between Ansible, Terraform, and Chef/Puppet?
| Tool | Category | Approach | State | Language | Best For |
|---|---|---|---|---|---|
| Terraform | IaC (provisioning) | Declarative | Yes (tfstate) | HCL | Cloud resource provisioning |
| Ansible | Configuration management | Imperative (playbooks) | Stateless | YAML | OS config, app deployment, ad-hoc tasks |
| Chef | Configuration management | Imperative (recipes) | Server (Chef Server) | Ruby | Traditional CM, complex configs |
| Puppet | Configuration management | Declarative | Server (PuppetDB) | Puppet DSL | Enterprise config management |
| Pulumi | IaC (provisioning) | Programmatic | Yes (backend) | Any language | Complex IaC requiring programming logic |
Common pattern: Terraform provisions the servers (EC2, RDS, VPC); Ansible configures them (install packages, deploy app, set up monitoring). Terraform handles "what infrastructure exists"; Ansible handles "what's installed on the servers."
Q6. What is GitHub Actions? Write a simple CI workflow.
# .github/workflows/ci.yml
name: CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.11', '3.12']
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Cache pip
uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('requirements*.txt') }}
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Lint
run: ruff check . --output-format=github
- name: Test
run: pytest tests/ --cov=app --cov-report=xml -v
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: SAST scan
uses: returntocorp/semgrep-action@v1
Q7. What is a Jenkins pipeline? What is the difference between Declarative and Scripted pipeline?
Declarative pipeline (recommended — structured, less Groovy knowledge needed):
// Jenkinsfile
pipeline {
agent any
environment {
DOCKER_REGISTRY = 'myregistry.com'
IMAGE_NAME = 'myapp'
}
stages {
stage('Checkout') {
steps {
git branch: 'main', url: 'https://github.com/myorg/myapp.git'
}
}
stage('Build & Test') {
steps {
sh 'mvn clean test'
}
post {
always {
junit 'target/surefire-reports/*.xml'
}
}
}
stage('Docker Build') {
steps {
script {
docker.build("${DOCKER_REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}")
}
}
}
stage('Deploy to Staging') {
when {
branch 'main'
}
steps {
sh './deploy.sh staging'
}
}
}
post {
failure {
slackSend(color: 'danger', message: "Build FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}")
}
}
}
Scripted pipeline: Pure Groovy inside node {} blocks. More flexible but more complex, harder to read. Use Declarative unless you need complex Groovy logic.
Q8. What is the difference between Git merge, rebase, and cherry-pick?
| Operation | What it does | Creates merge commit? | Rewrites history? |
|---|---|---|---|
| merge | Combines branches, preserves history | Yes (unless fast-forward) | No |
| rebase | Replays commits onto another branch tip | No | Yes — new commit SHAs |
| cherry-pick | Applies a specific commit to current branch | No | Yes — new commit SHA |
| squash merge | Combines all branch commits into one | One new commit | Yes |
# Merge feature into main (preserves feature history)
git checkout main && git merge feature/payment
# Rebase feature onto main (cleaner linear history)
git checkout feature/payment && git rebase main
# Cherry-pick a hotfix to multiple release branches
git cherry-pick abc1234
# Interactive rebase — squash last 3 commits
git rebase -i HEAD~3
Golden rule: Never rebase shared/public branches (main, develop). Only rebase local feature branches before merging.
Q9. What is the purpose of a staging environment? What makes a good staging setup?
Characteristics of a good staging environment:
- Production parity: Same infrastructure configuration (instance sizes can differ, but architecture must match)
- Real data: Anonymized production data dump — tests realistic data volumes, not 10 rows
- Isolated: No shared services with production (separate DB, separate queues)
- Continuously deployed: Every merge to main auto-deploys to staging
- Monitored: Same monitoring stack as production (so you catch monitoring gaps)
- External service stubs: Payment gateways (Razorpay test mode), SMS providers (mock)
Common staging failures: Using smaller instance types → misses memory/CPU issues. Shared database with prod → staging deploy breaks prod. Fake/sparse data → doesn't test realistic query performance.
Q10. What is Prometheus? How does it collect metrics?
Architecture:
Applications (/metrics endpoint in Prometheus format)
↑ scrape (HTTP GET /metrics)
Prometheus Server
├── TSDB (time-series database, local disk)
├── Rules evaluation (recording + alerting rules)
└── Alertmanager → PagerDuty, Slack, OpsGenie
↑ query (PromQL)
Grafana
Metric types:
- Counter: Only goes up (HTTP requests total, errors total)
- Gauge: Can go up or down (current memory usage, queue depth)
- Histogram: Distribution of observations (request latency buckets, response sizes)
- Summary: Similar to histogram, but calculates quantiles client-side
# Python app exposing Prometheus metrics
from prometheus_client import Counter, Histogram, start_http_server
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['endpoint'])
@app.route('/api/orders')
def get_orders():
with REQUEST_LATENCY.labels(endpoint='/api/orders').time():
result = db.query_orders()
REQUEST_COUNT.labels(method='GET', endpoint='/api/orders', status=200).inc()
return result
Q11. What is Grafana? How does it integrate with Prometheus?
Prometheus + Grafana integration:
- Add Prometheus as a data source in Grafana (URL:
http://prometheus:9090) - Use PromQL queries in Grafana panels:
# Request rate per endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
Good dashboards follow RED method:
- Rate: Requests per second
- Errors: Error rate (4xx/5xx)
- Duration: Latency distribution (p50, p90, p99)
Grafana's Explore feature allows ad-hoc metric queries during incident investigation without modifying dashboards. Grafana OnCall integrates alert routing directly.
Q12. What is an SLI, SLO, and SLA?
| Term | Full Form | Definition | Owner | Example |
|---|---|---|---|---|
| SLI | Service Level Indicator | A metric that measures service behavior | Engineering | "99.5% of requests complete in <200ms" |
| SLO | Service Level Objective | A target value or range for an SLI | Engineering + Product | "SLI must be ≥ 99.5% over 30 days" |
| SLA | Service Level Agreement | A business contract with consequences | Legal + Business | "99.9% uptime, or 10% credit for each 0.1% below" |
Relationship: SLA ≥ SLO ≥ actual performance. SLOs are internal targets (usually more strict than SLAs to leave buffer). SLIs are the measurements.
Error budget: 100% minus SLO. For 99.9% SLO: 0.1% error budget = 43.8 minutes/month of allowed downtime. If you're within budget, you can deploy new features. If you've exceeded budget, all deployments freeze until next period.
Good SLIs focus on what users experience:
- Availability: % of successful requests
- Latency: % of requests below threshold
- Throughput: Operations per second
- Error rate: % of failed requests
Critical concept at SRE interviews (Google, Amazon, Flipkart SRE)
Q13. What is the difference between monitoring, observability, and alerting?
| Concept | Definition | Tools |
|---|---|---|
| Monitoring | Collecting and displaying predefined metrics | Prometheus, CloudWatch, Datadog |
| Observability | Ability to understand system internal state from external outputs. 3 pillars: Metrics, Logs, Traces | Prometheus + Loki + Tempo (Grafana OSS) |
| Alerting | Notifying humans when metrics breach thresholds | Alertmanager, PagerDuty, OpsGenie |
Monitoring vs. Observability: Monitoring asks "Is this thing healthy?" (yes/no). Observability asks "WHY is this broken?" You need distributed tracing and structured logs to debug non-obvious failures across microservices.
The three pillars:
- Metrics: Numeric aggregations over time (fast, cheap, limited dimensionality)
- Logs: Timestamped text records (rich context, expensive to store and query at scale)
- Traces: End-to-end request path across services (shows bottlenecks, latency contributions)
OpenTelemetry (OTel) is the emerging standard for instrumentation — one SDK for all three signals, vendor-neutral.
Q14. What is Incident Management? Describe a good incident response process.
Incident severity levels:
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| P0/SEV1 | Total outage, all users affected | Immediate, 24/7 | Payment processing down |
| P1/SEV2 | Major feature broken, large user impact | <15 minutes | Login failure for 50% users |
| P2/SEV3 | Significant degradation, some users affected | <1 hour | Checkout slow (p99 >5s) |
| P3/SEV4 | Minor issue, small user impact | <4 hours | Minor UI bug |
Incident response process:
- Detect: Automated alert fires (Alertmanager → PagerDuty → on-call engineer)
- Acknowledge: On-call acknowledges within SLA (prevents escalation)
- Assemble: Incident commander paged for SEV1/2; coordinates responders
- Investigate: Identify blast radius — what's broken, how many users affected
- Mitigate: Minimize impact first (rollback, feature flag disable, capacity increase)
- Resolve: Permanent fix (may come later; mitigation is sufficient to close incident)
- Review: Blameless post-mortem within 48 hours for SEV1/2
Incident channels: Dedicated Slack channel per incident (#incident-2026-03-30-payment), Zoom bridge for coordination, PagerDuty status page updates.
Q15. What is a blameless post-mortem?
Why blameless? If engineers fear punishment for mistakes, they hide problems, don't take risks, and don't report near-misses. Google's SRE book established blameless culture as foundational to reliability.
Post-mortem structure:
- Summary: 2-3 sentence incident description
- Impact: Duration, affected users/services, business impact (revenue, support tickets)
- Timeline: Chronological events (when detected, key decisions, mitigation, resolution)
- Root cause: 5 Whys analysis — the actual technical cause
- Contributing factors: System fragility, process gaps, monitoring blind spots
- Action items: Concrete improvements with owners and due dates
- Lessons learned: What went well, what went poorly
Action items must be specific:
- Bad: "Improve monitoring"
- Good: "Add PagerDuty alert when payment service error rate exceeds 1% for 5 minutes — owner: @alice, due: 2026-04-15"
Intermediate-Level DevOps Questions (Q16-Q35)
This is where Razorpay, Flipkart, and Swiggy interviews get serious. These questions test whether you've actually operated production systems or just read about them.
Q16. Write a complete Terraform module for an AWS VPC.
# modules/vpc/main.tf
variable "environment" { type = string }
variable "vpc_cidr" { type = string; default = "10.0.0.0/16" }
variable "azs" { type = list(string) }
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.environment}-vpc"
Environment = var.environment
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = { Name = "${var.environment}-igw" }
}
resource "aws_subnet" "public" {
count = length(var.azs)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)
availability_zone = var.azs[count.index]
map_public_ip_on_launch = true
tags = { Name = "${var.environment}-public-${var.azs[count.index]}" }
}
resource "aws_subnet" "private" {
count = length(var.azs)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
availability_zone = var.azs[count.index]
tags = { Name = "${var.environment}-private-${var.azs[count.index]}" }
}
resource "aws_eip" "nat" {
count = length(var.azs)
domain = "vpc"
}
resource "aws_nat_gateway" "main" {
count = length(var.azs)
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
depends_on = [aws_internet_gateway.main]
tags = { Name = "${var.environment}-nat-${var.azs[count.index]}" }
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = { Name = "${var.environment}-public-rt" }
}
resource "aws_route_table_association" "public" {
count = length(var.azs)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
output "vpc_id" { value = aws_vpc.main.id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "private_subnet_ids"{ value = aws_subnet.private[*].id }
Q17. What is Terraform state? How do you handle state in a team?
Problems with local state in teams:
- Multiple people run
applysimultaneously → state corruption - State file on one person's laptop → team blocked if they're unavailable
- State in Git → security risk (state contains resource attributes including secrets in plaintext)
Remote state with locking:
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/app/terraform.tfstate"
region = "ap-south-1"
dynamodb_table = "terraform-locks" # State locking
encrypt = true
kms_key_id = "arn:aws:kms:ap-south-1:..." # Encrypt state
}
}
State management commands:
# Import existing resource into state
terraform import aws_instance.web i-1234567890abcdef0
# Move resource in state (refactoring)
terraform state mv aws_instance.web module.compute.aws_instance.web
# Remove resource from state (without destroying)
terraform state rm aws_s3_bucket.old_bucket
# View current state
terraform state list
terraform state show aws_instance.web
Workspace per environment:
terraform workspace new staging
terraform workspace select production
terraform apply -var-file=production.tfvars
Deep-dive question at senior DevOps and platform engineer interviews
Q18. How do you implement Terraform for multiple environments (dev/staging/prod)?
Pattern 1 — Workspaces (simple, not recommended for large teams):
terraform workspace new dev && terraform apply
terraform workspace new prod && terraform apply
Problem: All environments share same code + different state, but not different configs.
Pattern 2 — Directory per environment (recommended):
infrastructure/
├── modules/
│ ├── vpc/
│ ├── eks/
│ └── rds/
├── environments/
│ ├── dev/
│ │ ├── main.tf # uses modules, dev-specific values
│ │ ├── vars.tf
│ │ └── backend.tf
│ ├── staging/
│ └── production/
Pattern 3 — Terragrunt (DRY across environments):
# terragrunt.hcl in each environment directory
terraform {
source = "../../modules//eks"
}
inputs = {
cluster_version = "1.29"
node_count = local.env == "production" ? 5 : 2
}
# Remote state automatically configured per environment
remote_state {
backend = "s3"
config = {
bucket = "tf-state-${local.env}"
key = "${path_relative_to_include()}/terraform.tfstate"
}
}
Terragrunt handles the DRY (Don't Repeat Yourself) problem — one module definition, environment-specific overrides without copying Terraform code.
Q19. What is GitOps? How does it improve deployment reliability?
GitOps principles:
- Declarative: All config described in Git (not "run this script")
- Versioned and immutable: Git history is the audit log
- Pulled automatically: A GitOps agent (Argo CD, Flux) continuously reconciles cluster to Git state
- Continuously reconciled: Drift detected and auto-corrected
Reliability benefits:
- Rollback =
git revert— no "how do I undo that kubectl command" - Audit trail: Every change has a commit, PR, reviewer
- No "configuration drift" — agent reverts manual changes
- Disaster recovery: Re-sync from Git rebuilds entire cluster state
GitOps flow:
Developer writes code
→ PR to application repo
→ CI builds Docker image, pushes to registry
→ CI updates image tag in config repo (separate repo or same)
→ PR to config repo with new image tag
→ PR review + approval
→ Merge to main
→ Argo CD detects change → syncs cluster
→ Deployment rolls out
Q20. What is blue-green deployment vs. canary deployment vs. rolling deployment?
| Strategy | Description | Rollback Speed | Resource Cost | Risk |
|---|---|---|---|---|
| Blue-Green | Two identical environments; switch traffic | Instant (flip DNS/LB) | 2x (both envs running) | Low (instant switch) |
| Canary | Gradually shift traffic % to new version | Fast (shift back to 0%) | Low (small canary) | Low (limited blast radius) |
| Rolling | Replace instances one at a time | Slow (new rollout) | None extra | Medium (mixed versions) |
| Recreate | Kill all old, start all new | Fast (redeploy old) | None extra | High (downtime) |
Canary decision criteria: Monitor error rate, latency, custom business metrics on the canary. If all good → increase weight. If bad → rollback automatically.
Tools:
- AWS CodeDeploy: Blue-green for Lambda + ECS
- Argo Rollouts: Canary + blue-green for K8s with Prometheus-based auto-rollback
- Istio/Flagger: Traffic shifting with service mesh
- Feature flags (LaunchDarkly, Unleash): Canary at the application layer, not infra layer
Asked at Flipkart, Swiggy, Zomato deployment strategy questions
Q21. How do you write Prometheus alerting rules?
# prometheus-rules.yml
groups:
- name: application-alerts
interval: 1m
rules:
# Alert if error rate > 1% for 5 minutes
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.01
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate for {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# Alert if p99 latency > 2 seconds
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
) > 2
for: 3m
labels:
severity: warning
annotations:
summary: "P99 latency > 2s for {{ $labels.service }}"
# Node memory pressure
- alert: NodeMemoryHighUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
for: 10m
labels:
severity: warning
Alertmanager routing:
# alertmanager.yml
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-slack'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
team: platform
receiver: 'platform-slack'
Q22. What is the ELK Stack vs. the Loki stack for logging?
| Feature | ELK Stack | Loki Stack |
|---|---|---|
| Components | Elasticsearch + Logstash + Kibana | Loki + Promtail/FluentBit + Grafana |
| Storage model | Indexed full-text search | Index only labels, store log lines compressed |
| Query language | Lucene/KQL | LogQL (similar to PromQL) |
| Resource usage | High (Elasticsearch is heavy) | Low (Loki is lightweight) |
| Cost | Higher storage + compute | Much lower (10x cheaper at scale) |
| Full-text search | Excellent | Limited (labels-based) |
| Grafana integration | Yes (but separate) | Native (both Grafana projects) |
| Best for | Complex search, compliance | Cloud-native, cost-sensitive |
LogQL example (Loki):
# Show all error logs from the payment service in last 1 hour
{namespace="production", app="payment-service"} |= "ERROR"
# Parse JSON logs and filter
{app="api"} | json | status_code >= 500
# Rate of error logs
rate({app="api"} |= "ERROR" [5m])
For most Kubernetes deployments in 2026, Loki is the preferred choice due to cost efficiency and native Grafana integration.
Q23. What is chaos engineering? How do you implement it?
The process:
- Define a "steady state" (baseline metrics — error rate, latency, throughput)
- Hypothesize that steady state continues during the experiment
- Inject failures: kill nodes, increase latency, inject CPU pressure, drop packets
- Observe: does the system maintain steady state?
- Fix weaknesses discovered
Tools:
| Tool | What you can inject |
|---|---|
| AWS Fault Injection Service (FIS) | EC2 stop/terminate, AZ outage, CPU/memory stress, API throttling |
| Chaos Monkey (Netflix) | Random EC2 termination |
| LitmusChaos (CNCF) | K8s pod kill, network latency, disk IO, DNS chaos |
| Chaos Toolkit | Multi-platform, extensible |
| Gremlin | Commercial, comprehensive blast radius control |
AWS FIS example:
{
"targets": {
"eks-nodes": {
"resourceType": "aws:ec2:instance",
"resourceTags": {"eks:nodegroup-name": "production-workers"},
"selectionMode": "PERCENT(33)"
}
},
"actions": {
"terminate-eks-nodes": {
"actionId": "aws:ec2:terminate-instances",
"targets": {"Instances": "eks-nodes"},
"parameters": {}
}
},
"stopConditions": [
{"source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:...PaymentErrorAlarm"}
]
}
The stopConditions are critical — if your alarm fires, the experiment auto-stops to minimize damage.
Asked at Google SRE, Amazon SRE, Flipkart platform interviews
Q24. What is Helm in the context of CI/CD? How do you deploy with Helm in a pipeline?
# GitHub Actions deployment job using Helm
deploy-production:
runs-on: ubuntu-latest
needs: [test, build]
environment: production # Requires manual approval in GitHub
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE }}
aws-region: ap-south-1
- name: Update kubeconfig for EKS
run: aws eks update-kubeconfig --name production-cluster --region ap-south-1
- name: Helm deploy
run: |
helm upgrade --install myapp ./helm/myapp \
--namespace production \
--create-namespace \
--values helm/myapp/values.yaml \
--values helm/myapp/values-production.yaml \
--set image.tag=${{ github.sha }} \
--set deployment.replicas=5 \
--atomic \
--timeout 10m \
--history-max 5
- name: Verify deployment
run: |
kubectl rollout status deployment/myapp -n production --timeout=5m
kubectl get pods -n production -l app=myapp
--atomic: If upgrade fails (health checks, readiness), automatically roll back to previous release.
--history-max 5: Keep only 5 Helm release history entries.
Q25. What is ArgoCD? How does it implement GitOps?
Core concepts:
- Application: Maps a Git repo + path to a K8s cluster + namespace
- App of Apps: One Application that deploys all other Applications (cluster bootstrapping)
- Sync: Process of making cluster state match Git state
- Health: Argo CD checks resource health (Deployment fully rolled out, Service has endpoints)
# Application CRD
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payment-service
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/myorg/k8s-manifests
targetRevision: HEAD
path: services/payment-service
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Revert manual changes
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
Argo CD's UI provides a visual representation of every deployed resource with health status, sync status, and diff view showing what changed.
Q26. How do you implement secrets management in CI/CD pipelines?
GitHub Actions secrets:
# Secrets stored in GitHub's encrypted store, injected as env vars
steps:
- name: Deploy
env:
DATABASE_URL: ${{ secrets.PROD_DATABASE_URL }}
API_KEY: ${{ secrets.PAYMENT_API_KEY }}
run: ./deploy.sh
Best practices:
- Environment-scoped secrets: Separate secrets for dev/staging/prod. Production secrets require environment approval.
- OIDC for cloud credentials: Use GitHub OIDC → AWS/GCP role assumption. No stored cloud credentials at all.
- HashiCorp Vault for runtime secrets: CI pipeline retrieves runtime secrets from Vault using a short-lived token.
- Rotate regularly: Automate secret rotation (AWS Secrets Manager auto-rotation).
- Audit access: Log every secret access — who retrieved what, when.
# OIDC-based secret fetching (no stored credentials)
- name: Configure AWS via OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-deploy-role
aws-region: ap-south-1
# Now fetch secrets from AWS Secrets Manager
- name: Fetch secrets
run: |
aws secretsmanager get-secret-value \
--secret-id production/myapp/database \
--query SecretString --output text >> $GITHUB_ENV
Q27. What is the difference between horizontal and vertical scaling?
| Feature | Horizontal Scaling (Scale Out) | Vertical Scaling (Scale Up) |
|---|---|---|
| Method | Add more instances | Increase instance size (CPU/RAM) |
| Limit | Theoretically unlimited | Limited by largest available instance |
| Cost | Linear per instance | Exponential (large instances cost more per unit) |
| Complexity | Application must be stateless/distributed | Simpler (same app, bigger machine) |
| Downtime | None (add instances live) | Often requires restart |
| Best for | Stateless web/app servers | Stateful databases, monoliths |
| Example | EC2 ASG: 5 t3.medium → 10 t3.medium | t3.medium → t3.xlarge |
Kubernetes HPA = horizontal scaling. VPA = vertical scaling (with pod restart).
Cloud-native applications are designed for horizontal scaling:
- Stateless (session in Redis, not in-process memory)
- Shared-nothing architecture
- Configuration from environment (12-factor app)
- Health endpoints for load balancer integration
Q28. Explain the 12-Factor App methodology.
| Factor | Principle | Example |
|---|---|---|
| 1. Codebase | One codebase, many deploys | Git monorepo or separate repos per service |
| 2. Dependencies | Explicitly declared, isolated | requirements.txt, package.json, go.mod |
| 3. Config | Store config in environment | DATABASE_URL env var, not hardcoded |
| 4. Backing services | Treat as attached resources | DB, Redis, S3 accessed via URL from env |
| 5. Build/Release/Run | Strictly separate stages | Docker image (build) + env vars (release) + container (run) |
| 6. Processes | Execute as stateless processes | No sticky sessions; session state in Redis |
| 7. Port binding | Export services via port | App binds to $PORT |
| 8. Concurrency | Scale out via processes | Multiple workers, HPA |
| 9. Disposability | Fast startup, graceful shutdown | Kubernetes preStop hook, SIGTERM handling |
| 10. Dev/prod parity | Keep environments similar | Same Docker image, same configs |
| 11. Logs | Treat as event streams | Write to stdout, infrastructure aggregates |
| 12. Admin processes | Run as one-off processes | kubectl exec, AWS ECS exec |
Q29. What is a service mesh? Explain the Istio architecture.
Istio architecture:
Control Plane (istiod)
├── Pilot — distributes service discovery, routing rules to proxies
├── Citadel — certificate authority for mTLS
├── Galley — validates config
└── Mixer (deprecated) — formerly handled telemetry
Data Plane
└── Envoy sidecar proxies (injected into every pod)
├── Intercept all inbound/outbound traffic
├── Enforce mTLS
├── Collect telemetry (metrics, traces)
└── Apply routing rules (retries, timeouts, circuit breaking)
Key Istio resources:
# VirtualService — traffic routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payment-service
spec:
hosts:
- payment-service
http:
- retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,connect-failure
timeout: 10s
fault:
delay:
percentage:
value: 10 # 10% of requests delayed (chaos testing)
fixedDelay: 5s
Istio is powerful but complex — adds ~5ms latency per hop and significant memory overhead (Envoy). Linkerd is a lighter alternative using Rust proxies.
Q30. How do you implement distributed tracing with OpenTelemetry?
# Python FastAPI app with OTel tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
# Configure tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument frameworks
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
tracer = trace.get_tracer(__name__)
@app.get("/orders/{order_id}")
async def get_order(order_id: str):
with tracer.start_as_current_span("get-order") as span:
span.set_attribute("order.id", order_id)
order = await db.get_order(order_id)
if not order:
span.set_status(StatusCode.ERROR, "Order not found")
return order
OTel Collector receives traces from all services, samples them, and exports to Jaeger/Tempo:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
tail_sampling:
policies:
- type: error-rate
error_rate: {min_error_rate: 0.01}
- type: probabilistic
probabilistic: {sampling_percentage: 5} # Sample 5% of successful requests
exporters:
jaeger:
endpoint: "jaeger:14250"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [jaeger]
Q31. What is the difference between push and pull monitoring models?
| Model | Description | Examples | Trade-offs |
|---|---|---|---|
| Pull (scrape) | Monitoring system fetches metrics from targets | Prometheus | Target must expose HTTP endpoint; scalable; firewall-friendly if Prometheus is inside network |
| Push | Targets send metrics to collector | Graphite, InfluxDB, CloudWatch, Datadog | Works behind NAT; useful for short-lived jobs; collector can be overwhelmed |
Prometheus Pushgateway: Bridge for short-lived jobs (batch jobs, cron) that exit before Prometheus scrapes them. Job pushes metrics to Pushgateway; Prometheus scrapes Pushgateway.
# Push metrics from a batch job to Pushgateway
cat <<EOF | curl --data-binary @- http://pushgateway:9091/metrics/job/backup-job/instance/server1
# TYPE backup_duration_seconds gauge
backup_duration_seconds 420
# TYPE backup_files_processed counter
backup_files_processed 1523847
EOF
Q32. How do you implement feature flags? What are the benefits?
Types:
- Release flags: Enable new feature for % of users (canary at app layer)
- Experiment flags: A/B testing (50% see UI variant A, 50% see B)
- Ops flags: Kill switches for problematic features (disable heavy query, enable maintenance mode)
- Permission flags: Enable features per user tier (free vs. paid)
# Using Unleash (open-source feature flags)
from unleash_client import UnleashClient
client = UnleashClient(
url="https://unleash.example.com/api",
app_name="payment-service",
authorization="*:development.abc123"
)
@app.post("/checkout")
async def checkout(request: CheckoutRequest):
if client.is_enabled("new-payment-flow", {"userId": request.user_id}):
return await new_payment_processor(request)
else:
return await legacy_payment_processor(request)
Benefits for DevOps:
- Decouple deployment from release — deploy code, enable flag later
- Instant rollback without redeployment (disable flag)
- Dark launches — ship code to production disabled, enable for testing
- Gradual rollouts — 1% → 10% → 100%
- Trunk-based development — merge incomplete features behind flags
Q33. What is Packer? How does it fit into a DevOps workflow?
Why build custom AMIs:
- Faster EC2 launch (no
apt installon startup — already baked in) - Immutable infrastructure pattern — never patch running instances, replace with new AMI
- Tested, hardened images (CIS benchmarks applied during build)
- Consistent configuration across environments
# packer.pkr.hcl
source "amazon-ebs" "ubuntu" {
region = "ap-south-1"
source_ami = "ami-0f5ee92e2d63afc18" # Ubuntu 22.04 LTS
instance_type = "t3.medium"
ssh_username = "ubuntu"
ami_name = "myapp-base-{{timestamp}}"
}
build {
sources = ["source.amazon-ebs.ubuntu"]
provisioner "shell" {
inline = [
"sudo apt-get update",
"sudo apt-get install -y nginx",
"sudo systemctl enable nginx"
]
}
provisioner "ansible" {
playbook_file = "playbooks/harden.yml"
}
post-processor "manifest" {
output = "manifest.json" # Save AMI ID for Terraform
}
}
CI pipeline: Packer builds AMI → runs tests → publishes AMI ID → Terraform references latest AMI → instances launch immediately with all software pre-installed.
Q34. What is the difference between MTTR, MTBF, and MTTD?
| Metric | Full Name | Measures | Goal |
|---|---|---|---|
| MTTR | Mean Time to Recovery | Average time to restore service after failure | Minimize (faster recovery) |
| MTBF | Mean Time Between Failures | Average time between incidents | Maximize (more reliable) |
| MTTD | Mean Time to Detect | Average time from failure to detection | Minimize (better monitoring) |
| MTTF | Mean Time to Failure | Average time until a component fails | Maximize |
How to improve MTTR:
- Better runbooks (clear, tested procedures)
- Auto-remediation (Lambda triggered by CloudWatch alarm)
- Feature flags (disable problematic feature instantly)
- Rollback automation (Argo CD sync to previous revision on alert)
- PagerDuty escalation policies (right person paged immediately)
- Chaos engineering (practice incident response regularly)
How to improve MTTD:
- Comprehensive alerting (SLI-based alerts, not just infrastructure)
- Synthetic monitoring (actively probe from outside)
- Real user monitoring (RUM — detect user-impacting issues before alerts)
- Anomaly detection (ML-based: Amazon DevOps Guru, Datadog)
Q35. What is Vault by HashiCorp? How does it manage secrets?
Core concepts:
- Secret Engines: Backends that store/generate secrets (KV, AWS IAM, database, PKI)
- Auth Methods: How clients authenticate (Kubernetes ServiceAccount, AWS IAM, GitHub)
- Policies: ACL rules controlling who can access which secrets
- Leases: Time-bound access — dynamic secrets expire and are revoked
Dynamic secrets (killer feature):
# Vault generates a short-lived AWS access key on demand
vault read aws/creds/my-role
# Key: AKIAIOSFODNN7EXAMPLE
# Secret: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Expires in: 1 hour
# After expiry: Vault automatically revokes it from AWS
In Kubernetes:
# Vault Agent sidecar annotation — injects secrets as files
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "payment-service"
vault.hashicorp.com/agent-inject-secret-db: "secret/data/production/database"
vault.hashicorp.com/agent-inject-template-db: |
{{- with secret "secret/data/production/database" -}}
DATABASE_URL=postgresql://{{ .Data.data.username }}:{{ .Data.data.password }}@db:5432/app
{{- end }}
Advanced-Level DevOps Questions (Q36-Q50)
The Advanced section is where Rs 45+ LPA offers are won. These questions are asked for senior SRE and staff-level DevOps roles. If you can answer these confidently, you're in the top 5% of candidates.
Q36. How do you implement zero-trust networking in a DevOps context?
Implementation pillars:
- Service-to-service mTLS: Istio/Linkerd automatically issues certificates, enforces mutual authentication — even internal services verify each other
- Short-lived credentials: No long-term passwords; dynamic secrets from Vault, IAM roles
- Workload identity: SPIFFE/SPIRE assigns cryptographic identities to workloads (pods, VMs)
- Network micro-segmentation: K8s NetworkPolicies — explicit allow lists between services
- Device trust: BeyondCorp-style — employee machines verified before VPN
- Continuous authorization: Re-verify on every request, not just login
- Comprehensive audit logging: Every request logged with who, what, when, from where
# Istio PeerAuthentication — enforce mTLS cluster-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system # Cluster-wide
spec:
mtls:
mode: STRICT # All service-to-service traffic must use mTLS
Q37. Design a complete CI/CD pipeline for a microservices application.
Architecture:
Code changes
│
▼
GitHub (PR opened)
│
▼
GitHub Actions CI job:
├── Lint + unit tests
├── Build Docker image (BuildKit + cache)
├── Security scan (Trivy CRITICAL/HIGH block)
├── SAST scan (Semgrep/CodeQL)
├── Integration tests (docker-compose)
├── Push to ECR (only on PR merge to main)
└── Update image tag in config repo (GitOps)
│
▼
Config repo PR (automated)
│
▼
Platform team reviews (auto-approve for non-prod)
│
▼
Merge to config repo main branch
│
▼
Argo CD detects change → syncs staging cluster
│
▼
Staging deployment + smoke tests
│
▼
Manual approval (production)
│
▼
Argo CD syncs production cluster
├── Canary: 5% → 25% → 50% → 100% (Argo Rollouts)
└── Auto-rollback if error rate alarm fires
│
▼
PagerDuty alert if post-deploy health check fails
Key design decisions:
- OIDC for all cloud credentials (no stored keys)
- Separate application repo from config repo (GitOps)
- Security scanning gates are non-negotiable
- Canary with Prometheus-based analysis for production
Q38. How do you implement observability for a distributed system at scale?
Cardinality management (critical at scale): High-cardinality labels (user_id, request_id in Prometheus) cause memory exhaustion. Rules:
- Prometheus labels: Only low-cardinality (service, endpoint, status_code, region)
- High-cardinality data: Traces (Jaeger/Tempo), logs (Loki) — not metrics
Sampling strategy:
All traces: 100M requests/day → too expensive to store all
Strategy: Head-based sampling (make decision at request start)
- 100% sample error traces
- 100% sample traces > 500ms latency
- 1% sample remaining successful traces
OR Tail-based sampling (OTel Collector tail_sampling processor):
- Better: make sampling decision after seeing full trace
- More resource-intensive at collector
Exemplars: Link metrics to traces — when a high-latency spike appears in Prometheus, exemplars provide the trace ID that caused it:
# Histogram with exemplar
REQUEST_LATENCY.observe(duration, exemplar={"traceID": current_trace_id})
SLO-based alerting (multi-window, multi-burn-rate):
# Alert fires if you're burning through 30-day error budget too fast
# Fast burn: 14.4x budget burn rate for 1h (critical)
# Slow burn: 3x budget burn rate for 6h (warning)
- alert: ErrorBudgetBurnTooFast
expr: |
(
sum(rate(http_errors_total[1h])) / sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x the 0.1% SLO error rate
Q39. What is FinOps? How do DevOps engineers contribute to cloud cost optimization?
DevOps engineer's role in FinOps:
- Rightsizing: Use AWS Compute Optimizer recommendations — downsize over-provisioned instances
- Auto-scaling: Scale down outside business hours (scheduled scaling for stateless apps)
- Spot/Preemptible instances: 70-90% cheaper for fault-tolerant workloads (CI runners, batch, ML training)
- Resource tagging: Every resource tagged (environment, team, service, cost-center) for chargeback
- Delete zombie resources: Unused EIPs, forgotten Load Balancers, orphaned EBS volumes
- Savings Plans/Reserved Instances: Commit to base capacity for 30-66% savings
- Architecture optimization: Lambda > always-on EC2 for bursty workloads; Graviton > x86 for ~20% cost reduction
- S3 lifecycle policies: Auto-move old data to Glacier ($0.004/GB vs $0.023/GB)
Tooling:
- AWS Cost Explorer + Budgets: Alerts on spend anomalies
- Infracost: Show cost diff in Terraform PRs
- OpenCost / Kubecost: K8s cost visibility (cost per namespace, deployment, team)
Q40. Explain platform engineering vs. DevOps. What is an Internal Developer Platform (IDP)?
DevOps (early 2010s): Individual dev teams own their own CI/CD and infrastructure. "You build it, you run it."
Platform Engineering (2020s): A dedicated team builds an Internal Developer Platform — golden paths and self-service tools that abstract infrastructure complexity from application developers.
Internal Developer Platform (IDP) components:
| Component | Purpose | Example Tools |
|---|---|---|
| Self-service portal | Developers create environments/services via UI | Backstage (Spotify) |
| Golden path templates | Pre-approved service templates with best practices | Cookiecutter, Backstage Software Templates |
| CI/CD abstractions | Developers don't write raw GitHub Actions | Dagger, shared GitHub Actions libraries |
| Secret management | One-click secret rotation, dev access | Vault UI, External Secrets |
| Observability | Auto-configured monitoring per service | Grafana + auto-dashboards |
| Environment provisioning | Create dev environment in minutes | Crossplane, Argo CD |
Why it matters: At scale (500+ engineers), having each team own all of DevOps creates inconsistency, security gaps, and toil. Platform Engineering creates leverage — one platform team enables hundreds of product engineers.
Increasingly asked at senior/lead level interviews (2026)
Q41. What is Crossplane? How does it extend Kubernetes for infrastructure management?
# Provision an RDS PostgreSQL instance using Crossplane
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
name: production-postgres
spec:
forProvider:
region: ap-south-1
dbInstanceClass: db.t3.medium
masterUsername: admin
engine: postgres
engineVersion: "15.3"
multiAZ: true
skipFinalSnapshotBeforeDeletion: false
writeConnectionSecretsToRef:
namespace: production
name: postgres-credentials # Crossplane stores connection details as K8s Secret
Crossplane vs. Terraform:
- Crossplane is Kubernetes-native — leverages existing K8s RBAC, GitOps tools (Argo CD), and tooling
- Terraform is more mature, larger ecosystem, easier to use outside K8s
- Crossplane is better for organizations fully committed to Kubernetes and GitOps
Q42. What is Site Reliability Engineering (SRE)? How does it differ from DevOps?
| Aspect | DevOps | SRE |
|---|---|---|
| Origin | Cultural movement | Google's implementation of DevOps principles |
| Focus | Collaboration, CI/CD, automation | Reliability, scalability, SLOs |
| Team structure | Embedded in product teams | Dedicated SRE teams (or embedded) |
| Key metrics | Deployment frequency, lead time | SLOs, error budget, MTTR, MTBF |
| Tooling | CI/CD, IaC, monitoring | All DevOps + capacity planning, load testing |
| Book | "The Phoenix Project" | "Site Reliability Engineering" (Google) |
SRE unique concepts:
- Toil: Manual, repetitive, automatable work that scales with service load. SREs target <50% toil time.
- Error budgets: Formalize the reliability vs. velocity trade-off. If error budget is depleted, new features freeze.
- Eliminating toil: Every manual task is a candidate for automation. If you do it twice, automate it.
- Postmortems: Blameless, written, shared across organization.
- Production readiness reviews (PRR): Checklist before launching new services (alerts configured? runbook exists? load tested?)
Critical distinction for Google, Amazon, Flipkart SRE positions
Q43. How do you design for reliability in a multi-region deployment?
Active-Active (both regions serve traffic):
Users globally
│
Route 53 latency-based routing
/ \
ap-south-1 (Mumbai) ap-southeast-1 (Singapore)
├── EKS cluster ├── EKS cluster
├── Aurora Global DB └── Aurora Read Replica
└── ElastiCache (promotes to writer on failover)
Active-Passive (one region handles traffic, other on standby):
- Simpler, lower cost
- RTO: minutes (failover time)
- RPO: seconds (replication lag)
Key patterns for multi-region:
- Data consistency: Use CRDTs for eventually consistent data; avoid two-phase commit across regions
- Circuit breakers: Don't let a failing region cascade to healthy region
- Chaos engineering: Regularly simulate region failures (AWS FIS)
- DNS failover: Route 53 health checks auto-reroute on region failure
- Deployment: Deploy to one region, verify, then the second (sequenced deploys)
RTO/RPO targets:
- Tier 1 (payments, login): RTO <5 min, RPO ~0 (synchronous replication)
- Tier 2 (recommendations): RTO <30 min, RPO <5 min
- Tier 3 (analytics): RTO <4 hours, RPO <1 hour
Q44. What is supply chain security in DevOps? How do you implement SLSA?
SLSA (Supply-chain Levels for Software Artifacts) — a framework for supply chain integrity:
| Level | Requirements |
|---|---|
| SLSA 1 | Provenance exists (build logs available) |
| SLSA 2 | Provenance signed and hosted by build service |
| SLSA 3 | Source verified, build isolated, hardened build environment |
| SLSA 4 | Two-person review, hermetic reproducible builds |
Implementation:
# GitHub Actions with SLSA provenance (using slsa-github-generator)
- name: Build Docker image
uses: docker/build-push-action@v5
with:
push: true
tags: myapp:${{ github.sha }}
- name: Generate SLSA provenance
uses: slsa-framework/slsa-github-generator/.github/workflows/generator_container_slsa3.yml@v1
with:
image: myapp
digest: ${{ steps.build.outputs.digest }}
Dependency security:
- Dependabot: Auto-PRs for dependency updates
- Renovate: More configurable alternative
- Snyk: Deep vulnerability scanning including transitive dependencies
- SBOM (CycloneDX/SPDX): Know every component in your software
Q45. How do you handle database migrations in CI/CD?
Safe migration patterns:
-
Expand-Contract (for zero-downtime):
- Phase 1 (Expand): Add new column/table, keep old one. Both old and new app code can run.
- Phase 2 (Migrate): Backfill new column; run new app code.
- Phase 3 (Contract): Remove old column after old code is fully deployed.
-
Blue-Green with schema sync:
- New schema must be backward compatible with both blue (old) and green (new) versions
- Never drop columns in the same deploy that adds new ones
-
Tools:
- Flyway/Liquibase: Version-controlled SQL migrations, run in pipeline before app deploy
- Atlas: Modern schema-as-code for multiple databases
- Django/Rails migrations: Framework-native
# Kubernetes job running migrations before deployment
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration-{{ .Values.image.tag }}
spec:
template:
spec:
initContainers:
- name: wait-for-db
image: busybox
command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 2; done']
containers:
- name: migrate
image: myapp:{{ .Values.image.tag }}
command: ["python", "manage.py", "migrate", "--noinput"]
restartPolicy: Never
backoffLimit: 3
Helm pre-install/pre-upgrade hooks ensure migrations run and succeed before the new app version is deployed.
Q46. What is Dagger? How does it improve CI/CD portability?
Problem it solves: CI configuration is fragmented across YAML files for each platform. Testing locally requires pushing to CI. Different behavior in CI vs. local.
# dagger pipeline in Python — runs same way locally and in GitHub Actions
import anyio
import dagger
async def build_and_test():
async with dagger.Connection() as client:
# Get source code
source = client.host().directory(".", exclude=[".git", "node_modules"])
# Build container
node = (
client.container()
.from_("node:20-alpine")
.with_directory("/src", source)
.with_workdir("/src")
.with_exec(["npm", "ci"])
)
# Run tests (returns container with test results)
test = await node.with_exec(["npm", "test"]).stdout()
print(f"Tests: {test}")
# Build Docker image
image = await node.with_exec(["npm", "run", "build"]).publish(
"myregistry.com/myapp:latest"
)
print(f"Published: {image}")
anyio.run(build_and_test)
Run locally with python pipeline.py — same execution in CI via dagger run python pipeline.py.
Q47. How do you implement policy as code with OPA/Gatekeeper?
# Gatekeeper ConstraintTemplate — enforce resource limits
package k8srequiredlimits
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not container.resources.limits.memory
msg := sprintf("Container '%s' must have memory limits", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not container.resources.limits.cpu
msg := sprintf("Container '%s' must have CPU limits", [container.name])
}
# Constraint — apply to all namespaces except kube-system
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLimits
metadata:
name: require-resource-limits
spec:
enforcementAction: deny # or 'warn' for audit mode
match:
namespaces: ["production", "staging"]
OPA in CI/CD (non-K8s):
# Evaluate Terraform plan against policies before apply
terraform plan -out plan.json
terraform show -json plan.json > plan-output.json
# OPA policy check
opa eval -d policies/ -i plan-output.json "data.terraform.allow" --fail
Conftest uses OPA policies to validate any YAML/JSON/Terraform/Dockerfile.
Q48. What is AIOps? How is AI being integrated into DevOps in 2026?
Current practical applications (2026):
| Use Case | Tool | How it works |
|---|---|---|
| Anomaly detection | Grafana ML, Datadog Watchdog | Baseline metrics, flag unusual deviations |
| Alert correlation | PagerDuty AIOps | Group related alerts, reduce alert noise |
| Root cause analysis | Amazon DevOps Guru | Identify unusual resource behavior patterns |
| Log analysis | Elastic ML, Grafana Loki ML | Cluster log patterns, detect new error types |
| Predictive scaling | AWS Auto Scaling predictive mode | Scale before traffic hits, not after |
| Incident resolution | Slack AI + runbook lookup | Suggest runbook steps from incident description |
| Code review | GitHub Copilot, CodeRabbit | Review PRs for security issues, performance |
Practical integration pattern:
# GitHub Actions with AI-assisted PR review
- name: CodeRabbit Review
uses: coderabbitai/ai-pr-reviewer@latest
with:
openai_api_key: ${{ secrets.OPENAI_API_KEY }}
# Reviews for: logic errors, security issues, performance, test coverage
AI doesn't replace DevOps engineers in 2026 — it handles toil (log analysis, alert grouping) so engineers focus on system design and reliability.
Q49. How do you implement disaster recovery? Explain RTO and RPO in practice.
DR Strategies (from cheapest to most expensive):
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | Low | Restore from S3 backups |
| Pilot Light | 10-30 min | Minutes | Low-Medium | Core services always running, scale up on disaster |
| Warm Standby | Minutes | Seconds | Medium | Scaled-down version running in secondary region |
| Multi-Site Active-Active | Near zero | Near zero | High | Full production capacity in both regions |
Pilot Light example:
# Primary Region (ap-south-1) — full production
module "primary" {
instance_count = 10
rds_instance = "db.r6g.xlarge"
}
# DR Region (ap-southeast-1) — pilot light
module "dr" {
instance_count = 0 # ASG min=0, max=10
rds_instance = "db.t3.medium" # Smaller RDS receiving replication
}
DR runbook steps (must be tested quarterly):
- Verify data replication is up-to-date
- Scale up DR region ASG
- Promote RDS read replica to primary
- Update Route 53 health check to point to DR region
- Verify application is serving traffic
- Communicate status to stakeholders
Test your DR plan: Untested DR plans fail when you need them most. Chaos engineering + game days simulate disasters on a schedule.
Q50. Describe your approach to building a developer platform from scratch at a 200-person engineering organization.
Phase 1 — Assess (Week 1-2):
- Interview 20+ engineers: biggest pain points, manual toils, blocked deployments
- Audit current state: How many unique CI/CD setups? How long do deployments take? DORA metrics baseline
- Identify top 3 pain points (usually: slow/flaky CI, inconsistent environments, opaque deployments)
Phase 2 — Golden Path (Month 1-3):
- Standardize on one CI/CD platform (GitHub Actions)
- Shared GitHub Actions library: reusable build/test/deploy workflows
- Opinionated service template (Cookiecutter + Backstage): generates a new service with CI/CD, monitoring, and security scanning pre-configured
- Target: new service from idea to first deployment < 1 day
Phase 3 — Self-Service (Month 3-6):
- Backstage portal: service catalog, create environments, view deployment status
- One-click staging environment provisioning
- Automated secret management onboarding
- Auto-configured Grafana dashboards per service
Phase 4 — Reliability (Month 6-12):
- SLO framework and tooling
- Chaos engineering program
- Production readiness checklist
- On-call tooling (PagerDuty + runbooks)
Measure success with DORA metrics:
- Deployment frequency: 1 deploy/week/team → 5/day/team
- Lead time: 3 days → 2 hours
- Change failure rate: 20% → 5%
- MTTR: 2 hours → 15 minutes
FAQ Section — Straight Answers to Your DevOps Career Questions
Q: What's the difference between DevOps Engineer, SRE, and Platform Engineer? This confuses almost everyone, so here's the clear breakdown: DevOps Engineer focuses on CI/CD, automation, build/deploy tooling. SRE focuses on production reliability, SLOs, incident response, and on-call. Platform Engineer builds internal tools and golden paths for other engineers. Roles overlap significantly — job title often depends on company culture. Pro tip: Read the job description carefully. A "DevOps Engineer" role at a bank is very different from one at a startup.
Q: Is Kubernetes knowledge required for DevOps roles in 2026? Yes — it's effectively non-negotiable for product companies. Kubernetes is the production container orchestration standard. At minimum, understand deployments, services, ConfigMaps, RBAC, and basic troubleshooting. EKS/GKE-specific knowledge is a bonus. Check out our Kubernetes Interview Questions 2026 for a deep dive.
Q: Terraform vs. Pulumi in 2026 — which is gaining? Terraform remains dominant in market share. Pulumi is gaining traction at organizations with strong software engineering cultures (programmatic IaC using real languages vs. HCL). The OpenTofu fork (open-source Terraform) is growing after HashiCorp's BSL license change.
Q: What monitoring stack should I learn? The open-source stack: Prometheus + Grafana + Loki + Tempo + OpenTelemetry. This covers metrics, logs, and traces. Datadog is the dominant commercial alternative — learn if you're going to enterprise companies.
Q: Is Jenkins still relevant in 2026? Jenkins remains widely deployed in enterprises and has a huge plugin ecosystem. GitHub Actions, GitLab CI, and CircleCI are preferred at modern product companies. If you're applying to enterprise/bank/large IT — know Jenkins. Startups: GitHub Actions.
Q: What certifications should a DevOps engineer get? Tier 1: AWS Solutions Architect Associate + CKA (Certified Kubernetes Administrator). Tier 2: HashiCorp Terraform Associate, Prometheus Certified Associate. Tier 3: CKS (Security), AWS DevOps Professional, Google Cloud Professional DevOps Engineer.
Q: How important is Python/scripting for DevOps? Extremely important. Bash for one-liners and simple scripts. Python for anything more complex (AWS boto3, custom tooling, Ansible modules). Go is increasingly common for writing K8s operators and CLI tools. Know at least Python + Bash.
Q: What is the DevOps salary range in India in 2026? Here are the verified numbers from real offers: Junior DevOps (0-2 yrs): Rs 8-15 LPA. Mid-level (3-5 yrs, K8s + AWS + CI/CD): Rs 18-40 LPA. Senior/SRE (7+ yrs, system design, architecture): Rs 45-90 LPA. Principal/Staff: Rs 80 LPA-1.5 Cr at top product companies. The jump from mid-level to senior is massive — and it's driven by system design and incident management skills, not just tool knowledge.
Build the complete DevOps & infrastructure interview toolkit:
- AWS Interview Questions 2026 — Master the #1 cloud platform
- Kubernetes Interview Questions 2026 — Container orchestration deep dive
- Docker Interview Questions 2026 — Container fundamentals
- System Design Interview Questions 2026 — Design scalable distributed systems
- Microservices Interview Questions 2026 — Distributed application architecture
- Data Engineering Interview Questions 2026 — Data pipelines and infrastructure
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Related Articles
AWS Interview Questions 2026 — Top 50 with Expert Answers
AWS certifications command a 25-30% salary premium in India, and AWS skills appear in 74% of all cloud job postings. AWS...
Docker Interview Questions 2026 — Top 40 with Expert Answers
Docker engineers at product companies command ₹15-35 LPA, and senior container/DevOps specialists at Flipkart, Razorpay, and...
Kubernetes Interview Questions 2026 — Top 50 with Expert Answers
Kubernetes engineers command ₹25-60 LPA in India. Platform engineers with deep K8s expertise at Flipkart, Swiggy, and...
Microservices Interview Questions 2026 — Top 40 with Expert Answers
Senior backend engineers with microservices expertise earn ₹30-90 LPA at product companies. Staff/Principal architects at...
AI/ML Interview Questions 2026 — Top 50 Questions with Answers
AI/ML engineer is the highest-paid engineering role in 2026, with median compensation exceeding $200K at top companies. But...