DevOps Interview Questions 2026 — Top 50 with Expert Answers

Elite DevOps teams deploy to production multiple times per day with a change failure rate under 5%. That's the bar companies are hiring for in 2026. DevOps has evolved from a cultural philosophy into a concrete set of engineering practices — and companies expect you to command CI/CD pipelines, infrastructure as code, observability, incident response, and reliability engineering at an expert level. This guide covers 50 real questions asked at product companies like Razorpay, Swiggy, Flipkart, and global FAANG firms, organized by difficulty with the exact answers that get offers.

DevOps/SRE is one of the fastest-growing career paths in India, with senior roles commanding Rs 45-90 LPA at top product companies. The skills gap is real — master these 50 questions and you're ahead of 90% of candidates.

Beginner-Level DevOps Questions (Q1-Q15)

Even if you're a senior engineer, don't skip these. Interviewers at Razorpay and Flipkart use beginner questions to test whether you truly understand the "why" behind DevOps — not just the tools.

Q1. What is DevOps? How does it differ from traditional IT operations?

Traditional IT vs. DevOps:

Aspect	Traditional IT	DevOps
Dev/Ops relationship	Siloed teams with handoffs	Shared ownership, shared goals
Deployment frequency	Quarterly or monthly releases	Multiple times per day
Deployment method	Manual, scripted	Automated CI/CD pipelines
Infrastructure	Pet servers (named, cared for)	Cattle (identical, replaceable)
Failure response	Blame, RCA blame game	Blameless post-mortems, SRE principles
Feedback loop	Months	Minutes to hours
Rollback	Manual, risky	Automated, safe

DORA Metrics (DevOps Research and Assessment) measure DevOps performance:

Deployment frequency: How often you deploy to production
Lead time for changes: Code commit to production
Change failure rate: % of deployments causing incidents
Time to restore service: MTTR after incident

Elite performers (2024 DORA report): Deploy on-demand (multiple times/day), lead time <1 day, CFR <5%, MTTR <1 hour.

Q2. What is CI/CD? Explain each stage.

Continuous Integration (CI): Every code commit triggers automated build and test — developers merge frequently, preventing integration hell.

Continuous Delivery (CD): Every passing build is automatically deployed to staging. Human approval gates production deployment.

Continuous Deployment: Every passing build is automatically deployed to production — no human gates.

Full CI/CD pipeline stages:

Developer commits code
    │
    ▼
1. Source Control (Git) — PR created, branch policies enforced
    │
    ▼
2. CI Trigger — webhook fires pipeline
    │
    ▼
3. Build
   ├── Compile / install dependencies
   ├── Run unit tests
   ├── Static code analysis (SonarQube, ESLint)
   └── Security scan (SAST — Semgrep, CodeQL)
    │
    ▼
4. Test
   ├── Integration tests
   ├── Contract tests (Pact)
   └── Vulnerability scan (Trivy on Docker image)
    │
    ▼
5. Artifact
   ├── Build Docker image
   └── Push to registry (ECR, GCR)
    │
    ▼
6. Deploy to Staging
   └── Smoke tests / synthetic monitoring
    │
    ▼
7. [Approval gate] — automated or manual
    │
    ▼
8. Deploy to Production
   ├── Blue/green or canary deployment
   └── Post-deploy health checks
    │
    ▼
9. Monitor
   └── Alert if error rate spikes → auto-rollback or PagerDuty alert

Q3. What is Infrastructure as Code (IaC)? Why is it important?

Benefits:

Benefit	Explanation
Reproducibility	Same code creates identical environments (no "works on my machine")
Version control	Infrastructure changes tracked in Git — who changed what, when, why
Peer review	Infrastructure changes reviewed via pull requests
Automation	Environments created in minutes, not weeks
Drift detection	Know when actual state diverges from desired state
Disaster recovery	Re-create entire environment from code
Cost control	Spin down dev environments on weekends (schedule destroy)

Popular IaC tools:

Tool	Type	Best for
Terraform	Declarative, multi-cloud	General purpose, most popular
AWS CloudFormation	Declarative, AWS-only	Native AWS integration
AWS CDK	Programmatic IaC (TypeScript, Python)	Developers who prefer real languages
Pulumi	Programmatic IaC (any language)	Multi-cloud with programming constructs
Ansible	Imperative, configuration management	OS config, application deployment
Packer	Image builder	AMI, GCP image creation

Q4. Explain Terraform workflow — init, plan, apply, destroy.

# 1. terraform init
# Downloads provider plugins, initializes backend (remote state)
terraform init

# 2. terraform plan
# Shows what will be created/modified/destroyed (dry run)
# ALWAYS review this before apply
terraform plan -out=tfplan

# 3. terraform apply
# Applies the planned changes
terraform apply tfplan
# Or interactively:
terraform apply  # Shows plan, prompts for "yes"

# 4. terraform destroy
# Destroys all resources in the state
terraform destroy  # Prompts for "yes"
terraform destroy -target=aws_instance.web  # Destroy specific resource

State file (terraform.tfstate): Terraform tracks the mapping between your config and real infrastructure in a state file. In teams, state is stored remotely (S3 + DynamoDB lock for AWS):

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/networking/terraform.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "terraform-state-locks"
    encrypt        = true
  }
}

DynamoDB prevents concurrent applies (state locking). Never commit state files to Git — they contain sensitive data.

Asked at Flipkart, Razorpay, PhonePe infrastructure interviews

Q5. What is the difference between Ansible, Terraform, and Chef/Puppet?

Tool	Category	Approach	State	Language	Best For
Terraform	IaC (provisioning)	Declarative	Yes (tfstate)	HCL	Cloud resource provisioning
Ansible	Configuration management	Imperative (playbooks)	Stateless	YAML	OS config, app deployment, ad-hoc tasks
Chef	Configuration management	Imperative (recipes)	Server (Chef Server)	Ruby	Traditional CM, complex configs
Puppet	Configuration management	Declarative	Server (PuppetDB)	Puppet DSL	Enterprise config management
Pulumi	IaC (provisioning)	Programmatic	Yes (backend)	Any language	Complex IaC requiring programming logic

Common pattern: Terraform provisions the servers (EC2, RDS, VPC); Ansible configures them (install packages, deploy app, set up monitoring). Terraform handles "what infrastructure exists"; Ansible handles "what's installed on the servers."

Q6. What is GitHub Actions? Write a simple CI workflow.

# .github/workflows/ci.yml
name: CI Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ['3.11', '3.12']

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v5
      with:
        python-version: ${{ matrix.python-version }}

    - name: Cache pip
      uses: actions/cache@v3
      with:
        path: ~/.cache/pip
        key: ${{ runner.os }}-pip-${{ hashFiles('requirements*.txt') }}

    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install -r requirements-dev.txt

    - name: Lint
      run: ruff check . --output-format=github

    - name: Test
      run: pytest tests/ --cov=app --cov-report=xml -v

    - name: Upload coverage
      uses: codecov/codecov-action@v4
      with:
        token: ${{ secrets.CODECOV_TOKEN }}

  security:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: SAST scan
      uses: returntocorp/semgrep-action@v1

Q7. What is a Jenkins pipeline? What is the difference between Declarative and Scripted pipeline?

Declarative pipeline (recommended — structured, less Groovy knowledge needed):

// Jenkinsfile
pipeline {
    agent any

    environment {
        DOCKER_REGISTRY = 'myregistry.com'
        IMAGE_NAME = 'myapp'
    }

    stages {
        stage('Checkout') {
            steps {
                git branch: 'main', url: 'https://github.com/myorg/myapp.git'
            }
        }

        stage('Build & Test') {
            steps {
                sh 'mvn clean test'
            }
            post {
                always {
                    junit 'target/surefire-reports/*.xml'
                }
            }
        }

        stage('Docker Build') {
            steps {
                script {
                    docker.build("${DOCKER_REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}")
                }
            }
        }

        stage('Deploy to Staging') {
            when {
                branch 'main'
            }
            steps {
                sh './deploy.sh staging'
            }
        }
    }

    post {
        failure {
            slackSend(color: 'danger', message: "Build FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}")
        }
    }
}

Scripted pipeline: Pure Groovy inside node {} blocks. More flexible but more complex, harder to read. Use Declarative unless you need complex Groovy logic.

Q8. What is the difference between Git merge, rebase, and cherry-pick?

Operation	What it does	Creates merge commit?	Rewrites history?
merge	Combines branches, preserves history	Yes (unless fast-forward)	No
rebase	Replays commits onto another branch tip	No	Yes — new commit SHAs
cherry-pick	Applies a specific commit to current branch	No	Yes — new commit SHA
squash merge	Combines all branch commits into one	One new commit	Yes

# Merge feature into main (preserves feature history)
git checkout main && git merge feature/payment

# Rebase feature onto main (cleaner linear history)
git checkout feature/payment && git rebase main

# Cherry-pick a hotfix to multiple release branches
git cherry-pick abc1234

# Interactive rebase — squash last 3 commits
git rebase -i HEAD~3

Golden rule: Never rebase shared/public branches (main, develop). Only rebase local feature branches before merging.

Q9. What is the purpose of a staging environment? What makes a good staging setup?

Characteristics of a good staging environment:

Production parity: Same infrastructure configuration (instance sizes can differ, but architecture must match)
Real data: Anonymized production data dump — tests realistic data volumes, not 10 rows
Isolated: No shared services with production (separate DB, separate queues)
Continuously deployed: Every merge to main auto-deploys to staging
Monitored: Same monitoring stack as production (so you catch monitoring gaps)
External service stubs: Payment gateways (Razorpay test mode), SMS providers (mock)

Common staging failures: Using smaller instance types → misses memory/CPU issues. Shared database with prod → staging deploy breaks prod. Fake/sparse data → doesn't test realistic query performance.

Q10. What is Prometheus? How does it collect metrics?

Architecture:

Applications (/metrics endpoint in Prometheus format)
    ↑ scrape (HTTP GET /metrics)
Prometheus Server
    ├── TSDB (time-series database, local disk)
    ├── Rules evaluation (recording + alerting rules)
    └── Alertmanager → PagerDuty, Slack, OpsGenie
    ↑ query (PromQL)
Grafana

Metric types:

Counter: Only goes up (HTTP requests total, errors total)
Gauge: Can go up or down (current memory usage, queue depth)
Histogram: Distribution of observations (request latency buckets, response sizes)
Summary: Similar to histogram, but calculates quantiles client-side

# Python app exposing Prometheus metrics
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['endpoint'])

@app.route('/api/orders')
def get_orders():
    with REQUEST_LATENCY.labels(endpoint='/api/orders').time():
        result = db.query_orders()
        REQUEST_COUNT.labels(method='GET', endpoint='/api/orders', status=200).inc()
        return result

Q11. What is Grafana? How does it integrate with Prometheus?

Prometheus + Grafana integration:

Add Prometheus as a data source in Grafana (URL: http://prometheus:9090)
Use PromQL queries in Grafana panels:

# Request rate per endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

Good dashboards follow RED method:

Rate: Requests per second
Errors: Error rate (4xx/5xx)
Duration: Latency distribution (p50, p90, p99)

Grafana's Explore feature allows ad-hoc metric queries during incident investigation without modifying dashboards. Grafana OnCall integrates alert routing directly.

Q12. What is an SLI, SLO, and SLA?

Term	Full Form	Definition	Owner	Example
SLI	Service Level Indicator	A metric that measures service behavior	Engineering	"99.5% of requests complete in <200ms"
SLO	Service Level Objective	A target value or range for an SLI	Engineering + Product	"SLI must be ≥ 99.5% over 30 days"
SLA	Service Level Agreement	A business contract with consequences	Legal + Business	"99.9% uptime, or 10% credit for each 0.1% below"

Relationship: SLA ≥ SLO ≥ actual performance. SLOs are internal targets (usually more strict than SLAs to leave buffer). SLIs are the measurements.

Error budget: 100% minus SLO. For 99.9% SLO: 0.1% error budget = 43.8 minutes/month of allowed downtime. If you're within budget, you can deploy new features. If you've exceeded budget, all deployments freeze until next period.

Good SLIs focus on what users experience:

Availability: % of successful requests
Latency: % of requests below threshold
Throughput: Operations per second
Error rate: % of failed requests

Critical concept at SRE interviews (Google, Amazon, Flipkart SRE)

Q13. What is the difference between monitoring, observability, and alerting?

Concept	Definition	Tools
Monitoring	Collecting and displaying predefined metrics	Prometheus, CloudWatch, Datadog
Observability	Ability to understand system internal state from external outputs. 3 pillars: Metrics, Logs, Traces	Prometheus + Loki + Tempo (Grafana OSS)
Alerting	Notifying humans when metrics breach thresholds	Alertmanager, PagerDuty, OpsGenie

Monitoring vs. Observability: Monitoring asks "Is this thing healthy?" (yes/no). Observability asks "WHY is this broken?" You need distributed tracing and structured logs to debug non-obvious failures across microservices.

The three pillars:

Metrics: Numeric aggregations over time (fast, cheap, limited dimensionality)
Logs: Timestamped text records (rich context, expensive to store and query at scale)
Traces: End-to-end request path across services (shows bottlenecks, latency contributions)

OpenTelemetry (OTel) is the emerging standard for instrumentation — one SDK for all three signals, vendor-neutral.

Q14. What is Incident Management? Describe a good incident response process.

Incident severity levels:

Severity	Impact	Response Time	Example
P0/SEV1	Total outage, all users affected	Immediate, 24/7	Payment processing down
P1/SEV2	Major feature broken, large user impact	<15 minutes	Login failure for 50% users
P2/SEV3	Significant degradation, some users affected	<1 hour	Checkout slow (p99 >5s)
P3/SEV4	Minor issue, small user impact	<4 hours	Minor UI bug

Incident response process:

Detect: Automated alert fires (Alertmanager → PagerDuty → on-call engineer)
Acknowledge: On-call acknowledges within SLA (prevents escalation)
Assemble: Incident commander paged for SEV1/2; coordinates responders
Investigate: Identify blast radius — what's broken, how many users affected
Mitigate: Minimize impact first (rollback, feature flag disable, capacity increase)
Resolve: Permanent fix (may come later; mitigation is sufficient to close incident)
Review: Blameless post-mortem within 48 hours for SEV1/2

Incident channels: Dedicated Slack channel per incident (#incident-2026-03-30-payment), Zoom bridge for coordination, PagerDuty status page updates.

Q15. What is a blameless post-mortem?

Why blameless? If engineers fear punishment for mistakes, they hide problems, don't take risks, and don't report near-misses. Google's SRE book established blameless culture as foundational to reliability.

Post-mortem structure:

Summary: 2-3 sentence incident description
Impact: Duration, affected users/services, business impact (revenue, support tickets)
Timeline: Chronological events (when detected, key decisions, mitigation, resolution)
Root cause: 5 Whys analysis — the actual technical cause
Contributing factors: System fragility, process gaps, monitoring blind spots
Action items: Concrete improvements with owners and due dates
Lessons learned: What went well, what went poorly

Action items must be specific:

Bad: "Improve monitoring"
Good: "Add PagerDuty alert when payment service error rate exceeds 1% for 5 minutes — owner: @alice, due: 2026-04-15"

Intermediate-Level DevOps Questions (Q16-Q35)

This is where Razorpay, Flipkart, and Swiggy interviews get serious. These questions test whether you've actually operated production systems or just read about them.

Q16. Write a complete Terraform module for an AWS VPC.

# modules/vpc/main.tf
variable "environment" { type = string }
variable "vpc_cidr"    { type = string; default = "10.0.0.0/16" }
variable "azs"         { type = list(string) }

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "${var.environment}-vpc"
    Environment = var.environment
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  tags   = { Name = "${var.environment}-igw" }
}

resource "aws_subnet" "public" {
  count             = length(var.azs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index)
  availability_zone = var.azs[count.index]
  map_public_ip_on_launch = true
  tags = { Name = "${var.environment}-public-${var.azs[count.index]}" }
}

resource "aws_subnet" "private" {
  count             = length(var.azs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
  availability_zone = var.azs[count.index]
  tags = { Name = "${var.environment}-private-${var.azs[count.index]}" }
}

resource "aws_eip" "nat" {
  count  = length(var.azs)
  domain = "vpc"
}

resource "aws_nat_gateway" "main" {
  count         = length(var.azs)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  depends_on    = [aws_internet_gateway.main]
  tags = { Name = "${var.environment}-nat-${var.azs[count.index]}" }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
  tags = { Name = "${var.environment}-public-rt" }
}

resource "aws_route_table_association" "public" {
  count          = length(var.azs)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

output "vpc_id"            { value = aws_vpc.main.id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "private_subnet_ids"{ value = aws_subnet.private[*].id }

Q17. What is Terraform state? How do you handle state in a team?

Problems with local state in teams:

Multiple people run apply simultaneously → state corruption
State file on one person's laptop → team blocked if they're unavailable
State in Git → security risk (state contains resource attributes including secrets in plaintext)

Remote state with locking:

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/app/terraform.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "terraform-locks"    # State locking
    encrypt        = true
    kms_key_id     = "arn:aws:kms:ap-south-1:..."  # Encrypt state
  }
}

State management commands:

# Import existing resource into state
terraform import aws_instance.web i-1234567890abcdef0

# Move resource in state (refactoring)
terraform state mv aws_instance.web module.compute.aws_instance.web

# Remove resource from state (without destroying)
terraform state rm aws_s3_bucket.old_bucket

# View current state
terraform state list
terraform state show aws_instance.web

Workspace per environment:

terraform workspace new staging
terraform workspace select production
terraform apply -var-file=production.tfvars

Deep-dive question at senior DevOps and platform engineer interviews

Q18. How do you implement Terraform for multiple environments (dev/staging/prod)?

Pattern 1 — Workspaces (simple, not recommended for large teams):

terraform workspace new dev && terraform apply
terraform workspace new prod && terraform apply

Problem: All environments share same code + different state, but not different configs.

Pattern 2 — Directory per environment (recommended):

infrastructure/
├── modules/
│   ├── vpc/
│   ├── eks/
│   └── rds/
├── environments/
│   ├── dev/
│   │   ├── main.tf   # uses modules, dev-specific values
│   │   ├── vars.tf
│   │   └── backend.tf
│   ├── staging/
│   └── production/

Pattern 3 — Terragrunt (DRY across environments):

# terragrunt.hcl in each environment directory
terraform {
  source = "../../modules//eks"
}

inputs = {
  cluster_version = "1.29"
  node_count      = local.env == "production" ? 5 : 2
}

# Remote state automatically configured per environment
remote_state {
  backend = "s3"
  config = {
    bucket = "tf-state-${local.env}"
    key    = "${path_relative_to_include()}/terraform.tfstate"
  }
}

Terragrunt handles the DRY (Don't Repeat Yourself) problem — one module definition, environment-specific overrides without copying Terraform code.

Q19. What is GitOps? How does it improve deployment reliability?

GitOps principles:

Declarative: All config described in Git (not "run this script")
Versioned and immutable: Git history is the audit log
Pulled automatically: A GitOps agent (Argo CD, Flux) continuously reconciles cluster to Git state
Continuously reconciled: Drift detected and auto-corrected

Reliability benefits:

Rollback = git revert — no "how do I undo that kubectl command"
Audit trail: Every change has a commit, PR, reviewer
No "configuration drift" — agent reverts manual changes
Disaster recovery: Re-sync from Git rebuilds entire cluster state

GitOps flow:

Developer writes code
    → PR to application repo
    → CI builds Docker image, pushes to registry
    → CI updates image tag in config repo (separate repo or same)
    → PR to config repo with new image tag
    → PR review + approval
    → Merge to main
    → Argo CD detects change → syncs cluster
    → Deployment rolls out

Q20. What is blue-green deployment vs. canary deployment vs. rolling deployment?

Strategy	Description	Rollback Speed	Resource Cost	Risk
Blue-Green	Two identical environments; switch traffic	Instant (flip DNS/LB)	2x (both envs running)	Low (instant switch)
Canary	Gradually shift traffic % to new version	Fast (shift back to 0%)	Low (small canary)	Low (limited blast radius)
Rolling	Replace instances one at a time	Slow (new rollout)	None extra	Medium (mixed versions)
Recreate	Kill all old, start all new	Fast (redeploy old)	None extra	High (downtime)

Canary decision criteria: Monitor error rate, latency, custom business metrics on the canary. If all good → increase weight. If bad → rollback automatically.

Tools:

AWS CodeDeploy: Blue-green for Lambda + ECS
Argo Rollouts: Canary + blue-green for K8s with Prometheus-based auto-rollback
Istio/Flagger: Traffic shifting with service mesh
Feature flags (LaunchDarkly, Unleash): Canary at the application layer, not infra layer

Asked at Flipkart, Swiggy, Zomato deployment strategy questions

Q21. How do you write Prometheus alerting rules?

# prometheus-rules.yml
groups:
- name: application-alerts
  interval: 1m
  rules:

  # Alert if error rate > 1% for 5 minutes
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      /
      sum(rate(http_requests_total[5m])) by (service)
      > 0.01
    for: 5m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "High error rate for {{ $labels.service }}"
      description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
      runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

  # Alert if p99 latency > 2 seconds
  - alert: HighLatency
    expr: |
      histogram_quantile(0.99,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
      ) > 2
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "P99 latency > 2s for {{ $labels.service }}"

  # Node memory pressure
  - alert: NodeMemoryHighUsage
    expr: |
      (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
    for: 10m
    labels:
      severity: warning

Alertmanager routing:

# alertmanager.yml
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-slack'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
  - match:
      team: platform
    receiver: 'platform-slack'

Q22. What is the ELK Stack vs. the Loki stack for logging?

Feature	ELK Stack	Loki Stack
Components	Elasticsearch + Logstash + Kibana	Loki + Promtail/FluentBit + Grafana
Storage model	Indexed full-text search	Index only labels, store log lines compressed
Query language	Lucene/KQL	LogQL (similar to PromQL)
Resource usage	High (Elasticsearch is heavy)	Low (Loki is lightweight)
Cost	Higher storage + compute	Much lower (10x cheaper at scale)
Full-text search	Excellent	Limited (labels-based)
Grafana integration	Yes (but separate)	Native (both Grafana projects)
Best for	Complex search, compliance	Cloud-native, cost-sensitive

LogQL example (Loki):

# Show all error logs from the payment service in last 1 hour
{namespace="production", app="payment-service"} |= "ERROR"

# Parse JSON logs and filter
{app="api"} | json | status_code >= 500

# Rate of error logs
rate({app="api"} |= "ERROR" [5m])

For most Kubernetes deployments in 2026, Loki is the preferred choice due to cost efficiency and native Grafana integration.

Q23. What is chaos engineering? How do you implement it?

The process:

Define a "steady state" (baseline metrics — error rate, latency, throughput)
Hypothesize that steady state continues during the experiment
Inject failures: kill nodes, increase latency, inject CPU pressure, drop packets
Observe: does the system maintain steady state?
Fix weaknesses discovered

Tools:

Tool	What you can inject
AWS Fault Injection Service (FIS)	EC2 stop/terminate, AZ outage, CPU/memory stress, API throttling
Chaos Monkey (Netflix)	Random EC2 termination
LitmusChaos (CNCF)	K8s pod kill, network latency, disk IO, DNS chaos
Chaos Toolkit	Multi-platform, extensible
Gremlin	Commercial, comprehensive blast radius control

AWS FIS example:

{
  "targets": {
    "eks-nodes": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {"eks:nodegroup-name": "production-workers"},
      "selectionMode": "PERCENT(33)"
    }
  },
  "actions": {
    "terminate-eks-nodes": {
      "actionId": "aws:ec2:terminate-instances",
      "targets": {"Instances": "eks-nodes"},
      "parameters": {}
    }
  },
  "stopConditions": [
    {"source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:...PaymentErrorAlarm"}
  ]
}

The stopConditions are critical — if your alarm fires, the experiment auto-stops to minimize damage.

Asked at Google SRE, Amazon SRE, Flipkart platform interviews

Q24. What is Helm in the context of CI/CD? How do you deploy with Helm in a pipeline?

# GitHub Actions deployment job using Helm
deploy-production:
  runs-on: ubuntu-latest
  needs: [test, build]
  environment: production  # Requires manual approval in GitHub

  steps:
  - name: Checkout
    uses: actions/checkout@v4

  - name: Configure AWS credentials
    uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE }}
      aws-region: ap-south-1

  - name: Update kubeconfig for EKS
    run: aws eks update-kubeconfig --name production-cluster --region ap-south-1

  - name: Helm deploy
    run: |
      helm upgrade --install myapp ./helm/myapp \
        --namespace production \
        --create-namespace \
        --values helm/myapp/values.yaml \
        --values helm/myapp/values-production.yaml \
        --set image.tag=${{ github.sha }} \
        --set deployment.replicas=5 \
        --atomic \
        --timeout 10m \
        --history-max 5

  - name: Verify deployment
    run: |
      kubectl rollout status deployment/myapp -n production --timeout=5m
      kubectl get pods -n production -l app=myapp

--atomic: If upgrade fails (health checks, readiness), automatically roll back to previous release. --history-max 5: Keep only 5 Helm release history entries.

Q25. What is ArgoCD? How does it implement GitOps?

Core concepts:

Application: Maps a Git repo + path to a K8s cluster + namespace
App of Apps: One Application that deploys all other Applications (cluster bootstrapping)
Sync: Process of making cluster state match Git state
Health: Argo CD checks resource health (Deployment fully rolled out, Service has endpoints)

# Application CRD
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/myorg/k8s-manifests
    targetRevision: HEAD
    path: services/payment-service
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true      # Delete resources removed from Git
      selfHeal: true   # Revert manual changes
    syncOptions:
    - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Argo CD's UI provides a visual representation of every deployed resource with health status, sync status, and diff view showing what changed.

Q26. How do you implement secrets management in CI/CD pipelines?

GitHub Actions secrets:

# Secrets stored in GitHub's encrypted store, injected as env vars
steps:
- name: Deploy
  env:
    DATABASE_URL: ${{ secrets.PROD_DATABASE_URL }}
    API_KEY: ${{ secrets.PAYMENT_API_KEY }}
  run: ./deploy.sh

Best practices:

Environment-scoped secrets: Separate secrets for dev/staging/prod. Production secrets require environment approval.
OIDC for cloud credentials: Use GitHub OIDC → AWS/GCP role assumption. No stored cloud credentials at all.
HashiCorp Vault for runtime secrets: CI pipeline retrieves runtime secrets from Vault using a short-lived token.
Rotate regularly: Automate secret rotation (AWS Secrets Manager auto-rotation).
Audit access: Log every secret access — who retrieved what, when.

# OIDC-based secret fetching (no stored credentials)
- name: Configure AWS via OIDC
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789:role/github-deploy-role
    aws-region: ap-south-1

# Now fetch secrets from AWS Secrets Manager
- name: Fetch secrets
  run: |
    aws secretsmanager get-secret-value \
      --secret-id production/myapp/database \
      --query SecretString --output text >> $GITHUB_ENV

Q27. What is the difference between horizontal and vertical scaling?

Feature	Horizontal Scaling (Scale Out)	Vertical Scaling (Scale Up)
Method	Add more instances	Increase instance size (CPU/RAM)
Limit	Theoretically unlimited	Limited by largest available instance
Cost	Linear per instance	Exponential (large instances cost more per unit)
Complexity	Application must be stateless/distributed	Simpler (same app, bigger machine)
Downtime	None (add instances live)	Often requires restart
Best for	Stateless web/app servers	Stateful databases, monoliths
Example	EC2 ASG: 5 t3.medium → 10 t3.medium	t3.medium → t3.xlarge

Kubernetes HPA = horizontal scaling. VPA = vertical scaling (with pod restart).

Cloud-native applications are designed for horizontal scaling:

Stateless (session in Redis, not in-process memory)
Shared-nothing architecture
Configuration from environment (12-factor app)
Health endpoints for load balancer integration

Q28. Explain the 12-Factor App methodology.

Factor	Principle	Example
1. Codebase	One codebase, many deploys	Git monorepo or separate repos per service
2. Dependencies	Explicitly declared, isolated	requirements.txt, package.json, go.mod
3. Config	Store config in environment	DATABASE_URL env var, not hardcoded
4. Backing services	Treat as attached resources	DB, Redis, S3 accessed via URL from env
5. Build/Release/Run	Strictly separate stages	Docker image (build) + env vars (release) + container (run)
6. Processes	Execute as stateless processes	No sticky sessions; session state in Redis
7. Port binding	Export services via port	App binds to $PORT
8. Concurrency	Scale out via processes	Multiple workers, HPA
9. Disposability	Fast startup, graceful shutdown	Kubernetes preStop hook, SIGTERM handling
10. Dev/prod parity	Keep environments similar	Same Docker image, same configs
11. Logs	Treat as event streams	Write to stdout, infrastructure aggregates
12. Admin processes	Run as one-off processes	`kubectl exec`, AWS ECS exec

Q29. What is a service mesh? Explain the Istio architecture.

Istio architecture:

Control Plane (istiod)
├── Pilot — distributes service discovery, routing rules to proxies
├── Citadel — certificate authority for mTLS
├── Galley — validates config
└── Mixer (deprecated) — formerly handled telemetry

Data Plane
└── Envoy sidecar proxies (injected into every pod)
    ├── Intercept all inbound/outbound traffic
    ├── Enforce mTLS
    ├── Collect telemetry (metrics, traces)
    └── Apply routing rules (retries, timeouts, circuit breaking)

Key Istio resources:

# VirtualService — traffic routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
  - payment-service
  http:
  - retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,connect-failure
    timeout: 10s
    fault:
      delay:
        percentage:
          value: 10   # 10% of requests delayed (chaos testing)
        fixedDelay: 5s

Istio is powerful but complex — adds ~5ms latency per hop and significant memory overhead (Envoy). Linkerd is a lighter alternative using Rust proxies.

Q30. How do you implement distributed tracing with OpenTelemetry?

# Python FastAPI app with OTel tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Configure tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Auto-instrument frameworks
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@app.get("/orders/{order_id}")
async def get_order(order_id: str):
    with tracer.start_as_current_span("get-order") as span:
        span.set_attribute("order.id", order_id)
        order = await db.get_order(order_id)
        if not order:
            span.set_status(StatusCode.ERROR, "Order not found")
        return order

OTel Collector receives traces from all services, samples them, and exports to Jaeger/Tempo:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
  tail_sampling:
    policies:
    - type: error-rate
      error_rate: {min_error_rate: 0.01}
    - type: probabilistic
      probabilistic: {sampling_percentage: 5}  # Sample 5% of successful requests

exporters:
  jaeger:
    endpoint: "jaeger:14250"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [jaeger]

Q31. What is the difference between push and pull monitoring models?

Model	Description	Examples	Trade-offs
Pull (scrape)	Monitoring system fetches metrics from targets	Prometheus	Target must expose HTTP endpoint; scalable; firewall-friendly if Prometheus is inside network
Push	Targets send metrics to collector	Graphite, InfluxDB, CloudWatch, Datadog	Works behind NAT; useful for short-lived jobs; collector can be overwhelmed

Prometheus Pushgateway: Bridge for short-lived jobs (batch jobs, cron) that exit before Prometheus scrapes them. Job pushes metrics to Pushgateway; Prometheus scrapes Pushgateway.

# Push metrics from a batch job to Pushgateway
cat <<EOF | curl --data-binary @- http://pushgateway:9091/metrics/job/backup-job/instance/server1
# TYPE backup_duration_seconds gauge
backup_duration_seconds 420
# TYPE backup_files_processed counter
backup_files_processed 1523847
EOF

Q32. How do you implement feature flags? What are the benefits?

Types:

Release flags: Enable new feature for % of users (canary at app layer)
Experiment flags: A/B testing (50% see UI variant A, 50% see B)
Ops flags: Kill switches for problematic features (disable heavy query, enable maintenance mode)
Permission flags: Enable features per user tier (free vs. paid)

# Using Unleash (open-source feature flags)
from unleash_client import UnleashClient

client = UnleashClient(
    url="https://unleash.example.com/api",
    app_name="payment-service",
    authorization="*:development.abc123"
)

@app.post("/checkout")
async def checkout(request: CheckoutRequest):
    if client.is_enabled("new-payment-flow", {"userId": request.user_id}):
        return await new_payment_processor(request)
    else:
        return await legacy_payment_processor(request)

Benefits for DevOps:

Decouple deployment from release — deploy code, enable flag later
Instant rollback without redeployment (disable flag)
Dark launches — ship code to production disabled, enable for testing
Gradual rollouts — 1% → 10% → 100%
Trunk-based development — merge incomplete features behind flags

Q33. What is Packer? How does it fit into a DevOps workflow?

Why build custom AMIs:

Faster EC2 launch (no apt install on startup — already baked in)
Immutable infrastructure pattern — never patch running instances, replace with new AMI
Tested, hardened images (CIS benchmarks applied during build)
Consistent configuration across environments

# packer.pkr.hcl
source "amazon-ebs" "ubuntu" {
  region        = "ap-south-1"
  source_ami    = "ami-0f5ee92e2d63afc18"  # Ubuntu 22.04 LTS
  instance_type = "t3.medium"
  ssh_username  = "ubuntu"
  ami_name      = "myapp-base-{{timestamp}}"
}

build {
  sources = ["source.amazon-ebs.ubuntu"]

  provisioner "shell" {
    inline = [
      "sudo apt-get update",
      "sudo apt-get install -y nginx",
      "sudo systemctl enable nginx"
    ]
  }

  provisioner "ansible" {
    playbook_file = "playbooks/harden.yml"
  }

  post-processor "manifest" {
    output = "manifest.json"  # Save AMI ID for Terraform
  }
}

CI pipeline: Packer builds AMI → runs tests → publishes AMI ID → Terraform references latest AMI → instances launch immediately with all software pre-installed.

Q34. What is the difference between MTTR, MTBF, and MTTD?

Metric	Full Name	Measures	Goal
MTTR	Mean Time to Recovery	Average time to restore service after failure	Minimize (faster recovery)
MTBF	Mean Time Between Failures	Average time between incidents	Maximize (more reliable)
MTTD	Mean Time to Detect	Average time from failure to detection	Minimize (better monitoring)
MTTF	Mean Time to Failure	Average time until a component fails	Maximize

How to improve MTTR:

Better runbooks (clear, tested procedures)
Auto-remediation (Lambda triggered by CloudWatch alarm)
Feature flags (disable problematic feature instantly)
Rollback automation (Argo CD sync to previous revision on alert)
PagerDuty escalation policies (right person paged immediately)
Chaos engineering (practice incident response regularly)

How to improve MTTD:

Comprehensive alerting (SLI-based alerts, not just infrastructure)
Synthetic monitoring (actively probe from outside)
Real user monitoring (RUM — detect user-impacting issues before alerts)
Anomaly detection (ML-based: Amazon DevOps Guru, Datadog)

Q35. What is Vault by HashiCorp? How does it manage secrets?

Core concepts:

Secret Engines: Backends that store/generate secrets (KV, AWS IAM, database, PKI)
Auth Methods: How clients authenticate (Kubernetes ServiceAccount, AWS IAM, GitHub)
Policies: ACL rules controlling who can access which secrets
Leases: Time-bound access — dynamic secrets expire and are revoked

Dynamic secrets (killer feature):

# Vault generates a short-lived AWS access key on demand
vault read aws/creds/my-role
# Key: AKIAIOSFODNN7EXAMPLE
# Secret: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Expires in: 1 hour
# After expiry: Vault automatically revokes it from AWS

In Kubernetes:

# Vault Agent sidecar annotation — injects secrets as files
annotations:
  vault.hashicorp.com/agent-inject: "true"
  vault.hashicorp.com/role: "payment-service"
  vault.hashicorp.com/agent-inject-secret-db: "secret/data/production/database"
  vault.hashicorp.com/agent-inject-template-db: |
    {{- with secret "secret/data/production/database" -}}
    DATABASE_URL=postgresql://{{ .Data.data.username }}:{{ .Data.data.password }}@db:5432/app
    {{- end }}

Advanced-Level DevOps Questions (Q36-Q50)

The Advanced section is where Rs 45+ LPA offers are won. These questions are asked for senior SRE and staff-level DevOps roles. If you can answer these confidently, you're in the top 5% of candidates.

Q36. How do you implement zero-trust networking in a DevOps context?

Implementation pillars:

Service-to-service mTLS: Istio/Linkerd automatically issues certificates, enforces mutual authentication — even internal services verify each other
Short-lived credentials: No long-term passwords; dynamic secrets from Vault, IAM roles
Workload identity: SPIFFE/SPIRE assigns cryptographic identities to workloads (pods, VMs)
Network micro-segmentation: K8s NetworkPolicies — explicit allow lists between services
Device trust: BeyondCorp-style — employee machines verified before VPN
Continuous authorization: Re-verify on every request, not just login
Comprehensive audit logging: Every request logged with who, what, when, from where

# Istio PeerAuthentication — enforce mTLS cluster-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system  # Cluster-wide
spec:
  mtls:
    mode: STRICT  # All service-to-service traffic must use mTLS

Q37. Design a complete CI/CD pipeline for a microservices application.

Architecture:

Code changes
    │
    ▼
GitHub (PR opened)
    │
    ▼
GitHub Actions CI job:
├── Lint + unit tests
├── Build Docker image (BuildKit + cache)
├── Security scan (Trivy CRITICAL/HIGH block)
├── SAST scan (Semgrep/CodeQL)
├── Integration tests (docker-compose)
├── Push to ECR (only on PR merge to main)
└── Update image tag in config repo (GitOps)
    │
    ▼
Config repo PR (automated)
    │
    ▼
Platform team reviews (auto-approve for non-prod)
    │
    ▼
Merge to config repo main branch
    │
    ▼
Argo CD detects change → syncs staging cluster
    │
    ▼
Staging deployment + smoke tests
    │
    ▼
Manual approval (production)
    │
    ▼
Argo CD syncs production cluster
    ├── Canary: 5% → 25% → 50% → 100% (Argo Rollouts)
    └── Auto-rollback if error rate alarm fires
    │
    ▼
PagerDuty alert if post-deploy health check fails

Key design decisions:

OIDC for all cloud credentials (no stored keys)
Separate application repo from config repo (GitOps)
Security scanning gates are non-negotiable
Canary with Prometheus-based analysis for production

Q38. How do you implement observability for a distributed system at scale?

Cardinality management (critical at scale): High-cardinality labels (user_id, request_id in Prometheus) cause memory exhaustion. Rules:

Prometheus labels: Only low-cardinality (service, endpoint, status_code, region)
High-cardinality data: Traces (Jaeger/Tempo), logs (Loki) — not metrics

Sampling strategy:

All traces: 100M requests/day → too expensive to store all
Strategy: Head-based sampling (make decision at request start)
  - 100% sample error traces
  - 100% sample traces > 500ms latency
  - 1% sample remaining successful traces
OR Tail-based sampling (OTel Collector tail_sampling processor):
  - Better: make sampling decision after seeing full trace
  - More resource-intensive at collector

Exemplars: Link metrics to traces — when a high-latency spike appears in Prometheus, exemplars provide the trace ID that caused it:

# Histogram with exemplar
REQUEST_LATENCY.observe(duration, exemplar={"traceID": current_trace_id})

SLO-based alerting (multi-window, multi-burn-rate):

# Alert fires if you're burning through 30-day error budget too fast
# Fast burn: 14.4x budget burn rate for 1h (critical)
# Slow burn: 3x budget burn rate for 6h (warning)
- alert: ErrorBudgetBurnTooFast
  expr: |
    (
      sum(rate(http_errors_total[1h])) / sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x the 0.1% SLO error rate

Q39. What is FinOps? How do DevOps engineers contribute to cloud cost optimization?

DevOps engineer's role in FinOps:

Rightsizing: Use AWS Compute Optimizer recommendations — downsize over-provisioned instances
Auto-scaling: Scale down outside business hours (scheduled scaling for stateless apps)
Spot/Preemptible instances: 70-90% cheaper for fault-tolerant workloads (CI runners, batch, ML training)
Resource tagging: Every resource tagged (environment, team, service, cost-center) for chargeback
Delete zombie resources: Unused EIPs, forgotten Load Balancers, orphaned EBS volumes
Savings Plans/Reserved Instances: Commit to base capacity for 30-66% savings
Architecture optimization: Lambda > always-on EC2 for bursty workloads; Graviton > x86 for ~20% cost reduction
S3 lifecycle policies: Auto-move old data to Glacier ($0.004/GB vs $0.023/GB)

Tooling:

AWS Cost Explorer + Budgets: Alerts on spend anomalies
Infracost: Show cost diff in Terraform PRs
OpenCost / Kubecost: K8s cost visibility (cost per namespace, deployment, team)

Q40. Explain platform engineering vs. DevOps. What is an Internal Developer Platform (IDP)?

DevOps (early 2010s): Individual dev teams own their own CI/CD and infrastructure. "You build it, you run it."

Platform Engineering (2020s): A dedicated team builds an Internal Developer Platform — golden paths and self-service tools that abstract infrastructure complexity from application developers.

Internal Developer Platform (IDP) components:

Component	Purpose	Example Tools
Self-service portal	Developers create environments/services via UI	Backstage (Spotify)
Golden path templates	Pre-approved service templates with best practices	Cookiecutter, Backstage Software Templates
CI/CD abstractions	Developers don't write raw GitHub Actions	Dagger, shared GitHub Actions libraries
Secret management	One-click secret rotation, dev access	Vault UI, External Secrets
Observability	Auto-configured monitoring per service	Grafana + auto-dashboards
Environment provisioning	Create dev environment in minutes	Crossplane, Argo CD

Why it matters: At scale (500+ engineers), having each team own all of DevOps creates inconsistency, security gaps, and toil. Platform Engineering creates leverage — one platform team enables hundreds of product engineers.

Increasingly asked at senior/lead level interviews (2026)

Q41. What is Crossplane? How does it extend Kubernetes for infrastructure management?

# Provision an RDS PostgreSQL instance using Crossplane
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: production-postgres
spec:
  forProvider:
    region: ap-south-1
    dbInstanceClass: db.t3.medium
    masterUsername: admin
    engine: postgres
    engineVersion: "15.3"
    multiAZ: true
    skipFinalSnapshotBeforeDeletion: false
  writeConnectionSecretsToRef:
    namespace: production
    name: postgres-credentials  # Crossplane stores connection details as K8s Secret

Crossplane vs. Terraform:

Crossplane is Kubernetes-native — leverages existing K8s RBAC, GitOps tools (Argo CD), and tooling
Terraform is more mature, larger ecosystem, easier to use outside K8s
Crossplane is better for organizations fully committed to Kubernetes and GitOps

Q42. What is Site Reliability Engineering (SRE)? How does it differ from DevOps?

Aspect	DevOps	SRE
Origin	Cultural movement	Google's implementation of DevOps principles
Focus	Collaboration, CI/CD, automation	Reliability, scalability, SLOs
Team structure	Embedded in product teams	Dedicated SRE teams (or embedded)
Key metrics	Deployment frequency, lead time	SLOs, error budget, MTTR, MTBF
Tooling	CI/CD, IaC, monitoring	All DevOps + capacity planning, load testing
Book	"The Phoenix Project"	"Site Reliability Engineering" (Google)

SRE unique concepts:

Toil: Manual, repetitive, automatable work that scales with service load. SREs target <50% toil time.
Error budgets: Formalize the reliability vs. velocity trade-off. If error budget is depleted, new features freeze.
Eliminating toil: Every manual task is a candidate for automation. If you do it twice, automate it.
Postmortems: Blameless, written, shared across organization.
Production readiness reviews (PRR): Checklist before launching new services (alerts configured? runbook exists? load tested?)

Critical distinction for Google, Amazon, Flipkart SRE positions

Q43. How do you design for reliability in a multi-region deployment?

Active-Active (both regions serve traffic):

Users globally
    │
Route 53 latency-based routing
   /                    \
ap-south-1 (Mumbai)    ap-southeast-1 (Singapore)
├── EKS cluster         ├── EKS cluster
├── Aurora Global DB    └── Aurora Read Replica
└── ElastiCache             (promotes to writer on failover)

Active-Passive (one region handles traffic, other on standby):

Simpler, lower cost
RTO: minutes (failover time)
RPO: seconds (replication lag)

Key patterns for multi-region:

Data consistency: Use CRDTs for eventually consistent data; avoid two-phase commit across regions
Circuit breakers: Don't let a failing region cascade to healthy region
Chaos engineering: Regularly simulate region failures (AWS FIS)
DNS failover: Route 53 health checks auto-reroute on region failure
Deployment: Deploy to one region, verify, then the second (sequenced deploys)

RTO/RPO targets:

Tier 1 (payments, login): RTO <5 min, RPO ~0 (synchronous replication)
Tier 2 (recommendations): RTO <30 min, RPO <5 min
Tier 3 (analytics): RTO <4 hours, RPO <1 hour

Q44. What is supply chain security in DevOps? How do you implement SLSA?

SLSA (Supply-chain Levels for Software Artifacts) — a framework for supply chain integrity:

Level	Requirements
SLSA 1	Provenance exists (build logs available)
SLSA 2	Provenance signed and hosted by build service
SLSA 3	Source verified, build isolated, hardened build environment
SLSA 4	Two-person review, hermetic reproducible builds

Implementation:

# GitHub Actions with SLSA provenance (using slsa-github-generator)
- name: Build Docker image
  uses: docker/build-push-action@v5
  with:
    push: true
    tags: myapp:${{ github.sha }}

- name: Generate SLSA provenance
  uses: slsa-framework/slsa-github-generator/.github/workflows/generator_container_slsa3.yml@v1
  with:
    image: myapp
    digest: ${{ steps.build.outputs.digest }}

Dependency security:

Dependabot: Auto-PRs for dependency updates
Renovate: More configurable alternative
Snyk: Deep vulnerability scanning including transitive dependencies
SBOM (CycloneDX/SPDX): Know every component in your software

Q45. How do you handle database migrations in CI/CD?

Safe migration patterns:

Expand-Contract (for zero-downtime):
- Phase 1 (Expand): Add new column/table, keep old one. Both old and new app code can run.
- Phase 2 (Migrate): Backfill new column; run new app code.
- Phase 3 (Contract): Remove old column after old code is fully deployed.
Blue-Green with schema sync:
- New schema must be backward compatible with both blue (old) and green (new) versions
- Never drop columns in the same deploy that adds new ones
Tools:
- Flyway/Liquibase: Version-controlled SQL migrations, run in pipeline before app deploy
- Atlas: Modern schema-as-code for multiple databases
- Django/Rails migrations: Framework-native

# Kubernetes job running migrations before deployment
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-{{ .Values.image.tag }}
spec:
  template:
    spec:
      initContainers:
      - name: wait-for-db
        image: busybox
        command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 2; done']
      containers:
      - name: migrate
        image: myapp:{{ .Values.image.tag }}
        command: ["python", "manage.py", "migrate", "--noinput"]
      restartPolicy: Never
  backoffLimit: 3

Helm pre-install/pre-upgrade hooks ensure migrations run and succeed before the new app version is deployed.

Q46. What is Dagger? How does it improve CI/CD portability?

Problem it solves: CI configuration is fragmented across YAML files for each platform. Testing locally requires pushing to CI. Different behavior in CI vs. local.

# dagger pipeline in Python — runs same way locally and in GitHub Actions
import anyio
import dagger

async def build_and_test():
    async with dagger.Connection() as client:
        # Get source code
        source = client.host().directory(".", exclude=[".git", "node_modules"])

        # Build container
        node = (
            client.container()
            .from_("node:20-alpine")
            .with_directory("/src", source)
            .with_workdir("/src")
            .with_exec(["npm", "ci"])
        )

        # Run tests (returns container with test results)
        test = await node.with_exec(["npm", "test"]).stdout()
        print(f"Tests: {test}")

        # Build Docker image
        image = await node.with_exec(["npm", "run", "build"]).publish(
            "myregistry.com/myapp:latest"
        )
        print(f"Published: {image}")

anyio.run(build_and_test)

Run locally with python pipeline.py — same execution in CI via dagger run python pipeline.py.

Q47. How do you implement policy as code with OPA/Gatekeeper?

# Gatekeeper ConstraintTemplate — enforce resource limits
package k8srequiredlimits

violation[{"msg": msg}] {
    container := input.review.object.spec.containers[_]
    not container.resources.limits.memory
    msg := sprintf("Container '%s' must have memory limits", [container.name])
}

violation[{"msg": msg}] {
    container := input.review.object.spec.containers[_]
    not container.resources.limits.cpu
    msg := sprintf("Container '%s' must have CPU limits", [container.name])
}

# Constraint — apply to all namespaces except kube-system
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLimits
metadata:
  name: require-resource-limits
spec:
  enforcementAction: deny  # or 'warn' for audit mode
  match:
    namespaces: ["production", "staging"]

OPA in CI/CD (non-K8s):

# Evaluate Terraform plan against policies before apply
terraform plan -out plan.json
terraform show -json plan.json > plan-output.json

# OPA policy check
opa eval -d policies/ -i plan-output.json "data.terraform.allow" --fail

Conftest uses OPA policies to validate any YAML/JSON/Terraform/Dockerfile.

Q48. What is AIOps? How is AI being integrated into DevOps in 2026?

Current practical applications (2026):

Use Case	Tool	How it works
Anomaly detection	Grafana ML, Datadog Watchdog	Baseline metrics, flag unusual deviations
Alert correlation	PagerDuty AIOps	Group related alerts, reduce alert noise
Root cause analysis	Amazon DevOps Guru	Identify unusual resource behavior patterns
Log analysis	Elastic ML, Grafana Loki ML	Cluster log patterns, detect new error types
Predictive scaling	AWS Auto Scaling predictive mode	Scale before traffic hits, not after
Incident resolution	Slack AI + runbook lookup	Suggest runbook steps from incident description
Code review	GitHub Copilot, CodeRabbit	Review PRs for security issues, performance

Practical integration pattern:

# GitHub Actions with AI-assisted PR review
- name: CodeRabbit Review
  uses: coderabbitai/ai-pr-reviewer@latest
  with:
    openai_api_key: ${{ secrets.OPENAI_API_KEY }}
    # Reviews for: logic errors, security issues, performance, test coverage

AI doesn't replace DevOps engineers in 2026 — it handles toil (log analysis, alert grouping) so engineers focus on system design and reliability.

Q49. How do you implement disaster recovery? Explain RTO and RPO in practice.

DR Strategies (from cheapest to most expensive):

Strategy	RTO	RPO	Cost	Description
Backup & Restore	Hours	Hours	Low	Restore from S3 backups
Pilot Light	10-30 min	Minutes	Low-Medium	Core services always running, scale up on disaster
Warm Standby	Minutes	Seconds	Medium	Scaled-down version running in secondary region
Multi-Site Active-Active	Near zero	Near zero	High	Full production capacity in both regions

Pilot Light example:

# Primary Region (ap-south-1) — full production
module "primary" {
  instance_count = 10
  rds_instance   = "db.r6g.xlarge"
}

# DR Region (ap-southeast-1) — pilot light
module "dr" {
  instance_count = 0            # ASG min=0, max=10
  rds_instance   = "db.t3.medium"  # Smaller RDS receiving replication
}

DR runbook steps (must be tested quarterly):

Verify data replication is up-to-date
Scale up DR region ASG
Promote RDS read replica to primary
Update Route 53 health check to point to DR region
Verify application is serving traffic
Communicate status to stakeholders

Test your DR plan: Untested DR plans fail when you need them most. Chaos engineering + game days simulate disasters on a schedule.

Q50. Describe your approach to building a developer platform from scratch at a 200-person engineering organization.

Phase 1 — Assess (Week 1-2):

Interview 20+ engineers: biggest pain points, manual toils, blocked deployments
Audit current state: How many unique CI/CD setups? How long do deployments take? DORA metrics baseline
Identify top 3 pain points (usually: slow/flaky CI, inconsistent environments, opaque deployments)

Phase 2 — Golden Path (Month 1-3):

Standardize on one CI/CD platform (GitHub Actions)
Shared GitHub Actions library: reusable build/test/deploy workflows
Opinionated service template (Cookiecutter + Backstage): generates a new service with CI/CD, monitoring, and security scanning pre-configured
Target: new service from idea to first deployment < 1 day

Phase 3 — Self-Service (Month 3-6):

Backstage portal: service catalog, create environments, view deployment status
One-click staging environment provisioning
Automated secret management onboarding
Auto-configured Grafana dashboards per service

Phase 4 — Reliability (Month 6-12):

SLO framework and tooling
Chaos engineering program
Production readiness checklist
On-call tooling (PagerDuty + runbooks)

Measure success with DORA metrics:

Deployment frequency: 1 deploy/week/team → 5/day/team
Lead time: 3 days → 2 hours
Change failure rate: 20% → 5%
MTTR: 2 hours → 15 minutes

FAQ Section — Straight Answers to Your DevOps Career Questions

Q: What's the difference between DevOps Engineer, SRE, and Platform Engineer? This confuses almost everyone, so here's the clear breakdown: DevOps Engineer focuses on CI/CD, automation, build/deploy tooling. SRE focuses on production reliability, SLOs, incident response, and on-call. Platform Engineer builds internal tools and golden paths for other engineers. Roles overlap significantly — job title often depends on company culture. Pro tip: Read the job description carefully. A "DevOps Engineer" role at a bank is very different from one at a startup.

Q: Is Kubernetes knowledge required for DevOps roles in 2026? Yes — it's effectively non-negotiable for product companies. Kubernetes is the production container orchestration standard. At minimum, understand deployments, services, ConfigMaps, RBAC, and basic troubleshooting. EKS/GKE-specific knowledge is a bonus. Check out our Kubernetes Interview Questions 2026 for a deep dive.

Q: Terraform vs. Pulumi in 2026 — which is gaining? Terraform remains dominant in market share. Pulumi is gaining traction at organizations with strong software engineering cultures (programmatic IaC using real languages vs. HCL). The OpenTofu fork (open-source Terraform) is growing after HashiCorp's BSL license change.

Q: What monitoring stack should I learn? The open-source stack: Prometheus + Grafana + Loki + Tempo + OpenTelemetry. This covers metrics, logs, and traces. Datadog is the dominant commercial alternative — learn if you're going to enterprise companies.

Q: Is Jenkins still relevant in 2026? Jenkins remains widely deployed in enterprises and has a huge plugin ecosystem. GitHub Actions, GitLab CI, and CircleCI are preferred at modern product companies. If you're applying to enterprise/bank/large IT — know Jenkins. Startups: GitHub Actions.

Q: What certifications should a DevOps engineer get? Tier 1: AWS Solutions Architect Associate + CKA (Certified Kubernetes Administrator). Tier 2: HashiCorp Terraform Associate, Prometheus Certified Associate. Tier 3: CKS (Security), AWS DevOps Professional, Google Cloud Professional DevOps Engineer.

Q: How important is Python/scripting for DevOps? Extremely important. Bash for one-liners and simple scripts. Python for anything more complex (AWS boto3, custom tooling, Ansible modules). Go is increasingly common for writing K8s operators and CLI tools. Know at least Python + Bash.

Q: What is the DevOps salary range in India in 2026? Here are the verified numbers from real offers: Junior DevOps (0-2 yrs): Rs 8-15 LPA. Mid-level (3-5 yrs, K8s + AWS + CI/CD): Rs 18-40 LPA. Senior/SRE (7+ yrs, system design, architecture): Rs 45-90 LPA. Principal/Staff: Rs 80 LPA-1.5 Cr at top product companies. The jump from mid-level to senior is massive — and it's driven by system design and incident management skills, not just tool knowledge.

Build the complete DevOps & infrastructure interview toolkit:

AWS Interview Questions 2026 — Master the #1 cloud platform
Kubernetes Interview Questions 2026 — Container orchestration deep dive
Docker Interview Questions 2026 — Container fundamentals
System Design Interview Questions 2026 — Design scalable distributed systems
Microservices Interview Questions 2026 — Distributed application architecture
Data Engineering Interview Questions 2026 — Data pipelines and infrastructure