PapersAdda
2026 Placement Season is LIVE12,000+ students preparing now

DevOps Interview Questions 2026 — Top 50 with Expert Answers

49 min read
Interview Questions
Last Updated: 30 Mar 2026
Verified by Industry Experts
3,486 students found this helpful
Advertisement Placement

Elite DevOps teams deploy to production multiple times per day with a change failure rate under 5%. That's the bar companies are hiring for in 2026. DevOps has evolved from a cultural philosophy into a concrete set of engineering practices — and companies expect you to command CI/CD pipelines, infrastructure as code, observability, incident response, and reliability engineering at an expert level. This guide covers 50 real questions asked at product companies like Razorpay, Swiggy, Flipkart, and global FAANG firms, organized by difficulty with the exact answers that get offers.

DevOps/SRE is one of the fastest-growing career paths in India, with senior roles commanding Rs 45-90 LPA at top product companies. The skills gap is real — master these 50 questions and you're ahead of 90% of candidates.

Related: AWS Interview Questions 2026 | Kubernetes Interview Questions 2026 | Docker Interview Questions 2026 | System Design Interview Questions 2026


Beginner-Level DevOps Questions (Q1-Q15)

Even if you're a senior engineer, don't skip these. Interviewers at Razorpay and Flipkart use beginner questions to test whether you truly understand the "why" behind DevOps — not just the tools.

Q1. What is DevOps? How does it differ from traditional IT operations?

Traditional IT vs. DevOps:

AspectTraditional ITDevOps
Dev/Ops relationshipSiloed teams with handoffsShared ownership, shared goals
Deployment frequencyQuarterly or monthly releasesMultiple times per day
Deployment methodManual, scriptedAutomated CI/CD pipelines
InfrastructurePet servers (named, cared for)Cattle (identical, replaceable)
Failure responseBlame, RCA blame gameBlameless post-mortems, SRE principles
Feedback loopMonthsMinutes to hours
RollbackManual, riskyAutomated, safe

DORA Metrics (DevOps Research and Assessment) measure DevOps performance:

  1. Deployment frequency: How often you deploy to production
  2. Lead time for changes: Code commit to production
  3. Change failure rate: % of deployments causing incidents
  4. Time to restore service: MTTR after incident

Elite performers (2024 DORA report): Deploy on-demand (multiple times/day), lead time <1 day, CFR <5%, MTTR <1 hour.


Q2. What is CI/CD? Explain each stage.

Continuous Integration (CI): Every code commit triggers automated build and test — developers merge frequently, preventing integration hell.

Continuous Delivery (CD): Every passing build is automatically deployed to staging. Human approval gates production deployment.

Continuous Deployment: Every passing build is automatically deployed to production — no human gates.

Full CI/CD pipeline stages:

Developer commits code
    │
    ▼
1. Source Control (Git) — PR created, branch policies enforced
    │
    ▼
2. CI Trigger — webhook fires pipeline
    │
    ▼
3. Build
   ├── Compile / install dependencies
   ├── Run unit tests
   ├── Static code analysis (SonarQube, ESLint)
   └── Security scan (SAST — Semgrep, CodeQL)
    │
    ▼
4. Test
   ├── Integration tests
   ├── Contract tests (Pact)
   └── Vulnerability scan (Trivy on Docker image)
    │
    ▼
5. Artifact
   ├── Build Docker image
   └── Push to registry (ECR, GCR)
    │
    ▼
6. Deploy to Staging
   └── Smoke tests / synthetic monitoring
    │
    ▼
7. [Approval gate] — automated or manual
    │
    ▼
8. Deploy to Production
   ├── Blue/green or canary deployment
   └── Post-deploy health checks
    │
    ▼
9. Monitor
   └── Alert if error rate spikes → auto-rollback or PagerDuty alert

Q3. What is Infrastructure as Code (IaC)? Why is it important?

Benefits:

BenefitExplanation
ReproducibilitySame code creates identical environments (no "works on my machine")
Version controlInfrastructure changes tracked in Git — who changed what, when, why
Peer reviewInfrastructure changes reviewed via pull requests
AutomationEnvironments created in minutes, not weeks
Drift detectionKnow when actual state diverges from desired state
Disaster recoveryRe-create entire environment from code
Cost controlSpin down dev environments on weekends (schedule destroy)

Popular IaC tools:

ToolTypeBest for
TerraformDeclarative, multi-cloudGeneral purpose, most popular
AWS CloudFormationDeclarative, AWS-onlyNative AWS integration
AWS CDKProgrammatic IaC (TypeScript, Python)Developers who prefer real languages
PulumiProgrammatic IaC (any language)Multi-cloud with programming constructs
AnsibleImperative, configuration managementOS config, application deployment
PackerImage builderAMI, GCP image creation

Q4. Explain Terraform workflow — init, plan, apply, destroy.

# 1. terraform init
# Downloads provider plugins, initializes backend (remote state)
terraform init

# 2. terraform plan
# Shows what will be created/modified/destroyed (dry run)
# ALWAYS review this before apply
terraform plan -out=tfplan

# 3. terraform apply
# Applies the planned changes
terraform apply tfplan
# Or interactively:
terraform apply  # Shows plan, prompts for "yes"

# 4. terraform destroy
# Destroys all resources in the state
terraform destroy  # Prompts for "yes"
terraform destroy -target=aws_instance.web  # Destroy specific resource

State file (terraform.tfstate): Terraform tracks the mapping between your config and real infrastructure in a state file. In teams, state is stored remotely (S3 + DynamoDB lock for AWS):

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/networking/terraform.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "terraform-state-locks"
    encrypt        = true
  }
}

DynamoDB prevents concurrent applies (state locking). Never commit state files to Git — they contain sensitive data.

Asked at Flipkart, Razorpay, PhonePe infrastructure interviews


Q5. What is the difference between Ansible, Terraform, and Chef/Puppet?

ToolCategoryApproachStateLanguageBest For
TerraformIaC (provisioning)DeclarativeYes (tfstate)HCLCloud resource provisioning
AnsibleConfiguration managementImperative (playbooks)StatelessYAMLOS config, app deployment, ad-hoc tasks
ChefConfiguration managementImperative (recipes)Server (Chef Server)RubyTraditional CM, complex configs
PuppetConfiguration managementDeclarativeServer (PuppetDB)Puppet DSLEnterprise config management
PulumiIaC (provisioning)ProgrammaticYes (backend)Any languageComplex IaC requiring programming logic

Common pattern: Terraform provisions the servers (EC2, RDS, VPC); Ansible configures them (install packages, deploy app, set up monitoring). Terraform handles "what infrastructure exists"; Ansible handles "what's installed on the servers."


Q6. What is GitHub Actions? Write a simple CI workflow.

# .github/workflows/ci.yml
name: CI Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ['3.11', '3.12']

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v5
      with:
        python-version: ${{ matrix.python-version }}

    - name: Cache pip
      uses: actions/cache@v3
      with:
        path: ~/.cache/pip
        key: ${{ runner.os }}-pip-${{ hashFiles('requirements*.txt') }}

    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install -r requirements-dev.txt

    - name: Lint
      run: ruff check . --output-format=github

    - name: Test
      run: pytest tests/ --cov=app --cov-report=xml -v

    - name: Upload coverage
      uses: codecov/codecov-action@v4
      with:
        token: ${{ secrets.CODECOV_TOKEN }}

  security:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: SAST scan
      uses: returntocorp/semgrep-action@v1

Q7. What is a Jenkins pipeline? What is the difference between Declarative and Scripted pipeline?

Declarative pipeline (recommended — structured, less Groovy knowledge needed):

// Jenkinsfile
pipeline {
    agent any

    environment {
        DOCKER_REGISTRY = 'myregistry.com'
        IMAGE_NAME = 'myapp'
    }

    stages {
        stage('Checkout') {
            steps {
                git branch: 'main', url: 'https://github.com/myorg/myapp.git'
            }
        }

        stage('Build & Test') {
            steps {
                sh 'mvn clean test'
            }
            post {
                always {
                    junit 'target/surefire-reports/*.xml'
                }
            }
        }

        stage('Docker Build') {
            steps {
                script {
                    docker.build("${DOCKER_REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}")
                }
            }
        }

        stage('Deploy to Staging') {
            when {
                branch 'main'
            }
            steps {
                sh './deploy.sh staging'
            }
        }
    }

    post {
        failure {
            slackSend(color: 'danger', message: "Build FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}")
        }
    }
}

Scripted pipeline: Pure Groovy inside node {} blocks. More flexible but more complex, harder to read. Use Declarative unless you need complex Groovy logic.


Q8. What is the difference between Git merge, rebase, and cherry-pick?

OperationWhat it doesCreates merge commit?Rewrites history?
mergeCombines branches, preserves historyYes (unless fast-forward)No
rebaseReplays commits onto another branch tipNoYes — new commit SHAs
cherry-pickApplies a specific commit to current branchNoYes — new commit SHA
squash mergeCombines all branch commits into oneOne new commitYes
# Merge feature into main (preserves feature history)
git checkout main && git merge feature/payment

# Rebase feature onto main (cleaner linear history)
git checkout feature/payment && git rebase main

# Cherry-pick a hotfix to multiple release branches
git cherry-pick abc1234

# Interactive rebase — squash last 3 commits
git rebase -i HEAD~3

Golden rule: Never rebase shared/public branches (main, develop). Only rebase local feature branches before merging.


Q9. What is the purpose of a staging environment? What makes a good staging setup?

Characteristics of a good staging environment:

  1. Production parity: Same infrastructure configuration (instance sizes can differ, but architecture must match)
  2. Real data: Anonymized production data dump — tests realistic data volumes, not 10 rows
  3. Isolated: No shared services with production (separate DB, separate queues)
  4. Continuously deployed: Every merge to main auto-deploys to staging
  5. Monitored: Same monitoring stack as production (so you catch monitoring gaps)
  6. External service stubs: Payment gateways (Razorpay test mode), SMS providers (mock)

Common staging failures: Using smaller instance types → misses memory/CPU issues. Shared database with prod → staging deploy breaks prod. Fake/sparse data → doesn't test realistic query performance.


Q10. What is Prometheus? How does it collect metrics?

Architecture:

Applications (/metrics endpoint in Prometheus format)
    ↑ scrape (HTTP GET /metrics)
Prometheus Server
    ├── TSDB (time-series database, local disk)
    ├── Rules evaluation (recording + alerting rules)
    └── Alertmanager → PagerDuty, Slack, OpsGenie
    ↑ query (PromQL)
Grafana

Metric types:

  • Counter: Only goes up (HTTP requests total, errors total)
  • Gauge: Can go up or down (current memory usage, queue depth)
  • Histogram: Distribution of observations (request latency buckets, response sizes)
  • Summary: Similar to histogram, but calculates quantiles client-side
# Python app exposing Prometheus metrics
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['endpoint'])

@app.route('/api/orders')
def get_orders():
    with REQUEST_LATENCY.labels(endpoint='/api/orders').time():
        result = db.query_orders()
        REQUEST_COUNT.labels(method='GET', endpoint='/api/orders', status=200).inc()
        return result

Q11. What is Grafana? How does it integrate with Prometheus?

Prometheus + Grafana integration:

  1. Add Prometheus as a data source in Grafana (URL: http://prometheus:9090)
  2. Use PromQL queries in Grafana panels:
# Request rate per endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

# P99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

Good dashboards follow RED method:

  • Rate: Requests per second
  • Errors: Error rate (4xx/5xx)
  • Duration: Latency distribution (p50, p90, p99)

Grafana's Explore feature allows ad-hoc metric queries during incident investigation without modifying dashboards. Grafana OnCall integrates alert routing directly.


Q12. What is an SLI, SLO, and SLA?

TermFull FormDefinitionOwnerExample
SLIService Level IndicatorA metric that measures service behaviorEngineering"99.5% of requests complete in <200ms"
SLOService Level ObjectiveA target value or range for an SLIEngineering + Product"SLI must be ≥ 99.5% over 30 days"
SLAService Level AgreementA business contract with consequencesLegal + Business"99.9% uptime, or 10% credit for each 0.1% below"

Relationship: SLA ≥ SLO ≥ actual performance. SLOs are internal targets (usually more strict than SLAs to leave buffer). SLIs are the measurements.

Error budget: 100% minus SLO. For 99.9% SLO: 0.1% error budget = 43.8 minutes/month of allowed downtime. If you're within budget, you can deploy new features. If you've exceeded budget, all deployments freeze until next period.

Good SLIs focus on what users experience:

  • Availability: % of successful requests
  • Latency: % of requests below threshold
  • Throughput: Operations per second
  • Error rate: % of failed requests

Critical concept at SRE interviews (Google, Amazon, Flipkart SRE)


Q13. What is the difference between monitoring, observability, and alerting?

ConceptDefinitionTools
MonitoringCollecting and displaying predefined metricsPrometheus, CloudWatch, Datadog
ObservabilityAbility to understand system internal state from external outputs. 3 pillars: Metrics, Logs, TracesPrometheus + Loki + Tempo (Grafana OSS)
AlertingNotifying humans when metrics breach thresholdsAlertmanager, PagerDuty, OpsGenie

Monitoring vs. Observability: Monitoring asks "Is this thing healthy?" (yes/no). Observability asks "WHY is this broken?" You need distributed tracing and structured logs to debug non-obvious failures across microservices.

The three pillars:

  1. Metrics: Numeric aggregations over time (fast, cheap, limited dimensionality)
  2. Logs: Timestamped text records (rich context, expensive to store and query at scale)
  3. Traces: End-to-end request path across services (shows bottlenecks, latency contributions)

OpenTelemetry (OTel) is the emerging standard for instrumentation — one SDK for all three signals, vendor-neutral.


Q14. What is Incident Management? Describe a good incident response process.

Incident severity levels:

SeverityImpactResponse TimeExample
P0/SEV1Total outage, all users affectedImmediate, 24/7Payment processing down
P1/SEV2Major feature broken, large user impact<15 minutesLogin failure for 50% users
P2/SEV3Significant degradation, some users affected<1 hourCheckout slow (p99 >5s)
P3/SEV4Minor issue, small user impact<4 hoursMinor UI bug

Incident response process:

  1. Detect: Automated alert fires (Alertmanager → PagerDuty → on-call engineer)
  2. Acknowledge: On-call acknowledges within SLA (prevents escalation)
  3. Assemble: Incident commander paged for SEV1/2; coordinates responders
  4. Investigate: Identify blast radius — what's broken, how many users affected
  5. Mitigate: Minimize impact first (rollback, feature flag disable, capacity increase)
  6. Resolve: Permanent fix (may come later; mitigation is sufficient to close incident)
  7. Review: Blameless post-mortem within 48 hours for SEV1/2

Incident channels: Dedicated Slack channel per incident (#incident-2026-03-30-payment), Zoom bridge for coordination, PagerDuty status page updates.


Q15. What is a blameless post-mortem?

Why blameless? If engineers fear punishment for mistakes, they hide problems, don't take risks, and don't report near-misses. Google's SRE book established blameless culture as foundational to reliability.

Post-mortem structure:

  1. Summary: 2-3 sentence incident description
  2. Impact: Duration, affected users/services, business impact (revenue, support tickets)
  3. Timeline: Chronological events (when detected, key decisions, mitigation, resolution)
  4. Root cause: 5 Whys analysis — the actual technical cause
  5. Contributing factors: System fragility, process gaps, monitoring blind spots
  6. Action items: Concrete improvements with owners and due dates
  7. Lessons learned: What went well, what went poorly

Action items must be specific:

  • Bad: "Improve monitoring"
  • Good: "Add PagerDuty alert when payment service error rate exceeds 1% for 5 minutes — owner: @alice, due: 2026-04-15"

Intermediate-Level DevOps Questions (Q16-Q35)

This is where Razorpay, Flipkart, and Swiggy interviews get serious. These questions test whether you've actually operated production systems or just read about them.

Q16. Write a complete Terraform module for an AWS VPC.

# modules/vpc/main.tf
variable "environment" { type = string }
variable "vpc_cidr"    { type = string; default = "10.0.0.0/16" }
variable "azs"         { type = list(string) }

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "${var.environment}-vpc"
    Environment = var.environment
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  tags   = { Name = "${var.environment}-igw" }
}

resource "aws_subnet" "public" {
  count             = length(var.azs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index)
  availability_zone = var.azs[count.index]
  map_public_ip_on_launch = true
  tags = { Name = "${var.environment}-public-${var.azs[count.index]}" }
}

resource "aws_subnet" "private" {
  count             = length(var.azs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
  availability_zone = var.azs[count.index]
  tags = { Name = "${var.environment}-private-${var.azs[count.index]}" }
}

resource "aws_eip" "nat" {
  count  = length(var.azs)
  domain = "vpc"
}

resource "aws_nat_gateway" "main" {
  count         = length(var.azs)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  depends_on    = [aws_internet_gateway.main]
  tags = { Name = "${var.environment}-nat-${var.azs[count.index]}" }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
  tags = { Name = "${var.environment}-public-rt" }
}

resource "aws_route_table_association" "public" {
  count          = length(var.azs)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

output "vpc_id"            { value = aws_vpc.main.id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "private_subnet_ids"{ value = aws_subnet.private[*].id }

Q17. What is Terraform state? How do you handle state in a team?

Problems with local state in teams:

  1. Multiple people run apply simultaneously → state corruption
  2. State file on one person's laptop → team blocked if they're unavailable
  3. State in Git → security risk (state contains resource attributes including secrets in plaintext)

Remote state with locking:

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/app/terraform.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "terraform-locks"    # State locking
    encrypt        = true
    kms_key_id     = "arn:aws:kms:ap-south-1:..."  # Encrypt state
  }
}

State management commands:

# Import existing resource into state
terraform import aws_instance.web i-1234567890abcdef0

# Move resource in state (refactoring)
terraform state mv aws_instance.web module.compute.aws_instance.web

# Remove resource from state (without destroying)
terraform state rm aws_s3_bucket.old_bucket

# View current state
terraform state list
terraform state show aws_instance.web

Workspace per environment:

terraform workspace new staging
terraform workspace select production
terraform apply -var-file=production.tfvars

Deep-dive question at senior DevOps and platform engineer interviews


Q18. How do you implement Terraform for multiple environments (dev/staging/prod)?

Pattern 1 — Workspaces (simple, not recommended for large teams):

terraform workspace new dev && terraform apply
terraform workspace new prod && terraform apply

Problem: All environments share same code + different state, but not different configs.

Pattern 2 — Directory per environment (recommended):

infrastructure/
├── modules/
│   ├── vpc/
│   ├── eks/
│   └── rds/
├── environments/
│   ├── dev/
│   │   ├── main.tf   # uses modules, dev-specific values
│   │   ├── vars.tf
│   │   └── backend.tf
│   ├── staging/
│   └── production/

Pattern 3 — Terragrunt (DRY across environments):

# terragrunt.hcl in each environment directory
terraform {
  source = "../../modules//eks"
}

inputs = {
  cluster_version = "1.29"
  node_count      = local.env == "production" ? 5 : 2
}

# Remote state automatically configured per environment
remote_state {
  backend = "s3"
  config = {
    bucket = "tf-state-${local.env}"
    key    = "${path_relative_to_include()}/terraform.tfstate"
  }
}

Terragrunt handles the DRY (Don't Repeat Yourself) problem — one module definition, environment-specific overrides without copying Terraform code.


Q19. What is GitOps? How does it improve deployment reliability?

GitOps principles:

  1. Declarative: All config described in Git (not "run this script")
  2. Versioned and immutable: Git history is the audit log
  3. Pulled automatically: A GitOps agent (Argo CD, Flux) continuously reconciles cluster to Git state
  4. Continuously reconciled: Drift detected and auto-corrected

Reliability benefits:

  • Rollback = git revert — no "how do I undo that kubectl command"
  • Audit trail: Every change has a commit, PR, reviewer
  • No "configuration drift" — agent reverts manual changes
  • Disaster recovery: Re-sync from Git rebuilds entire cluster state

GitOps flow:

Developer writes code
    → PR to application repo
    → CI builds Docker image, pushes to registry
    → CI updates image tag in config repo (separate repo or same)
    → PR to config repo with new image tag
    → PR review + approval
    → Merge to main
    → Argo CD detects change → syncs cluster
    → Deployment rolls out

Q20. What is blue-green deployment vs. canary deployment vs. rolling deployment?

StrategyDescriptionRollback SpeedResource CostRisk
Blue-GreenTwo identical environments; switch trafficInstant (flip DNS/LB)2x (both envs running)Low (instant switch)
CanaryGradually shift traffic % to new versionFast (shift back to 0%)Low (small canary)Low (limited blast radius)
RollingReplace instances one at a timeSlow (new rollout)None extraMedium (mixed versions)
RecreateKill all old, start all newFast (redeploy old)None extraHigh (downtime)

Canary decision criteria: Monitor error rate, latency, custom business metrics on the canary. If all good → increase weight. If bad → rollback automatically.

Tools:

  • AWS CodeDeploy: Blue-green for Lambda + ECS
  • Argo Rollouts: Canary + blue-green for K8s with Prometheus-based auto-rollback
  • Istio/Flagger: Traffic shifting with service mesh
  • Feature flags (LaunchDarkly, Unleash): Canary at the application layer, not infra layer

Asked at Flipkart, Swiggy, Zomato deployment strategy questions


Q21. How do you write Prometheus alerting rules?

# prometheus-rules.yml
groups:
- name: application-alerts
  interval: 1m
  rules:

  # Alert if error rate > 1% for 5 minutes
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      /
      sum(rate(http_requests_total[5m])) by (service)
      > 0.01
    for: 5m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "High error rate for {{ $labels.service }}"
      description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
      runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

  # Alert if p99 latency > 2 seconds
  - alert: HighLatency
    expr: |
      histogram_quantile(0.99,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
      ) > 2
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "P99 latency > 2s for {{ $labels.service }}"

  # Node memory pressure
  - alert: NodeMemoryHighUsage
    expr: |
      (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
    for: 10m
    labels:
      severity: warning

Alertmanager routing:

# alertmanager.yml
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-slack'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
  - match:
      team: platform
    receiver: 'platform-slack'

Q22. What is the ELK Stack vs. the Loki stack for logging?

FeatureELK StackLoki Stack
ComponentsElasticsearch + Logstash + KibanaLoki + Promtail/FluentBit + Grafana
Storage modelIndexed full-text searchIndex only labels, store log lines compressed
Query languageLucene/KQLLogQL (similar to PromQL)
Resource usageHigh (Elasticsearch is heavy)Low (Loki is lightweight)
CostHigher storage + computeMuch lower (10x cheaper at scale)
Full-text searchExcellentLimited (labels-based)
Grafana integrationYes (but separate)Native (both Grafana projects)
Best forComplex search, complianceCloud-native, cost-sensitive

LogQL example (Loki):

# Show all error logs from the payment service in last 1 hour
{namespace="production", app="payment-service"} |= "ERROR"

# Parse JSON logs and filter
{app="api"} | json | status_code >= 500

# Rate of error logs
rate({app="api"} |= "ERROR" [5m])

For most Kubernetes deployments in 2026, Loki is the preferred choice due to cost efficiency and native Grafana integration.


Q23. What is chaos engineering? How do you implement it?

The process:

  1. Define a "steady state" (baseline metrics — error rate, latency, throughput)
  2. Hypothesize that steady state continues during the experiment
  3. Inject failures: kill nodes, increase latency, inject CPU pressure, drop packets
  4. Observe: does the system maintain steady state?
  5. Fix weaknesses discovered

Tools:

ToolWhat you can inject
AWS Fault Injection Service (FIS)EC2 stop/terminate, AZ outage, CPU/memory stress, API throttling
Chaos Monkey (Netflix)Random EC2 termination
LitmusChaos (CNCF)K8s pod kill, network latency, disk IO, DNS chaos
Chaos ToolkitMulti-platform, extensible
GremlinCommercial, comprehensive blast radius control

AWS FIS example:

{
  "targets": {
    "eks-nodes": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {"eks:nodegroup-name": "production-workers"},
      "selectionMode": "PERCENT(33)"
    }
  },
  "actions": {
    "terminate-eks-nodes": {
      "actionId": "aws:ec2:terminate-instances",
      "targets": {"Instances": "eks-nodes"},
      "parameters": {}
    }
  },
  "stopConditions": [
    {"source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:...PaymentErrorAlarm"}
  ]
}

The stopConditions are critical — if your alarm fires, the experiment auto-stops to minimize damage.

Asked at Google SRE, Amazon SRE, Flipkart platform interviews


Q24. What is Helm in the context of CI/CD? How do you deploy with Helm in a pipeline?

# GitHub Actions deployment job using Helm
deploy-production:
  runs-on: ubuntu-latest
  needs: [test, build]
  environment: production  # Requires manual approval in GitHub

  steps:
  - name: Checkout
    uses: actions/checkout@v4

  - name: Configure AWS credentials
    uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE }}
      aws-region: ap-south-1

  - name: Update kubeconfig for EKS
    run: aws eks update-kubeconfig --name production-cluster --region ap-south-1

  - name: Helm deploy
    run: |
      helm upgrade --install myapp ./helm/myapp \
        --namespace production \
        --create-namespace \
        --values helm/myapp/values.yaml \
        --values helm/myapp/values-production.yaml \
        --set image.tag=${{ github.sha }} \
        --set deployment.replicas=5 \
        --atomic \
        --timeout 10m \
        --history-max 5

  - name: Verify deployment
    run: |
      kubectl rollout status deployment/myapp -n production --timeout=5m
      kubectl get pods -n production -l app=myapp

--atomic: If upgrade fails (health checks, readiness), automatically roll back to previous release. --history-max 5: Keep only 5 Helm release history entries.


Q25. What is ArgoCD? How does it implement GitOps?

Core concepts:

  • Application: Maps a Git repo + path to a K8s cluster + namespace
  • App of Apps: One Application that deploys all other Applications (cluster bootstrapping)
  • Sync: Process of making cluster state match Git state
  • Health: Argo CD checks resource health (Deployment fully rolled out, Service has endpoints)
# Application CRD
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/myorg/k8s-manifests
    targetRevision: HEAD
    path: services/payment-service
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true      # Delete resources removed from Git
      selfHeal: true   # Revert manual changes
    syncOptions:
    - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Argo CD's UI provides a visual representation of every deployed resource with health status, sync status, and diff view showing what changed.


Q26. How do you implement secrets management in CI/CD pipelines?

GitHub Actions secrets:

# Secrets stored in GitHub's encrypted store, injected as env vars
steps:
- name: Deploy
  env:
    DATABASE_URL: ${{ secrets.PROD_DATABASE_URL }}
    API_KEY: ${{ secrets.PAYMENT_API_KEY }}
  run: ./deploy.sh

Best practices:

  1. Environment-scoped secrets: Separate secrets for dev/staging/prod. Production secrets require environment approval.
  2. OIDC for cloud credentials: Use GitHub OIDC → AWS/GCP role assumption. No stored cloud credentials at all.
  3. HashiCorp Vault for runtime secrets: CI pipeline retrieves runtime secrets from Vault using a short-lived token.
  4. Rotate regularly: Automate secret rotation (AWS Secrets Manager auto-rotation).
  5. Audit access: Log every secret access — who retrieved what, when.
# OIDC-based secret fetching (no stored credentials)
- name: Configure AWS via OIDC
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789:role/github-deploy-role
    aws-region: ap-south-1

# Now fetch secrets from AWS Secrets Manager
- name: Fetch secrets
  run: |
    aws secretsmanager get-secret-value \
      --secret-id production/myapp/database \
      --query SecretString --output text >> $GITHUB_ENV

Q27. What is the difference between horizontal and vertical scaling?

FeatureHorizontal Scaling (Scale Out)Vertical Scaling (Scale Up)
MethodAdd more instancesIncrease instance size (CPU/RAM)
LimitTheoretically unlimitedLimited by largest available instance
CostLinear per instanceExponential (large instances cost more per unit)
ComplexityApplication must be stateless/distributedSimpler (same app, bigger machine)
DowntimeNone (add instances live)Often requires restart
Best forStateless web/app serversStateful databases, monoliths
ExampleEC2 ASG: 5 t3.medium → 10 t3.mediumt3.medium → t3.xlarge

Kubernetes HPA = horizontal scaling. VPA = vertical scaling (with pod restart).

Cloud-native applications are designed for horizontal scaling:

  • Stateless (session in Redis, not in-process memory)
  • Shared-nothing architecture
  • Configuration from environment (12-factor app)
  • Health endpoints for load balancer integration

Q28. Explain the 12-Factor App methodology.

FactorPrincipleExample
1. CodebaseOne codebase, many deploysGit monorepo or separate repos per service
2. DependenciesExplicitly declared, isolatedrequirements.txt, package.json, go.mod
3. ConfigStore config in environmentDATABASE_URL env var, not hardcoded
4. Backing servicesTreat as attached resourcesDB, Redis, S3 accessed via URL from env
5. Build/Release/RunStrictly separate stagesDocker image (build) + env vars (release) + container (run)
6. ProcessesExecute as stateless processesNo sticky sessions; session state in Redis
7. Port bindingExport services via portApp binds to $PORT
8. ConcurrencyScale out via processesMultiple workers, HPA
9. DisposabilityFast startup, graceful shutdownKubernetes preStop hook, SIGTERM handling
10. Dev/prod parityKeep environments similarSame Docker image, same configs
11. LogsTreat as event streamsWrite to stdout, infrastructure aggregates
12. Admin processesRun as one-off processeskubectl exec, AWS ECS exec

Q29. What is a service mesh? Explain the Istio architecture.

Istio architecture:

Control Plane (istiod)
├── Pilot — distributes service discovery, routing rules to proxies
├── Citadel — certificate authority for mTLS
├── Galley — validates config
└── Mixer (deprecated) — formerly handled telemetry

Data Plane
└── Envoy sidecar proxies (injected into every pod)
    ├── Intercept all inbound/outbound traffic
    ├── Enforce mTLS
    ├── Collect telemetry (metrics, traces)
    └── Apply routing rules (retries, timeouts, circuit breaking)

Key Istio resources:

# VirtualService — traffic routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
  - payment-service
  http:
  - retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,connect-failure
    timeout: 10s
    fault:
      delay:
        percentage:
          value: 10   # 10% of requests delayed (chaos testing)
        fixedDelay: 5s

Istio is powerful but complex — adds ~5ms latency per hop and significant memory overhead (Envoy). Linkerd is a lighter alternative using Rust proxies.


Q30. How do you implement distributed tracing with OpenTelemetry?

# Python FastAPI app with OTel tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Configure tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Auto-instrument frameworks
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@app.get("/orders/{order_id}")
async def get_order(order_id: str):
    with tracer.start_as_current_span("get-order") as span:
        span.set_attribute("order.id", order_id)
        order = await db.get_order(order_id)
        if not order:
            span.set_status(StatusCode.ERROR, "Order not found")
        return order

OTel Collector receives traces from all services, samples them, and exports to Jaeger/Tempo:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
  tail_sampling:
    policies:
    - type: error-rate
      error_rate: {min_error_rate: 0.01}
    - type: probabilistic
      probabilistic: {sampling_percentage: 5}  # Sample 5% of successful requests

exporters:
  jaeger:
    endpoint: "jaeger:14250"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [jaeger]

Q31. What is the difference between push and pull monitoring models?

ModelDescriptionExamplesTrade-offs
Pull (scrape)Monitoring system fetches metrics from targetsPrometheusTarget must expose HTTP endpoint; scalable; firewall-friendly if Prometheus is inside network
PushTargets send metrics to collectorGraphite, InfluxDB, CloudWatch, DatadogWorks behind NAT; useful for short-lived jobs; collector can be overwhelmed

Prometheus Pushgateway: Bridge for short-lived jobs (batch jobs, cron) that exit before Prometheus scrapes them. Job pushes metrics to Pushgateway; Prometheus scrapes Pushgateway.

# Push metrics from a batch job to Pushgateway
cat <<EOF | curl --data-binary @- http://pushgateway:9091/metrics/job/backup-job/instance/server1
# TYPE backup_duration_seconds gauge
backup_duration_seconds 420
# TYPE backup_files_processed counter
backup_files_processed 1523847
EOF

Q32. How do you implement feature flags? What are the benefits?

Types:

  • Release flags: Enable new feature for % of users (canary at app layer)
  • Experiment flags: A/B testing (50% see UI variant A, 50% see B)
  • Ops flags: Kill switches for problematic features (disable heavy query, enable maintenance mode)
  • Permission flags: Enable features per user tier (free vs. paid)
# Using Unleash (open-source feature flags)
from unleash_client import UnleashClient

client = UnleashClient(
    url="https://unleash.example.com/api",
    app_name="payment-service",
    authorization="*:development.abc123"
)

@app.post("/checkout")
async def checkout(request: CheckoutRequest):
    if client.is_enabled("new-payment-flow", {"userId": request.user_id}):
        return await new_payment_processor(request)
    else:
        return await legacy_payment_processor(request)

Benefits for DevOps:

  1. Decouple deployment from release — deploy code, enable flag later
  2. Instant rollback without redeployment (disable flag)
  3. Dark launches — ship code to production disabled, enable for testing
  4. Gradual rollouts — 1% → 10% → 100%
  5. Trunk-based development — merge incomplete features behind flags

Q33. What is Packer? How does it fit into a DevOps workflow?

Why build custom AMIs:

  • Faster EC2 launch (no apt install on startup — already baked in)
  • Immutable infrastructure pattern — never patch running instances, replace with new AMI
  • Tested, hardened images (CIS benchmarks applied during build)
  • Consistent configuration across environments
# packer.pkr.hcl
source "amazon-ebs" "ubuntu" {
  region        = "ap-south-1"
  source_ami    = "ami-0f5ee92e2d63afc18"  # Ubuntu 22.04 LTS
  instance_type = "t3.medium"
  ssh_username  = "ubuntu"
  ami_name      = "myapp-base-{{timestamp}}"
}

build {
  sources = ["source.amazon-ebs.ubuntu"]

  provisioner "shell" {
    inline = [
      "sudo apt-get update",
      "sudo apt-get install -y nginx",
      "sudo systemctl enable nginx"
    ]
  }

  provisioner "ansible" {
    playbook_file = "playbooks/harden.yml"
  }

  post-processor "manifest" {
    output = "manifest.json"  # Save AMI ID for Terraform
  }
}

CI pipeline: Packer builds AMI → runs tests → publishes AMI ID → Terraform references latest AMI → instances launch immediately with all software pre-installed.


Q34. What is the difference between MTTR, MTBF, and MTTD?

MetricFull NameMeasuresGoal
MTTRMean Time to RecoveryAverage time to restore service after failureMinimize (faster recovery)
MTBFMean Time Between FailuresAverage time between incidentsMaximize (more reliable)
MTTDMean Time to DetectAverage time from failure to detectionMinimize (better monitoring)
MTTFMean Time to FailureAverage time until a component failsMaximize

How to improve MTTR:

  1. Better runbooks (clear, tested procedures)
  2. Auto-remediation (Lambda triggered by CloudWatch alarm)
  3. Feature flags (disable problematic feature instantly)
  4. Rollback automation (Argo CD sync to previous revision on alert)
  5. PagerDuty escalation policies (right person paged immediately)
  6. Chaos engineering (practice incident response regularly)

How to improve MTTD:

  1. Comprehensive alerting (SLI-based alerts, not just infrastructure)
  2. Synthetic monitoring (actively probe from outside)
  3. Real user monitoring (RUM — detect user-impacting issues before alerts)
  4. Anomaly detection (ML-based: Amazon DevOps Guru, Datadog)

Q35. What is Vault by HashiCorp? How does it manage secrets?

Core concepts:

  • Secret Engines: Backends that store/generate secrets (KV, AWS IAM, database, PKI)
  • Auth Methods: How clients authenticate (Kubernetes ServiceAccount, AWS IAM, GitHub)
  • Policies: ACL rules controlling who can access which secrets
  • Leases: Time-bound access — dynamic secrets expire and are revoked

Dynamic secrets (killer feature):

# Vault generates a short-lived AWS access key on demand
vault read aws/creds/my-role
# Key: AKIAIOSFODNN7EXAMPLE
# Secret: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Expires in: 1 hour
# After expiry: Vault automatically revokes it from AWS

In Kubernetes:

# Vault Agent sidecar annotation — injects secrets as files
annotations:
  vault.hashicorp.com/agent-inject: "true"
  vault.hashicorp.com/role: "payment-service"
  vault.hashicorp.com/agent-inject-secret-db: "secret/data/production/database"
  vault.hashicorp.com/agent-inject-template-db: |
    {{- with secret "secret/data/production/database" -}}
    DATABASE_URL=postgresql://{{ .Data.data.username }}:{{ .Data.data.password }}@db:5432/app
    {{- end }}

Advanced-Level DevOps Questions (Q36-Q50)

The Advanced section is where Rs 45+ LPA offers are won. These questions are asked for senior SRE and staff-level DevOps roles. If you can answer these confidently, you're in the top 5% of candidates.

Q36. How do you implement zero-trust networking in a DevOps context?

Implementation pillars:

  1. Service-to-service mTLS: Istio/Linkerd automatically issues certificates, enforces mutual authentication — even internal services verify each other
  2. Short-lived credentials: No long-term passwords; dynamic secrets from Vault, IAM roles
  3. Workload identity: SPIFFE/SPIRE assigns cryptographic identities to workloads (pods, VMs)
  4. Network micro-segmentation: K8s NetworkPolicies — explicit allow lists between services
  5. Device trust: BeyondCorp-style — employee machines verified before VPN
  6. Continuous authorization: Re-verify on every request, not just login
  7. Comprehensive audit logging: Every request logged with who, what, when, from where
# Istio PeerAuthentication — enforce mTLS cluster-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system  # Cluster-wide
spec:
  mtls:
    mode: STRICT  # All service-to-service traffic must use mTLS

Q37. Design a complete CI/CD pipeline for a microservices application.

Architecture:

Code changes
    │
    ▼
GitHub (PR opened)
    │
    ▼
GitHub Actions CI job:
├── Lint + unit tests
├── Build Docker image (BuildKit + cache)
├── Security scan (Trivy CRITICAL/HIGH block)
├── SAST scan (Semgrep/CodeQL)
├── Integration tests (docker-compose)
├── Push to ECR (only on PR merge to main)
└── Update image tag in config repo (GitOps)
    │
    ▼
Config repo PR (automated)
    │
    ▼
Platform team reviews (auto-approve for non-prod)
    │
    ▼
Merge to config repo main branch
    │
    ▼
Argo CD detects change → syncs staging cluster
    │
    ▼
Staging deployment + smoke tests
    │
    ▼
Manual approval (production)
    │
    ▼
Argo CD syncs production cluster
    ├── Canary: 5% → 25% → 50% → 100% (Argo Rollouts)
    └── Auto-rollback if error rate alarm fires
    │
    ▼
PagerDuty alert if post-deploy health check fails

Key design decisions:

  • OIDC for all cloud credentials (no stored keys)
  • Separate application repo from config repo (GitOps)
  • Security scanning gates are non-negotiable
  • Canary with Prometheus-based analysis for production

Q38. How do you implement observability for a distributed system at scale?

Cardinality management (critical at scale): High-cardinality labels (user_id, request_id in Prometheus) cause memory exhaustion. Rules:

  • Prometheus labels: Only low-cardinality (service, endpoint, status_code, region)
  • High-cardinality data: Traces (Jaeger/Tempo), logs (Loki) — not metrics

Sampling strategy:

All traces: 100M requests/day → too expensive to store all
Strategy: Head-based sampling (make decision at request start)
  - 100% sample error traces
  - 100% sample traces > 500ms latency
  - 1% sample remaining successful traces
OR Tail-based sampling (OTel Collector tail_sampling processor):
  - Better: make sampling decision after seeing full trace
  - More resource-intensive at collector

Exemplars: Link metrics to traces — when a high-latency spike appears in Prometheus, exemplars provide the trace ID that caused it:

# Histogram with exemplar
REQUEST_LATENCY.observe(duration, exemplar={"traceID": current_trace_id})

SLO-based alerting (multi-window, multi-burn-rate):

# Alert fires if you're burning through 30-day error budget too fast
# Fast burn: 14.4x budget burn rate for 1h (critical)
# Slow burn: 3x budget burn rate for 6h (warning)
- alert: ErrorBudgetBurnTooFast
  expr: |
    (
      sum(rate(http_errors_total[1h])) / sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x the 0.1% SLO error rate

Q39. What is FinOps? How do DevOps engineers contribute to cloud cost optimization?

DevOps engineer's role in FinOps:

  1. Rightsizing: Use AWS Compute Optimizer recommendations — downsize over-provisioned instances
  2. Auto-scaling: Scale down outside business hours (scheduled scaling for stateless apps)
  3. Spot/Preemptible instances: 70-90% cheaper for fault-tolerant workloads (CI runners, batch, ML training)
  4. Resource tagging: Every resource tagged (environment, team, service, cost-center) for chargeback
  5. Delete zombie resources: Unused EIPs, forgotten Load Balancers, orphaned EBS volumes
  6. Savings Plans/Reserved Instances: Commit to base capacity for 30-66% savings
  7. Architecture optimization: Lambda > always-on EC2 for bursty workloads; Graviton > x86 for ~20% cost reduction
  8. S3 lifecycle policies: Auto-move old data to Glacier ($0.004/GB vs $0.023/GB)

Tooling:

  • AWS Cost Explorer + Budgets: Alerts on spend anomalies
  • Infracost: Show cost diff in Terraform PRs
  • OpenCost / Kubecost: K8s cost visibility (cost per namespace, deployment, team)

Q40. Explain platform engineering vs. DevOps. What is an Internal Developer Platform (IDP)?

DevOps (early 2010s): Individual dev teams own their own CI/CD and infrastructure. "You build it, you run it."

Platform Engineering (2020s): A dedicated team builds an Internal Developer Platform — golden paths and self-service tools that abstract infrastructure complexity from application developers.

Internal Developer Platform (IDP) components:

ComponentPurposeExample Tools
Self-service portalDevelopers create environments/services via UIBackstage (Spotify)
Golden path templatesPre-approved service templates with best practicesCookiecutter, Backstage Software Templates
CI/CD abstractionsDevelopers don't write raw GitHub ActionsDagger, shared GitHub Actions libraries
Secret managementOne-click secret rotation, dev accessVault UI, External Secrets
ObservabilityAuto-configured monitoring per serviceGrafana + auto-dashboards
Environment provisioningCreate dev environment in minutesCrossplane, Argo CD

Why it matters: At scale (500+ engineers), having each team own all of DevOps creates inconsistency, security gaps, and toil. Platform Engineering creates leverage — one platform team enables hundreds of product engineers.

Increasingly asked at senior/lead level interviews (2026)


Q41. What is Crossplane? How does it extend Kubernetes for infrastructure management?

# Provision an RDS PostgreSQL instance using Crossplane
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: production-postgres
spec:
  forProvider:
    region: ap-south-1
    dbInstanceClass: db.t3.medium
    masterUsername: admin
    engine: postgres
    engineVersion: "15.3"
    multiAZ: true
    skipFinalSnapshotBeforeDeletion: false
  writeConnectionSecretsToRef:
    namespace: production
    name: postgres-credentials  # Crossplane stores connection details as K8s Secret

Crossplane vs. Terraform:

  • Crossplane is Kubernetes-native — leverages existing K8s RBAC, GitOps tools (Argo CD), and tooling
  • Terraform is more mature, larger ecosystem, easier to use outside K8s
  • Crossplane is better for organizations fully committed to Kubernetes and GitOps

Q42. What is Site Reliability Engineering (SRE)? How does it differ from DevOps?

AspectDevOpsSRE
OriginCultural movementGoogle's implementation of DevOps principles
FocusCollaboration, CI/CD, automationReliability, scalability, SLOs
Team structureEmbedded in product teamsDedicated SRE teams (or embedded)
Key metricsDeployment frequency, lead timeSLOs, error budget, MTTR, MTBF
ToolingCI/CD, IaC, monitoringAll DevOps + capacity planning, load testing
Book"The Phoenix Project""Site Reliability Engineering" (Google)

SRE unique concepts:

  • Toil: Manual, repetitive, automatable work that scales with service load. SREs target <50% toil time.
  • Error budgets: Formalize the reliability vs. velocity trade-off. If error budget is depleted, new features freeze.
  • Eliminating toil: Every manual task is a candidate for automation. If you do it twice, automate it.
  • Postmortems: Blameless, written, shared across organization.
  • Production readiness reviews (PRR): Checklist before launching new services (alerts configured? runbook exists? load tested?)

Critical distinction for Google, Amazon, Flipkart SRE positions


Q43. How do you design for reliability in a multi-region deployment?

Active-Active (both regions serve traffic):

Users globally
    │
Route 53 latency-based routing
   /                    \
ap-south-1 (Mumbai)    ap-southeast-1 (Singapore)
├── EKS cluster         ├── EKS cluster
├── Aurora Global DB    └── Aurora Read Replica
└── ElastiCache             (promotes to writer on failover)

Active-Passive (one region handles traffic, other on standby):

  • Simpler, lower cost
  • RTO: minutes (failover time)
  • RPO: seconds (replication lag)

Key patterns for multi-region:

  1. Data consistency: Use CRDTs for eventually consistent data; avoid two-phase commit across regions
  2. Circuit breakers: Don't let a failing region cascade to healthy region
  3. Chaos engineering: Regularly simulate region failures (AWS FIS)
  4. DNS failover: Route 53 health checks auto-reroute on region failure
  5. Deployment: Deploy to one region, verify, then the second (sequenced deploys)

RTO/RPO targets:

  • Tier 1 (payments, login): RTO <5 min, RPO ~0 (synchronous replication)
  • Tier 2 (recommendations): RTO <30 min, RPO <5 min
  • Tier 3 (analytics): RTO <4 hours, RPO <1 hour

Q44. What is supply chain security in DevOps? How do you implement SLSA?

SLSA (Supply-chain Levels for Software Artifacts) — a framework for supply chain integrity:

LevelRequirements
SLSA 1Provenance exists (build logs available)
SLSA 2Provenance signed and hosted by build service
SLSA 3Source verified, build isolated, hardened build environment
SLSA 4Two-person review, hermetic reproducible builds

Implementation:

# GitHub Actions with SLSA provenance (using slsa-github-generator)
- name: Build Docker image
  uses: docker/build-push-action@v5
  with:
    push: true
    tags: myapp:${{ github.sha }}

- name: Generate SLSA provenance
  uses: slsa-framework/slsa-github-generator/.github/workflows/generator_container_slsa3.yml@v1
  with:
    image: myapp
    digest: ${{ steps.build.outputs.digest }}

Dependency security:

  • Dependabot: Auto-PRs for dependency updates
  • Renovate: More configurable alternative
  • Snyk: Deep vulnerability scanning including transitive dependencies
  • SBOM (CycloneDX/SPDX): Know every component in your software

Q45. How do you handle database migrations in CI/CD?

Safe migration patterns:

  1. Expand-Contract (for zero-downtime):

    • Phase 1 (Expand): Add new column/table, keep old one. Both old and new app code can run.
    • Phase 2 (Migrate): Backfill new column; run new app code.
    • Phase 3 (Contract): Remove old column after old code is fully deployed.
  2. Blue-Green with schema sync:

    • New schema must be backward compatible with both blue (old) and green (new) versions
    • Never drop columns in the same deploy that adds new ones
  3. Tools:

    • Flyway/Liquibase: Version-controlled SQL migrations, run in pipeline before app deploy
    • Atlas: Modern schema-as-code for multiple databases
    • Django/Rails migrations: Framework-native
# Kubernetes job running migrations before deployment
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-{{ .Values.image.tag }}
spec:
  template:
    spec:
      initContainers:
      - name: wait-for-db
        image: busybox
        command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 2; done']
      containers:
      - name: migrate
        image: myapp:{{ .Values.image.tag }}
        command: ["python", "manage.py", "migrate", "--noinput"]
      restartPolicy: Never
  backoffLimit: 3

Helm pre-install/pre-upgrade hooks ensure migrations run and succeed before the new app version is deployed.


Q46. What is Dagger? How does it improve CI/CD portability?

Problem it solves: CI configuration is fragmented across YAML files for each platform. Testing locally requires pushing to CI. Different behavior in CI vs. local.

# dagger pipeline in Python — runs same way locally and in GitHub Actions
import anyio
import dagger

async def build_and_test():
    async with dagger.Connection() as client:
        # Get source code
        source = client.host().directory(".", exclude=[".git", "node_modules"])

        # Build container
        node = (
            client.container()
            .from_("node:20-alpine")
            .with_directory("/src", source)
            .with_workdir("/src")
            .with_exec(["npm", "ci"])
        )

        # Run tests (returns container with test results)
        test = await node.with_exec(["npm", "test"]).stdout()
        print(f"Tests: {test}")

        # Build Docker image
        image = await node.with_exec(["npm", "run", "build"]).publish(
            "myregistry.com/myapp:latest"
        )
        print(f"Published: {image}")

anyio.run(build_and_test)

Run locally with python pipeline.py — same execution in CI via dagger run python pipeline.py.


Q47. How do you implement policy as code with OPA/Gatekeeper?

# Gatekeeper ConstraintTemplate — enforce resource limits
package k8srequiredlimits

violation[{"msg": msg}] {
    container := input.review.object.spec.containers[_]
    not container.resources.limits.memory
    msg := sprintf("Container '%s' must have memory limits", [container.name])
}

violation[{"msg": msg}] {
    container := input.review.object.spec.containers[_]
    not container.resources.limits.cpu
    msg := sprintf("Container '%s' must have CPU limits", [container.name])
}
# Constraint — apply to all namespaces except kube-system
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLimits
metadata:
  name: require-resource-limits
spec:
  enforcementAction: deny  # or 'warn' for audit mode
  match:
    namespaces: ["production", "staging"]

OPA in CI/CD (non-K8s):

# Evaluate Terraform plan against policies before apply
terraform plan -out plan.json
terraform show -json plan.json > plan-output.json

# OPA policy check
opa eval -d policies/ -i plan-output.json "data.terraform.allow" --fail

Conftest uses OPA policies to validate any YAML/JSON/Terraform/Dockerfile.


Q48. What is AIOps? How is AI being integrated into DevOps in 2026?

Current practical applications (2026):

Use CaseToolHow it works
Anomaly detectionGrafana ML, Datadog WatchdogBaseline metrics, flag unusual deviations
Alert correlationPagerDuty AIOpsGroup related alerts, reduce alert noise
Root cause analysisAmazon DevOps GuruIdentify unusual resource behavior patterns
Log analysisElastic ML, Grafana Loki MLCluster log patterns, detect new error types
Predictive scalingAWS Auto Scaling predictive modeScale before traffic hits, not after
Incident resolutionSlack AI + runbook lookupSuggest runbook steps from incident description
Code reviewGitHub Copilot, CodeRabbitReview PRs for security issues, performance

Practical integration pattern:

# GitHub Actions with AI-assisted PR review
- name: CodeRabbit Review
  uses: coderabbitai/ai-pr-reviewer@latest
  with:
    openai_api_key: ${{ secrets.OPENAI_API_KEY }}
    # Reviews for: logic errors, security issues, performance, test coverage

AI doesn't replace DevOps engineers in 2026 — it handles toil (log analysis, alert grouping) so engineers focus on system design and reliability.


Q49. How do you implement disaster recovery? Explain RTO and RPO in practice.

DR Strategies (from cheapest to most expensive):

StrategyRTORPOCostDescription
Backup & RestoreHoursHoursLowRestore from S3 backups
Pilot Light10-30 minMinutesLow-MediumCore services always running, scale up on disaster
Warm StandbyMinutesSecondsMediumScaled-down version running in secondary region
Multi-Site Active-ActiveNear zeroNear zeroHighFull production capacity in both regions

Pilot Light example:

# Primary Region (ap-south-1) — full production
module "primary" {
  instance_count = 10
  rds_instance   = "db.r6g.xlarge"
}

# DR Region (ap-southeast-1) — pilot light
module "dr" {
  instance_count = 0            # ASG min=0, max=10
  rds_instance   = "db.t3.medium"  # Smaller RDS receiving replication
}

DR runbook steps (must be tested quarterly):

  1. Verify data replication is up-to-date
  2. Scale up DR region ASG
  3. Promote RDS read replica to primary
  4. Update Route 53 health check to point to DR region
  5. Verify application is serving traffic
  6. Communicate status to stakeholders

Test your DR plan: Untested DR plans fail when you need them most. Chaos engineering + game days simulate disasters on a schedule.


Q50. Describe your approach to building a developer platform from scratch at a 200-person engineering organization.

Phase 1 — Assess (Week 1-2):

  • Interview 20+ engineers: biggest pain points, manual toils, blocked deployments
  • Audit current state: How many unique CI/CD setups? How long do deployments take? DORA metrics baseline
  • Identify top 3 pain points (usually: slow/flaky CI, inconsistent environments, opaque deployments)

Phase 2 — Golden Path (Month 1-3):

  • Standardize on one CI/CD platform (GitHub Actions)
  • Shared GitHub Actions library: reusable build/test/deploy workflows
  • Opinionated service template (Cookiecutter + Backstage): generates a new service with CI/CD, monitoring, and security scanning pre-configured
  • Target: new service from idea to first deployment < 1 day

Phase 3 — Self-Service (Month 3-6):

  • Backstage portal: service catalog, create environments, view deployment status
  • One-click staging environment provisioning
  • Automated secret management onboarding
  • Auto-configured Grafana dashboards per service

Phase 4 — Reliability (Month 6-12):

  • SLO framework and tooling
  • Chaos engineering program
  • Production readiness checklist
  • On-call tooling (PagerDuty + runbooks)

Measure success with DORA metrics:

  • Deployment frequency: 1 deploy/week/team → 5/day/team
  • Lead time: 3 days → 2 hours
  • Change failure rate: 20% → 5%
  • MTTR: 2 hours → 15 minutes

FAQ Section — Straight Answers to Your DevOps Career Questions

Q: What's the difference between DevOps Engineer, SRE, and Platform Engineer? This confuses almost everyone, so here's the clear breakdown: DevOps Engineer focuses on CI/CD, automation, build/deploy tooling. SRE focuses on production reliability, SLOs, incident response, and on-call. Platform Engineer builds internal tools and golden paths for other engineers. Roles overlap significantly — job title often depends on company culture. Pro tip: Read the job description carefully. A "DevOps Engineer" role at a bank is very different from one at a startup.

Q: Is Kubernetes knowledge required for DevOps roles in 2026? Yes — it's effectively non-negotiable for product companies. Kubernetes is the production container orchestration standard. At minimum, understand deployments, services, ConfigMaps, RBAC, and basic troubleshooting. EKS/GKE-specific knowledge is a bonus. Check out our Kubernetes Interview Questions 2026 for a deep dive.

Q: Terraform vs. Pulumi in 2026 — which is gaining? Terraform remains dominant in market share. Pulumi is gaining traction at organizations with strong software engineering cultures (programmatic IaC using real languages vs. HCL). The OpenTofu fork (open-source Terraform) is growing after HashiCorp's BSL license change.

Q: What monitoring stack should I learn? The open-source stack: Prometheus + Grafana + Loki + Tempo + OpenTelemetry. This covers metrics, logs, and traces. Datadog is the dominant commercial alternative — learn if you're going to enterprise companies.

Q: Is Jenkins still relevant in 2026? Jenkins remains widely deployed in enterprises and has a huge plugin ecosystem. GitHub Actions, GitLab CI, and CircleCI are preferred at modern product companies. If you're applying to enterprise/bank/large IT — know Jenkins. Startups: GitHub Actions.

Q: What certifications should a DevOps engineer get? Tier 1: AWS Solutions Architect Associate + CKA (Certified Kubernetes Administrator). Tier 2: HashiCorp Terraform Associate, Prometheus Certified Associate. Tier 3: CKS (Security), AWS DevOps Professional, Google Cloud Professional DevOps Engineer.

Q: How important is Python/scripting for DevOps? Extremely important. Bash for one-liners and simple scripts. Python for anything more complex (AWS boto3, custom tooling, Ansible modules). Go is increasingly common for writing K8s operators and CLI tools. Know at least Python + Bash.

Q: What is the DevOps salary range in India in 2026? Here are the verified numbers from real offers: Junior DevOps (0-2 yrs): Rs 8-15 LPA. Mid-level (3-5 yrs, K8s + AWS + CI/CD): Rs 18-40 LPA. Senior/SRE (7+ yrs, system design, architecture): Rs 45-90 LPA. Principal/Staff: Rs 80 LPA-1.5 Cr at top product companies. The jump from mid-level to senior is massive — and it's driven by system design and incident management skills, not just tool knowledge.


Build the complete DevOps & infrastructure interview toolkit:

Advertisement Placement

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Related Articles

More from PapersAdda

Share this guide: