placement brief / Interview Questions / interview questions / 08 Jun 2026

DevOps Engineer Interview Questions 2026: CI/CD, Containers, IaC & SRE

Q: What is the difference between Docker Compose and Kubernetes?

Docker Compose is a tool for defining and running multi-container applications on a single host. It is used for local development and simple single-server deployments. Kubernetes is a container orchestration system for production workloads: multi-node scheduling, auto-scaling, self-healing, rolling updates, service discovery, and storage orchestration across a cluster. Use Docker Compose for local dev; use Kubernetes for production. Tools like Kompose can convert Compose files to Kubernetes manifests.

Q: What is toil in SRE, and how do you reduce it?

Toil is manual, repetitive, tactical operational work that scales with service growth and provides no lasting value. Examples: manually restarting failed services, manually approving routine deployments, manually responding to false-positive alerts. SRE teams aim to keep toil below 50% of work time; the rest goes to engineering work (automation, reliability improvements). Reduce toil by: automating repetitive tasks, eliminating flaky alerts, implementing self-healing (Kubernetes restarts failed pods), and building tooling that removes human steps from standard workflows.

Q: How do you implement zero-downtime deployments in Kubernetes?

Key requirements: readiness probes (traffic only sent to ready pods), PodDisruptionBudget (limits pods unavailable during node drain), rolling update strategy (maxUnavailable: 0, maxSurge: 1), preStop hook + terminationGracePeriodSeconds (let in-flight requests complete), and proper graceful shutdown in application code (handle SIGTERM, drain connections before exit). Candidates report that missing readiness probes is the most common cause of deployment-related downtime in Kubernetes. Confirm current Kubernetes best practices on the official Kubernetes documentation. ---

> Candidates report that DevOps Engineer interviews in 2026 emphasize Kubernetes operations, Terraform IaC, CI/CD pipeline design, observability, and incident...

By Aditya SharmaPublished 8 Jun 20263 sources listedSpot an error? Corrections open

6 min read last revised 8 Jun 2026

on this page§ 17

Candidates report that DevOps Engineer interviews in 2026 emphasize Kubernetes operations, Terraform IaC, CI/CD pipeline design, observability, and incident management. Confirm exact tool versions and practices expected on the official company careers portal before your interview.

DevOps Engineering covers the full software delivery lifecycle: CI/CD, containerization, orchestration, infrastructure automation, monitoring, and reliability engineering. This guide covers core concepts and real-world scenarios.

CI/CD Fundamentals

Q1. What is CI/CD, and what are the stages of a typical pipeline?

Continuous Integration (CI): Developers merge code frequently (multiple times/day), triggering automated build and test to detect integration issues early.

Continuous Delivery (CD): Every passing CI build is a releasable artifact. Deployment to production requires a manual approval gate.

Continuous Deployment: Every passing CI build deploys automatically to production. No manual gate.

Typical pipeline stages:

# GitHub Actions example pipeline
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'
    - name: Install dependencies
      run: pip install -r requirements.txt
    - name: Lint
      run: flake8 src/ tests/
    - name: Unit tests
      run: pytest tests/unit/ --cov=src --cov-report=xml
    - name: Integration tests
      run: pytest tests/integration/ -m integration
    - name: Security scan
      uses: aquasecurity/trivy-action@master
      with:
        scan-type: 'fs'
        severity: 'HIGH,CRITICAL'

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Build Docker image
      run: docker build -t myapp:${{ github.sha }} .
    - name: Push to registry
      run: |
        echo ${{ secrets.REGISTRY_PASSWORD }} | docker login -u ${{ secrets.REGISTRY_USER }} --password-stdin
        docker push myapp:${{ github.sha }}
        docker tag myapp:${{ github.sha }} myapp:latest
        docker push myapp:latest

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    environment: staging
    steps:
    - name: Deploy to staging
      run: |
        kubectl set image deployment/myapp \
            myapp=myapp:${{ github.sha }} \
            --namespace=staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production  # requires approval
    if: github.ref == 'refs/heads/main'
    steps:
    - name: Deploy to production
      run: |
        kubectl set image deployment/myapp \
            myapp=myapp:${{ github.sha }} \
            --namespace=production

Key pipeline principles:

Fail fast: run fast tests (unit) before slow tests (integration).
Artifact immutability: same Docker image promoted staging -> production (never rebuild).
Every merge to main should be deployable.
Pipeline as code: version-controlled alongside application code.

Q2. What are deployment strategies, and what are the trade-offs?

Rolling Update (default Kubernetes):

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%         # extra pods during update (max total = 125%)
      maxUnavailable: 25%   # pods that can be unavailable (min available = 75%)

Gradual replacement of old pods with new pods.
Zero-downtime if health checks are correct.
Risk: both versions running simultaneously during update (API backward compatibility required).

Blue/Green:

# Blue service (current production)
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # <-- switch to green to deploy

---
# Blue deployment (keep running until verified)
kind: Deployment
spec:
  selector:
    matchLabels:
      app: myapp
      version: blue

---
# Green deployment (new version)
kind: Deployment
spec:
  selector:
    matchLabels:
      app: myapp
      version: green

Instant cutover: update Service selector.
Instant rollback: revert Service selector.
Requires 2x capacity.

Canary:

# 90% production, 10% canary
# Production: 9 replicas
# Canary: 1 replica
# Traffic split by replica ratio (or Istio/Argo Rollouts for precise %)

Use Argo Rollouts for percentage-based canary with automatic rollback:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10        # 10% canary
      - pause: {duration: 5m}
      - analysis:            # auto-rollback if error rate > threshold
          templates:
          - templateName: error-rate
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100

Recreate:

Stop all old pods, start all new pods.
Downtime during restart.
Use only when: breaking changes prevent both versions running simultaneously.

Docker

Q3. What are Docker layers, and how do you write efficient Dockerfiles?

Each RUN, COPY, ADD instruction creates a new layer. Layers are cached -- only changed layers and below are rebuilt.

Inefficient Dockerfile:

FROM python:3.11
WORKDIR /app
COPY . .                          # BAD: copies everything including code
RUN pip install -r requirements.txt  # Runs AFTER code copy -- no cache benefit

Efficient Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# Copy dependencies first (cached unless requirements.txt changes)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code last (changes frequently -- invalidates only this layer)
COPY src/ src/
COPY config/ config/

# Non-root user (security best practice)
RUN useradd --create-home appuser
USER appuser

# Document port, don't use EXPOSE to actually open it
EXPOSE 8080

# Use exec form (no shell wrapping, proper signal handling)
CMD ["python", "-m", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"]

Multi-stage build (reduce image size):

# Stage 1: Build
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Runtime (only production artifacts)
FROM node:18-alpine AS runtime
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/server.js"]
# Final image: ~150 MB instead of ~1.2 GB

Security best practices:

# Pin base image digest (not just tag -- tags can be overwritten)
FROM python:3.11-slim@sha256:abc123...

# No SUID binaries in user-facing code
# Scan with: trivy image myapp:latest
# or: docker scout cves myapp:latest

Q4. How do Docker networking modes work?

# Bridge network (default): containers on same host communicate via virtual bridge
# Container gets private IP (172.17.0.x)
docker run --network bridge myapp

# Host network: container shares host's network namespace (no isolation)
# Use for: extreme performance, monitoring agents needing host network visibility
docker run --network host myapp

# Custom bridge network (better than default bridge -- DNS by container name)
docker network create myapp-net
docker run --network myapp-net --name api myapi:latest
docker run --network myapp-net --name db postgres:15
# 'api' container can reach 'db' container by name: postgres://db:5432/

# No network (fully isolated, no external access)
docker run --network none mybatch

# Overlay network (multi-host, Docker Swarm)
docker network create --driver overlay my-overlay

# macvlan: container gets MAC address, appears as physical device on network
# Use for: legacy apps requiring direct network access, specific MAC requirements

Kubernetes

Q5. Explain the Kubernetes architecture and key components.

Control Plane (master):
  kube-apiserver    -- REST API, single entry point for all management
  etcd              -- distributed key-value store, cluster state/config
  kube-scheduler    -- assigns Pods to Nodes (based on resources, affinity, taints)
  kube-controller-manager -- runs controllers: ReplicaSet, Deployment, Namespace, etc.
  cloud-controller-manager -- cloud-specific controllers (load balancer provisioning)

Worker Nodes:
  kubelet           -- node agent, ensures Pods from PodSpec are running
  kube-proxy        -- network rules, iptables/IPVS for Service routing
  Container Runtime -- containerd, CRI-O (runs containers)
  
Add-ons:
  CoreDNS           -- cluster DNS (service discovery)
  CNI plugin        -- networking (Calico, Flannel, Cilium)
  CSI driver        -- storage (AWS EBS, GCE PD, Ceph)

Request flow (kubectl apply -> Pod running):

1. kubectl apply -> kube-apiserver (validates + stores in etcd)
2. Deployment controller creates ReplicaSet
3. ReplicaSet controller creates Pod objects (no node assigned)
4. kube-scheduler watches for unscheduled Pods -> assigns Node
5. kubelet on target Node watches for Pods assigned to it
6. kubelet calls container runtime -> pulls image -> starts container
7. kubelet reports status back to apiserver -> etcd updated

Q6. What are Kubernetes resource types, and when do you use each?

Resource	Use case
Pod	Atomic unit, rarely created directly
Deployment	Stateless apps, rolling updates
StatefulSet	Stateful apps needing stable identity (databases, Kafka)
DaemonSet	One pod per node (log collectors, monitoring agents)
Job	One-time batch task (completes and exits)
CronJob	Scheduled batch tasks (like Unix cron)
Service	Stable network endpoint for Pods
Ingress	HTTP(S) routing, TLS termination, virtual hosts
ConfigMap	Non-sensitive configuration data
Secret	Sensitive data (passwords, tokens, certs)
PersistentVolumeClaim	Request for persistent storage
HorizontalPodAutoscaler	Scale replicas based on metrics
NetworkPolicy	Pod-level firewall rules

StatefulSet example (Kafka):

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
spec:
  serviceName: kafka-headless  # headless Service for DNS (kafka-0.kafka-headless)
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    spec:
      containers:
      - name: kafka
        image: confluentinc/cp-kafka:7.6.0
        env:
        - name: KAFKA_BROKER_ID
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
        volumeMounts:
        - name: data
          mountPath: /var/lib/kafka/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 100Gi

Q7. How do Kubernetes probes work, and why are they important?

spec:
  containers:
  - name: app
    image: myapp:v1

    # Liveness probe: restart container if it fails
    # Use for: deadlocked processes, infinite loop bugs
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 10   # wait before first probe
      periodSeconds: 10          # probe interval
      timeoutSeconds: 5
      failureThreshold: 3        # 3 consecutive failures -> restart

    # Readiness probe: remove from Service endpoints if it fails
    # Use for: slow startup, DB connection not ready, downstream dependency down
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 3        # 3 failures -> remove from LB, no more traffic

    # Startup probe: disable liveness until app finishes starting
    # Use for: slow-starting apps (Java Spring Boot, ML model loading)
    startupProbe:
      httpGet:
        path: /health/startup
        port: 8080
      failureThreshold: 30       # 30 * 10s = 300s max startup time
      periodSeconds: 10

Health endpoint implementation:

from flask import Flask, jsonify
import threading

app = Flask(__name__)
is_ready = threading.Event()

@app.route('/health/live')
def liveness():
    # Lightweight -- just verify process is alive
    return jsonify({'status': 'alive'}), 200

@app.route('/health/ready')
def readiness():
    # Check dependencies: DB connected, cache connected, model loaded
    if not db.is_connected():
        return jsonify({'status': 'not ready', 'reason': 'db disconnected'}), 503
    if not is_ready.is_set():
        return jsonify({'status': 'not ready', 'reason': 'initializing'}), 503
    return jsonify({'status': 'ready'}), 200

Q8. How does Kubernetes handle resource requests, limits, and QoS classes?

spec:
  containers:
  - name: app
    resources:
      requests:          # Guaranteed minimum -- used for scheduling decisions
        cpu: "500m"      # 0.5 vCPU
        memory: "512Mi"
      limits:            # Maximum -- container killed/throttled if exceeded
        cpu: "2000m"     # 2 vCPUs
        memory: "1Gi"

QoS Classes (affects eviction priority during memory pressure):

Class	Condition	Eviction priority
Guaranteed	requests == limits for all containers	Last evicted
Burstable	requests < limits for at least one container	Middle
BestEffort	No requests or limits set	First evicted

CPU throttling vs memory OOM:

CPU: container throttled (slowed), not killed. cpu_throttled_seconds metric visible in cAdvisor.
Memory: container killed with OOMKilled exit code. Set limits conservatively with 20-30% headroom.

Vertical Pod Autoscaler (VPA):

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myapp-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  updatePolicy:
    updateMode: "Auto"    # Auto-update requests; Off = recommendations only

Infrastructure as Code

Q9. How does Terraform work, and what is its state management model?

Terraform uses a declarative model: you define desired infrastructure state; Terraform calculates and applies the diff.

Core workflow:

terraform init      # download providers, initialize backend
terraform plan      # compute diff (desired vs current state)
terraform apply     # apply changes (with confirmation)
terraform destroy   # tear down all managed resources

State file:

terraform.tfstate: JSON representation of all managed resources.
Maps Terraform config to real infrastructure IDs.
Never edit manually.
Store in remote backend (S3, GCS, Terraform Cloud) -- never local for teams.

# Remote backend (S3 + DynamoDB for locking)
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/webapp/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Example: provision AWS EKS cluster with Terraform:

# variables.tf
variable "cluster_name" {
  default = "my-eks-cluster"
}

variable "region" {
  default = "us-east-1"
}

# main.tf
provider "aws" {
  region = var.region
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"

  name = "${var.cluster_name}-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["${var.region}a", "${var.region}b", "${var.region}c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = false  # one per AZ for HA
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.0.0"

  cluster_name    = var.cluster_name
  cluster_version = "1.29"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    general = {
      instance_types = ["m7g.xlarge"]
      min_size       = 2
      max_size       = 10
      desired_size   = 3
    }
  }
}

output "cluster_endpoint" {
  value = module.eks.cluster_endpoint
}

Terraform best practices:

- Modules: encapsulate reusable infrastructure components
- Workspaces or separate state per environment (dev/staging/prod)
- Remote state locking: prevent concurrent applies
- terraform plan in CI: post plan output as PR comment
- Sentinel / OPA policies: enforce compliance before apply
- Import: import existing resources into state without destroying them
- Lifecycle rules: prevent_destroy for production databases

Q10. What is Ansible, and how does it differ from Terraform?

Aspect	Terraform	Ansible
Primary use	Infrastructure provisioning	Configuration management + application deployment
Approach	Declarative (desired state)	Procedural (ordered tasks in playbooks)
State	Managed (tfstate file)	Stateless (re-runs are idempotent by task design)
Agentless	No (providers are API-based)	Yes (SSH/WinRM to target hosts)
Cloud IaC	Excellent	Limited (modules exist but Terraform is better)
OS config	Limited	Excellent (packages, files, services, users)

Ansible playbook example (configure web servers):

---
- name: Configure web servers
  hosts: webservers
  become: yes  # sudo

  vars:
    nginx_port: 80
    app_user: appuser

  tasks:
  - name: Install Nginx
    apt:
      name: nginx
      state: present
      update_cache: yes

  - name: Copy Nginx config
    template:
      src: templates/nginx.conf.j2
      dest: /etc/nginx/nginx.conf
      mode: '0644'
    notify: Restart Nginx

  - name: Create application user
    user:
      name: "{{ app_user }}"
      state: present
      shell: /bin/bash
      create_home: yes

  - name: Deploy application
    git:
      repo: https://github.com/myorg/myapp.git
      dest: /opt/myapp
      version: "{{ app_version | default('main') }}"
    notify: Restart app service

  - name: Ensure app service is running
    systemd:
      name: myapp
      state: started
      enabled: yes

  handlers:
  - name: Restart Nginx
    systemd:
      name: nginx
      state: restarted

  - name: Restart app service
    systemd:
      name: myapp
      state: restarted

Monitoring and Observability

Q11. What are the three pillars of observability?

1. Metrics (What is happening?) Numerical time-series measurements. Prometheus is the de facto standard.

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Counters: monotonically increasing
http_requests = Counter('http_requests_total', 'Total HTTP requests',
                        ['method', 'endpoint', 'status'])

# Histograms: latency distributions (automatically creates _count, _sum, _bucket)
request_duration = Histogram('http_request_duration_seconds', 'Request latency',
                              ['method', 'endpoint'],
                              buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5])

# Gauges: current value (can go up/down)
active_connections = Gauge('active_connections', 'Current active connections')

@app.route('/api/users')
def get_users():
    with request_duration.labels('GET', '/api/users').time():
        result = db.query_users()
        http_requests.labels('GET', '/api/users', '200').inc()
        return result

2. Logs (What happened?) Structured event records. ELK Stack (Elasticsearch + Logstash + Kibana) or Loki + Grafana.

import structlog

log = structlog.get_logger()

def process_order(order_id: str):
    log.info("order.processing", order_id=order_id, user_id=order.user_id)
    try:
        result = payment_service.charge(order)
        log.info("order.completed", order_id=order_id, amount=result.amount)
    except PaymentError as e:
        log.error("order.failed", order_id=order_id, error=str(e), exc_info=True)
        raise

3. Traces (How did it happen?) Distributed traces connecting spans across services. OpenTelemetry is the standard.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

provider = TracerProvider()
exporter = JaegerExporter(agent_host_name='jaeger', agent_port=6831)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

def handle_order(order_id: str):
    with tracer.start_as_current_span("handle_order") as span:
        span.set_attribute("order.id", order_id)
        validate_inventory(order_id)      # creates child span
        process_payment(order_id)         # creates child span
        send_confirmation(order_id)       # creates child span

Q12. How do you set up Prometheus + Grafana for Kubernetes monitoring?

# Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager + node exporters)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
    --namespace monitoring \
    --create-namespace \
    --values monitoring-values.yaml

# monitoring-values.yaml
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 50Gi

alertmanager:
  config:
    receivers:
    - name: slack
      slack_configs:
      - channel: '#alerts'
        api_url: 'https://hooks.slack.com/services/...'
        send_resolved: true
    route:
      receiver: slack
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h

grafana:
  dashboardProviders:
    dashboardproviders.yaml:
      providers:
      - orgId: 1
        folder: 'Kubernetes'
        type: file
        options:
          path: /var/lib/grafana/dashboards/kubernetes
  dashboards:
    kubernetes:
      k8s-cluster:
        gnetId: 6417        # community dashboard from Grafana.com
        revision: 1
        datasource: Prometheus

Key Prometheus alert rules:

groups:
- name: kubernetes
  rules:
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[5m]) > 0.1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.pod }} is crash looping"

  - alert: HighCPUUsage
    expr: |
      sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
      / sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod)
      > 0.9
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} CPU usage above 90% of limit"

  - alert: PersistentVolumeFillingUp
    expr: |
      kubelet_volume_stats_available_bytes
      / kubelet_volume_stats_capacity_bytes < 0.15
    for: 5m
    labels:
      severity: warning

SRE Practices

Q13. What are SLIs, SLOs, and SLAs? How do error budgets work?

SLI (Service Level Indicator): A quantitative measure of a service aspect. The metric you measure.

Availability SLI = successful_requests / total_requests
Latency SLI = % requests served in < 200ms
Error rate SLI = error_requests / total_requests
Throughput SLI = requests per second

SLO (Service Level Objective): Target value for an SLI. Your internal commitment.

Availability SLO: 99.9% (allows 43.8 minutes downtime/month)
Latency SLO: 95% of requests served in < 200ms
Error rate SLO: < 0.1% error rate

SLA (Service Level Agreement): External contract with customers. Has financial penalties for breach. Typically lower than SLO (buffer between internal target and external commitment).

SLO: 99.9% availability
SLA: 99.5% availability (with service credits if breached)

Error budget:

Monthly error budget = 100% - SLO = 0.1% of requests (for 99.9% SLO)
For 1M requests/month: error budget = 1,000 failed requests

Error budget consumed = actual errors / budget
If 500 errors this month: 50% budget consumed, 50% remaining

When error budget is exhausted:
  - Feature releases frozen until budget replenishes
  - Focus on reliability work only
  - Incident review and reliability improvement

When error budget has plenty remaining:
  - Deploy freely, take risks, ship features fast

Error budget policy (Prometheus query):

# Monthly error budget consumption
1 - (
    sum(rate(http_requests_total{status!~"5.."}[30d]))
    /
    sum(rate(http_requests_total[30d]))
)
/
0.001  # error budget fraction (1 - 0.999 SLO)

Q14. How do you conduct a blameless post-mortem?

Post-mortem structure:

## Incident: [Title]
**Date:** 2026-06-08
**Duration:** 47 minutes (14:23 - 15:10 UTC)
**Severity:** SEV-2 (partial service degradation)
**Author:** [Name]
**Status:** Action items in progress

## Impact
- 23% of API requests returned 503 errors
- Approximately 8,400 users affected
- No data loss

## Timeline (UTC)
14:23 -- PagerDuty alert: error rate > 5% (SLO breach)
14:27 -- On-call engineer acknowledges, begins investigation
14:31 -- Identified spike in database connection errors in logs
14:38 -- Checked recent deployments: config change deployed at 14:20
14:44 -- Identified: DB connection pool size reduced from 50 to 5 in deploy
14:47 -- Revert in progress
14:55 -- Revert deployed, error rate dropping
15:10 -- Error rate back to baseline, incident closed

## Root Cause
Connection pool size was accidentally set to 5 (from 50) in the deployment
at 14:20 due to a merge conflict resolution error in the config file.
The config change was not covered by automated tests.

## What Went Well
- Monitoring caught the issue quickly (4 minutes)
- Runbook for DB connection issues was helpful
- Rollback procedure was well-documented and fast

## What Went Poorly
- No automated test for connection pool configuration
- Config change review did not catch the merge conflict artifact
- No staging environment test of the specific config path

## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add automated test for DB connection pool config | Backend team | 2026-06-15 |
| Add connection pool size to config validation schema | DevOps | 2026-06-12 |
| Alert on connection pool exhaustion (not just errors) | SRE | 2026-06-10 |

Blameless principles:

Focus on systemic causes, not individual errors.
Share post-mortems widely -- learning opportunity for entire org.
Action items must have owners and due dates.
Measure: track action item completion rate, repeat incident rate.

Git and Version Control

Q15. What is GitOps, and how does it work with ArgoCD?

GitOps principle: Git is the single source of truth for both application code AND infrastructure state. Any desired change to production goes through a Git commit + PR review.

ArgoCD (pull-based GitOps for Kubernetes):

Developer commits K8s manifests to Git repo
     |
ArgoCD polls repo (or webhook triggers)
     |
ArgoCD compares desired state (Git) vs actual state (Kubernetes)
     |
ArgoCD applies diff to cluster (automated or manual approval)

# ArgoCD Application definition
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp-production
  namespace: argocd
spec:
  project: production

  source:
    repoURL: https://github.com/myorg/k8s-manifests.git
    targetRevision: main
    path: apps/myapp/production

  destination:
    server: https://kubernetes.default.svc
    namespace: production

  syncPolicy:
    automated:
      prune: true      # delete resources removed from Git
      selfHeal: true   # revert manual changes to cluster
    syncOptions:
    - CreateNamespace=true
    - PrunePropagationPolicy=foreground

    retry:
      limit: 5
      backoff:
        duration: 5s
        maxDuration: 3m
        factor: 2

Benefits over push-based CI/CD:

Cluster state always matches Git (no configuration drift).
Full audit trail: every change is a Git commit with author + message.
Rollback = git revert (no special tooling).
Multi-cluster management from single ArgoCD instance.
Separation: CI builds artifacts, GitOps deploys them (decoupled).

Q16. What is a Git branching strategy, and which is best for DevOps?

GitFlow (complex, suitable for versioned releases):

main        -- production code, tagged releases
develop     -- integration branch
feature/*   -- feature development (from develop)
release/*   -- release preparation (from develop)
hotfix/*    -- urgent production fixes (from main)

GitHub Flow (simple, CI/CD-optimized):

main        -- always deployable
feature/*   -- short-lived feature branches (from main)
             -- PR -> review -> merge to main -> auto-deploy

Trunk-Based Development (fastest, requires feature flags):

main (trunk) -- only branch; CI/CD deploys every commit
feature flags -- control which features are enabled per user
               -- merge incomplete features behind disabled flag
              -- enable gradually (canary -> 100%)

DevOps recommendation: Trunk-based development + feature flags for high-velocity teams. GitHub Flow for most teams. GitFlow only if you ship versioned software with independent release cycles.

Real-World Scenarios

Q17. How would you set up a complete CI/CD pipeline for a microservices application on Kubernetes?

Requirements: 5 microservices, independent deployments, automated testing, production safety, GitOps.

Toolchain:

Source control: GitHub
CI: GitHub Actions (test, build, push image)
Image registry: Docker Hub or ECR or GCR
CD: ArgoCD (GitOps pull-based)
K8s: EKS or GKE
Config management: Helm charts in separate manifests repo
Secrets: External Secrets Operator (pulls from AWS Secrets Manager / Vault)
Monitoring: Prometheus + Grafana

Pipeline per microservice:

PR opened:
  -> GitHub Actions: lint, unit tests, integration tests
  -> Post test results as PR check (block merge if failed)

PR merged to main:
  -> GitHub Actions: build Docker image, tag with commit SHA
  -> Push to ECR
  -> Update image tag in k8s-manifests repo (automated PR or direct commit)
  
k8s-manifests repo updated:
  -> ArgoCD detects change
  -> Deploys to staging automatically
  -> Smoke tests run
  -> Manual approval gate for production
  -> ArgoCD deploys to production
  -> Monitors deployment health (Argo Rollouts canary)
  -> Auto-rollback if error rate spikes

Helm chart structure:

charts/
  myservice/
    Chart.yaml
    values.yaml               -- defaults
    values-staging.yaml       -- staging overrides
    values-production.yaml    -- production overrides
    templates/
      deployment.yaml
      service.yaml
      ingress.yaml
      hpa.yaml
      pdb.yaml                -- PodDisruptionBudget
      serviceaccount.yaml
      configmap.yaml

Q18. How do you handle secrets in Kubernetes securely?

The problem with native Kubernetes Secrets:

# Kubernetes Secrets are base64-encoded (NOT encrypted at rest by default)
kubectl get secret myapp-secret -o yaml
# apiVersion: v1
# data:
#   db_password: cGFzc3dvcmQxMjM=  <-- just base64, trivially decoded
#   api_key: c2VjcmV0a2V5MTIz

echo "cGFzc3dvcmQxMjM=" | base64 -d  # -> password123

Approach 1: etcd encryption at rest (minimum baseline)

# EncryptionConfiguration on API server
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
    - secrets
    providers:
    - aescbc:
        keys:
        - name: key1
          secret: <base64-encoded-32-byte-key>
    - identity: {}

Approach 2: External Secrets Operator (recommended for production)

# ExternalSecret: pulls secret from AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: myapp-db-secret
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: ClusterSecretStore

  target:
    name: myapp-db-credentials  # creates a native K8s Secret
    creationPolicy: Owner

  data:
  - secretKey: DB_PASSWORD
    remoteRef:
      key: myapp/production/database
      property: password
  - secretKey: DB_USER
    remoteRef:
      key: myapp/production/database
      property: username

Approach 3: HashiCorp Vault + Vault Agent Injector

# Pod annotation: Vault injects secret as file at startup
annotations:
  vault.hashicorp.com/agent-inject: "true"
  vault.hashicorp.com/role: "myapp"
  vault.hashicorp.com/agent-inject-secret-db: "secret/myapp/database"
  vault.hashicorp.com/agent-inject-template-db: |
    {{- with secret "secret/myapp/database" -}}
    DB_PASSWORD={{ .Data.data.password }}
    {{- end }}
# Secret mounted at /vault/secrets/db inside container

FAQ

Q: What is the difference between Docker Compose and Kubernetes?

Docker Compose is a tool for defining and running multi-container applications on a single host. It is used for local development and simple single-server deployments. Kubernetes is a container orchestration system for production workloads: multi-node scheduling, auto-scaling, self-healing, rolling updates, service discovery, and storage orchestration across a cluster. Use Docker Compose for local dev; use Kubernetes for production. Tools like Kompose can convert Compose files to Kubernetes manifests.

Q: What is toil in SRE, and how do you reduce it?

Toil is manual, repetitive, tactical operational work that scales with service growth and provides no lasting value. Examples: manually restarting failed services, manually approving routine deployments, manually responding to false-positive alerts. SRE teams aim to keep toil below 50% of work time; the rest goes to engineering work (automation, reliability improvements). Reduce toil by: automating repetitive tasks, eliminating flaky alerts, implementing self-healing (Kubernetes restarts failed pods), and building tooling that removes human steps from standard workflows.

Q: How do you implement zero-downtime deployments in Kubernetes?

Key requirements: readiness probes (traffic only sent to ready pods), PodDisruptionBudget (limits pods unavailable during node drain), rolling update strategy (maxUnavailable: 0, maxSurge: 1), preStop hook + terminationGracePeriodSeconds (let in-flight requests complete), and proper graceful shutdown in application code (handle SIGTERM, drain connections before exit). Candidates report that missing readiness probes is the most common cause of deployment-related downtime in Kubernetes. Confirm current Kubernetes best practices on the official Kubernetes documentation.

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp

DevOps Engineer Interview Questions 2026: CI/CD, Containers, IaC & SRE

CI/CD Fundamentals

Q1. What is CI/CD, and what are the stages of a typical pipeline?

Q2. What are deployment strategies, and what are the trade-offs?

Docker

Q3. What are Docker layers, and how do you write efficient Dockerfiles?

Q4. How do Docker networking modes work?

Kubernetes

Q5. Explain the Kubernetes architecture and key components.

Q6. What are Kubernetes resource types, and when do you use each?

Q7. How do Kubernetes probes work, and why are they important?

Q8. How does Kubernetes handle resource requests, limits, and QoS classes?

Infrastructure as Code

Q9. How does Terraform work, and what is its state management model?

Q10. What is Ansible, and how does it differ from Terraform?

Monitoring and Observability

Q11. What are the three pillars of observability?

Q12. How do you set up Prometheus + Grafana for Kubernetes monitoring?

SRE Practices

Q13. What are SLIs, SLOs, and SLAs? How do error budgets work?

Q14. How do you conduct a blameless post-mortem?

Git and Version Control

Q15. What is GitOps, and how does it work with ArgoCD?

Q16. What is a Git branching strategy, and which is best for DevOps?

Real-World Scenarios

Q17. How would you set up a complete CI/CD pipeline for a microservices application on Kubernetes?

Q18. How do you handle secrets in Kubernetes securely?

FAQ

Q: What is the difference between Docker Compose and Kubernetes?

Q: What is toil in SRE, and how do you reduce it?

Q: How do you implement zero-downtime deployments in Kubernetes?

More resources in Interview Questions

Sat this this year? Share your story, earn ₹500.

Take a free timed mock test

DevOps Interview Questions 2026, Top 50 with Expert Answers

Docker Interview Questions 2026, Top 40 with Expert Answers

Docker Compose Interview Questions 2026, 28 Q&A with YAML and Networking

Kubernetes Interview Questions 2026, Top 50 with Expert Answers

Terraform Interview Questions 2026, 32 Q&A on State, Modules, and Workflow

Share this guide

DevOps Engineer Interview Questions 2026: CI/CD, Containers, IaC & SRE

CI/CD Fundamentals

Q1. What is CI/CD, and what are the stages of a typical pipeline?

Q2. What are deployment strategies, and what are the trade-offs?

Docker

Q3. What are Docker layers, and how do you write efficient Dockerfiles?

Q4. How do Docker networking modes work?

Kubernetes

Q5. Explain the Kubernetes architecture and key components.

Q6. What are Kubernetes resource types, and when do you use each?

Q7. How do Kubernetes probes work, and why are they important?

Q8. How does Kubernetes handle resource requests, limits, and QoS classes?

Infrastructure as Code

Q9. How does Terraform work, and what is its state management model?

Q10. What is Ansible, and how does it differ from Terraform?

Monitoring and Observability

Q11. What are the three pillars of observability?

Q12. How do you set up Prometheus + Grafana for Kubernetes monitoring?

SRE Practices

Q13. What are SLIs, SLOs, and SLAs? How do error budgets work?

Q14. How do you conduct a blameless post-mortem?

Git and Version Control

Q15. What is GitOps, and how does it work with ArgoCD?

Q16. What is a Git branching strategy, and which is best for DevOps?

Real-World Scenarios

Q17. How would you set up a complete CI/CD pipeline for a microservices application on Kubernetes?

Q18. How do you handle secrets in Kubernetes securely?

FAQ

Q: What is the difference between Docker Compose and Kubernetes?

Q: What is toil in SRE, and how do you reduce it?

Q: How do you implement zero-downtime deployments in Kubernetes?

Related Topics

More resources in Interview Questions

Sat this this year? Share your story, earn ₹500.

Take a free timed mock test

DevOps Interview Questions 2026, Top 50 with Expert Answers

Docker Interview Questions 2026, Top 40 with Expert Answers

Docker Compose Interview Questions 2026, 28 Q&A with YAML and Networking

Kubernetes Interview Questions 2026, Top 50 with Expert Answers

Terraform Interview Questions 2026, 32 Q&A on State, Modules, and Workflow

Share this guide