DevOps Engineer Interview Questions 2026: CI/CD, Containers, IaC & SRE

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
Candidates report that DevOps Engineer interviews in 2026 emphasize Kubernetes operations, Terraform IaC, CI/CD pipeline design, observability, and incident management. Confirm exact tool versions and practices expected on the official company careers portal before your interview.
DevOps Engineering covers the full software delivery lifecycle: CI/CD, containerization, orchestration, infrastructure automation, monitoring, and reliability engineering. This guide covers core concepts and real-world scenarios.
CI/CD Fundamentals
Q1. What is CI/CD, and what are the stages of a typical pipeline?
Continuous Integration (CI): Developers merge code frequently (multiple times/day), triggering automated build and test to detect integration issues early.
Continuous Delivery (CD): Every passing CI build is a releasable artifact. Deployment to production requires a manual approval gate.
Continuous Deployment: Every passing CI build deploys automatically to production. No manual gate.
Typical pipeline stages:
# GitHub Actions example pipeline
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Lint
run: flake8 src/ tests/
- name: Unit tests
run: pytest tests/unit/ --cov=src --cov-report=xml
- name: Integration tests
run: pytest tests/integration/ -m integration
- name: Security scan
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
severity: 'HIGH,CRITICAL'
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build Docker image
run: docker build -t myapp:${{ github.sha }} .
- name: Push to registry
run: |
echo ${{ secrets.REGISTRY_PASSWORD }} | docker login -u ${{ secrets.REGISTRY_USER }} --password-stdin
docker push myapp:${{ github.sha }}
docker tag myapp:${{ github.sha }} myapp:latest
docker push myapp:latest
deploy-staging:
needs: build
runs-on: ubuntu-latest
environment: staging
steps:
- name: Deploy to staging
run: |
kubectl set image deployment/myapp \
myapp=myapp:${{ github.sha }} \
--namespace=staging
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production # requires approval
if: github.ref == 'refs/heads/main'
steps:
- name: Deploy to production
run: |
kubectl set image deployment/myapp \
myapp=myapp:${{ github.sha }} \
--namespace=production
Key pipeline principles:
- Fail fast: run fast tests (unit) before slow tests (integration).
- Artifact immutability: same Docker image promoted staging -> production (never rebuild).
- Every merge to main should be deployable.
- Pipeline as code: version-controlled alongside application code.
Q2. What are deployment strategies, and what are the trade-offs?
Rolling Update (default Kubernetes):
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25% # extra pods during update (max total = 125%)
maxUnavailable: 25% # pods that can be unavailable (min available = 75%)
- Gradual replacement of old pods with new pods.
- Zero-downtime if health checks are correct.
- Risk: both versions running simultaneously during update (API backward compatibility required).
Blue/Green:
# Blue service (current production)
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # <-- switch to green to deploy
---
# Blue deployment (keep running until verified)
kind: Deployment
spec:
selector:
matchLabels:
app: myapp
version: blue
---
# Green deployment (new version)
kind: Deployment
spec:
selector:
matchLabels:
app: myapp
version: green
- Instant cutover: update Service selector.
- Instant rollback: revert Service selector.
- Requires 2x capacity.
Canary:
# 90% production, 10% canary
# Production: 9 replicas
# Canary: 1 replica
# Traffic split by replica ratio (or Istio/Argo Rollouts for precise %)
Use Argo Rollouts for percentage-based canary with automatic rollback:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 10 # 10% canary
- pause: {duration: 5m}
- analysis: # auto-rollback if error rate > threshold
templates:
- templateName: error-rate
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
Recreate:
- Stop all old pods, start all new pods.
- Downtime during restart.
- Use only when: breaking changes prevent both versions running simultaneously.
Docker
Q3. What are Docker layers, and how do you write efficient Dockerfiles?
Each RUN, COPY, ADD instruction creates a new layer. Layers are cached -- only changed layers and below are rebuilt.
Inefficient Dockerfile:
FROM python:3.11
WORKDIR /app
COPY . . # BAD: copies everything including code
RUN pip install -r requirements.txt # Runs AFTER code copy -- no cache benefit
Efficient Dockerfile:
FROM python:3.11-slim
WORKDIR /app
# Copy dependencies first (cached unless requirements.txt changes)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code last (changes frequently -- invalidates only this layer)
COPY src/ src/
COPY config/ config/
# Non-root user (security best practice)
RUN useradd --create-home appuser
USER appuser
# Document port, don't use EXPOSE to actually open it
EXPOSE 8080
# Use exec form (no shell wrapping, proper signal handling)
CMD ["python", "-m", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"]
Multi-stage build (reduce image size):
# Stage 1: Build
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Stage 2: Runtime (only production artifacts)
FROM node:18-alpine AS runtime
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/server.js"]
# Final image: ~150 MB instead of ~1.2 GB
Security best practices:
# Pin base image digest (not just tag -- tags can be overwritten)
FROM python:3.11-slim@sha256:abc123...
# No SUID binaries in user-facing code
# Scan with: trivy image myapp:latest
# or: docker scout cves myapp:latest
Q4. How do Docker networking modes work?
# Bridge network (default): containers on same host communicate via virtual bridge
# Container gets private IP (172.17.0.x)
docker run --network bridge myapp
# Host network: container shares host's network namespace (no isolation)
# Use for: extreme performance, monitoring agents needing host network visibility
docker run --network host myapp
# Custom bridge network (better than default bridge -- DNS by container name)
docker network create myapp-net
docker run --network myapp-net --name api myapi:latest
docker run --network myapp-net --name db postgres:15
# 'api' container can reach 'db' container by name: postgres://db:5432/
# No network (fully isolated, no external access)
docker run --network none mybatch
# Overlay network (multi-host, Docker Swarm)
docker network create --driver overlay my-overlay
# macvlan: container gets MAC address, appears as physical device on network
# Use for: legacy apps requiring direct network access, specific MAC requirements
Kubernetes
Q5. Explain the Kubernetes architecture and key components.
Control Plane (master):
kube-apiserver -- REST API, single entry point for all management
etcd -- distributed key-value store, cluster state/config
kube-scheduler -- assigns Pods to Nodes (based on resources, affinity, taints)
kube-controller-manager -- runs controllers: ReplicaSet, Deployment, Namespace, etc.
cloud-controller-manager -- cloud-specific controllers (load balancer provisioning)
Worker Nodes:
kubelet -- node agent, ensures Pods from PodSpec are running
kube-proxy -- network rules, iptables/IPVS for Service routing
Container Runtime -- containerd, CRI-O (runs containers)
Add-ons:
CoreDNS -- cluster DNS (service discovery)
CNI plugin -- networking (Calico, Flannel, Cilium)
CSI driver -- storage (AWS EBS, GCE PD, Ceph)
Request flow (kubectl apply -> Pod running):
1. kubectl apply -> kube-apiserver (validates + stores in etcd)
2. Deployment controller creates ReplicaSet
3. ReplicaSet controller creates Pod objects (no node assigned)
4. kube-scheduler watches for unscheduled Pods -> assigns Node
5. kubelet on target Node watches for Pods assigned to it
6. kubelet calls container runtime -> pulls image -> starts container
7. kubelet reports status back to apiserver -> etcd updated
Q6. What are Kubernetes resource types, and when do you use each?
| Resource | Use case |
|---|---|
| Pod | Atomic unit, rarely created directly |
| Deployment | Stateless apps, rolling updates |
| StatefulSet | Stateful apps needing stable identity (databases, Kafka) |
| DaemonSet | One pod per node (log collectors, monitoring agents) |
| Job | One-time batch task (completes and exits) |
| CronJob | Scheduled batch tasks (like Unix cron) |
| Service | Stable network endpoint for Pods |
| Ingress | HTTP(S) routing, TLS termination, virtual hosts |
| ConfigMap | Non-sensitive configuration data |
| Secret | Sensitive data (passwords, tokens, certs) |
| PersistentVolumeClaim | Request for persistent storage |
| HorizontalPodAutoscaler | Scale replicas based on metrics |
| NetworkPolicy | Pod-level firewall rules |
StatefulSet example (Kafka):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
spec:
serviceName: kafka-headless # headless Service for DNS (kafka-0.kafka-headless)
replicas: 3
selector:
matchLabels:
app: kafka
template:
spec:
containers:
- name: kafka
image: confluentinc/cp-kafka:7.6.0
env:
- name: KAFKA_BROKER_ID
valueFrom:
fieldRef:
fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
volumeMounts:
- name: data
mountPath: /var/lib/kafka/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi
Q7. How do Kubernetes probes work, and why are they important?
spec:
containers:
- name: app
image: myapp:v1
# Liveness probe: restart container if it fails
# Use for: deadlocked processes, infinite loop bugs
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10 # wait before first probe
periodSeconds: 10 # probe interval
timeoutSeconds: 5
failureThreshold: 3 # 3 consecutive failures -> restart
# Readiness probe: remove from Service endpoints if it fails
# Use for: slow startup, DB connection not ready, downstream dependency down
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3 # 3 failures -> remove from LB, no more traffic
# Startup probe: disable liveness until app finishes starting
# Use for: slow-starting apps (Java Spring Boot, ML model loading)
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30 # 30 * 10s = 300s max startup time
periodSeconds: 10
Health endpoint implementation:
from flask import Flask, jsonify
import threading
app = Flask(__name__)
is_ready = threading.Event()
@app.route('/health/live')
def liveness():
# Lightweight -- just verify process is alive
return jsonify({'status': 'alive'}), 200
@app.route('/health/ready')
def readiness():
# Check dependencies: DB connected, cache connected, model loaded
if not db.is_connected():
return jsonify({'status': 'not ready', 'reason': 'db disconnected'}), 503
if not is_ready.is_set():
return jsonify({'status': 'not ready', 'reason': 'initializing'}), 503
return jsonify({'status': 'ready'}), 200
Q8. How does Kubernetes handle resource requests, limits, and QoS classes?
spec:
containers:
- name: app
resources:
requests: # Guaranteed minimum -- used for scheduling decisions
cpu: "500m" # 0.5 vCPU
memory: "512Mi"
limits: # Maximum -- container killed/throttled if exceeded
cpu: "2000m" # 2 vCPUs
memory: "1Gi"
QoS Classes (affects eviction priority during memory pressure):
| Class | Condition | Eviction priority |
|---|---|---|
| Guaranteed | requests == limits for all containers | Last evicted |
| Burstable | requests < limits for at least one container | Middle |
| BestEffort | No requests or limits set | First evicted |
CPU throttling vs memory OOM:
- CPU: container throttled (slowed), not killed.
cpu_throttled_secondsmetric visible in cAdvisor. - Memory: container killed with OOMKilled exit code. Set limits conservatively with 20-30% headroom.
Vertical Pod Autoscaler (VPA):
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: myapp-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
updatePolicy:
updateMode: "Auto" # Auto-update requests; Off = recommendations only
Infrastructure as Code
Q9. How does Terraform work, and what is its state management model?
Terraform uses a declarative model: you define desired infrastructure state; Terraform calculates and applies the diff.
Core workflow:
terraform init # download providers, initialize backend
terraform plan # compute diff (desired vs current state)
terraform apply # apply changes (with confirmation)
terraform destroy # tear down all managed resources
State file:
terraform.tfstate: JSON representation of all managed resources.- Maps Terraform config to real infrastructure IDs.
- Never edit manually.
- Store in remote backend (S3, GCS, Terraform Cloud) -- never local for teams.
# Remote backend (S3 + DynamoDB for locking)
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/webapp/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
Example: provision AWS EKS cluster with Terraform:
# variables.tf
variable "cluster_name" {
default = "my-eks-cluster"
}
variable "region" {
default = "us-east-1"
}
# main.tf
provider "aws" {
region = var.region
}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.0.0"
name = "${var.cluster_name}-vpc"
cidr = "10.0.0.0/16"
azs = ["${var.region}a", "${var.region}b", "${var.region}c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # one per AZ for HA
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.0.0"
cluster_name = var.cluster_name
cluster_version = "1.29"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
general = {
instance_types = ["m7g.xlarge"]
min_size = 2
max_size = 10
desired_size = 3
}
}
}
output "cluster_endpoint" {
value = module.eks.cluster_endpoint
}
Terraform best practices:
- Modules: encapsulate reusable infrastructure components
- Workspaces or separate state per environment (dev/staging/prod)
- Remote state locking: prevent concurrent applies
- terraform plan in CI: post plan output as PR comment
- Sentinel / OPA policies: enforce compliance before apply
- Import: import existing resources into state without destroying them
- Lifecycle rules: prevent_destroy for production databases
Q10. What is Ansible, and how does it differ from Terraform?
| Aspect | Terraform | Ansible |
|---|---|---|
| Primary use | Infrastructure provisioning | Configuration management + application deployment |
| Approach | Declarative (desired state) | Procedural (ordered tasks in playbooks) |
| State | Managed (tfstate file) | Stateless (re-runs are idempotent by task design) |
| Agentless | No (providers are API-based) | Yes (SSH/WinRM to target hosts) |
| Cloud IaC | Excellent | Limited (modules exist but Terraform is better) |
| OS config | Limited | Excellent (packages, files, services, users) |
Ansible playbook example (configure web servers):
---
- name: Configure web servers
hosts: webservers
become: yes # sudo
vars:
nginx_port: 80
app_user: appuser
tasks:
- name: Install Nginx
apt:
name: nginx
state: present
update_cache: yes
- name: Copy Nginx config
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/nginx.conf
mode: '0644'
notify: Restart Nginx
- name: Create application user
user:
name: "{{ app_user }}"
state: present
shell: /bin/bash
create_home: yes
- name: Deploy application
git:
repo: https://github.com/myorg/myapp.git
dest: /opt/myapp
version: "{{ app_version | default('main') }}"
notify: Restart app service
- name: Ensure app service is running
systemd:
name: myapp
state: started
enabled: yes
handlers:
- name: Restart Nginx
systemd:
name: nginx
state: restarted
- name: Restart app service
systemd:
name: myapp
state: restarted
Monitoring and Observability
Q11. What are the three pillars of observability?
1. Metrics (What is happening?) Numerical time-series measurements. Prometheus is the de facto standard.
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Counters: monotonically increasing
http_requests = Counter('http_requests_total', 'Total HTTP requests',
['method', 'endpoint', 'status'])
# Histograms: latency distributions (automatically creates _count, _sum, _bucket)
request_duration = Histogram('http_request_duration_seconds', 'Request latency',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5])
# Gauges: current value (can go up/down)
active_connections = Gauge('active_connections', 'Current active connections')
@app.route('/api/users')
def get_users():
with request_duration.labels('GET', '/api/users').time():
result = db.query_users()
http_requests.labels('GET', '/api/users', '200').inc()
return result
2. Logs (What happened?) Structured event records. ELK Stack (Elasticsearch + Logstash + Kibana) or Loki + Grafana.
import structlog
log = structlog.get_logger()
def process_order(order_id: str):
log.info("order.processing", order_id=order_id, user_id=order.user_id)
try:
result = payment_service.charge(order)
log.info("order.completed", order_id=order_id, amount=result.amount)
except PaymentError as e:
log.error("order.failed", order_id=order_id, error=str(e), exc_info=True)
raise
3. Traces (How did it happen?) Distributed traces connecting spans across services. OpenTelemetry is the standard.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
provider = TracerProvider()
exporter = JaegerExporter(agent_host_name='jaeger', agent_port=6831)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
def handle_order(order_id: str):
with tracer.start_as_current_span("handle_order") as span:
span.set_attribute("order.id", order_id)
validate_inventory(order_id) # creates child span
process_payment(order_id) # creates child span
send_confirmation(order_id) # creates child span
Q12. How do you set up Prometheus + Grafana for Kubernetes monitoring?
# Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager + node exporters)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values monitoring-values.yaml
# monitoring-values.yaml
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 50Gi
alertmanager:
config:
receivers:
- name: slack
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/...'
send_resolved: true
route:
receiver: slack
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
grafana:
dashboardProviders:
dashboardproviders.yaml:
providers:
- orgId: 1
folder: 'Kubernetes'
type: file
options:
path: /var/lib/grafana/dashboards/kubernetes
dashboards:
kubernetes:
k8s-cluster:
gnetId: 6417 # community dashboard from Grafana.com
revision: 1
datasource: Prometheus
Key Prometheus alert rules:
groups:
- name: kubernetes
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
/ sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod)
> 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} CPU usage above 90% of limit"
- alert: PersistentVolumeFillingUp
expr: |
kubelet_volume_stats_available_bytes
/ kubelet_volume_stats_capacity_bytes < 0.15
for: 5m
labels:
severity: warning
SRE Practices
Q13. What are SLIs, SLOs, and SLAs? How do error budgets work?
SLI (Service Level Indicator): A quantitative measure of a service aspect. The metric you measure.
Availability SLI = successful_requests / total_requests
Latency SLI = % requests served in < 200ms
Error rate SLI = error_requests / total_requests
Throughput SLI = requests per second
SLO (Service Level Objective): Target value for an SLI. Your internal commitment.
Availability SLO: 99.9% (allows 43.8 minutes downtime/month)
Latency SLO: 95% of requests served in < 200ms
Error rate SLO: < 0.1% error rate
SLA (Service Level Agreement): External contract with customers. Has financial penalties for breach. Typically lower than SLO (buffer between internal target and external commitment).
SLO: 99.9% availability
SLA: 99.5% availability (with service credits if breached)
Error budget:
Monthly error budget = 100% - SLO = 0.1% of requests (for 99.9% SLO)
For 1M requests/month: error budget = 1,000 failed requests
Error budget consumed = actual errors / budget
If 500 errors this month: 50% budget consumed, 50% remaining
When error budget is exhausted:
- Feature releases frozen until budget replenishes
- Focus on reliability work only
- Incident review and reliability improvement
When error budget has plenty remaining:
- Deploy freely, take risks, ship features fast
Error budget policy (Prometheus query):
# Monthly error budget consumption
1 - (
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
/
0.001 # error budget fraction (1 - 0.999 SLO)
Q14. How do you conduct a blameless post-mortem?
Post-mortem structure:
## Incident: [Title]
**Date:** 2026-06-08
**Duration:** 47 minutes (14:23 - 15:10 UTC)
**Severity:** SEV-2 (partial service degradation)
**Author:** [Name]
**Status:** Action items in progress
## Impact
- 23% of API requests returned 503 errors
- Approximately 8,400 users affected
- No data loss
## Timeline (UTC)
14:23 -- PagerDuty alert: error rate > 5% (SLO breach)
14:27 -- On-call engineer acknowledges, begins investigation
14:31 -- Identified spike in database connection errors in logs
14:38 -- Checked recent deployments: config change deployed at 14:20
14:44 -- Identified: DB connection pool size reduced from 50 to 5 in deploy
14:47 -- Revert in progress
14:55 -- Revert deployed, error rate dropping
15:10 -- Error rate back to baseline, incident closed
## Root Cause
Connection pool size was accidentally set to 5 (from 50) in the deployment
at 14:20 due to a merge conflict resolution error in the config file.
The config change was not covered by automated tests.
## What Went Well
- Monitoring caught the issue quickly (4 minutes)
- Runbook for DB connection issues was helpful
- Rollback procedure was well-documented and fast
## What Went Poorly
- No automated test for connection pool configuration
- Config change review did not catch the merge conflict artifact
- No staging environment test of the specific config path
## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add automated test for DB connection pool config | Backend team | 2026-06-15 |
| Add connection pool size to config validation schema | DevOps | 2026-06-12 |
| Alert on connection pool exhaustion (not just errors) | SRE | 2026-06-10 |
Blameless principles:
- Focus on systemic causes, not individual errors.
- Share post-mortems widely -- learning opportunity for entire org.
- Action items must have owners and due dates.
- Measure: track action item completion rate, repeat incident rate.
Git and Version Control
Q15. What is GitOps, and how does it work with ArgoCD?
GitOps principle: Git is the single source of truth for both application code AND infrastructure state. Any desired change to production goes through a Git commit + PR review.
ArgoCD (pull-based GitOps for Kubernetes):
Developer commits K8s manifests to Git repo
|
ArgoCD polls repo (or webhook triggers)
|
ArgoCD compares desired state (Git) vs actual state (Kubernetes)
|
ArgoCD applies diff to cluster (automated or manual approval)
# ArgoCD Application definition
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp-production
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/myorg/k8s-manifests.git
targetRevision: main
path: apps/myapp/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # delete resources removed from Git
selfHeal: true # revert manual changes to cluster
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
retry:
limit: 5
backoff:
duration: 5s
maxDuration: 3m
factor: 2
Benefits over push-based CI/CD:
- Cluster state always matches Git (no configuration drift).
- Full audit trail: every change is a Git commit with author + message.
- Rollback =
git revert(no special tooling). - Multi-cluster management from single ArgoCD instance.
- Separation: CI builds artifacts, GitOps deploys them (decoupled).
Q16. What is a Git branching strategy, and which is best for DevOps?
GitFlow (complex, suitable for versioned releases):
main -- production code, tagged releases
develop -- integration branch
feature/* -- feature development (from develop)
release/* -- release preparation (from develop)
hotfix/* -- urgent production fixes (from main)
GitHub Flow (simple, CI/CD-optimized):
main -- always deployable
feature/* -- short-lived feature branches (from main)
-- PR -> review -> merge to main -> auto-deploy
Trunk-Based Development (fastest, requires feature flags):
main (trunk) -- only branch; CI/CD deploys every commit
feature flags -- control which features are enabled per user
-- merge incomplete features behind disabled flag
-- enable gradually (canary -> 100%)
DevOps recommendation: Trunk-based development + feature flags for high-velocity teams. GitHub Flow for most teams. GitFlow only if you ship versioned software with independent release cycles.
Real-World Scenarios
Q17. How would you set up a complete CI/CD pipeline for a microservices application on Kubernetes?
Requirements: 5 microservices, independent deployments, automated testing, production safety, GitOps.
Toolchain:
Source control: GitHub
CI: GitHub Actions (test, build, push image)
Image registry: Docker Hub or ECR or GCR
CD: ArgoCD (GitOps pull-based)
K8s: EKS or GKE
Config management: Helm charts in separate manifests repo
Secrets: External Secrets Operator (pulls from AWS Secrets Manager / Vault)
Monitoring: Prometheus + Grafana
Pipeline per microservice:
PR opened:
-> GitHub Actions: lint, unit tests, integration tests
-> Post test results as PR check (block merge if failed)
PR merged to main:
-> GitHub Actions: build Docker image, tag with commit SHA
-> Push to ECR
-> Update image tag in k8s-manifests repo (automated PR or direct commit)
k8s-manifests repo updated:
-> ArgoCD detects change
-> Deploys to staging automatically
-> Smoke tests run
-> Manual approval gate for production
-> ArgoCD deploys to production
-> Monitors deployment health (Argo Rollouts canary)
-> Auto-rollback if error rate spikes
Helm chart structure:
charts/
myservice/
Chart.yaml
values.yaml -- defaults
values-staging.yaml -- staging overrides
values-production.yaml -- production overrides
templates/
deployment.yaml
service.yaml
ingress.yaml
hpa.yaml
pdb.yaml -- PodDisruptionBudget
serviceaccount.yaml
configmap.yaml
Q18. How do you handle secrets in Kubernetes securely?
The problem with native Kubernetes Secrets:
# Kubernetes Secrets are base64-encoded (NOT encrypted at rest by default)
kubectl get secret myapp-secret -o yaml
# apiVersion: v1
# data:
# db_password: cGFzc3dvcmQxMjM= <-- just base64, trivially decoded
# api_key: c2VjcmV0a2V5MTIz
echo "cGFzc3dvcmQxMjM=" | base64 -d # -> password123
Approach 1: etcd encryption at rest (minimum baseline)
# EncryptionConfiguration on API server
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- aescbc:
keys:
- name: key1
secret: <base64-encoded-32-byte-key>
- identity: {}
Approach 2: External Secrets Operator (recommended for production)
# ExternalSecret: pulls secret from AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: myapp-db-secret
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secretsmanager
kind: ClusterSecretStore
target:
name: myapp-db-credentials # creates a native K8s Secret
creationPolicy: Owner
data:
- secretKey: DB_PASSWORD
remoteRef:
key: myapp/production/database
property: password
- secretKey: DB_USER
remoteRef:
key: myapp/production/database
property: username
Approach 3: HashiCorp Vault + Vault Agent Injector
# Pod annotation: Vault injects secret as file at startup
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "myapp"
vault.hashicorp.com/agent-inject-secret-db: "secret/myapp/database"
vault.hashicorp.com/agent-inject-template-db: |
{{- with secret "secret/myapp/database" -}}
DB_PASSWORD={{ .Data.data.password }}
{{- end }}
# Secret mounted at /vault/secrets/db inside container
FAQ
Q: What is the difference between Docker Compose and Kubernetes? Docker Compose is a tool for defining and running multi-container applications on a single host. It is used for local development and simple single-server deployments. Kubernetes is a container orchestration system for production workloads: multi-node scheduling, auto-scaling, self-healing, rolling updates, service discovery, and storage orchestration across a cluster. Use Docker Compose for local dev; use Kubernetes for production. Tools like Kompose can convert Compose files to Kubernetes manifests.
Q: What is toil in SRE, and how do you reduce it? Toil is manual, repetitive, tactical operational work that scales with service growth and provides no lasting value. Examples: manually restarting failed services, manually approving routine deployments, manually responding to false-positive alerts. SRE teams aim to keep toil below 50% of work time; the rest goes to engineering work (automation, reliability improvements). Reduce toil by: automating repetitive tasks, eliminating flaky alerts, implementing self-healing (Kubernetes restarts failed pods), and building tooling that removes human steps from standard workflows.
Q: How do you implement zero-downtime deployments in Kubernetes? Key requirements: readiness probes (traffic only sent to ready pods), PodDisruptionBudget (limits pods unavailable during node drain), rolling update strategy (maxUnavailable: 0, maxSurge: 1), preStop hook + terminationGracePeriodSeconds (let in-flight requests complete), and proper graceful shutdown in application code (handle SIGTERM, drain connections before exit). Candidates report that missing readiness probes is the most common cause of deployment-related downtime in Kubernetes. Confirm current Kubernetes best practices on the official Kubernetes documentation.
Related Topics
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →Related Articles
Airbnb Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airbnb's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Airtel Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airtel's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
AMD Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing AMD's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Atlassian Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Atlassian's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical,...
Barclays Interview Questions 2026
_Last verified by [Aditya Sharma](/author/aditya-sharma/) · cross-checked against PapersAdda Hiring Pulse and...
More from PapersAdda
Accenture Interview Process 2026: Rounds & Prep
Accenture Interview Questions 2026 (with Answers for Freshers)
Adobe Interview Process 2026: Rounds, OA & Aptitude
Amazon Interview Process 2026: Full Loop + Bar Raiser