Microservices Interview Questions 2026 — Top 40 with Expert Answers
Senior backend engineers with microservices expertise earn ₹30-90 LPA at product companies. Staff/Principal architects at Flipkart, Swiggy, and Razorpay command ₹80 LPA-1.5 Cr. The difference between a ₹15 LPA "backend developer" and a ₹50 LPA "distributed systems engineer" is whether you can answer these questions with production depth — not just textbook definitions.
Every company running at scale — Flipkart, Swiggy, Razorpay, Amazon, Netflix, PhonePe, CRED — expects you to design, build, and debug microservices architectures. This guide covers 40 battle-tested questions compiled from real interviews at these exact companies, from fundamentals to the system design scenarios that decide Senior and Staff offers.
Related: Kubernetes Interview Questions 2026 | Docker Interview Questions 2026 | Golang Interview Questions 2026
Beginner-Level Microservices Questions (Q1–Q12)
Q1. What are microservices? How do they differ from a monolith?
Monolith: Single deployable unit containing all application features. The entire application — user management, orders, payments, notifications — is one codebase, one build, one deployment.
Microservices: Independent, small services each owning a specific business capability. Each service has its own codebase, deployment pipeline, and database.
Comparison:
| Aspect | Monolith | Microservices |
|---|---|---|
| Deployment | Entire app per release | Independent per service |
| Scaling | Scale entire app | Scale individual bottlenecks |
| Technology | Single stack | Polyglot (right tool per job) |
| Team ownership | Shared codebase | Service ownership per team |
| Failure blast radius | Full app goes down | Isolated to one service |
| Development speed | Fast initially, slows with size | Slower initially, scales better |
| Testing | Simpler (one process) | Complex (distributed) |
| Operational complexity | Low | High |
| Data consistency | Easy (one DB transaction) | Hard (distributed transactions) |
When NOT to use microservices: Small teams (<10 engineers), early-stage startups, low traffic, simple domains. A well-designed monolith is often the right choice until scale demands otherwise. Start with a modular monolith and extract services when a specific domain needs independent scaling.
Every senior backend interview starts here
Q2. What is service discovery? How does it work?
Two patterns:
Client-side discovery:
- Service A queries a Service Registry (Eureka, Consul, etcd) for Service B's instances
- Service A applies load balancing logic and calls one instance directly
- Example: Netflix Ribbon + Eureka
- Advantage: No proxy overhead
- Disadvantage: Every client needs discovery library
Server-side discovery:
- Service A sends request to a load balancer/proxy
- Proxy queries the Service Registry and routes to an available instance
- Example: AWS ALB + ECS, Kubernetes Services + kube-proxy, Nginx + Consul-template
- Advantage: Language-agnostic
- Disadvantage: Extra network hop, proxy is a potential bottleneck
Kubernetes service discovery:
K8s uses its own service registry (etcd) + CoreDNS. Services are discoverable at <service-name>.<namespace>.svc.cluster.local. Kubernetes kube-proxy programs iptables/IPVS rules for traffic routing. This is server-side discovery built into the platform.
Asked at Flipkart, Swiggy, Amazon SDE-2/3
Q3. What is an API Gateway? What are its responsibilities?
Responsibilities:
| Function | Description |
|---|---|
| Routing | Route requests to appropriate backend service |
| Authentication | Validate JWT tokens, API keys before forwarding |
| Rate limiting | Throttle per client/endpoint |
| SSL termination | Handle HTTPS at gateway, services use HTTP internally |
| Request transformation | Transform between client format and service format |
| Response aggregation | Call multiple services, combine into one response (BFF pattern) |
| Caching | Cache responses for read-heavy endpoints |
| Observability | Centralized logging, tracing, metrics |
| Circuit breaking | Fail fast when backend is unhealthy |
Popular API Gateways:
| Gateway | Type | Best For |
|---|---|---|
| Kong | Open-source, plugin-based | General purpose, high customization |
| AWS API Gateway | Managed, serverless | AWS-native, Lambda integration |
| Nginx/NGINX Plus | Reverse proxy + gateway | Performance, simple routing |
| Envoy | L7 proxy | Service mesh foundation (Istio, AWS App Mesh) |
| Traefik | Cloud-native, K8s-native | Kubernetes deployments |
| Apigee | Enterprise, Google Cloud | API management, developer portal |
Architecture:
Mobile App
Web App ──→ API Gateway ──→ User Service
Third-party ──→ Order Service
API consumers ──→ Payment Service
──→ Notification Service
Q4. What is the circuit breaker pattern? How does it work?
States:
CLOSED (normal operation)
│ failures exceed threshold (e.g., 5 failures in 10 seconds)
▼
OPEN (circuit tripped — reject all requests immediately)
│ after reset timeout (e.g., 30 seconds)
▼
HALF-OPEN (let one test request through)
│ success → back to CLOSED
│ failure → back to OPEN
Why it matters: Without circuit breakers, a slow/failed Payment Service causes all Order Service threads to block waiting for timeout → thread pool exhaustion → Order Service also fails → cascading failure across the system.
Implementation (Python with pybreaker):
import pybreaker
# Create circuit breaker: open after 5 failures, reset after 60s
payment_breaker = pybreaker.CircuitBreaker(fail_max=5, reset_timeout=60)
@payment_breaker
def call_payment_service(order_id: str, amount: float):
response = requests.post(
"http://payment-service/charge",
json={"order_id": order_id, "amount": amount},
timeout=2.0
)
response.raise_for_status()
return response.json()
# In order service
try:
result = call_payment_service(order_id, amount)
except pybreaker.CircuitBreakerError:
# Circuit is OPEN — fail fast, return cached/degraded response
return {"status": "payment_deferred", "message": "Payment system temporarily unavailable"}
Resilience4j (Java), Polly (.NET), and Istio (infrastructure level) also implement circuit breakers.
Critical pattern at Razorpay, Flipkart, and any payment/ordering system interview
Q5. What is the Saga pattern? When do you use it?
Two implementations:
1. Choreography-based Saga: Each service publishes events; other services react and execute their local transaction.
OrderService → publishes: OrderCreated
↓ (async)
PaymentService → listens: OrderCreated → charges card → publishes: PaymentProcessed
↓ (async)
InventoryService → listens: PaymentProcessed → reserves stock → publishes: StockReserved
↓ (async)
ShippingService → listens: StockReserved → creates shipment → publishes: ShipmentCreated
Compensating transactions on failure:
PaymentService fails → publishes: PaymentFailed
OrderService listens → cancels order → publishes: OrderCancelled
2. Orchestration-based Saga: A central Saga Orchestrator (often AWS Step Functions, Temporal, or Conductor) coordinates the workflow.
SagaOrchestrator:
1. Call PaymentService.charge() → success
2. Call InventoryService.reserve() → fails
3. Call PaymentService.refund() [compensating transaction]
4. Update order status to failed
Choreography vs. Orchestration:
| Choreography | Orchestration | |
|---|---|---|
| Coupling | Loose (event-driven) | Tighter (orchestrator knows all steps) |
| Visibility | Hard to see overall flow | Clear (orchestrator owns flow) |
| Error handling | Complex (distributed) | Centralized in orchestrator |
| Best for | Simple, few services | Complex workflows, many services |
Most frequently discussed pattern at system design interviews for e-commerce and fintech
Q6. What is database-per-service pattern? What are the trade-offs?
Benefits:
- Services are truly independently deployable (schema changes don't break other services)
- Polyglot persistence (user service → PostgreSQL, product catalog → MongoDB, sessions → Redis)
- Independent scaling (scale the database with the service that needs it)
- Fault isolation (payment DB failure doesn't affect user DB)
Drawbacks:
- No cross-service joins (you can't
JOIN users u ON o.user_id = u.idacross services) - Data consistency is eventual, not immediate (no multi-DB transactions)
- Duplicated data across services (denormalization)
- Reporting/analytics requires aggregating from multiple sources
Solutions for cross-service data needs:
- API composition: Query multiple services, join in memory (client or API Gateway)
- CQRS read models: Maintain a denormalized read database updated via events
- Event-driven data replication: Services publish events → other services maintain local copies
Orders Service Users Service
│ │
└─ publishes OrderCreated └─ publishes UserUpdated
│ │
└───────────────────────────────┘
│
Read Model (Elasticsearch)
Combines order + user data for search
Asked at every microservices architecture interview
Q7. What is an event-driven architecture? How is it different from request-response?
Request-Response (synchronous):
- Service A calls Service B directly, waits for response
- Tight coupling (A must know B's endpoint)
- If B is slow, A is slow
- If B is down, A fails
Event-Driven (asynchronous):
- Service A publishes an event to a message broker
- Service B subscribes and processes at its own pace
- A doesn't know about B (loose coupling)
- A continues processing regardless of B's state
Comparison:
| Aspect | Request-Response | Event-Driven |
|---|---|---|
| Coupling | Tight (caller knows callee) | Loose (producer doesn't know consumers) |
| Availability | Dependent on downstream | Independent |
| Latency | Synchronous wait | Asynchronous (no wait) |
| Complexity | Simpler to reason about | More complex (eventual consistency) |
| Observability | Easy (call chain visible) | Harder (event flows across brokers) |
| Use case | Real-time user-facing reads | Background processing, notifications, integration |
Message brokers:
- Apache Kafka: High-throughput, ordered, durable, replay-able. Best for event streaming, audit logs.
- RabbitMQ: Feature-rich queuing, routing (AMQP), dead-letter exchanges. Best for task queues.
- AWS SQS/SNS: Managed, serverless, deep AWS integration.
- Apache Pulsar: Multi-tenancy, geo-replication, Kafka alternative.
Q8. What is eventual consistency? How do you handle it in microservices?
Example: User changes their email. User Service updates immediately. Order Service still has the old email for 500ms (replication lag).
CAP Theorem: In a distributed system, you can have at most 2 of: Consistency, Availability, Partition tolerance. Since network partitions happen in any distributed system, you choose CP (consistent but may be unavailable during partition — e.g., Zookeeper) or AP (available but may show stale data — e.g., Cassandra, DynamoDB).
Patterns for handling eventual consistency:
- Idempotency: Handle duplicate events gracefully. Process an event with the same ID twice = same result.
- Optimistic locking: Include version numbers in updates; reject stale updates.
- Read-your-writes consistency: After a write, always read from the same service (not a replica) for that user's next request.
- Compensating actions: If you detect inconsistency, correct it (Saga compensating transactions).
- UI design: Show "processing..." for async operations instead of immediate final state.
Q9. What is the difference between REST, gRPC, and GraphQL for microservices communication?
| Feature | REST | gRPC | GraphQL |
|---|---|---|---|
| Protocol | HTTP/1.1 or HTTP/2 | HTTP/2 (always) | HTTP/1.1 or HTTP/2 |
| Format | JSON/XML | Protocol Buffers (binary) | JSON |
| Performance | Good | Excellent (binary, multiplexed) | Good |
| Type safety | Optional (OpenAPI) | Strong (protobuf IDL) | Strong (schema) |
| Streaming | Limited (SSE, WebSocket) | Native (bidirectional) | Subscriptions |
| Browser support | Excellent | Limited (needs grpc-web) | Excellent |
| Learning curve | Low | Medium | Medium |
| Best for | Public APIs, external clients | Internal service-to-service | Flexible client queries |
gRPC example:
// payment.proto
service PaymentService {
rpc ProcessPayment(PaymentRequest) returns (PaymentResponse);
rpc StreamTransactions(UserRequest) returns (stream Transaction); // Server streaming
}
message PaymentRequest {
string order_id = 1;
double amount = 2;
string currency = 3;
}
Common architecture: External clients → REST/GraphQL API (via API Gateway). Internal service-to-service → gRPC (performance, strong typing).
Asked at Razorpay, Flipkart, Zerodha architecture discussions
Q10. What is the bulkhead pattern? Why is it important?
Implementation — Thread pool isolation:
// Without bulkhead: one slow service consumes all 200 threads
// Result: entire application unresponsive
// With bulkhead: dedicated thread pools per service
@HystrixCommand(
commandKey = "PaymentService",
threadPoolKey = "PaymentServicePool",
threadPoolProperties = {
@HystrixProperty(name="coreSize", value="10"), // Max 10 threads for payment
@HystrixProperty(name="maxQueueSize", value="5") // Queue 5 more
}
)
public PaymentResult processPayment(Order order) { ... }
// RecommendationService gets its own pool — payment slowness doesn't affect it
@HystrixCommand(threadPoolKey = "RecommendationPool")
public List<Product> getRecommendations(String userId) { ... }
Implementation — Connection pool isolation: Separate database connection pools per downstream service. If one pool is exhausted (slow DB), other services still have their connections.
In Kubernetes: Resource limits + requests per deployment act as bulkheads — one misbehaving service can't consume all cluster CPU/memory.
Q11. What is the strangler fig pattern for migrating from monolith to microservices?
Migration steps:
- Deploy a routing layer (API Gateway or reverse proxy) in front of the monolith
- Identify a bounded context to extract (e.g., User Service)
- Build the new microservice independently
- Configure the router to send user-related traffic to the new service
- Monolith still handles remaining domains
- Repeat until monolith is replaced (or retired)
Phase 1: Phase 2: Phase 3:
Monolith API Gateway API Gateway
├── Users ├── /users ──→ UserSvc ├── /users ──→ UserSvc
├── Orders └── /* ──→ Monolith ├── /orders ──→ OrderSvc
└── Payments └── /payments ──→ Monolith
(payments still migrating)
Key considerations:
- Keep the monolith working the entire time — zero downtime migration
- Start with the domain that causes the most pain (slowest deployments, most merge conflicts)
- Don't try to extract tightly coupled domains first
- The dual-write problem: during transition, both old and new code may need to write to data
Q12. What is a bounded context in Domain-Driven Design (DDD)?
Example — E-commerce:
Bounded Context: Order Management
├── Order (has items, status, shipping address)
├── OrderItem (product, quantity, price at purchase time)
└── Customer (just: id, name, email — what order management cares about)
Bounded Context: Product Catalog
├── Product (detailed specs, images, variants, inventory)
├── Category
└── Pricing rules
Bounded Context: Payment
├── Transaction
├── PaymentMethod
└── Customer (just: id, billing address — what payment cares about)
Note: "Customer" exists in multiple bounded contexts with different attributes. That's correct — each context only models what it needs, using its own language.
Microservice ↔ Bounded Context: Ideally, one microservice per bounded context. Splitting a bounded context across services creates tight coupling. Combining multiple bounded contexts in one service loses independence.
Context mapping: Define how bounded contexts relate — Shared Kernel, Customer/Supplier, Conformist, Anti-Corruption Layer (ACL).
DDD is the foundation of good microservices decomposition — asked at principal/architect-level interviews
Solid on Q1-Q12? You've cleared the screening bar at most companies. The intermediate section below covers the patterns that ₹25 LPA+ backend engineers are expected to know cold — saga, CQRS, event sourcing, and distributed tracing.
Essential Intermediate Microservices Questions (Q13–Q28)
Q13. What is event sourcing? How does it differ from traditional CRUD?
Traditional CRUD: Store current state. When something changes, overwrite. History is lost.
-- Current state only
UPDATE orders SET status = 'shipped', updated_at = NOW() WHERE id = 123;
-- What was the previous status? When did it change? Unknown.
Event Sourcing: Store all events that led to the current state. Current state is derived by replaying events.
# Event store (append-only)
events = [
{"type": "OrderCreated", "order_id": 123, "items": [...], "ts": "2026-03-30T10:00"},
{"type": "PaymentProcessed", "order_id": 123, "amount": 599, "ts": "2026-03-30T10:01"},
{"type": "ItemShipped", "order_id": 123, "tracking": "EX123", "ts": "2026-03-30T11:30"},
]
# Derive current state by replaying events
def get_order_state(order_id):
order = {}
for event in get_events(order_id):
if event["type"] == "OrderCreated":
order = {"id": order_id, "status": "created", **event}
elif event["type"] == "PaymentProcessed":
order["status"] = "paid"
elif event["type"] == "ItemShipped":
order["status"] = "shipped"
order["tracking"] = event["tracking"]
return order
Benefits:
- Complete audit trail (compliance, debugging)
- Time travel — reconstruct state at any point in time
- Events are the source of truth for other services (publish to Kafka)
- Can replay events to rebuild read models or populate new services
Drawbacks:
- Query complexity (you can't
SELECT * WHERE status = 'paid'— need a read model) - Event schema evolution is tricky
- Storage grows indefinitely (snapshots mitigate this)
Q14. What is CQRS (Command Query Responsibility Segregation)?
Command Side (writes) Query Side (reads)
│ │
Client → API → Command Handler Client → API → Query Handler
│ │
└─→ Write Store ──→ Event ──→ Read Store
(PostgreSQL) (Kafka) (Elasticsearch / Redis)
normalized async denormalized, optimized
ACID update for specific queries
Why separate reads and writes?
- Write models are optimized for correctness (normalized, ACID)
- Read models are optimized for query patterns (denormalized, pre-computed)
- Scale independently: most apps read 10x more than they write
- Different databases per side: PostgreSQL for writes, Elasticsearch for full-text search reads
CQRS without event sourcing:
Order created (write DB: PostgreSQL)
│
▼ (async, Debezium CDC or Kafka)
Order search index updated (read DB: Elasticsearch)
│
▼
User searches for "my orders" — queries Elasticsearch (fast, full-text capable)
CQRS + Event Sourcing is the full pattern — events are the commands, event store is the write side, projections build the read models.
System design question at Flipkart, Amazon, and Swiggy (order search systems)
Q15. Design a notification service for a large-scale application.
Architecture:
Order Service ──→ Kafka (topic: order-events)
Payment Service ──→ Kafka (topic: payment-events)
Delivery Service ──→ Kafka (topic: delivery-events)
│
▼
Notification Orchestrator
├── Subscribe to relevant topics
├── Determine notification type (email, SMS, push)
├── Check user preferences (do-not-disturb, channels)
├── Deduplicate (prevent duplicate notifications)
└── Route to appropriate channel worker
│
┌───────────┼───────────┐
▼ ▼ ▼
Email Worker SMS Worker Push Worker
(Amazon SES) (Twilio/MSG91) (FCM/APNs)
│ │ │
└───────────┴───────────┘
│
Notification Log DB
(track sent, failed, retries)
Key design decisions:
- Idempotency: Use notification ID (hash of event + user + type) to prevent duplicates
- Priority queues: Separate Kafka topics/consumers for critical (OTP, payment) vs. marketing notifications
- User preferences: Cache in Redis — check before sending
- Retry with backoff: Exponential backoff for failed sends; dead-letter queue after N retries
- Rate limiting: Don't flood users — max 3 marketing emails per day
- Unsubscribe handling: Immediate propagation to prevent sending after unsubscribe
Q16. What is the outbox pattern? How does it solve the dual-write problem?
# BAD — dual-write, not atomic
def create_order(order_data):
order = db.save_order(order_data) # DB update succeeds
kafka.publish("order-created", order) # What if this fails? DB has order, no event
return order
Outbox Pattern solution: Write the event to an outbox table in the SAME database transaction. A separate process (transactional outbox publisher) reads from the outbox and publishes to Kafka.
# GOOD — single database transaction
def create_order(order_data):
with db.transaction():
order = db.save_order(order_data)
# Same transaction — either both succeed or both fail
db.save_to_outbox({
"event_type": "OrderCreated",
"payload": order.to_dict(),
"status": "pending"
})
return order
# Separate outbox publisher process (or Debezium CDC)
def outbox_publisher():
while True:
events = db.get_pending_outbox_events(limit=100)
for event in events:
kafka.publish(event.topic, event.payload)
db.mark_outbox_event_published(event.id)
time.sleep(0.1)
Better implementation — Debezium (Change Data Capture): Debezium reads PostgreSQL/MySQL WAL (write-ahead log) and publishes changes to Kafka automatically. No polling process needed.
Asked at Razorpay, Flipkart, Zerodha — any system with event-driven microservices
Q17. What is the sidecar pattern? Give real examples.
Pod (Kubernetes)
├── Main Container (application)
└── Sidecar Container (e.g., Envoy proxy)
- Same network namespace
- Same lifecycle
- Shared volumes
Real sidecar examples:
| Sidecar | Purpose | Example |
|---|---|---|
| Istio Envoy Proxy | mTLS, traffic management, telemetry | All traffic in/out of pod goes through Envoy |
| Log shipper | Collect and forward logs | Fluent Bit reads app log file, ships to OpenSearch |
| Secret injector | Inject secrets at startup | Vault Agent writes secrets to shared volume |
| Config sync | Keep config file up-to-date | Consul Template watches Consul, regenerates nginx config |
| Network proxy | Ambassador pattern for external calls | Proxy to legacy system that needs auth transformation |
| Metrics exporter | Expose metrics for services that can't | JMX Exporter for JVM metrics → Prometheus |
Init Container (runs before app, not alongside): Used for setup tasks — wait for DB to be ready, run migrations, download config files.
Q18. Explain the API Composition pattern vs. CQRS for cross-service queries.
API Composition: The API Gateway or a dedicated Composer service calls multiple services, joins the data in memory, and returns a combined response.
# API Gateway composer
async def get_order_details(order_id: str) -> OrderDetails:
# Call multiple services concurrently
order, user, payment, items = await asyncio.gather(
order_service.get_order(order_id),
user_service.get_user(order.user_id),
payment_service.get_payment(order_id),
inventory_service.get_items(order.item_ids)
)
# Compose response
return OrderDetails(order=order, user=user, payment=payment, items=items)
Pros: Simple, always consistent (reads from authoritative sources). Cons: Higher latency (network calls per service), availability depends on all services, can't do complex joins/filtering.
CQRS Read Model: Maintain a pre-joined, denormalized view in a separate store.
OrderCreated event → update order_summary table/index
PaymentProcessed event → update payment_status in order_summary
ItemShipped event → update shipping_status in order_summary
Query: SELECT * FROM order_summary WHERE user_id = 123
-- All data pre-joined, single fast query, no service calls
Pros: Fast queries, no runtime service calls, can handle complex filtering/sorting. Cons: Eventual consistency (slight lag), complexity of maintaining projections.
When to use which:
- API Composition: Real-time consistency required, few services, low query volume
- CQRS Read Model: High query volume, complex filtering, acceptable eventual consistency
Q19. How do you implement retry logic with exponential backoff in microservices?
Naive retry (BAD — thundering herd):
for attempt in range(3):
try:
return call_payment_service()
except Exception:
time.sleep(1) # All failed requests retry at exactly the same time
Exponential backoff with jitter (CORRECT):
import random
import time
def call_with_retry(func, max_attempts=4, base_delay=0.5, max_delay=32):
for attempt in range(max_attempts):
try:
return func()
except (ConnectionError, TimeoutError) as e:
if attempt == max_attempts - 1:
raise # Last attempt — re-raise
# Exponential backoff: 0.5s, 1s, 2s, 4s...
delay = min(base_delay * (2 ** attempt), max_delay)
# Add jitter: randomize within [delay/2, delay*1.5]
jitter = delay * (0.5 + random.random())
time.sleep(jitter)
except (ValueError, KeyError) as e:
raise # Don't retry on non-transient errors
Tenacity library (Python):
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
@retry(
stop=stop_after_attempt(4),
wait=wait_exponential(multiplier=0.5, min=0.5, max=32),
retry=retry_if_exception_type((ConnectionError, TimeoutError))
)
def call_payment_service():
return requests.post("http://payment-service/charge", timeout=2.0)
Key principle: Only retry idempotent operations. Never auto-retry payment charges — you may double-charge. For non-idempotent operations, use idempotency keys.
Q20. What is an idempotency key? How do you implement idempotent APIs?
Client implementation:
import uuid
def process_payment(order_id: str, amount: float, retries: int = 3):
idempotency_key = f"payment-{order_id}" # Deterministic key per order
for attempt in range(retries):
response = requests.post(
"http://payment-service/charge",
headers={"X-Idempotency-Key": idempotency_key},
json={"amount": amount}
)
if response.status_code in (200, 201):
return response.json()
elif response.status_code == 409: # Conflict = already processed
return response.json() # Return cached result
time.sleep(2 ** attempt)
Server implementation:
@app.post("/charge")
async def charge(request: ChargeRequest, idempotency_key: str = Header(...)):
# Check if we've seen this key before
cached = await redis.get(f"idempotency:{idempotency_key}")
if cached:
return json.loads(cached) # Return previous result
# Process payment
result = await payment_processor.charge(request.amount)
# Cache result (TTL = 24 hours)
await redis.setex(f"idempotency:{idempotency_key}", 86400, json.dumps(result))
return result
Razorpay, Stripe, and Paytm all implement idempotency keys for payment APIs for exactly this reason.
Q21. What is the two-phase commit (2PC) and why is it problematic in microservices?
Phase 1 (Prepare):
- Coordinator asks all participants: "Can you commit?"
- Each participant locks resources, writes to WAL, responds "YES" or "NO"
Phase 2 (Commit/Abort):
- If all said YES → Coordinator sends COMMIT to all
- If any said NO → Coordinator sends ABORT to all
Why 2PC is problematic in microservices:
- Blocking protocol: Resources are locked during both phases. If coordinator crashes after Phase 1, participants are stuck with locks indefinitely.
- Single point of failure: Coordinator failure blocks the entire transaction.
- Network partitions: If a participant receives the Phase 2 COMMIT but another participant doesn't, you have inconsistency.
- Low throughput: Long-held locks → contention → poor performance at scale.
- Not cloud-native: Cloud services (AWS RDS, DynamoDB) don't participate in external 2PC.
Alternative: Saga pattern with compensating transactions — eventual consistency instead of ACID, but resilient and scalable.
Q22. How do you implement distributed tracing across microservices?
How trace context propagates:
User Request
│ Trace-ID: abc123, Span-ID: 0001
▼
API Gateway
│ (adds Span-ID: 0001 → parent-span)
│ new Span-ID: 0002
▼
Order Service
│ new Span-ID: 0003
▼
Payment Service
│ (B3 headers forwarded automatically by OpenTelemetry)
new Span-ID: 0004
▼
Database
OpenTelemetry context propagation:
# Service uses OTel — trace context auto-propagates via HTTP headers
from opentelemetry.propagate import inject, extract
from opentelemetry import trace
# Outgoing HTTP call — inject trace context into headers
headers = {}
inject(headers) # Adds traceparent header
response = requests.post("http://payment-service/charge",
headers=headers,
json=payload)
# Incoming request — extract trace context
context = extract(request.headers)
with tracer.start_as_current_span("process-order", context=context):
process_order(request.data)
Trace analysis in Jaeger:
- See full call tree with each service's contribution
- Identify which service caused the P99 spike
- Find N+1 query patterns (many short DB calls instead of one batch)
- Correlate errors across services
Q23. What is the Backends for Frontends (BFF) pattern?
Mobile App ──→ Mobile BFF ──→ User Service
(compact JSON, Order Service
limited fields, Payment Service
offline support)
Web App ──→ Web BFF ──→ User Service
(rich data, Order Service
full features) Analytics Service
Third-party ──→ Public API ──→ User Service
(versioned, Order Service
rate-limited)
Why BFF instead of one universal API Gateway?
- Mobile apps need lightweight responses (battery, bandwidth)
- Web apps need richer data (full user profile, analytics)
- Different authentication mechanisms per client
- Different rate limits
- Client-specific aggregation logic doesn't pollute the universal gateway
Who implements the BFF? Typically the frontend team — they own the BFF along with the client. This gives frontend teams control over their data fetching without negotiating with a platform team.
Q24. How do you handle service-to-service authentication in microservices?
Option 1 — JWT (JSON Web Tokens):
# Service A generates JWT using its private key
token = jwt.encode({
"sub": "order-service",
"iss": "auth-service",
"aud": "payment-service",
"iat": datetime.utcnow(),
"exp": datetime.utcnow() + timedelta(minutes=5)
}, private_key, algorithm="RS256")
# Service B verifies JWT using auth-service's public key
decoded = jwt.decode(token, public_key, algorithms=["RS256"],
audience="payment-service")
Option 2 — mTLS (mutual TLS) via Istio: Both services present certificates. Istio's control plane issues and rotates certs automatically — no application code changes needed. Best for Kubernetes deployments.
Option 3 — API Keys (simple, less secure): Pre-shared keys stored in secrets manager. Simpler but no identity verification, hard to rotate.
Option 4 — SPIFFE/SPIRE:
Platform-agnostic workload identity. Every workload gets a SPIFFE ID (spiffe://trust-domain/service/payment) and a short-lived X.509 cert. Works across clouds, VMs, and containers.
Best practice in 2026: Istio mTLS for K8s (automatic, zero code), SPIFFE/SPIRE for hybrid environments.
Q25. What is Kafka and when should you use it over RabbitMQ?
| Feature | Apache Kafka | RabbitMQ |
|---|---|---|
| Architecture | Distributed log (partitioned, replicated) | Message broker (push-based queuing) |
| Message retention | Configurable (days/weeks/forever) | Until consumed (by default) |
| Replay messages | Yes (consumer offsets) | No (consumed = gone) |
| Throughput | Millions of messages/second | Hundreds of thousands/second |
| Consumer model | Pull (consumer controls pace) | Push (broker delivers) |
| Ordering | Guaranteed within partition | Not guaranteed across queues |
| Use case | Event streaming, audit log, data pipeline | Task queue, RPC, work distribution |
Use Kafka when:
- You need to replay events (rebuild a new service's database from history)
- Multiple independent consumers need the same events
- High throughput (millions/second)
- Event log that's the source of truth (event sourcing)
- Real-time data pipelines (Kafka Streams, Flink)
Use RabbitMQ when:
- Task queue with routing logic (exchange types: direct, topic, fanout, headers)
- Complex retry/DLQ patterns
- Request-reply patterns
- Smaller scale, lower operational overhead
Q26. How do you handle versioning of microservice APIs?
Strategies:
- URI versioning (most common):
GET /api/v1/users/123
GET /api/v2/users/123 # Breaking change in v2
- Header versioning:
GET /api/users/123
Accept: application/vnd.myapp.v2+json
- Query parameter versioning:
GET /api/users/123?version=2
Consumer-Driven Contract Testing (Pact): Before breaking changes, verify no consumer depends on the old behavior:
# Payment service defines what it expects from Order service (pact)
pact = Consumer('PaymentService').has_pact_with(Provider('OrderService'))
pact.given('Order 123 exists').upon_receiving('a request for order').with_request(
method='GET', path='/orders/123'
).will_respond_with(200, body={"id": "123", "amount": 599}) # Contract
# Order service must fulfill this contract — tested in isolation
Event versioning (Kafka):
- Add fields (backward compatible): consumers can ignore new fields
- Remove/rename fields (breaking): use schema registry (Confluent Schema Registry with Avro/Protobuf) with backward/forward compatibility enforcement
Q27. What is the choreography vs. orchestration debate in microservices?
Choreography: Each service reacts to events and publishes its own events. No central coordinator.
OrderService publishes: OrderPlaced
→ PaymentService listens, charges card, publishes: PaymentCharged
→ InventoryService listens, reserves items, publishes: InventoryReserved
→ ShippingService listens, creates shipment, publishes: ShipmentCreated
Orchestration: A workflow orchestrator (Step Functions, Temporal, Conductor) explicitly calls each service in sequence.
OrderOrchestrator:
1. Call PaymentService.charge() → wait for response
2. Call InventoryService.reserve() → wait
3. Call ShippingService.create() → wait
4. Publish OrderFulfilled event
Trade-off comparison:
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coupling | Very loose | Orchestrator knows all services |
| Visibility | Hard to see end-to-end flow | Workflow is explicit in orchestrator |
| Error handling | Each service must handle its own failures | Centralized error handling, rollback logic |
| Debugging | Trace events across multiple topics | Debug in one place (orchestrator logs) |
| Testability | Hard (need full event pipeline) | Easier (mock service calls) |
| Evolution | Adding a step = new consumer (no code change) | Adding a step = modify orchestrator |
Recommendation (2026 best practice): Use orchestration for business-critical workflows (order processing, payment flows) — visibility and error handling outweigh the coupling. Use choreography for non-critical fan-out (send email notification, update recommendation engine).
Debated at principal engineer / tech lead interviews
Q28. How do you design a rate limiter for an API gateway?
Algorithms:
| Algorithm | Description | Burst Handling | Use Case |
|---|---|---|---|
| Fixed Window Counter | Count requests in fixed window (e.g., 1-minute slots) | Allows 2x limit at window boundary | Simple, common |
| Sliding Window Log | Track exact timestamps of last N requests | Precise | Accurate but memory-intensive |
| Sliding Window Counter | Combine fixed windows with interpolation | Smooth | Good balance |
| Token Bucket | Bucket fills at rate R, requests consume tokens | Yes (burst up to bucket size) | API rate limiting |
| Leaky Bucket | Requests processed at fixed rate (queue excess) | Smooths bursts | Output rate limiting |
Token bucket implementation with Redis:
import redis
import time
r = redis.Redis()
def is_rate_limited(user_id: str, max_requests: int = 100, window_seconds: int = 60) -> bool:
key = f"rate_limit:{user_id}"
current_time = time.time()
window_start = current_time - window_seconds
pipe = r.pipeline()
# Remove old entries outside window
pipe.zremrangebyscore(key, 0, window_start)
# Count current requests in window
pipe.zcard(key)
# Add current request
pipe.zadd(key, {str(current_time): current_time})
# Set expiry on key
pipe.expire(key, window_seconds)
results = pipe.execute()
request_count = results[1]
return request_count >= max_requests
Distributed rate limiting: Redis with atomic Lua scripts ensures correctness across multiple API Gateway instances. Alternatively, Nginx Plus and Kong have built-in rate limiting plugins.
If you can nail Q1-Q28, you're already in the top 15% of backend candidates. The advanced section is where Staff Engineer and Architect offers are decided — real system design scenarios from Razorpay, Flipkart, and Amazon interviews.
Advanced Microservices Questions — The Architect Round (Q29–Q40)
Q29. Design an e-commerce order management system using microservices.
Architecture:
Client (Web/Mobile)
│
API Gateway (Kong)
├── /auth → Auth Service (JWT issuance/validation)
├── /users → User Service (PostgreSQL)
├── /products → Product Catalog Service (PostgreSQL + Elasticsearch)
├── /cart → Cart Service (Redis — ephemeral)
├── /orders → Order Service (PostgreSQL + Kafka publisher)
└── /payments → Payment Service (PostgreSQL + Razorpay integration)
Kafka Topics:
- order-events: OrderCreated, OrderUpdated, OrderCancelled
- payment-events: PaymentProcessed, PaymentFailed, RefundInitiated
- inventory-events: StockReserved, StockReleased, LowStockAlert
Async Services (Kafka consumers):
- Inventory Service: listens to order-events, reserves/releases stock
- Notification Service: listens to all events, sends emails/SMS/push
- Analytics Service: listens to all events, updates dashboards
- Search Indexer: listens to order-events, updates Elasticsearch
Read Models (CQRS):
- Order History: Elasticsearch (user's past orders, full-text search)
- Recommendation Engine: Feature store (order patterns)
Order creation flow (Saga):
- Order Service saves order (PENDING), publishes OrderCreated
- Payment Service: processes payment, publishes PaymentProcessed or PaymentFailed
- Inventory Service: reserves stock, publishes StockReserved or InsufficientStock
- Shipping Service: creates shipment, publishes ShipmentCreated
- Notification Service: sends order confirmation email
Compensating transactions on failure:
- PaymentFailed → OrderService cancels order → InventoryService releases any reserved stock
Q30. How do you implement distributed caching in microservices?
Caching patterns:
- Cache-Aside (Lazy Loading):
def get_user(user_id: str) -> User:
cached = redis.get(f"user:{user_id}")
if cached:
return User.from_json(cached)
user = db.query_user(user_id) # Cache miss — hit DB
redis.setex(f"user:{user_id}", 300, user.to_json()) # Cache 5 minutes
return user
-
Write-Through: Write to cache and DB simultaneously. Cache always has latest.
-
Write-Behind (Write-Back): Write to cache first, async to DB. Risk of data loss on cache failure.
-
Read-Through: Cache handles all reads, fetches from DB on miss (transparent to application).
Cache invalidation strategies:
- TTL (Time-To-Live): Simple, eventual consistency guaranteed
- Event-driven invalidation: When user updated, publish event → cache invalidation consumer deletes key
- Write-through invalidation: On write, immediately update or delete cache entry
Problems to watch:
- Cache stampede: Many cache misses at once → all hit DB simultaneously. Solution: probabilistic early expiration or mutex lock on miss.
- Cache poisoning: Malicious data in cache. Always validate data from cache.
- Hot keys: One cache key gets millions of requests → single Redis node bottleneck. Solution: local in-process cache (Caffeine/Guava for JVM) for extremely hot keys.
Q31. How do you test microservices effectively?
Testing pyramid for microservices:
/\
/ \
/ E2E\ (Few — very expensive, slow, fragile)
/──────\
/ Integ \ (Some — verify service interactions)
/──────────\
/ Contract \ (Many — verify API contracts between services)
/──────────────\
/ Unit Tests \ (Most — fast, isolated, plentiful)
/──────────────────\
Unit tests: Test business logic in isolation (mock all external dependencies).
Contract tests (Pact): Consumer defines expected API behavior; provider verifies it fulfills the contract without a real integration test environment.
Integration tests: Test service against real dependencies (real DB in Docker, real Redis). Testcontainers is excellent for this:
from testcontainers.postgres import PostgresContainer
def test_order_creation():
with PostgresContainer("postgres:16") as pg:
db = create_engine(pg.get_connection_url())
setup_schema(db)
order_service = OrderService(db=db, kafka=MockKafka())
order = order_service.create_order(user_id="u1", items=[...])
assert order.status == "pending"
End-to-end tests: Deploy all services (docker-compose or kind cluster), run scenario tests. Run these on a schedule (not every PR) — too slow and flaky for CI gates.
Consumer-Driven Contract Testing workflow:
- Consumer team writes Pact contract (what they expect)
- Pact pushed to Pact Broker
- Provider CI downloads and verifies against actual implementation
- If provider fails contract → prevent deployment
Q32. How do you handle schema evolution in Kafka/event-driven systems?
Confluent Schema Registry with Avro/Protobuf:
# Register schema
from confluent_kafka.schema_registry import SchemaRegistryClient
sr_client = SchemaRegistryClient({"url": "http://schema-registry:8081"})
# Define Avro schema
order_schema = {
"type": "record",
"name": "OrderCreated",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "user_id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "currency", "type": "string", "default": "INR"} # New field with default
]
}
# Schema registry enforces compatibility before registration
Compatibility modes:
- BACKWARD: New schema can read old messages (add optional fields with defaults — consumers can be updated after producers)
- FORWARD: Old schema can read new messages (remove optional fields — producers can be updated before consumers)
- FULL: Both backward and forward compatible
- NONE: No compatibility checking (use with care)
Rules for safe evolution:
- ALWAYS add fields as optional with defaults (backward compatible)
- NEVER remove required fields
- NEVER change field types (int → string is a breaking change)
- NEVER rename fields (use aliases if needed)
Q33. Design a real-time bidding (RTB) system for online advertising using microservices.
Architecture:
Impression Request (ad slot on a website)
│ <100ms deadline
│
Request Router (Nginx)
│
Bid Request Enrichment Service
├── User profile lookup (Redis — <1ms)
├── Context parsing (URL, device, geo)
└── Audience segments (feature store)
│
Auction Service
├── Fan-out bid requests to DSPs in parallel (<50ms)
├── Collect bids (first response wins after timeout)
└── Second-price auction (winner pays second-highest + $0.01)
│
Ad Serving Service
├── Fetch winning creative from CDN
└── Return ad markup
│
Impression Tracker (async — fire and forget)
│
Kafka: impression-events, click-events, conversion-events
│
Analytics pipeline (Flink + S3 + Athena)
Latency budget: 100ms total. Typical breakdown: 20ms routing/enrichment, 50ms auction, 30ms ad serving.
Key technical challenges:
- Redis clusters for sub-millisecond user profile lookups
- Consistent hashing for request routing (same user → same cache node)
- Goroutines/async for parallel bid collection
- Circuit breakers on every DSP call (slow DSP = skip, not block)
Q34. What is the ambassador pattern? Give a real use case.
Real use case — Legacy database with connection limits:
App Pods (100 replicas × 10 connections = 1000 connections to DB)
↓ — this kills most databases
Problem: PostgreSQL handles ~500 connections before degrading
Solution with Ambassador (PgBouncer):
App Pod → PgBouncer (ambassador sidecar, pool of 20 connections)
PgBouncer multiplexes 100 app connections → 20 DB connections
Result: DB sees 100 replicas × 20 connections each → PgBouncer pools to DB: only 100 connections
Other ambassador use cases:
- Protocol translation: App speaks REST, legacy backend requires SOAP — ambassador translates
- Retry/circuit-breaking: Ambassador (Envoy) handles retries so app code doesn't need to
- mTLS proxy: App speaks plain HTTP; ambassador adds mTLS for service-to-service security
- StatsD → Prometheus: App emits StatsD metrics; ambassador converts to Prometheus format
Q35. How do you implement health checks in microservices? What patterns exist?
Health check types:
- Shallow (ping) health check: Is the process alive? Returns 200 immediately.
@app.get("/health/ping")
async def ping():
return {"status": "ok"}
- Deep health check: Verify dependencies are accessible (DB, Redis, external services):
@app.get("/health/ready")
async def readiness_check():
checks = {}
# Database check
try:
await db.execute("SELECT 1")
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"failed: {str(e)}"
# Redis check
try:
await redis.ping()
checks["redis"] = "ok"
except Exception as e:
checks["redis"] = f"failed: {str(e)}"
# External service check
try:
response = await httpx.get("http://payment-service/health/ping", timeout=1.0)
checks["payment_service"] = "ok" if response.status_code == 200 else "degraded"
except Exception:
checks["payment_service"] = "unavailable"
overall_status = "healthy" if all(v == "ok" for v in checks.values()) else "degraded"
http_status = 200 if overall_status == "healthy" else 503
return JSONResponse({"status": overall_status, "checks": checks}, status_code=http_status)
- Liveness vs. Readiness (Kubernetes):
- Liveness
/health/live: Is the process healthy? (Restart if fails) — simple checks only - Readiness
/health/ready: Is it ready to serve traffic? (Remove from LB if fails) — dependency checks
Important: Deep health check on liveness probe → if your DB goes down, all pods restart (worsens the situation). Only use deep checks for readiness.
Q36. Design a distributed rate limiter for a payment API serving 1M requests/minute.
Architecture:
API Gateway (multiple instances)
│
Redis Cluster (6 nodes, 3 masters + 3 replicas)
├── Key: rate_limit:{user_id}:{window}
├── Algorithm: sliding window with sorted sets
└── Lua script for atomic check-and-increment
│
Fallback: Local in-memory counter (if Redis unavailable)
└── Accept 10% of normal limit locally (graceful degradation)
Redis Lua script (atomic — prevents race conditions):
-- Atomic sliding window rate limiter
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local window_start = now - window
redis.call('ZREMRANGEBYSCORE', key, 0, window_start)
local count = redis.call('ZCARD', key)
if count >= limit then
return 0 -- Rate limited
end
redis.call('ZADD', key, now, now)
redis.call('EXPIRE', key, math.ceil(window/1000))
return 1 -- Allowed
Tiered rate limits:
- Free tier: 100 req/min
- Business tier: 1,000 req/min
- Enterprise: 10,000 req/min
Rate limit headers (RFC 6585):
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 450
X-RateLimit-Reset: 1711756800
Retry-After: 30 # Only present when rate limited
Q37. How do you achieve zero-downtime database migrations in a microservices environment?
The challenge: Service instances run multiple versions during rolling deployment. Schema changes must be backward compatible with both old and new service code simultaneously.
Expand-Contract pattern in practice:
Step 1 — Expand (add new column, keep old):
-- Migration: add new column with nullable (backward compatible)
ALTER TABLE users ADD COLUMN phone_number VARCHAR(15);
Deploy this migration before new code. Old code ignores the new column. New code writes to both old column (email) and new column (phone_number).
Step 2 — Backfill:
# Background job — doesn't block serving
def backfill_phone_numbers():
batch_size = 1000
offset = 0
while True:
users = db.query(f"SELECT id FROM users WHERE phone_number IS NULL LIMIT {batch_size} OFFSET {offset}")
if not users:
break
for user in users:
phone = lookup_phone(user.id)
db.execute("UPDATE users SET phone_number = ? WHERE id = ?", phone, user.id)
offset += batch_size
time.sleep(0.1) # Throttle
Step 3 — Deploy new code only reading new column:
Verify new column has complete data. Switch reads from email to phone_number.
Step 4 — Contract (remove old column):
-- Safe to drop only after ALL old service versions are gone from production
ALTER TABLE users DROP COLUMN email;
Never combine Step 1 (migration) and Step 4 (drop) in the same deployment.
Q38. What is the Hexagonal Architecture (Ports and Adapters)?
HTTP Adapter (REST)
│
Kafka ──→ │ ──→ Primary Ports (use cases)
Adapter │ │
CLI │ Core Business Logic
Adapter │ (Domain + Application Layer)
│ │
Secondary Ports (outbound)
/ \
DB Adapter Email Adapter
(PostgreSQL) (SendGrid)
Primary ports: Define what the application can do (interfaces that drive the application)
Secondary ports: Define what the application needs (interfaces it calls)
Adapters: Implementations of port interfaces (PostgreSQL adapter implements UserRepository port)
Benefits for microservices:
- Swap databases without touching business logic (test with in-memory DB)
- Same business logic can be exposed as REST API AND message consumer AND CLI
- Extremely testable — business logic tests use fake adapters, no real DB needed
# Domain layer (no framework dependencies)
class OrderService:
def __init__(self, order_repo: OrderRepository, payment_gateway: PaymentGateway):
self.order_repo = order_repo # Port — any adapter works
self.payment_gateway = payment_gateway
def create_order(self, user_id: str, items: list) -> Order:
# Pure business logic — no HTTP, no DB specifics
order = Order(user_id=user_id, items=items)
charged = self.payment_gateway.charge(order.total())
if charged:
order.confirm()
self.order_repo.save(order)
return order
Q39. How do you handle large file uploads in a microservices architecture?
Naive approach (broken at scale): Client uploads file to API gateway → service stores in memory → writes to S3. Problem: memory exhaustion, timeouts for large files, API gateway can't handle large bodies.
Correct pattern — Pre-signed S3 URLs:
1. Client requests upload URL
Client → Upload Service → "I want to upload 500MB video"
↓
Upload Service → AWS S3: generate pre-signed PUT URL (valid 30 min)
↓
Upload Service → Client: {upload_url: "https://s3.../...", file_id: "abc"}
2. Client uploads directly to S3 (bypasses your servers entirely)
Client → S3 (direct, signed URL)
3. S3 triggers event on completion
S3 → EventBridge → SQS → Processing Service
↓
Processing Service: validate, thumbnail, transcode
4. Client polls for processing status
Client → Upload Service: "Is file abc ready?"
↓
Upload Service: queries processing status DB
Benefits:
- Files never touch your application servers
- Scales infinitely (S3 handles uploads directly)
- Reduces bandwidth costs (your servers don't proxy)
- Works for files of any size (multi-part upload via S3 for >100MB)
Multipart upload for very large files:
# Break 5GB file into 100MB parts, upload in parallel
import boto3
from concurrent.futures import ThreadPoolExecutor
s3 = boto3.client('s3')
mpu = s3.create_multipart_upload(Bucket='my-bucket', Key='large-file.zip')
def upload_part(part_data, part_number, upload_id):
return s3.upload_part(
Body=part_data, Bucket='my-bucket', Key='large-file.zip',
PartNumber=part_number, UploadId=upload_id
)
Q40. Design a payment processing microservice that handles failures and maintains consistency.
Architecture:
Client
│
Payment API Service (RESTful, idempotent)
├── POST /payments (idempotency-key required)
├── GET /payments/{id} (status check)
└── POST /payments/{id}/refund
│
├── Idempotency cache (Redis — 24h TTL)
├── Write to payments DB (PostgreSQL)
└── Publish to outbox table (same transaction)
Outbox Processor (Debezium CDC → Kafka)
│
Kafka: payment-commands topic
│
Payment Processor Service (Kafka consumer)
├── Reads payment command
├── Calls Razorpay/Stripe via HTTP (with retry + circuit breaker)
├── Receives webhook confirmation (async)
└── Updates payment status in DB
│
Kafka: payment-events topic (PaymentProcessed, PaymentFailed)
│
Order Service ← listens (confirm order)
Notification Service ← listens (send receipt)
Analytics Service ← listens (update revenue dashboard)
Consistency guarantees:
- Idempotency key: Duplicate payment requests return same result
- Outbox pattern: Payment event published atomically with DB write
- Saga with compensation: If payment succeeds but order confirm fails → auto-refund via compensating transaction
- Exactly-once semantics: Kafka consumer uses idempotent consumer group with offset commits after DB update
Failure handling:
- Razorpay API timeout → retry 3x with exponential backoff
- Circuit breaker: After 10 failures in 60s, open circuit (return error immediately, don't call Razorpay)
- Dead-letter queue: After all retries exhausted → move to DLQ → alert team → manual review
- Webhook verification: Validate Razorpay webhook HMAC signature before processing
This is a real system design question asked at Razorpay, PhonePe, and Paytm interviews
FAQ — Honest Answers to Your Microservices Career Questions
Q: When should I start breaking a monolith into microservices? When you have clear pain points: deployment bottlenecks (one slow team blocks everyone), different scaling requirements per component, multiple teams fighting over the same codebase, or specific domains needing different technology. The threshold is roughly 50+ engineers or when deployment frequency drops because of coordination overhead.
Q: What is the difference between microservices and SOA (Service-Oriented Architecture)? SOA (2000s) used heavyweight protocols (SOAP, WS-*, XML), a central Enterprise Service Bus (ESB), and was typically implemented within one organization's IT department. Microservices use lightweight protocols (REST, gRPC, messaging), decentralized communication (no ESB), smaller service scope, and independent deployment. Microservices are essentially SOA done right.
Q: What is the minimum viable microservice size? A service should be small enough that a team of 2-4 engineers can understand the entire codebase, but large enough to be deployed and scaled independently. The "two-pizza team" rule: if a team can't be fed by two pizzas, the service is too large. Resist nano-services that make every function a separate deployment — the operational overhead isn't worth it.
Q: How do you handle the "distributed monolith" antipattern? A distributed monolith has microservices that must be deployed together (tight coupling, shared database, synchronous call chains where A calls B calls C calls D and all must be up). Fix: introduce async messaging between services, separate databases, identify bounded contexts, allow services to degrade gracefully when dependencies are unavailable.
Q: What monitoring is essential for microservices? The RED method (Rate, Error, Duration) per service. Distributed tracing with Jaeger/Tempo for debugging latency. Service dependency graph to understand call paths. Alert on SLO burn rates, not individual metrics. The four golden signals: latency, traffic, errors, saturation.
Q: Should every microservice have its own CI/CD pipeline? Yes — independent deployment is the core value proposition. Each service should have: its own repository (or module in a monorepo), its own pipeline, its own deployment lifecycle. A change to the payment service should never require coordinating with the user service deployment.
Q: What's the hardest part of microservices in practice? Data. Cross-service data access, eventual consistency, distributed transactions, and keeping read models up to date are significantly harder than equivalent monolith operations. Most teams underestimate this. The network is also unreliable in ways that in-process calls aren't — every service call needs timeouts, retries, and circuit breakers.
Q: What salary can I expect for microservices/distributed systems expertise in India? Backend Engineer (3-5 yrs, microservices): ₹20–45 LPA. Senior Backend/Architect (7+ yrs): ₹50–90 LPA. Staff/Principal with distributed systems depth: ₹80 LPA–1.5 Cr at product companies. FAANG (Amazon, Google): ₹60 LPA–2 Cr including RSUs.
You now have the same distributed systems knowledge that ₹50 LPA+ architects carry. Pair this with hands-on implementation — build a saga pattern, implement circuit breakers, deploy on Kubernetes — and you'll walk into interviews with unshakeable confidence.
Related Articles:
- Kubernetes Interview Questions 2026 — where your microservices actually run
- Docker Interview Questions 2026 — containerizing each service
- Golang Interview Questions 2026 — Go is the top choice for microservices at Indian unicorns
- React Interview Questions 2026 — the frontend consuming your APIs
- System Design Interview Questions 2026
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Related Articles
AWS Interview Questions 2026 — Top 50 with Expert Answers
AWS certifications command a 25-30% salary premium in India, and AWS skills appear in 74% of all cloud job postings. AWS...
DevOps Interview Questions 2026 — Top 50 with Expert Answers
Elite DevOps teams deploy to production multiple times per day with a change failure rate under 5%. That's the bar companies...
Docker Interview Questions 2026 — Top 40 with Expert Answers
Docker engineers at product companies command ₹15-35 LPA, and senior container/DevOps specialists at Flipkart, Razorpay, and...
Kubernetes Interview Questions 2026 — Top 50 with Expert Answers
Kubernetes engineers command ₹25-60 LPA in India. Platform engineers with deep K8s expertise at Flipkart, Swiggy, and...
AI/ML Interview Questions 2026 — Top 50 Questions with Answers
AI/ML engineer is the highest-paid engineering role in 2026, with median compensation exceeding $200K at top companies. But...