System Design: Notification Service 2026 [Push, Email, SMS Architecture]

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
Last Updated: June 2026
Why Notification Service is a Frequently Asked Design
Candidates report notification service in roughly 10-15% of backend system design rounds, particularly at consumer product companies. Based on public preparation resources and candidate-reported interview threads, it tests asynchronous processing, multi-channel routing, reliability, and user preference management.
Step 1: Requirements
Functional requirements:
- Send notifications via push (iOS/Android), email, and SMS
- Support transactional (OTP, payment receipt) and marketing (promotions) notifications
- Template-based notifications with personalization
- User notification preferences (opt-in/opt-out per channel and category)
- Scheduled notifications (send at specific time or relative to event)
- Delivery tracking: sent, delivered, opened
Non-functional requirements:
- Scale: 10 million notifications per day (116/sec average, 1000+/sec peak)
- Transactional notifications: delivery under 5 seconds
- Marketing notifications: delivery within 30 minutes
- At-least-once delivery (deduplication prevents duplicates)
- No notification loss
Step 2: Capacity Estimation
Daily notifications: 10M
- Transactional (OTP, payment): ~500K/day = ~6/sec
- Social (likes, comments): ~3M/day = ~35/sec
- Marketing (promotions): ~6.5M/day = ~75/sec
Peak (evening hours ~8-10pm):
3x average = ~350/sec
Storage per notification:
Metadata: ~500 bytes (recipient, template, status, timestamp)
10M * 500B = 5GB/day notification log
Retain 90 days: ~450GB total
Step 3: High-Level Architecture
[Trigger Sources]
- User events (payment, signup, message)
- Scheduled jobs (marketing campaigns)
- Admin panel (broadcast notifications)
|
v
[Notification API Service]
- Validate request
- Check user preferences
- Enrich with template
- Assign notification_id (idempotency key)
|
v
[Priority Kafka Topics]
- notifications-critical (OTP, payment, security)
- notifications-social (likes, comments, follows)
- notifications-marketing (promotions, newsletters)
|
v
[Channel Workers]
- Push Worker -> APNs / FCM
- Email Worker -> SendGrid / AWS SES
- SMS Worker -> Twilio / Vonage
|
v
[Delivery Tracker]
- Store sent/delivered/failed status
- Retry failed notifications
Step 4: Core Data Model
-- PostgreSQL
-- Notification templates
CREATE TABLE notification_templates (
template_id UUID PRIMARY KEY,
name VARCHAR(100) UNIQUE,
channel VARCHAR(20), -- push, email, sms
category VARCHAR(50), -- transactional, marketing, social
subject_tpl TEXT, -- email subject template
body_tpl TEXT, -- body with {{variable}} placeholders
created_at TIMESTAMP DEFAULT NOW()
);
-- Notification log
CREATE TABLE notifications (
notification_id UUID PRIMARY KEY,
user_id UUID NOT NULL,
template_id UUID REFERENCES notification_templates,
channel VARCHAR(20),
status VARCHAR(20), -- queued, sent, delivered, failed, opened
payload JSONB, -- template variables
scheduled_for TIMESTAMP,
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_notifications_user ON notifications(user_id, created_at DESC);
CREATE INDEX idx_notifications_status ON notifications(status, scheduled_for);
-- User preferences
CREATE TABLE user_notification_preferences (
user_id UUID,
category VARCHAR(50),
channel VARCHAR(20),
is_enabled BOOLEAN DEFAULT TRUE,
PRIMARY KEY (user_id, category, channel)
);
-- Device tokens
CREATE TABLE user_devices (
device_id UUID PRIMARY KEY,
user_id UUID,
platform VARCHAR(10), -- ios, android, web
push_token TEXT,
is_active BOOLEAN DEFAULT TRUE,
last_seen TIMESTAMP
);
Step 5: Notification Service API
class NotificationService:
"""
Core orchestration: validate, check prefs, template, route to queue.
"""
def __init__(self, user_service, template_service, kafka_producer, preference_service):
self.users = user_service
self.templates = template_service
self.kafka = kafka_producer
self.preferences = preference_service
def send(self, notification_request):
"""
notification_request: {
user_id, template_id, variables, channels, scheduled_for
}
"""
import uuid
# Check user preferences
user_prefs = self.preferences.get(
notification_request['user_id'],
notification_request['category']
)
allowed_channels = [
ch for ch in notification_request['channels']
if user_prefs.is_channel_enabled(ch)
]
if not allowed_channels:
return {'status': 'suppressed', 'reason': 'user_preferences'}
# Resolve template
template = self.templates.get(notification_request['template_id'])
rendered = template.render(notification_request['variables'])
# Create notification record
notification_id = str(uuid.uuid4())
# Determine priority
priority = 'critical' if notification_request['category'] == 'transactional' else 'standard'
# Enqueue
for channel in allowed_channels:
topic = f'notifications-{priority}'
self.kafka.produce(topic, {
'notification_id': notification_id,
'user_id': notification_request['user_id'],
'channel': channel,
'content': rendered,
'scheduled_for': notification_request.get('scheduled_for')
})
return {'status': 'queued', 'notification_id': notification_id}
Step 6: Channel Sender Abstractions
from abc import ABC, abstractmethod
class NotificationSender(ABC):
@abstractmethod
def send(self, recipient, content):
pass
class PushSender(NotificationSender):
"""Handles APNs (iOS) and FCM (Android)."""
def send(self, recipient, content):
platform = recipient['platform']
token = recipient['push_token']
if platform == 'ios':
return self._send_apns(token, content)
else:
return self._send_fcm(token, content)
def _send_apns(self, token, content):
# Use HTTP/2 APNs API
# Retry on 500, fail permanently on 410 (invalid token)
pass
def _send_fcm(self, token, content):
# Use Firebase Admin SDK
# Handle UNREGISTERED tokens -> deactivate device
pass
class EmailSender(NotificationSender):
def send(self, recipient, content):
# SendGrid or AWS SES
# Return message_id for tracking
pass
class SMSSender(NotificationSender):
def send(self, recipient, content):
# Twilio or Vonage
# Return SID for delivery tracking
pass
Step 7: Reliability and Deduplication
class IdempotentDeliveryWorker:
"""
Prevents duplicate sends using Redis dedup cache.
"""
def __init__(self, redis_client, sender):
self.redis = redis_client
self.sender = sender
self.DEDUP_TTL = 86400 # 24 hours
def process(self, notification):
dedup_key = f"notif_sent:{notification['notification_id']}:{notification['channel']}"
# Check if already sent
if self.redis.exists(dedup_key):
return {'status': 'duplicate', 'skipped': True}
# Mark as in-flight
if not self.redis.set(dedup_key, '1', ex=self.DEDUP_TTL, nx=True):
return {'status': 'duplicate', 'skipped': True}
# Attempt delivery
try:
result = self.sender.send(
notification['recipient'],
notification['content']
)
return {'status': 'sent', 'provider_id': result}
except Exception as e:
# Delete dedup key so retry is attempted
self.redis.delete(dedup_key)
raise
Retry strategy:
Attempt 1: immediate
Attempt 2: 1 minute later
Attempt 3: 5 minutes later
Attempt 4: 30 minutes later
Max retries: 4 (total 36 min window)
After max retries: move to dead letter queue, alert ops team
Step 8: Handling Provider-Specific Failures
| Failure type | Action |
|---|---|
| Invalid push token (410/UNREGISTERED) | Mark device inactive, do not retry |
| Provider rate limit (429) | Backpressure: slow down Kafka consumer |
| Provider timeout | Retry with exponential backoff |
| Provider outage | Route to fallback provider |
| User unsubscribed (via provider) | Update preference DB |
Step 9: Scheduling and Batching
Scheduled notifications:
- Store with scheduled_for timestamp
- Scheduler worker: every minute, query notifications WHERE
status='queued' AND scheduled_for <= NOW()
- Batch size: 1000 per scheduler tick
- Push to Kafka for actual delivery
Marketing batch sends:
- Campaign creates 1M notification records
- Batch writer: INSERT 1000 records per DB transaction
- Scheduler picks up in 1000-record batches
- Rate limited to respect provider quotas
Email: ~14/sec (SendGrid free tier: 100/day -> use Pro)
Push: ~1000/sec per FCM project
SMS: ~1/sec Twilio trial, ~1000/sec Pro
Step 10: Monitoring and Alerting
Key metrics to track:
- Notification queue depth (Kafka consumer lag)
- Delivery success rate per channel
- P50/P99 delivery latency
- Provider error rates
- Invalid token rate (rising = stale device DB)
Alerts:
- Queue depth > 10K for critical topic
- Success rate < 95% for any channel
- Provider response P99 > 5s
The User Preference Problem
User notification preferences are deceptively complex. A user can opt out of marketing emails but still want transactional emails. A user can enable push notifications for direct messages but not for group chat. In production, preferences are stored per (user, category, channel) triplet. The notification service must check these before enqueuing.
The performance concern: checking preferences on every notification adds a database query. The standard mitigation is to cache user preferences in Redis with a TTL of roughly 15 minutes. Stale preferences are acceptable: a user who just turned off marketing emails might receive one more marketing notification in the worst case. This is a deliberate tradeoff to keep the hot path fast.
Why Kafka Instead of a Simple Database Queue
A database-backed queue (polling on a notifications table) breaks down at high throughput because every dequeue requires a SELECT plus an UPDATE to mark the row as processing, creating lock contention. Kafka's log-based approach has no lock contention: consumers track their own offset, and reads are append-only. Multiple workers can read the same partition independently for failover without interfering with each other.
The priority separation (critical vs standard topics) is also cleaner in Kafka. Critical topic workers have more instances and shorter poll intervals than standard topic workers. This ensures OTP and payment notifications are not queued behind a marketing batch of 1 million emails.
Template-Based Notifications vs Code-Based
Hard-coding notification copy in code creates a deployment dependency every time marketing wants to change a subject line. Template-based systems store message templates in the database and let non-engineers edit them via an admin panel without a code deployment. The tradeoff is that template rendering (variable substitution) adds a small CPU overhead per notification. At 116 notifications/second, this is negligible.
The failure mode to handle: a template with a missing variable should fail loudly (log error, skip notification) rather than silently sending a notification with a visible placeholder like "Hello {{first_name}}".
Deep Dive: Fairness and the Marketing Flood Problem
The hardest operational problem in a notification service is preventing a large marketing campaign from starving time-sensitive transactional notifications. Imagine marketing schedules 1 million promotional emails at 9 PM. If all notifications share one queue, an OTP that a user needs to log in right now sits behind 1 million promos.
The priority-topic separation solves the queue-jumping problem, but there is a subtler issue: even within the critical topic, a single noisy user or service can dominate. The mitigation is per-tenant or per-category rate fairness at the producer side, plus weighted worker allocation at the consumer side.
Worker allocation by priority:
critical topic: 20 consumer instances, poll every 100ms
social topic: 8 consumer instances, poll every 500ms
marketing topic: 4 consumer instances, rate-capped to provider quota
Provider quota awareness:
Marketing workers throttle to the email provider's allowed
send rate (e.g., 14/sec) so a 1M campaign drains over ~20 hours
WITHOUT ever touching the critical workers' capacity.
The key sentence to say in the interview: critical and marketing traffic must be physically isolated onto different topics with different worker pools, so the marketing flood can never consume the capacity reserved for OTPs and payment receipts. Sharing a queue and relying on priority ordering within it is the wrong answer because a single oversized batch still blocks the head of the line.
Failure Handling Across the Pipeline
Notification API service is down:
Trigger sources (payment, signup) retry the enqueue call with
the same idempotency key. Because notification_id is generated
from the source event, a retry does not create a duplicate.
Kafka is unavailable:
The API service buffers to a local durable spool and replays
when Kafka recovers. Transactional triggers that cannot tolerate
any delay fall back to a synchronous send for the critical channel.
Channel worker crashes mid-batch:
Kafka offset is committed only after successful delivery (or
dead-lettering). On restart, the worker reprocesses uncommitted
events. The Redis dedup key ensures a reprocessed event that was
already delivered is skipped.
Downstream provider (FCM/Twilio) outage:
Circuit breaker opens after consecutive failures, events route
to a fallback provider or park in a retry queue with backoff.
Critical channels have a second provider configured; marketing
channels simply wait out the outage.
Follow-up Questions Interviewers Ask
How do you guarantee exactly-once delivery? You cannot, across an unreliable network and third-party providers. The system guarantees at-least-once delivery plus deduplication: the notification_id plus channel forms a Redis dedup key, so a redelivered event is skipped. The user perceives exactly-once even though the pipeline is at-least-once.
How do you handle a device token that has gone stale? When a provider returns 410 Gone (APNs) or UNREGISTERED (FCM), mark the device row inactive and stop sending to it. A separate cleanup job prunes long-inactive devices. Continuing to send to dead tokens wastes quota and can hurt sender reputation.
How do you support a "quiet hours" or do-not-disturb window? Store quiet-hours preferences per user and timezone. The scheduler holds non-critical notifications until the window closes, then releases them. Critical notifications (OTP, security alerts) bypass quiet hours because they are time-sensitive and user-initiated.
How do you prevent sending the same user 50 notifications in a minute? Apply a per-user notification rate cap and a coalescing rule: if multiple social events arrive in a short window, batch them into one summary notification ("5 people liked your post") rather than five separate pushes. This is both a UX and a cost decision.
How do you scale to 10x the current volume? Notification workers are stateless and partition-parallel, so add Kafka partitions and worker instances horizontally. The likely bottleneck is the downstream providers' rate limits, not your own infrastructure, so multi-provider routing and quota-aware throttling matter more than raw worker count.
Related Articles
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Uncategorized
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →Related Articles
Accenture Eligibility Criteria 2026, Complete Guide
This article covers every Accenture eligibility 2026 requirement: academic cutoffs, allowed branches, backlog policy, and...
Axis Bank Careers and Recruitment 2026 - Programs, Eligibility, Selection Guide for Freshers
Axis Bank is India's third-largest private sector bank and a consistent large-scale recruiter at the fresher level. Here is...
Backlogs Allowed in Infosys 2026: Full Eligibility Guide
If you have a backlog and are targeting Infosys this year, you need a straight answer, not vague reassurances. This guide...
Backlogs Allowed in TCS 2026, Eligibility Policy Explained
TCS is one of the few mass recruiters that publishes a relatively clear backlog policy, but candidates frequently misread...
Birlasoft Eligibility Criteria 2026: Who Can Apply
Quick answer (updated 8 June 2026): Birlasoft's 2026 fresher hiring is candidate-reported to expect roughly 60 percent or a...