placement brief / Topics & Practice / eligibility recruitment / 08 Jun 2026

System Design: Notification Service 2026 [Push, Email, SMS Architecture]

Candidates report notification service in roughly 10-15% of backend system design rounds, particularly at consumer product companies. Based on public...

By Aditya SharmaPublished 8 Jun 20262 sources listedSpot an error? Corrections open

6 min read last revised 8 Jun 2026

on this page§ 19

Last Updated: June 2026

Why Notification Service is a Frequently Asked Design

Candidates report notification service in roughly 10-15% of backend system design rounds, particularly at consumer product companies. Based on public preparation resources and candidate-reported interview threads, it tests asynchronous processing, multi-channel routing, reliability, and user preference management.

Step 1: Requirements

Functional requirements:

Send notifications via push (iOS/Android), email, and SMS
Support transactional (OTP, payment receipt) and marketing (promotions) notifications
Template-based notifications with personalization
User notification preferences (opt-in/opt-out per channel and category)
Scheduled notifications (send at specific time or relative to event)
Delivery tracking: sent, delivered, opened

Non-functional requirements:

Scale: 10 million notifications per day (116/sec average, 1000+/sec peak)
Transactional notifications: delivery under 5 seconds
Marketing notifications: delivery within 30 minutes
At-least-once delivery (deduplication prevents duplicates)
No notification loss

Step 2: Capacity Estimation

Daily notifications: 10M
  - Transactional (OTP, payment): ~500K/day = ~6/sec
  - Social (likes, comments): ~3M/day = ~35/sec
  - Marketing (promotions): ~6.5M/day = ~75/sec

Peak (evening hours ~8-10pm):
  3x average = ~350/sec

Storage per notification:
  Metadata: ~500 bytes (recipient, template, status, timestamp)
  10M * 500B = 5GB/day notification log
  Retain 90 days: ~450GB total

Step 3: High-Level Architecture

[Trigger Sources]
  - User events (payment, signup, message)
  - Scheduled jobs (marketing campaigns)
  - Admin panel (broadcast notifications)

         |
         v
[Notification API Service]
  - Validate request
  - Check user preferences
  - Enrich with template
  - Assign notification_id (idempotency key)
         |
         v
[Priority Kafka Topics]
  - notifications-critical  (OTP, payment, security)
  - notifications-social    (likes, comments, follows)
  - notifications-marketing (promotions, newsletters)

         |
         v
[Channel Workers]
  - Push Worker  -> APNs / FCM
  - Email Worker -> SendGrid / AWS SES
  - SMS Worker   -> Twilio / Vonage

         |
         v
[Delivery Tracker]
  - Store sent/delivered/failed status
  - Retry failed notifications

Step 4: Core Data Model

-- PostgreSQL

-- Notification templates
CREATE TABLE notification_templates (
    template_id     UUID PRIMARY KEY,
    name            VARCHAR(100) UNIQUE,
    channel         VARCHAR(20),    -- push, email, sms
    category        VARCHAR(50),    -- transactional, marketing, social
    subject_tpl     TEXT,           -- email subject template
    body_tpl        TEXT,           -- body with {{variable}} placeholders
    created_at      TIMESTAMP DEFAULT NOW()
);

-- Notification log
CREATE TABLE notifications (
    notification_id UUID PRIMARY KEY,
    user_id         UUID NOT NULL,
    template_id     UUID REFERENCES notification_templates,
    channel         VARCHAR(20),
    status          VARCHAR(20),    -- queued, sent, delivered, failed, opened
    payload         JSONB,          -- template variables
    scheduled_for   TIMESTAMP,
    sent_at         TIMESTAMP,
    delivered_at    TIMESTAMP,
    created_at      TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_notifications_user ON notifications(user_id, created_at DESC);
CREATE INDEX idx_notifications_status ON notifications(status, scheduled_for);

-- User preferences
CREATE TABLE user_notification_preferences (
    user_id         UUID,
    category        VARCHAR(50),
    channel         VARCHAR(20),
    is_enabled      BOOLEAN DEFAULT TRUE,
    PRIMARY KEY (user_id, category, channel)
);

-- Device tokens
CREATE TABLE user_devices (
    device_id       UUID PRIMARY KEY,
    user_id         UUID,
    platform        VARCHAR(10),    -- ios, android, web
    push_token      TEXT,
    is_active       BOOLEAN DEFAULT TRUE,
    last_seen       TIMESTAMP
);

Step 5: Notification Service API

class NotificationService:
    """
    Core orchestration: validate, check prefs, template, route to queue.
    """
    def __init__(self, user_service, template_service, kafka_producer, preference_service):
        self.users = user_service
        self.templates = template_service
        self.kafka = kafka_producer
        self.preferences = preference_service

    def send(self, notification_request):
        """
        notification_request: {
          user_id, template_id, variables, channels, scheduled_for
        }
        """
        import uuid

        # Check user preferences
        user_prefs = self.preferences.get(
            notification_request['user_id'],
            notification_request['category']
        )

        allowed_channels = [
            ch for ch in notification_request['channels']
            if user_prefs.is_channel_enabled(ch)
        ]

        if not allowed_channels:
            return {'status': 'suppressed', 'reason': 'user_preferences'}

        # Resolve template
        template = self.templates.get(notification_request['template_id'])
        rendered = template.render(notification_request['variables'])

        # Create notification record
        notification_id = str(uuid.uuid4())

        # Determine priority
        priority = 'critical' if notification_request['category'] == 'transactional' else 'standard'

        # Enqueue
        for channel in allowed_channels:
            topic = f'notifications-{priority}'
            self.kafka.produce(topic, {
                'notification_id': notification_id,
                'user_id': notification_request['user_id'],
                'channel': channel,
                'content': rendered,
                'scheduled_for': notification_request.get('scheduled_for')
            })

        return {'status': 'queued', 'notification_id': notification_id}

Step 6: Channel Sender Abstractions

from abc import ABC, abstractmethod

class NotificationSender(ABC):
    @abstractmethod
    def send(self, recipient, content):
        pass

class PushSender(NotificationSender):
    """Handles APNs (iOS) and FCM (Android)."""

    def send(self, recipient, content):
        platform = recipient['platform']
        token = recipient['push_token']

        if platform == 'ios':
            return self._send_apns(token, content)
        else:
            return self._send_fcm(token, content)

    def _send_apns(self, token, content):
        # Use HTTP/2 APNs API
        # Retry on 500, fail permanently on 410 (invalid token)
        pass

    def _send_fcm(self, token, content):
        # Use Firebase Admin SDK
        # Handle UNREGISTERED tokens -> deactivate device
        pass

class EmailSender(NotificationSender):
    def send(self, recipient, content):
        # SendGrid or AWS SES
        # Return message_id for tracking
        pass

class SMSSender(NotificationSender):
    def send(self, recipient, content):
        # Twilio or Vonage
        # Return SID for delivery tracking
        pass

Step 7: Reliability and Deduplication

class IdempotentDeliveryWorker:
    """
    Prevents duplicate sends using Redis dedup cache.
    """
    def __init__(self, redis_client, sender):
        self.redis = redis_client
        self.sender = sender
        self.DEDUP_TTL = 86400  # 24 hours

    def process(self, notification):
        dedup_key = f"notif_sent:{notification['notification_id']}:{notification['channel']}"

        # Check if already sent
        if self.redis.exists(dedup_key):
            return {'status': 'duplicate', 'skipped': True}

        # Mark as in-flight
        if not self.redis.set(dedup_key, '1', ex=self.DEDUP_TTL, nx=True):
            return {'status': 'duplicate', 'skipped': True}

        # Attempt delivery
        try:
            result = self.sender.send(
                notification['recipient'],
                notification['content']
            )
            return {'status': 'sent', 'provider_id': result}
        except Exception as e:
            # Delete dedup key so retry is attempted
            self.redis.delete(dedup_key)
            raise

Retry strategy:

Attempt 1: immediate
Attempt 2: 1 minute later
Attempt 3: 5 minutes later
Attempt 4: 30 minutes later
Max retries: 4 (total 36 min window)
After max retries: move to dead letter queue, alert ops team

Step 8: Handling Provider-Specific Failures

Failure type	Action
Invalid push token (410/UNREGISTERED)	Mark device inactive, do not retry
Provider rate limit (429)	Backpressure: slow down Kafka consumer
Provider timeout	Retry with exponential backoff
Provider outage	Route to fallback provider
User unsubscribed (via provider)	Update preference DB

Step 9: Scheduling and Batching

Scheduled notifications:
  - Store with scheduled_for timestamp
  - Scheduler worker: every minute, query notifications WHERE
    status='queued' AND scheduled_for <= NOW()
  - Batch size: 1000 per scheduler tick
  - Push to Kafka for actual delivery

Marketing batch sends:
  - Campaign creates 1M notification records
  - Batch writer: INSERT 1000 records per DB transaction
  - Scheduler picks up in 1000-record batches
  - Rate limited to respect provider quotas

Email: ~14/sec (SendGrid free tier: 100/day -> use Pro)
Push: ~1000/sec per FCM project
SMS: ~1/sec Twilio trial, ~1000/sec Pro

Step 10: Monitoring and Alerting

Key metrics to track:
  - Notification queue depth (Kafka consumer lag)
  - Delivery success rate per channel
  - P50/P99 delivery latency
  - Provider error rates
  - Invalid token rate (rising = stale device DB)

Alerts:
  - Queue depth > 10K for critical topic
  - Success rate < 95% for any channel
  - Provider response P99 > 5s

The User Preference Problem

User notification preferences are deceptively complex. A user can opt out of marketing emails but still want transactional emails. A user can enable push notifications for direct messages but not for group chat. In production, preferences are stored per (user, category, channel) triplet. The notification service must check these before enqueuing.

The performance concern: checking preferences on every notification adds a database query. The standard mitigation is to cache user preferences in Redis with a TTL of roughly 15 minutes. Stale preferences are acceptable: a user who just turned off marketing emails might receive one more marketing notification in the worst case. This is a deliberate tradeoff to keep the hot path fast.

Why Kafka Instead of a Simple Database Queue

A database-backed queue (polling on a notifications table) breaks down at high throughput because every dequeue requires a SELECT plus an UPDATE to mark the row as processing, creating lock contention. Kafka's log-based approach has no lock contention: consumers track their own offset, and reads are append-only. Multiple workers can read the same partition independently for failover without interfering with each other.

The priority separation (critical vs standard topics) is also cleaner in Kafka. Critical topic workers have more instances and shorter poll intervals than standard topic workers. This ensures OTP and payment notifications are not queued behind a marketing batch of 1 million emails.

Template-Based Notifications vs Code-Based

Hard-coding notification copy in code creates a deployment dependency every time marketing wants to change a subject line. Template-based systems store message templates in the database and let non-engineers edit them via an admin panel without a code deployment. The tradeoff is that template rendering (variable substitution) adds a small CPU overhead per notification. At 116 notifications/second, this is negligible.

The failure mode to handle: a template with a missing variable should fail loudly (log error, skip notification) rather than silently sending a notification with a visible placeholder like "Hello {{first_name}}".

Deep Dive: Fairness and the Marketing Flood Problem

The hardest operational problem in a notification service is preventing a large marketing campaign from starving time-sensitive transactional notifications. Imagine marketing schedules 1 million promotional emails at 9 PM. If all notifications share one queue, an OTP that a user needs to log in right now sits behind 1 million promos.

The priority-topic separation solves the queue-jumping problem, but there is a subtler issue: even within the critical topic, a single noisy user or service can dominate. The mitigation is per-tenant or per-category rate fairness at the producer side, plus weighted worker allocation at the consumer side.

Worker allocation by priority:
  critical topic:  20 consumer instances, poll every 100ms
  social topic:    8 consumer instances, poll every 500ms
  marketing topic: 4 consumer instances, rate-capped to provider quota

Provider quota awareness:
  Marketing workers throttle to the email provider's allowed
  send rate (e.g., 14/sec) so a 1M campaign drains over ~20 hours
  WITHOUT ever touching the critical workers' capacity.

The key sentence to say in the interview: critical and marketing traffic must be physically isolated onto different topics with different worker pools, so the marketing flood can never consume the capacity reserved for OTPs and payment receipts. Sharing a queue and relying on priority ordering within it is the wrong answer because a single oversized batch still blocks the head of the line.

Failure Handling Across the Pipeline

Notification API service is down:
  Trigger sources (payment, signup) retry the enqueue call with
  the same idempotency key. Because notification_id is generated
  from the source event, a retry does not create a duplicate.

Kafka is unavailable:
  The API service buffers to a local durable spool and replays
  when Kafka recovers. Transactional triggers that cannot tolerate
  any delay fall back to a synchronous send for the critical channel.

Channel worker crashes mid-batch:
  Kafka offset is committed only after successful delivery (or
  dead-lettering). On restart, the worker reprocesses uncommitted
  events. The Redis dedup key ensures a reprocessed event that was
  already delivered is skipped.

Downstream provider (FCM/Twilio) outage:
  Circuit breaker opens after consecutive failures, events route
  to a fallback provider or park in a retry queue with backoff.
  Critical channels have a second provider configured; marketing
  channels simply wait out the outage.

Follow-up Questions Interviewers Ask

How do you guarantee exactly-once delivery? You cannot, across an unreliable network and third-party providers. The system guarantees at-least-once delivery plus deduplication: the notification_id plus channel forms a Redis dedup key, so a redelivered event is skipped. The user perceives exactly-once even though the pipeline is at-least-once.

How do you handle a device token that has gone stale? When a provider returns 410 Gone (APNs) or UNREGISTERED (FCM), mark the device row inactive and stop sending to it. A separate cleanup job prunes long-inactive devices. Continuing to send to dead tokens wastes quota and can hurt sender reputation.

How do you support a "quiet hours" or do-not-disturb window? Store quiet-hours preferences per user and timezone. The scheduler holds non-critical notifications until the window closes, then releases them. Critical notifications (OTP, security alerts) bypass quiet hours because they are time-sensitive and user-initiated.

How do you prevent sending the same user 50 notifications in a minute? Apply a per-user notification rate cap and a coalescing rule: if multiple social events arrive in a short window, batch them into one summary notification ("5 people liked your post") rather than five separate pushes. This is both a UX and a cost decision.

How do you scale to 10x the current volume? Notification workers are stateless and partition-parallel, so add Kafka partitions and worker instances horizontally. The likely bottleneck is the downstream providers' rate limits, not your own infrastructure, so multi-provider routing and quota-aware throttling matter more than raw worker count.

FAQs

How does a notification service handle different channels (push, email, SMS)?

The core pattern is channel abstraction. A NotificationSender interface is implemented by PushSender (APNs/FCM), EmailSender (SendGrid/SES), and SMSSender (Twilio). The orchestration layer selects the right sender based on user preferences and notification type, routing each notification to the appropriate channel.

How do you prevent duplicate notifications?

Use an idempotency key (typically notification_id) to deduplicate at the delivery layer. Store sent notification IDs in Redis with a short TTL. Before sending, check if the ID exists. This prevents duplicate sends during retries without requiring a persistent database lookup on every notification.

How do you handle notification delivery at 10 million notifications per day?

10M/day is roughly 116 notifications/second. At this scale a single Kafka topic with 10 partitions handles the throughput easily. Priority queues (urgent vs regular) ensure critical notifications (OTP, payment) are not delayed by marketing batch sends. Rate limiting per channel (APNs: 1000/sec, email: provider limits) must be respected.

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Topics & Practice

Share this guide

Twitter LinkedIn W WhatsApp

System Design: Notification Service 2026 [Push, Email, SMS Architecture]

Why Notification Service is a Frequently Asked Design

Step 1: Requirements

Step 2: Capacity Estimation

Step 3: High-Level Architecture

Step 4: Core Data Model

Step 5: Notification Service API

Step 6: Channel Sender Abstractions

Step 7: Reliability and Deduplication

Step 8: Handling Provider-Specific Failures

Step 9: Scheduling and Batching

Step 10: Monitoring and Alerting

The User Preference Problem

Why Kafka Instead of a Simple Database Queue

Template-Based Notifications vs Code-Based

Deep Dive: Fairness and the Marketing Flood Problem

Failure Handling Across the Pipeline

Follow-up Questions Interviewers Ask

FAQs

How does a notification service handle different channels (push, email, SMS)?

How do you prevent duplicate notifications?

How do you handle notification delivery at 10 million notifications per day?

More resources in Topics & Practice

Sat this this year? Share your story, earn ₹500.

Take a free timed mock test

System Design: Chat Application 2026 [WhatsApp/Slack Architecture]

System Design: URL Shortener 2026 [bit.ly Architecture Deep Dive]

System Design: Rate Limiter 2026 [Full Design with Code]

System Design: TinyURL 2026 [Hash Collision, Vanity URLs, QR Codes]

System Design: News Feed 2026 [Twitter/Instagram Architecture]

Share this guide

System Design: Notification Service 2026 [Push, Email, SMS Architecture]

Why Notification Service is a Frequently Asked Design

Step 1: Requirements

Step 2: Capacity Estimation

Step 3: High-Level Architecture

Step 4: Core Data Model

Step 5: Notification Service API

Step 6: Channel Sender Abstractions

Step 7: Reliability and Deduplication

Step 8: Handling Provider-Specific Failures

Step 9: Scheduling and Batching

Step 10: Monitoring and Alerting

The User Preference Problem

Why Kafka Instead of a Simple Database Queue

Template-Based Notifications vs Code-Based

Deep Dive: Fairness and the Marketing Flood Problem

Failure Handling Across the Pipeline

Follow-up Questions Interviewers Ask

FAQs

How does a notification service handle different channels (push, email, SMS)?

How do you prevent duplicate notifications?

How do you handle notification delivery at 10 million notifications per day?

Related Articles

More resources in Topics & Practice

Sat this this year? Share your story, earn ₹500.

Take a free timed mock test

System Design: Chat Application 2026 [WhatsApp/Slack Architecture]

System Design: URL Shortener 2026 [bit.ly Architecture Deep Dive]

System Design: Rate Limiter 2026 [Full Design with Code]

System Design: TinyURL 2026 [Hash Collision, Vanity URLs, QR Codes]

System Design: News Feed 2026 [Twitter/Instagram Architecture]

Share this guide