issue 117apr 27mmxxvi
est. 2017
Sun, 27 Apr 2026
vol. IX · no. 117
PapersAdda
placement intelligence, since 2017
640+ briefs · 24 campuses · by reservation
verified offers · sourced from r/developersIndia
razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1

System Design: Notification Service 2026 [Push, Email, SMS Architecture]

11 min read
Uncategorized
Updated: 8 Jun 2026
Aditya Sharma
Aditya's Edit

PapersAdda 2026 Placement Cycle

By Aditya Sharma·Founder & Editor, PapersAdda

What changed in 2026 drives

Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.

What I'd actually study for this

  • 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
  • 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
  • 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
  • 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken

Where most candidates trip up

The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.

Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.

Last Updated: June 2026


Why Notification Service is a Frequently Asked Design

Candidates report notification service in roughly 10-15% of backend system design rounds, particularly at consumer product companies. Based on public preparation resources and candidate-reported interview threads, it tests asynchronous processing, multi-channel routing, reliability, and user preference management.


Step 1: Requirements

Functional requirements:

  • Send notifications via push (iOS/Android), email, and SMS
  • Support transactional (OTP, payment receipt) and marketing (promotions) notifications
  • Template-based notifications with personalization
  • User notification preferences (opt-in/opt-out per channel and category)
  • Scheduled notifications (send at specific time or relative to event)
  • Delivery tracking: sent, delivered, opened

Non-functional requirements:

  • Scale: 10 million notifications per day (116/sec average, 1000+/sec peak)
  • Transactional notifications: delivery under 5 seconds
  • Marketing notifications: delivery within 30 minutes
  • At-least-once delivery (deduplication prevents duplicates)
  • No notification loss

Step 2: Capacity Estimation

Daily notifications: 10M
  - Transactional (OTP, payment): ~500K/day = ~6/sec
  - Social (likes, comments): ~3M/day = ~35/sec
  - Marketing (promotions): ~6.5M/day = ~75/sec

Peak (evening hours ~8-10pm):
  3x average = ~350/sec

Storage per notification:
  Metadata: ~500 bytes (recipient, template, status, timestamp)
  10M * 500B = 5GB/day notification log
  Retain 90 days: ~450GB total

Step 3: High-Level Architecture

[Trigger Sources]
  - User events (payment, signup, message)
  - Scheduled jobs (marketing campaigns)
  - Admin panel (broadcast notifications)

         |
         v
[Notification API Service]
  - Validate request
  - Check user preferences
  - Enrich with template
  - Assign notification_id (idempotency key)
         |
         v
[Priority Kafka Topics]
  - notifications-critical  (OTP, payment, security)
  - notifications-social    (likes, comments, follows)
  - notifications-marketing (promotions, newsletters)

         |
         v
[Channel Workers]
  - Push Worker  -> APNs / FCM
  - Email Worker -> SendGrid / AWS SES
  - SMS Worker   -> Twilio / Vonage

         |
         v
[Delivery Tracker]
  - Store sent/delivered/failed status
  - Retry failed notifications

Step 4: Core Data Model

-- PostgreSQL

-- Notification templates
CREATE TABLE notification_templates (
    template_id     UUID PRIMARY KEY,
    name            VARCHAR(100) UNIQUE,
    channel         VARCHAR(20),    -- push, email, sms
    category        VARCHAR(50),    -- transactional, marketing, social
    subject_tpl     TEXT,           -- email subject template
    body_tpl        TEXT,           -- body with {{variable}} placeholders
    created_at      TIMESTAMP DEFAULT NOW()
);

-- Notification log
CREATE TABLE notifications (
    notification_id UUID PRIMARY KEY,
    user_id         UUID NOT NULL,
    template_id     UUID REFERENCES notification_templates,
    channel         VARCHAR(20),
    status          VARCHAR(20),    -- queued, sent, delivered, failed, opened
    payload         JSONB,          -- template variables
    scheduled_for   TIMESTAMP,
    sent_at         TIMESTAMP,
    delivered_at    TIMESTAMP,
    created_at      TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_notifications_user ON notifications(user_id, created_at DESC);
CREATE INDEX idx_notifications_status ON notifications(status, scheduled_for);

-- User preferences
CREATE TABLE user_notification_preferences (
    user_id         UUID,
    category        VARCHAR(50),
    channel         VARCHAR(20),
    is_enabled      BOOLEAN DEFAULT TRUE,
    PRIMARY KEY (user_id, category, channel)
);

-- Device tokens
CREATE TABLE user_devices (
    device_id       UUID PRIMARY KEY,
    user_id         UUID,
    platform        VARCHAR(10),    -- ios, android, web
    push_token      TEXT,
    is_active       BOOLEAN DEFAULT TRUE,
    last_seen       TIMESTAMP
);

Step 5: Notification Service API

class NotificationService:
    """
    Core orchestration: validate, check prefs, template, route to queue.
    """
    def __init__(self, user_service, template_service, kafka_producer, preference_service):
        self.users = user_service
        self.templates = template_service
        self.kafka = kafka_producer
        self.preferences = preference_service

    def send(self, notification_request):
        """
        notification_request: {
          user_id, template_id, variables, channels, scheduled_for
        }
        """
        import uuid

        # Check user preferences
        user_prefs = self.preferences.get(
            notification_request['user_id'],
            notification_request['category']
        )

        allowed_channels = [
            ch for ch in notification_request['channels']
            if user_prefs.is_channel_enabled(ch)
        ]

        if not allowed_channels:
            return {'status': 'suppressed', 'reason': 'user_preferences'}

        # Resolve template
        template = self.templates.get(notification_request['template_id'])
        rendered = template.render(notification_request['variables'])

        # Create notification record
        notification_id = str(uuid.uuid4())

        # Determine priority
        priority = 'critical' if notification_request['category'] == 'transactional' else 'standard'

        # Enqueue
        for channel in allowed_channels:
            topic = f'notifications-{priority}'
            self.kafka.produce(topic, {
                'notification_id': notification_id,
                'user_id': notification_request['user_id'],
                'channel': channel,
                'content': rendered,
                'scheduled_for': notification_request.get('scheduled_for')
            })

        return {'status': 'queued', 'notification_id': notification_id}

Step 6: Channel Sender Abstractions

from abc import ABC, abstractmethod

class NotificationSender(ABC):
    @abstractmethod
    def send(self, recipient, content):
        pass

class PushSender(NotificationSender):
    """Handles APNs (iOS) and FCM (Android)."""

    def send(self, recipient, content):
        platform = recipient['platform']
        token = recipient['push_token']

        if platform == 'ios':
            return self._send_apns(token, content)
        else:
            return self._send_fcm(token, content)

    def _send_apns(self, token, content):
        # Use HTTP/2 APNs API
        # Retry on 500, fail permanently on 410 (invalid token)
        pass

    def _send_fcm(self, token, content):
        # Use Firebase Admin SDK
        # Handle UNREGISTERED tokens -> deactivate device
        pass

class EmailSender(NotificationSender):
    def send(self, recipient, content):
        # SendGrid or AWS SES
        # Return message_id for tracking
        pass

class SMSSender(NotificationSender):
    def send(self, recipient, content):
        # Twilio or Vonage
        # Return SID for delivery tracking
        pass

Step 7: Reliability and Deduplication

class IdempotentDeliveryWorker:
    """
    Prevents duplicate sends using Redis dedup cache.
    """
    def __init__(self, redis_client, sender):
        self.redis = redis_client
        self.sender = sender
        self.DEDUP_TTL = 86400  # 24 hours

    def process(self, notification):
        dedup_key = f"notif_sent:{notification['notification_id']}:{notification['channel']}"

        # Check if already sent
        if self.redis.exists(dedup_key):
            return {'status': 'duplicate', 'skipped': True}

        # Mark as in-flight
        if not self.redis.set(dedup_key, '1', ex=self.DEDUP_TTL, nx=True):
            return {'status': 'duplicate', 'skipped': True}

        # Attempt delivery
        try:
            result = self.sender.send(
                notification['recipient'],
                notification['content']
            )
            return {'status': 'sent', 'provider_id': result}
        except Exception as e:
            # Delete dedup key so retry is attempted
            self.redis.delete(dedup_key)
            raise

Retry strategy:

Attempt 1: immediate
Attempt 2: 1 minute later
Attempt 3: 5 minutes later
Attempt 4: 30 minutes later
Max retries: 4 (total 36 min window)
After max retries: move to dead letter queue, alert ops team

Step 8: Handling Provider-Specific Failures

Failure typeAction
Invalid push token (410/UNREGISTERED)Mark device inactive, do not retry
Provider rate limit (429)Backpressure: slow down Kafka consumer
Provider timeoutRetry with exponential backoff
Provider outageRoute to fallback provider
User unsubscribed (via provider)Update preference DB

Step 9: Scheduling and Batching

Scheduled notifications:
  - Store with scheduled_for timestamp
  - Scheduler worker: every minute, query notifications WHERE
    status='queued' AND scheduled_for <= NOW()
  - Batch size: 1000 per scheduler tick
  - Push to Kafka for actual delivery

Marketing batch sends:
  - Campaign creates 1M notification records
  - Batch writer: INSERT 1000 records per DB transaction
  - Scheduler picks up in 1000-record batches
  - Rate limited to respect provider quotas

Email: ~14/sec (SendGrid free tier: 100/day -> use Pro)
Push: ~1000/sec per FCM project
SMS: ~1/sec Twilio trial, ~1000/sec Pro

Step 10: Monitoring and Alerting

Key metrics to track:
  - Notification queue depth (Kafka consumer lag)
  - Delivery success rate per channel
  - P50/P99 delivery latency
  - Provider error rates
  - Invalid token rate (rising = stale device DB)

Alerts:
  - Queue depth > 10K for critical topic
  - Success rate < 95% for any channel
  - Provider response P99 > 5s

The User Preference Problem

User notification preferences are deceptively complex. A user can opt out of marketing emails but still want transactional emails. A user can enable push notifications for direct messages but not for group chat. In production, preferences are stored per (user, category, channel) triplet. The notification service must check these before enqueuing.

The performance concern: checking preferences on every notification adds a database query. The standard mitigation is to cache user preferences in Redis with a TTL of roughly 15 minutes. Stale preferences are acceptable: a user who just turned off marketing emails might receive one more marketing notification in the worst case. This is a deliberate tradeoff to keep the hot path fast.

Why Kafka Instead of a Simple Database Queue

A database-backed queue (polling on a notifications table) breaks down at high throughput because every dequeue requires a SELECT plus an UPDATE to mark the row as processing, creating lock contention. Kafka's log-based approach has no lock contention: consumers track their own offset, and reads are append-only. Multiple workers can read the same partition independently for failover without interfering with each other.

The priority separation (critical vs standard topics) is also cleaner in Kafka. Critical topic workers have more instances and shorter poll intervals than standard topic workers. This ensures OTP and payment notifications are not queued behind a marketing batch of 1 million emails.

Template-Based Notifications vs Code-Based

Hard-coding notification copy in code creates a deployment dependency every time marketing wants to change a subject line. Template-based systems store message templates in the database and let non-engineers edit them via an admin panel without a code deployment. The tradeoff is that template rendering (variable substitution) adds a small CPU overhead per notification. At 116 notifications/second, this is negligible.

The failure mode to handle: a template with a missing variable should fail loudly (log error, skip notification) rather than silently sending a notification with a visible placeholder like "Hello {{first_name}}".


Deep Dive: Fairness and the Marketing Flood Problem

The hardest operational problem in a notification service is preventing a large marketing campaign from starving time-sensitive transactional notifications. Imagine marketing schedules 1 million promotional emails at 9 PM. If all notifications share one queue, an OTP that a user needs to log in right now sits behind 1 million promos.

The priority-topic separation solves the queue-jumping problem, but there is a subtler issue: even within the critical topic, a single noisy user or service can dominate. The mitigation is per-tenant or per-category rate fairness at the producer side, plus weighted worker allocation at the consumer side.

Worker allocation by priority:
  critical topic:  20 consumer instances, poll every 100ms
  social topic:    8 consumer instances, poll every 500ms
  marketing topic: 4 consumer instances, rate-capped to provider quota

Provider quota awareness:
  Marketing workers throttle to the email provider's allowed
  send rate (e.g., 14/sec) so a 1M campaign drains over ~20 hours
  WITHOUT ever touching the critical workers' capacity.

The key sentence to say in the interview: critical and marketing traffic must be physically isolated onto different topics with different worker pools, so the marketing flood can never consume the capacity reserved for OTPs and payment receipts. Sharing a queue and relying on priority ordering within it is the wrong answer because a single oversized batch still blocks the head of the line.


Failure Handling Across the Pipeline

Notification API service is down:
  Trigger sources (payment, signup) retry the enqueue call with
  the same idempotency key. Because notification_id is generated
  from the source event, a retry does not create a duplicate.

Kafka is unavailable:
  The API service buffers to a local durable spool and replays
  when Kafka recovers. Transactional triggers that cannot tolerate
  any delay fall back to a synchronous send for the critical channel.

Channel worker crashes mid-batch:
  Kafka offset is committed only after successful delivery (or
  dead-lettering). On restart, the worker reprocesses uncommitted
  events. The Redis dedup key ensures a reprocessed event that was
  already delivered is skipped.

Downstream provider (FCM/Twilio) outage:
  Circuit breaker opens after consecutive failures, events route
  to a fallback provider or park in a retry queue with backoff.
  Critical channels have a second provider configured; marketing
  channels simply wait out the outage.

Follow-up Questions Interviewers Ask

How do you guarantee exactly-once delivery? You cannot, across an unreliable network and third-party providers. The system guarantees at-least-once delivery plus deduplication: the notification_id plus channel forms a Redis dedup key, so a redelivered event is skipped. The user perceives exactly-once even though the pipeline is at-least-once.

How do you handle a device token that has gone stale? When a provider returns 410 Gone (APNs) or UNREGISTERED (FCM), mark the device row inactive and stop sending to it. A separate cleanup job prunes long-inactive devices. Continuing to send to dead tokens wastes quota and can hurt sender reputation.

How do you support a "quiet hours" or do-not-disturb window? Store quiet-hours preferences per user and timezone. The scheduler holds non-critical notifications until the window closes, then releases them. Critical notifications (OTP, security alerts) bypass quiet hours because they are time-sensitive and user-initiated.

How do you prevent sending the same user 50 notifications in a minute? Apply a per-user notification rate cap and a coalescing rule: if multiple social events arrive in a short window, batch them into one summary notification ("5 people liked your post") rather than five separate pushes. This is both a UX and a cost decision.

How do you scale to 10x the current volume? Notification workers are stateless and partition-parallel, so add Kafka partitions and worker instances horizontally. The likely bottleneck is the downstream providers' rate limits, not your own infrastructure, so multi-provider routing and quota-aware throttling matter more than raw worker count.


Methodology applied to this articlelast verified 8 Jun 2026
Sources used
Public exam-pattern documents, official recruiter pages, and verified candidate reports on r/developersIndia and LinkedIn.
Verification window
Page last edited 8 Jun 2026 by Aditya Sharma. Numbers and patterns sanity-checked against the most recent 2026 cycle drives we tracked.
What we did NOT do
  • No fabricated salary numbers or success rates. If we quote a range, it's sourced.
  • No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
  • No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

Explore this topic cluster

More resources in Uncategorized

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Paid contributor programme

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.

Submit your story →

Ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start Free Mock Test →

Related Articles

More from PapersAdda

Share this guide: