System Design: Rate Limiter 2026 [Full Design with Code]

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
Last Updated: June 2026
Why Rate Limiter is a Canonical System Design Problem
Candidates report rate limiter as appearing in roughly 20% of FAANG system design rounds and as a near-universal design question at API platform companies (Stripe, Twilio, Cloudflare). Based on public preparation resources and candidate-reported interview threads, it tests distributed state management, tradeoffs between accuracy and performance, and knowledge of Redis primitives.
Step 1: Requirements Clarification
Functional requirements:
- Limit number of requests per user/IP/API key within a time window
- Support multiple rate limit tiers (free: 1000/min, pro: 10K/min, enterprise: custom)
- Requests exceeding the limit get 429 Too Many Requests with Retry-After header
- Rate limits should be configurable without redeployment
Non-functional requirements:
- Low latency: rate limit check must add less than 5ms to request latency
- High availability: rate limiter must not be a single point of failure
- Eventual consistency acceptable (minor over-counting under failure is OK)
- Scale: 10 million requests per minute across 100+ microservices
Out of scope:
- DDoS protection (separate layer, different tooling)
- Application-level throttling (this is infrastructure-level)
Step 2: Capacity Estimation
Traffic: 10M requests/minute = 166,667 RPS
Rate limit checks per request: 1 Redis operation
Redis throughput: roughly 100K-1M ops/sec per node
Redis nodes needed: 2-3 nodes with replication
Storage per user entry:
Key: "rate:{user_id}:{window}" = ~50 bytes
Value: counter (8 bytes) + TTL
Total: ~60 bytes per active user per window
For 1M active users: 60MB per window
With 2 windows (sliding): ~120MB (fits in single Redis instance)
Step 3: Core Algorithms
Algorithm 1: Fixed Window Counter
Simplest approach. Count requests in fixed time windows (per minute, per hour).
import redis
import time
class FixedWindowRateLimiter:
"""
Simple, low memory. Problem: burst at window boundary.
A user can make 200% of limit in 2 seconds straddling window boundary.
Time: O(1) Space: O(users * windows)
"""
def __init__(self, redis_client, limit, window_seconds):
self.redis = redis_client
self.limit = limit
self.window = window_seconds
def is_allowed(self, user_id):
window_id = int(time.time()) // self.window
key = f"rate:{user_id}:{window_id}"
count = self.redis.incr(key)
if count == 1:
self.redis.expire(key, self.window * 2)
return count <= self.limit
Problem: A user makes 1000 requests at 0:59:59 (limit) and 1000 at 1:00:01. Both windows see 1000, but the user made 2000 in a 2-second period.
Algorithm 2: Sliding Window Log
Store timestamp of every request. On each new request, count how many are within the last window.
class SlidingWindowLogRateLimiter:
"""
Accurate but high memory: O(requests) per user.
Time: O(log n) for sorted set operations Space: O(limit * users)
"""
def __init__(self, redis_client, limit, window_seconds):
self.redis = redis_client
self.limit = limit
self.window = window_seconds
def is_allowed(self, user_id):
now = time.time()
window_start = now - self.window
key = f"rate_log:{user_id}"
pipe = self.redis.pipeline()
pipe.zremrangebyscore(key, 0, window_start) # remove old entries
pipe.zadd(key, {str(now): now}) # add current request
pipe.zcard(key) # count remaining
pipe.expire(key, self.window)
_, _, count, _ = pipe.execute()
return count <= self.limit
Algorithm 3: Sliding Window Counter (Recommended)
Approximate sliding window using two fixed windows and linear interpolation.
class SlidingWindowCounterRateLimiter:
"""
Best production balance: approximate accuracy, O(1) per user.
Uses: current window count + weighted previous window count.
Approximation: current_count + prev_count * (1 - elapsed_fraction)
Time: O(1) Space: O(users)
"""
def __init__(self, redis_client, limit, window_seconds):
self.redis = redis_client
self.limit = limit
self.window = window_seconds
def is_allowed(self, user_id):
now = time.time()
current_window = int(now) // self.window
prev_window = current_window - 1
elapsed_in_window = now - (current_window * self.window)
curr_key = f"rate:{user_id}:{current_window}"
prev_key = f"rate:{user_id}:{prev_window}"
pipe = self.redis.pipeline()
pipe.get(curr_key)
pipe.get(prev_key)
curr_count, prev_count = pipe.execute()
curr_count = int(curr_count or 0)
prev_count = int(prev_count or 0)
# Weighted approximation
elapsed_fraction = elapsed_in_window / self.window
approx_count = curr_count + prev_count * (1 - elapsed_fraction)
if approx_count >= self.limit:
return False
pipe2 = self.redis.pipeline()
pipe2.incr(curr_key)
pipe2.expire(curr_key, self.window * 2)
pipe2.execute()
return True
Algorithm 4: Token Bucket
Fill a bucket with tokens at a fixed rate. Each request consumes one token. Allows bursting.
class TokenBucketRateLimiter:
"""
Supports bursting. More complex Redis implementation.
Time: O(1) Space: O(users)
"""
def __init__(self, redis_client, capacity, refill_rate):
self.redis = redis_client
self.capacity = capacity # max tokens
self.refill_rate = refill_rate # tokens per second
def is_allowed(self, user_id):
key = f"token_bucket:{user_id}"
now = time.time()
# Lua script for atomicity
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill based on time elapsed
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 1
else
return 0
end
"""
result = self.redis.eval(lua_script, 1, key, self.capacity, self.refill_rate, now)
return result == 1
Algorithm 5: Leaky Bucket
The leaky bucket models requests as water entering a bucket that leaks at a constant rate. Unlike the token bucket (which allows bursts), the leaky bucket enforces a smooth, constant output rate. Requests that arrive when the bucket is full are dropped.
class LeakyBucketRateLimiter:
"""
Smooths traffic to a constant outflow rate. No bursts allowed.
Implemented as a queue draining at a fixed rate.
Time: O(1) Space: O(users)
"""
def __init__(self, redis_client, capacity, leak_rate):
self.redis = redis_client
self.capacity = capacity # max queued requests
self.leak_rate = leak_rate # requests drained per second
def is_allowed(self, user_id):
key = f"leaky:{user_id}"
now = time.time()
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local leak_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'level', 'last_leak')
local level = tonumber(bucket[1]) or 0
local last_leak = tonumber(bucket[2]) or now
-- Leak out water since last check
local leaked = (now - last_leak) * leak_rate
level = math.max(0, level - leaked)
if level + 1 <= capacity then
level = level + 1
redis.call('HMSET', key, 'level', level, 'last_leak', now)
redis.call('EXPIRE', key, 3600)
return 1
else
redis.call('HMSET', key, 'level', level, 'last_leak', now)
return 0
end
"""
result = self.redis.eval(lua_script, 1, key, self.capacity, self.leak_rate, now)
return result == 1
The leaky bucket is the right choice when a downstream service requires a strictly constant request rate (for example, a payment processor that mandates no more than N transactions per second). The token bucket suits the opposite case: an API that wants to tolerate short bursts as long as the long-run average stays within limit.
Step 3b: Why Atomicity Matters Here
The single most common bug in a distributed rate limiter is a read-then-write race. Consider the naive sliding-window-counter flow: read the count, compare to the limit, then increment. Between the read and the increment, another request on a different server can read the same count and also pass the check. Under high concurrency, both requests are admitted even though only one slot was free, so the effective limit is breached.
Server A: GET count -> 999 (limit 1000), check passes
Server B: GET count -> 999 (limit 1000), check passes <-- race
Server A: INCR -> 1000
Server B: INCR -> 1001 <-- over the limit
Two correct fixes. First, push the entire check-and-increment into a single Lua script that Redis executes atomically (Redis runs Lua single-threaded, so no interleaving is possible). The token bucket and leaky bucket above already do this. Second, for the simple counter, use INCR first and then compare: INCR is atomic and returns the post-increment value, so each request gets a unique slot number and you reject any request whose returned value exceeds the limit. The order matters: increment-then-check is race-free, check-then-increment is not.
Step 4: High-Level Architecture
Client Request
|
v
[API Gateway / Load Balancer]
|
|-----> [Rate Limiter Middleware]
| |
| v
| [Redis Cluster]
| (rate counters, TTL-based)
|
| if allowed:
v
[Backend Services]
Header response:
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 734
X-RateLimit-Reset: 1686234060
Retry-After: 45 (on 429)
Step 5: API Design
Rate Limit Configuration API:
POST /rate-limits
{
"tier": "pro",
"resource": "/api/v1/payments",
"window_seconds": 60,
"limit": 10000
}
Rate Limit Header in every response:
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 734
X-RateLimit-Reset: 1686234060
On rate limit exceeded:
HTTP/1.1 429 Too Many Requests
Retry-After: 45
Content-Type: application/json
{"error": "rate_limit_exceeded", "message": "Retry after 45 seconds"}
Step 6: Distributed Deployment
Single Redis: works for < 100K RPS. Single point of failure.
Redis Cluster (sharded): partition user keys across nodes.
- Key: rate:{user_id} -> shard by hash(user_id)
- Each shard handles subset of users
- Failure of one shard affects only that partition
Redis Sentinel: master-replica with automatic failover.
- Write to master, read replicas for read-heavy workloads
- Acceptable: minor over-counting during failover
Consistent hashing for user assignment:
- Adds/removes Redis nodes without full re-sharding
- ~k/n keys reassigned when adding node (k = total keys, n = node count)
Step 7: Tradeoffs
| Algorithm | Accuracy | Memory | Latency | Burst handling |
|---|---|---|---|---|
| Fixed Window | Low (boundary burst) | O(1) | O(1) | No |
| Sliding Window Log | Exact | O(limit * users) | O(log n) | No |
| Sliding Window Counter | ~99% accurate | O(1) | O(1) | No |
| Token Bucket | Exact | O(1) | O(1) | Yes |
| Leaky Bucket | Exact | O(1) | O(1) | No (smooths) |
Production recommendation: Sliding Window Counter for most APIs. Token Bucket when burst allowance is a product requirement.
Common Follow-up Questions
Q: How do you handle Redis failure? Use circuit breaker: if Redis is down, allow requests with a fallback local in-memory counter. Accept minor over-counting vs a complete outage.
Q: How do you rate limit across multiple data centers? Each DC maintains its own counters. Accept eventual consistency: a user in DC1 might temporarily exceed global limit by the amount in DC2. True global consistency requires cross-DC synchronization which adds latency. Most products accept this tradeoff.
Q: What if the same user hits different servers? All servers connect to the same Redis cluster. The rate limiter state is centralized in Redis, not in the application server.
Q: How do you keep the rate-limit check under 5ms? Co-locate the Redis cluster in the same availability zone as the API gateway so the network round trip stays under 1ms. Use a single pipelined Redis call (or one Lua script) per check rather than multiple sequential calls. Keep connection pools warm so no request pays connection-setup cost. The dominant cost is the network hop, so minimizing round trips is the whole game.
Q: How do you support different limits for different endpoints and tiers? Store the limit configuration in a config store (Redis hash or a config service) keyed by (tier, resource). The rate limiter reads the applicable limit at check time and caches it locally with a short TTL. Because limits change rarely and reads dominate, a 30-second local cache of the config eliminates a config lookup on the hot path while still picking up changes quickly.
Q: How do you prevent a thundering herd when many users reset at the same window boundary? Fixed windows reset all users simultaneously, which can spike traffic at the boundary. The sliding window counter avoids this because each user's weighting decays continuously rather than snapping to zero at a shared boundary. If you must use fixed windows, add a small random jitter to each user's window start so resets spread out over time.
Q: Where do you return the Retry-After value from? For a fixed or sliding window, Retry-After is the seconds remaining until the current window expires. For a token bucket, it is the time needed to refill one token: 1 divided by the refill rate. Returning an accurate Retry-After lets well-behaved clients back off precisely instead of hammering the endpoint.
Related Articles
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Uncategorized
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →More from PapersAdda
Microsoft Interview Pattern Bank 2026: LRU Cache, OneDrive & AA Round
Accenture Gen AI Placement Papers 2026, Full Guide
Accenture vs Cognizant Fresher: Full Comparison 2026
Amazon Interview Process 2026: Full Loop + Bar Raiser