Every API needs rate limiting. The business case is obvious: prevent abuse, protect downstream services, enforce fair usage. The implementation sounds simple: count requests, reject the ones over the limit. A junior engineer could write the first version in an afternoon.
Then you put it in production. Traffic spikes at window boundaries. Distributed nodes disagree on counts. The Redis cluster goes down and you have to decide between blocking all traffic or letting everything through. Users complain that they hit limits they don't understand. Your mobile app retries on 429 and makes the problem worse. The "simple" rate limiter turns into a system with more edge cases than the feature it was protecting.
I've built rate limiting into APIs at several companies. Each time I've underestimated it. Here's what I've learned.
The Fixed Window Problem
The first rate limiter most people reach for is the fixed window counter. Divide time into windows, say, one minute each. Count requests per client per window. If the count exceeds the limit, reject the request. Reset the counter at the start of each new window.
It's dead simple to implement with Redis:
String key = "ratelimit:" + userId + ":" + (timestamp / 60);
long count = redis.incr(key);
redis.expire(key, 120); // 2 windows for safety
if (count > limit) {
throw new RateLimitExceededException();
}This works fine for most traffic. The problem is the boundary spike.
Imagine your limit is 100 requests per minute. A client sends 100 requests in the last second of minute one. Counter resets. They send 100 more requests in the first second of minute two. You've just served 200 requests in two seconds to a client you thought you were limiting to 100 per minute. The fixed window is correct at the window level but wrong at the actual time scale that matters.
For many use cases, this is acceptable. If you're protecting a backend service from sustained overload, a brief 2x burst at window boundaries won't matter. But if you're enforcing pricing tiers or preventing abuse that happens in short bursts, fixed windows can be gamed trivially. Any client that knows your window boundaries can double their effective rate limit with minimal effort.
Sliding Window Log
The theoretically correct approach is the sliding window log. Instead of counting in buckets, you store a timestamp for every request. To check the rate, you look at how many requests happened in the last minute (or whatever your window is) from the current moment.
long now = System.currentTimeMillis();
long windowStart = now - 60_000;
redis.zremrangeByScore(key, 0, windowStart); // remove old entries
long count = redis.zcard(key);
if (count >= limit) {
throw new RateLimitExceededException();
}
redis.zadd(key, now, UUID.randomUUID().toString());
redis.expire(key, 90);This is accurate. There's no boundary spike because the window moves with every request. You're always looking at exactly the last 60 seconds.
The cost is memory. If your limit is 1000 requests per minute and you have 100,000 active clients, you're storing up to 100 million timestamp entries. For high-volume APIs, this is prohibitive. You're paying memory proportional to the number of requests allowed, per client, per window. The more generous your limits, the more memory you consume.
Sliding window log is great for low-volume APIs where accuracy matters: OAuth token endpoints, password reset flows, payment initiations. For high-traffic APIs, you need something leaner.
Sliding Window Counter
The sliding window counter is the practical middle ground. It uses two fixed window buckets, current and previous, and estimates the sliding window count using a weighted average.
The idea: if you're 30% of the way through the current minute window, approximate the sliding window count as 70% of the previous window's count plus 100% of the current window's count. You're estimating what a true sliding window would show, using only two counters per client.
long now = System.currentTimeMillis();
long currentWindow = now / 60_000;
long previousWindow = currentWindow - 1;
double percentIntoCurrentWindow = (now % 60_000) / 60_000.0;
long currentCount = getCount(userId, currentWindow);
long previousCount = getCount(userId, previousWindow);
double estimate = previousCount * (1 - percentIntoCurrentWindow) + currentCount;
if (estimate >= limit) {
throw new RateLimitExceededException();
}The error rate on this estimate is small enough to be acceptable for most use cases. Cloudflare published analysis showing the sliding window counter over-rejects by less than 0.003% of requests compared to a perfect sliding window. That's close enough.
This gives you near-correct rate limiting at fixed-window memory cost. Two counters per client per window instead of one timestamp per request. It's what I use for most API rate limiting.
Token Bucket
Token bucket is the other algorithm you'll hear about, and it's conceptually different from window-based approaches. Instead of counting requests in a time window, you maintain a bucket of tokens. Each request consumes a token. Tokens are replenished at a fixed rate. If the bucket is empty, the request is rejected.
The bucket has a maximum capacity, which controls burst behavior. If a client has been idle, their bucket fills up and they can make a burst of requests up to the bucket size. Then they're rate-limited to the refill rate until they become idle again.
This is how most users intuitively think rate limiting should work. "I've been patient, I should be able to make more requests now." Token bucket rewards clients that space out their requests. Fixed window treats someone who makes 100 requests in the last second of a window the same as someone who made one request every 600 milliseconds throughout.
Token bucket is good when you have clients with bursty but legitimate traffic patterns. A data export job that wakes up and sends a batch of requests, then goes quiet for an hour, shouldn't be penalized as harshly as a client that hammers your API continuously at maximum rate. The burst capacity in token bucket allows for this.
The implementation is a bit trickier to get right in a distributed system because you need to track the token count and the last refill time atomically. Without atomicity, two concurrent requests can both see a non-empty bucket and both consume a token that doesn't exist.
The Real Problem: Distribution
Single-instance rate limiting is simple. Any of the algorithms above work fine in a single process with in-memory state. The complexity explodes when you have multiple instances of your service, which is almost always.
If you have four instances of your API and each one maintains its own rate limit counters, a client can make four times their intended limit by distributing requests across all four instances. Your rate limiter isn't limiting anything; it's just making accounting errors.
The standard fix is centralized state; Redis is the canonical choice. All instances share the same counters. But now you've added a network hop to every request. For high-throughput APIs, that hop adds up. A Redis call that takes 1ms on average adds 1ms of latency to every single request. At 10,000 requests per second, that adds up fast.
You also have a consistency problem. The incr-then-check pattern I showed earlier isn't atomic across the check and the increment. Two concurrent requests can both read a count of 99, both decide they're under the limit of 100, both increment, and you end up at 101. Redis Lua scripts solve this by making the check-and-increment atomic on the Redis side, but you need to be deliberate about using them.
-- Atomic check-and-increment in Lua
local current = redis.call('INCR', KEYS[1])
if current == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[1])
end
if current > tonumber(ARGV[2]) then
return 0
end
return 1Even with atomic operations, you have the problem of Redis latency under load. When your Redis cluster is the bottleneck, every API request waits. Rate limiting, a protection mechanism, has become a performance bottleneck.
One approach is local counting with periodic synchronization. Each instance keeps its own counter and periodically syncs with the central store. This reduces Redis calls but introduces a time lag: an instance might think a client has 50 remaining requests when the actual global count is 95. You're trading accuracy for performance. For most APIs, a small accuracy window is acceptable. For financial APIs or security-sensitive endpoints, it isn't.
Another approach is sticky routing: route all requests from a given client to the same instance. Now each instance can do accurate local rate limiting for its clients. The problem is that sticky routing creates uneven load distribution. If your top 10 clients by traffic all happen to hash to the same two instances, those instances are overloaded while others sit idle. You've solved the rate limiting accuracy problem and created a load balancing problem.
There's no perfect answer. The tradeoffs are real and the right choice depends on how much accuracy you need versus how much latency you can afford.
What Happens When Redis Is Down
This is the question most rate limiting implementations don't answer clearly, and it matters more than engineers expect.
You have two choices when your rate limit store is unavailable: fail open (allow all requests through) or fail closed (reject all requests). Each is wrong in a different way.
Fail open means an attacker who knows your architecture can knock out your Redis cluster and then flood your API without rate limiting. For any API that's a meaningful abuse target, that kill switch is a real threat.
Fail closed means a Redis outage takes down your entire API. A cache layer problem becomes a full service outage. If Redis is flaky and intermittently unreachable, your API becomes intermittently unavailable. Users see 429 responses during a Redis blip even if they haven't hit any actual limits.
The pragmatic answer is fail open with alerting and compensating controls. If Redis is unavailable, log aggressively, alert immediately, and let requests through. Simultaneously, have circuit breakers on your downstream services so that a flood of requests can't cascade into database overload. Rate limiting is one layer of defense; it shouldn't be the only one.
Some teams use local fallback counters when the central store is unavailable. Each instance does its best with local state. This gives you partial rate limiting during a Redis outage, not perfect, but better than nothing and better than a full service outage.
Different Axes of Limiting
Most discussions of rate limiting treat it as a single dimension: requests per time unit per client. In practice, you often need multiple overlapping limits.
Per-API-key limits enforce pricing tiers. A free tier gets 100 requests per minute. A paid tier gets 10,000. These are coarse limits tied to account standing.
Then there are per-endpoint limits, which protect specific resources. Your search endpoint might be expensive and limited to 10 requests per second per client, while your lightweight status endpoint allows 1000. Global per-key limits don't protect expensive endpoints from targeted abuse.
IP-based limits catch credential stuffing and scraping that rotates through API keys but comes from a small set of IPs. This is separate from per-key limiting. A legitimate client with one key might make requests from multiple IPs (a large corporate network, or clients in multiple regions), so you can't just replace key-based limiting with IP-based limiting.
Global limits on total traffic protect your infrastructure from DDoS regardless of who's sending it. If your total incoming request rate exceeds what your backend can handle, you need to shed load even from legitimate clients.
Managing multiple dimensions of rate limiting multiplies your state management complexity. You're not tracking one counter per client; you're tracking counters per client per endpoint, per IP, and globally. Redis handles this with namespaced keys, but your logic for checking and combining these limits needs to be explicit. "Which limit applies here?" becomes a non-trivial question when a request could be blocked by four different limits for four different reasons.
Communicating Limits to Clients
Rate limiting that clients can't observe is just mysterious failure. The standard is to include headers in every response, not just rate-limited ones:
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 743
X-RateLimit-Reset: 1712345678X-RateLimit-Reset deserves attention. It can be an absolute timestamp (epoch seconds when the window resets) or a relative value (seconds until reset). Absolute timestamps are unambiguous regardless of clock skew between client and server. Relative values are easier for clients to use directly. Pick one and be consistent.
When you do return a 429, include a Retry-After header. Without it, well-intentioned clients will retry immediately, which makes your rate limiting problem worse. A client that respects Retry-After and backs off is a cooperative client. Make it easy to be cooperative.
The worst thing a rate-limited client can do is implement exponential backoff with full jitter and then retry aggressively. The second worst thing is to not retry at all and surface an error to the user. Design your rate limit responses to guide clients toward the right behavior: wait this long, then try again.
What I Actually Do
For a new API, I start with sliding window counter in Redis for per-key limits, with a Lua script for atomicity. I add per-endpoint limits for any endpoint that costs at least 3x the average to serve. I implement X-RateLimit-* headers from day one because retrofitting them later is annoying and clients come to rely on the absence of headers as a signal.
I fail open on Redis unavailability with aggressive alerting and a circuit breaker on downstream services as a backstop.
Rate limit configuration lives as data, not code. Limits sit in a config file or database, not hardcoded in middleware. When a client needs a higher limit, I update a config entry, not a deployment. When I need to tighten limits during an incident, I can do it without a code change.
And I test the rate limiter explicitly. Not just that it blocks at the limit, but that it handles the window boundary correctly, that concurrent requests don't exceed limits, that it fails open gracefully, and that the headers are accurate. Rate limiters have a way of working perfectly in dev and misbehaving in production at the worst possible moment.
Rate limiting sounds like a solved problem you can bolt on at the end. It isn't. The algorithm choice, the consistency model, the failure behavior, the client communication: each of these has real consequences for the reliability and security of your API. Treat it like the infrastructure it is, not the afterthought it looks like.
Comments (0)