← Back to Blog

The Thundering Herd Problem

Cache stampedes, retry storms, reconnect floods: three failure modes with the same root cause. Synchronized behavior under load amplifies failures every time.

The cache went down for thirty seconds during a routine deployment. It came back up fine. Then the database fell over.

What happened was simple in hindsight. We had cached several thousand keys, all set with the same TTL. When the cache restarted, every key was gone. Every request that came in during those thirty seconds had to go to the database. Ten thousand requests hit the database simultaneously. The database could not keep up. Response times climbed. The application started timing out before it could populate the cache. Cache stayed empty. Database stayed overwhelmed. We had traded a thirty-second blip for a fifteen-minute outage.

This is the thundering herd problem. Not a single bug. A class of failure where synchronized behavior under stress turns a small problem into a large one. I have seen it in three different forms, and each time the fix was the same: break the synchronization.

Cache Stampede

The classic version is what happened to us. Many cached entries expire at the same time, either because a restart wiped them all or because you set identical TTLs when you loaded them. One request comes in, misses the cache, goes to the database. Then another. Then a hundred more. They all miss. They all go to the database. The database gets a spike it was not designed to handle.

The first fix people reach for is a lock. Only one request fetches from the database. The rest wait. Sounds right until you think about what happens when the request holding the lock is slow, or fails, or the lock expires before the fetch completes. Now you have a queue of waiting requests and a missing cache entry. Locking adds complexity without fully solving the problem.

A better fix is TTL jitter. When you cache something, do not use a fixed TTL. Add a random offset. Instead of caching everything for exactly 300 seconds, cache it for 270 to 330 seconds. Expirations spread across a window instead of concentrating at a single moment. No single instant triggers a mass miss.

There is also probabilistic early expiration, sometimes called XFetch. Before a key expires, requests start probabilistically deciding to refresh it early. The closer the key is to expiring, the more likely a request is to refresh it. The cache gets refreshed before it expires at all, under normal load, with no thundering herd. Harder to implement, but nearly eliminates the stampede entirely.

The root cause in our case was that we loaded the cache at startup with a script that set every key to the same TTL. A one-line change to add jitter would have prevented the whole incident.

Retry Storm

Retry logic is supposed to make systems more resilient. If a request fails, try again. This is correct. The problem is naive retry logic, where every client retries at the same fixed interval.

Imagine a service goes down for twenty seconds. A hundred clients are calling it. They all get errors. They all wait two seconds and retry. The service comes back up at exactly the moment all hundred clients retry simultaneously. The service was just recovering from whatever caused the outage, and now it gets hit with a synchronized wave of traffic. It falls over again.

This is a retry storm. You added retry logic to improve resilience and ended up making outages worse. The clients are behaving correctly individually. Collectively they are synchronized in a way that amplifies the problem.

Exponential backoff solves the synchronization partially. Instead of retrying after a fixed two seconds, clients retry after 2 seconds, then 4, then 8, doubling each time up to some cap. This spreads retries over a longer window, which helps.

But exponential backoff alone is not enough, because many clients will have started their retry sequences at roughly the same time. Even with doubling, they stay synchronized. They will all hit the 4-second mark at roughly the same moment.

Jitter is what breaks the synchronization. Instead of waiting exactly 4 seconds, wait a random amount between 2 and 4 seconds. Between 0 and the full backoff duration. The randomness scatters the retries across the window so the recovering service sees a steady trickle instead of a synchronized wave.

Full jitter, where you choose randomly between zero and the computed backoff duration, is more effective than decorrelated jitter or equal jitter. The AWS architecture blog published a detailed analysis of the different jitter strategies years ago and the conclusion is simple: if you are not adding jitter, your retry logic is incomplete.

Reconnect Flood

The third variant happens at the connection layer. Your database restarts, either due to a failure or a planned maintenance window. Every application server has a connection pool. Every connection in every pool breaks simultaneously. Every pool immediately tries to reconnect.

The database is still booting. It can accept a limited number of connections during startup. It gets hit with thousands of reconnect attempts at once. The connection handshake overhead is real. The database is overwhelmed before it finishes initializing. It either crashes again or becomes so slow that it might as well have.

Connection pools like HikariCP have reconnect backoff built in, but it is often not configured. The default behavior in many pools is to try aggressively and fast. Under normal conditions this is fine. Under the conditions where your database just restarted, it is the wrong behavior.

The fix is the same: backoff with jitter at the connection pool level. HikariCP's connectionTimeout and initializationFailTimeout settings give you some control, but the more important thing is making sure your pools are not all configured to reconnect in tight lockstep. If all your application servers have identical pool configuration and they all lost their connections at the same moment, they will all attempt reconnection on the same schedule.

One underused approach is staggered restarts. If you are restarting a database, restart the application servers one at a time on a delay, so connection pools are not all reconnecting simultaneously. This is basic operational hygiene that prevents the flood before it starts.

The Pattern

Cache stampede, retry storm, reconnect flood. Three different layers of the stack, three different mechanisms, one root cause. Clients that are synchronized in their behavior create spikes that overwhelm the thing they are all trying to reach.

The fix is always the same. Introduce randomness to break the synchronization. TTL jitter breaks the cache stampede. Backoff jitter breaks the retry storm. Staggered reconnect breaks the connection flood. In each case, the individual client behavior is not wrong. The problem is that identical behavior from many clients simultaneously becomes destructive at scale.

When you are designing retry logic, or setting cache TTLs, or configuring connection pools, the question to ask is: what happens if a thousand of these do this at the same time? If the answer is bad, add jitter. It is a one-line change with outsized impact, and it is one of the few situations where the fix is genuinely simple once you understand the problem.

We added TTL jitter after the incident. We also reviewed our retry logic and found two services using fixed intervals with no jitter. Both got fixed the same week. In the year since, we have not had a cascade failure from synchronized behavior. That thirty-second cache blip taught us more about distributed systems than months of reading about them.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 5 min read

Comments (0)