The 2am spike that was actually eviction
The page came in at 2am: p99 latency on the product API had tripled. Nothing had deployed, traffic was normal, the database looked bored. The one odd signal was our Redis cache hit rate, which had fallen off a cliff from its usual 95 percent to something like 60. We had a cache, and it had quietly stopped caching.
The misses were not misses. They were evictions. Redis had reached its memory limit and was throwing out keys to make room for new ones, and because we never told it which keys to protect, it was evicting things we were about to ask for again. Every evicted key became a cache miss, every miss became a database query, and the database we thought was bored was carrying load it had not seen in months. The cache was full, and full meant broken.
Redis is state in finite RAM
The mental model that got us there is the common one: Redis is the fast thing you put in front of the slow thing. Set a key, get a key, watch the latency drop. It feels less like a database and more like a speed dial you turn up when something is slow.
That model leaves out the only number that matters. Redis keeps everything in RAM, and RAM is finite and small next to disk. Every key you set occupies bytes that do not come back until the key is deleted or expires. A Postgres table can grow past memory, spill to disk, and keep working at a crawl. Redis has no such mercy. When it fills, it does not get slow. It starts making decisions about what to throw away, and you do not get a vote unless you asked for one in advance.
The keys nobody gave a TTL
Unbounded growth is almost never one big mistake. It is a thousand small SET calls that forgot the EX. A session blob written on login and never cleaned up. The per-user feature payload cached "for now" with no expiry. A computed result keyed by a request id that is unique every time, so the keyspace grows by one on every call, forever.
None of these hurt on the day you ship them, because the keyspace is small and the headroom is large. They hurt three months later, at a traffic peak, when the slow accumulation finally touches the ceiling. By then nobody connects the incident to the SET without an expiry that someone wrote in the spring, because the cause and the symptom are a quarter apart.
The eviction policy you did not choose
Redis does not crash the instant it fills, and that grace is its own kind of trap. Out of the box maxmemory is often unset, which means Redis grows until the operating system kills it, and an OOM kill is the worst way to learn your working set. So you set a limit. The moment you do, you have to answer a question most teams never answer on purpose: what should it throw away when it is full?
That answer is the eviction policy, and the options matter more than they look. noeviction stops accepting writes and returns errors once full, which turns your cache into a wall. allkeys-lru drops the least recently used key from the whole keyspace. allkeys-lfu prefers the least frequently used, which fits a cache better. The volatile- family only considers keys that carry a TTL, and that is the sharp one: if most of your keys have no expiry, a volatile-lru policy has almost nothing it is allowed to evict, so it behaves like noeviction and fails writes while gigabytes of expiry-less keys sit untouched.
The day eviction took the key you needed
Once eviction is running, it is making product decisions on your behalf at the worst possible time. Under pressure it removes whatever the policy points at, and if your hot keys and your cold keys share one pool, some of what it drops is load-bearing. An evicted hot key is a guaranteed miss the next millisecond.
That is also how a memory ceiling turns into a thundering herd. A mass eviction during a spike means a wave of simultaneous misses, every one of them recomputing the same expensive thing against the same database at the same instant, which is the exact stampede I have written about before. The cache that was supposed to absorb the load becomes the trigger that focuses it.
TTLs are a capacity decision, not just a freshness one
Most people reach for a TTL to keep data fresh, to stop a cached value from outliving its truth. That is a real reason, and it hides the more important one. A TTL is how a key dies, and a key that cannot die is a slow memory leak with good intentions. Expiry is the mechanism that keeps your working set bounded, so it actually fits in the RAM you bought.
The math is not hard, which is why skipping it stings. Estimate the size of a typical entry, multiply by how many you expect alive at once, and that is your working set. It has to fit inside maxmemory with real headroom left over, because Redis itself needs room and fragmentation eats more than you expect. If the working set does not fit, no eviction policy saves you. You are just choosing which users get a cache and which do not.
The numbers that predict the cliff
This failure mode is loud in the metrics long before it is loud in production, as long as you watch the right ones. Your fuel gauge is the ratio of used_memory to the configured maxmemory. The evicted_keys counter should sit at zero for a healthy cache, and any sustained climb in it is Redis telling you it is over budget right now. Rising keyspace_misses under steady traffic, plus a mem_fragmentation_ratio drifting well above one, round out the picture.
An alert on evicted_keys moving off zero would have caught our 2am incident in the afternoon, while someone was awake and the fix was a config change instead of a recovery. Eviction is not something you want to discover from your p99 graph.
What I actually do
I treat Redis like the stateful datastore it is, not the speed dial it pretends to be. I set maxmemory explicitly and pick an eviction policy on purpose, almost always allkeys-lfu for a pure cache. Every key gets a TTL, no exceptions, because a key with no death is a future incident. I size the instance to the working set with headroom, and I alert on evicted keys so the ceiling announces itself early.
None of this is exotic. It is the same capacity planning you would do for any database that holds your state, which is exactly what Redis is. The trouble only starts when you forget that the fast thing in front of the slow thing is still a thing, with a size, a limit, and a plan you either made or skipped.
Comments (0)