← Back to Blog

Your Disaster Recovery Plan Is Fiction Until You Run It

A DR plan you've never run is a hypothesis, not a plan. What only breaks on the real restore, why RTO is fiction until you measure it, and the game-day fix.

The restore that took six hours

The primary database went down hard at 9:40 on a Tuesday morning. That part was fine. We had a disaster recovery plan, a runbook with numbered steps, and a documented recovery time of thirty minutes. Somebody pulled up the wiki page, calm and a little smug, and started working through it. We were back at 3:50 that afternoon.

The thirty-minute plan met reality one line at a time. Our most recent backup was four hours old, because a cron job had been failing silently for a week and nobody watched it. The restore ran, and then the replica we spun up failed to reach the secrets manager, because its IAM role had drifted from the primary's months ago. Once it was finally up, traffic kept hitting the dead host for another forty minutes, because a DNS record carried a TTL nobody remembered setting. Every step in the runbook was technically correct. The plan, taken as a whole, was fiction.

A plan you have never run is a hypothesis

That is the lesson I keep relearning, so I will state it as flatly as I can. A disaster recovery plan you have never executed end to end is not a plan. It is a hypothesis about how your system will behave under conditions you have never actually created. You wrote down what you believe will happen. You have no evidence that it will.

The writing feels like the work, which is the trap. A document with clear steps and confident numbers looks finished, and it gets filed away as a solved problem. But that document describes a process nobody has performed, and a process nobody has performed is exactly as reliable as any other untested code path in production, which is to say not at all.

What only breaks on the real restore

The failures that take you down during a recovery are almost never the one you planned for. The database dying is the easy part, the part the runbook is actually about. What gets you is the long tail of things that quietly went stale since the last time anyone looked.

The backup completes every night and reports success, and nobody has ever restored from it, so no one knows it has been missing a table since a schema change in March. The standby exists and looks healthy, sized for a calm afternoon, and falls over the moment real production traffic lands on it. Secrets, IAM roles, and network rules on the recovery path drifted out of sync with the primary one small change at a time. Your restore script itself rots, calling a flag the tool removed two versions ago. None of these show up on a dashboard, because each one is only exercised during the event you are hoping never happens.

RTO and RPO are numbers you wrote, not numbers you measured

Every DR plan has two numbers in it. The recovery time objective is how fast you will be back, and the recovery point objective is how much data you are willing to lose. Both tend to get written the way you would write a wish, picked because they sound acceptable to whoever asked, not because anyone clocked a real restore and read the time off a stopwatch.

An RTO you have never measured is marketing. The only way to know your real recovery time is to recover, with the clock running, and watch where the minutes actually go. Almost always they go somewhere the plan did not mention: waiting on a backup to download, hunting for a credential, arguing about whether it is safe to cut over yet. The measured number is usually a multiple of the written one, and the only way to shrink the gap is to know it exists.

The game day is the only honest test

The fix is uncomfortable and simple. You schedule the disaster. On a quiet Tuesday afternoon, with the whole team watching and a rollback ready, you kill the primary on purpose and recover for real, exactly as if it had happened by surprise. Then you write down everything that broke, every minute that went somewhere unexpected, and every step in the runbook that turned out to be wrong.

The first game day is always a little humiliating, and that is the point. It converts a comfortable belief into an uncomfortable list of real defects, while the stakes are low and everyone is calm and caffeinated instead of panicking at 3am. You fix the list, and you run it again next quarter, because everything you just fixed will quietly drift back out of true the moment you stop looking.

The objections, answered

Every team that does not test its DR has a reason, and the reasons are always the same three. One is that a real failover is too risky to run on purpose, but that fear is itself the finding: if you are scared to fail over on a calm afternoon with a rollback ready, you have no business believing it will work during an actual outage. Another is that there is no time, which ignores that there is always time for the unplanned six-hour version later, so the only real choice is the scheduled one-hour version now. The last is that you already have backups, which is true and beside the point. A backup you have never restored is a file, not a safety net.

What I actually do

I put a DR drill on the calendar like any other recurring commitment, once a quarter, and I treat skipping it the way I would treat skipping a security patch. Between drills, I restore from backup into a clean throwaway environment and confirm the data is whole and the application boots against it. The RTO in our docs is a number we measured during the last drill, not one we hoped for, and it gets corrected when the real one moves.

The runbook is code, and like all code it rots, so it only earns trust by being executed. A disaster recovery plan is worth exactly as much as the last time you ran it from top to bottom. If that has never happened, you do not have a plan. You have a story you are telling yourself about a day you hope never comes.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 4 min read

Comments (0)