← Back to Blog

The Deploy That Took Down Friday

Friday deploys have a reputation for a reason. Here's why they go wrong, what guardrails actually help, and when it's okay to ship on a Friday anyway.

It was 4:47 PM on a Friday. I know the exact time because I checked the deploy log about forty times that night. A colleague had merged a PR that had been in review for two days. Clean code. Tests passing. Two approvals. The kind of PR you merge without thinking twice.

The deploy went out at 4:47. By 5:15, half the team was already offline for the weekend. By 5:32, our error rate dashboard looked like a hockey stick. By 5:45, I was on a call with two other engineers trying to figure out why our payment processing was returning 400 errors for about 30% of requests.

The fix took eleven minutes once we found it. A config value that referenced an environment variable which existed in staging but hadn't been added to production yet. The kind of thing that tests don't catch because tests don't run against production config. The rollback took another four minutes. But finding the problem took almost two hours, because the people who knew that part of the codebase best were already at dinner, at the gym, or had their phones on silent.

Total time from deploy to resolution: two hours and forty-one minutes. On a Monday morning, it would have been twenty minutes.

Why Fridays Are Different

The code doesn't know what day it is. The CI pipeline doesn't care. The servers don't take weekends off. So why does Friday have a reputation?

It's not about the technology. It's about the humans.

Reduced availability. People check out on Friday afternoons. Mentally, physically, or both. The person who wrote the code might be reachable. Whoever understands the infrastructure it touches might not be. And the DBA who knows why that particular query behaves differently under load is already on their second drink. When something goes wrong, your response time is directly proportional to how quickly you can get the right people in a room. On Friday at 5 PM, that room is empty.

Compressed recovery window. If you deploy on Monday and something breaks slowly, a gradual memory leak, a subtle data inconsistency, a performance regression that only shows up under sustained load, you have four full business days to notice and fix it. If you deploy on Friday, you have maybe two hours. Then the thing sits in production all weekend, potentially getting worse, with nobody watching.

Fatigue and rush. There's a specific Friday energy where people want to get things done before the week ends. PRs that have been open all week finally get merged. That feature that was supposed to ship this sprint gets pushed out at the last minute. Corners don't get cut intentionally, but attention gets thinner. Review is a little less thorough. Testing is a little more cursory. And “let me just check one more thing” becomes “it's probably fine.”

None of these are technical problems. They're all human problems. And human problems are the hardest kind to solve with tooling.

The Incidents I've Seen

The payment processing one was mine. But I've collected Friday deploy stories from every team I've worked on. They all follow the same pattern.

The migration that locked a table. A database migration added an index to a table with 50 million rows. In development, the table had 10,000 rows and the migration ran in under a second. In production, it locked the table for six minutes. Every write to that table queued up behind the lock. The service backed up, connections pooled out, and the whole API went down. Nobody was around to run the emergency DROP INDEX because it was 6 PM on a Friday.

The feature flag that wasn't off. A new feature was deployed behind a feature flag. The flag was supposed to be off in production. It was off. But the code checked the flag value from a cache, and the cache hadn't been invalidated after the deploy. For about 20% of users who hit a specific cache partition, the feature was on. The feature wasn't ready. It made API calls to a service that hadn't been deployed yet. Those calls failed silently and the users saw broken UI. The bug report came in Saturday morning from a user on Twitter.

The dependency update that changed behavior. A minor version bump of a JSON parsing library changed how it handled null values in nested objects. Instead of returning null, it threw a NullPointerException three layers deep in the serialization stack. The tests passed because the test fixtures didn't have null values in that specific nesting pattern. Production data did. The error didn't surface immediately because it only affected one specific API endpoint that was used by a mobile feature rolled out to 10% of users. That 10% had a broken app all weekend.

Every one of these would have been caught and fixed quickly on a Tuesday. Same code. Same bug. The difference was entirely about when people were available to respond.

The “Never Deploy on Friday” Rule

A lot of teams adopt a blanket rule: no deploys on Friday. Some extend it to after 2 PM on Thursday. I've worked at places where the CI pipeline literally blocked production deploys after a certain time.

I understand the appeal. It's simple. It's easy to enforce. It eliminates a category of risk. And for some teams, it's the right call.

But I also think it's a band-aid over deeper problems.

If your deploys are so risky that you can't ship on one-fifth of your workdays, that's telling you something about your deployment process, not about Fridays. If a single bad deploy can take down production for hours and you don't have fast rollback, that's a problem every day of the week. If you can't detect issues quickly because your monitoring is weak, Monday deploys are risky too, you just have more time to stumble onto the problem.

The “no Friday deploys” rule treats the symptom. The disease is that deploys are scary.

Making Deploys Less Scary

The teams I've seen handle this well don't ban Friday deploys. They make all deploys safer. Friday included.

Fast rollback

This is the single most important thing. If you can roll back a deploy in under two minutes, the blast radius of any bad deploy shrinks dramatically. It doesn't matter if it's Friday at 5 PM. You don't need to diagnose the problem. You don't need to find the person who wrote the code. You just roll back.

The way you get fast rollback depends on your infrastructure. If you're deploying containers, it's keeping the previous image tagged and ready. If you're on a platform like Firebase Hosting or Vercel, it's usually built in. If you're doing blue-green deployments, it's switching the router back. Whatever your setup, practice it. Time it. Make sure everyone on the team knows how to do it, not just the person who set up the pipeline.

I've seen teams with elaborate deployment pipelines who had never actually tested their rollback process. The first time they needed it, it didn't work. Don't be that team.

Canary deploys

Instead of sending 100% of traffic to the new version immediately, send 5%. Watch the error rates. Watch the latency. Watch the logs. If everything looks good after ten minutes, ramp to 25%, then 50%, then 100%. If something looks wrong at 5%, you've only affected 5% of your users, and you can pull back instantly.

This requires some infrastructure work. You need a load balancer or service mesh that can split traffic by version. But it's worth the investment. A canary deploy would have caught our payment processing bug when it affected 1.5% of transactions instead of 30%.

Monitoring that pages you

If your monitoring only shows dashboards that someone needs to be looking at, it's not monitoring. It's decoration. Real monitoring pages someone when error rates spike. It sends a Slack alert when latency exceeds a threshold. It wakes someone up at 3 AM if the database connection pool is exhausted.

The key metrics to alert on after any deploy: error rate (absolute and relative to baseline), p99 latency, and any business-critical transactions like payments, signups, or whatever your application's core action is. If any of these deviate from baseline within 30 minutes of a deploy, that deploy is the first suspect.

Deploy metadata in your logs

This is small but it matters. Tag every log line with the current deploy version or commit hash. When you're looking at an error spike, you want to immediately see which deploy introduced it. Without this, you're correlating timestamps between your deploy log and your error log, which works but wastes precious minutes during an incident.

In Spring Boot, I add the git commit hash to the application context at startup and include it in the MDC for every request. Five minutes of setup, saves hours over the life of the project.

Automated smoke tests post-deploy

After every deploy, run a small suite of integration tests against production. Not your full test suite. A handful of tests that exercise the critical paths: can a user log in, can they load the main page, can they perform the core action of your application. If any of these fail, auto-rollback or page immediately.

These tests should run in under a minute. They're not exhaustive. They're a tripwire. They exist to catch the class of bug that happens when everything works in staging but something is different in production, which is exactly the class of bug that Friday deploys turn into weekend incidents.

When to Actually Worry About Friday

Even with all these safeguards, there are deploys I'd still avoid on Friday afternoon.

Database migrations on large tables. These are risky by nature, slow, and hard to roll back. If a migration locks a table or corrupts data, rollback isn't “deploy the old version.” It's “restore from backup and replay writes.” That's a multi-hour operation you don't want to start at 6 PM.

Infrastructure changes. Changing the database version, rotating certificates, modifying network rules, updating Kubernetes node pools. Anything in this bucket affects everything, is hard to test in staging, and fails in a mode that's usually “nothing works at all.”

Anything you haven't deployed to staging first. Obvious, but I've seen it happen. Staging is broken for some unrelated reason, someone says “it worked locally, let's just push to prod.” On a Friday. Don't do this.

Major feature launches. If you're turning on a new feature for all users, do it on a Monday or Tuesday. Not because the deploy is riskier, but because you want maximum availability to respond to the unexpected. Even if the code is perfect, user behavior might surprise you. Support volume might spike. A third-party dependency might not handle the load. You want your full team awake and available for the first 48 hours.

For everything else, a regular code deploy with a well-tested PR, feature flags, canary rollout, and fast rollback? Ship it. Friday is fine.

The Culture Problem

There's a deeper issue here that I think a lot of teams avoid talking about. The Friday deploy fear is often a symptom of a culture where deploys are events, not routine.

If you deploy once a week, each deploy is a big deal. It's a batch of changes accumulated over five days. More changes means more risk. More risk means more anxiety. More anxiety means fewer deploys. Fewer deploys means bigger batches. It's a vicious cycle that ends with monthly releases and deployment ceremonies that involve a checklist, a war room, and someone's weekend.

The teams that deploy without fear, any day, any time, do it because they deploy constantly. Ten times a day. Twenty times a day. Each deploy is one or two small changes. The blast radius of any single deploy is tiny. Rollback is routine, not an emergency procedure. Deploying is as unremarkable as committing code.

Getting there requires investment. CI/CD pipelines that are fast and reliable. Feature flags that let you decouple deploy from release. Monitoring that tells you immediately when something is wrong. A rollback process that's one command or one button. Canary infrastructure. Automated smoke tests.

It's not free. But the alternative is a team that's afraid to ship code, and a team that's afraid to ship code ships less, ships later, and ships bigger batches when they finally do, which makes every deploy scarier, which makes the fear worse.

My Rules Now

After living through enough Friday incidents, here's where I've landed.

I don't ban Friday deploys. But I'm more deliberate about what ships on Friday.

Small, well-reviewed PRs with feature flags? Ship any day, including Friday.

Database migrations? Monday or Tuesday morning, when the full team is available and we have the whole week to monitor.

Infrastructure changes? Same. Early in the week, with a rollback plan documented before we start.

Big feature launches? Monday or Tuesday, with the team aligned on who's watching what metrics.

Hotfixes? Ship immediately, regardless of the day. A known bug in production is worse than the risk of a Friday deploy.

The point isn't to follow a rigid rule. It's to think about risk for each specific change. A one-line copy change and a database schema migration have wildly different risk profiles. Treating them the same, either by deploying both fearlessly or by blocking both on Fridays, misses the point.

Deploy the things that are safe to deploy. Hold the things that aren't. And invest in making more things safe to deploy, so that eventually the day of the week doesn't matter at all.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 9 min read

Comments (0)