← Back to Blog

Your Replica Is Lying To You

Read replicas trade staleness for throughput. Replication lag, read-your-writes, and the staleness window nobody tracks: these are where things actually break.

The feature was simple. A user creates a draft, gets redirected to the draft's detail page, edits it, saves. It worked in dev. It worked in QA. It worked in production for the first hour. Then traffic picked up. The redirect started returning 404. The draft existed in the database, but the page that fetched it from the replica reported it as missing. By the time the user hit refresh, the replica had caught up and the draft appeared. The bug was intermittent, low-volume, and exactly the wrong shape to debug under pressure.

Read replicas are sold as a solution to scaling. They sometimes are. They are also the source of an entire class of bugs that do not exist in single-database setups, and most teams adding their first replica do not understand what they are signing up for. Replicas are a contract. The contract is that you will read slightly stale data sometimes, in exchange for more read throughput. If your application cannot tolerate that, the replica is not solving your problem. It is creating a new one.

What Replication Actually Does

The default in PostgreSQL is asynchronous streaming replication. The primary commits a transaction. It writes the change to its WAL (write-ahead log). The replica streams the WAL over the network and applies it. There is a delay between the commit on the primary and the apply on the replica. Under steady load, that delay is usually milliseconds. Under heavy writes, long-running transactions, or network hiccups, it can stretch to seconds or minutes.

The replica is not behind because it is broken. It is behind because asynchronous replication is, definitionally, asynchronous. The primary did not wait for the replica to acknowledge before responding to the client. That tradeoff is the entire point. Synchronous replication exists too, but it costs every write a round-trip to the replica before the primary can return success. Most teams that turn it on turn it off again.

The Read-Your-Writes Trap

The most common bug a replica introduces is read-your-writes failure. The pattern looks like this:

  1. User submits a form. Write goes to primary.
  2. App returns success and redirects.
  3. Next page loads. Read goes to replica.
  4. Replica has not applied the write yet. Read returns nothing or stale data.
  5. User sees an empty page or thinks the save failed.

This is the case the original draft bug fell into. The user's request hit one of three application instances. To take pressure off the primary, the application had been configured to send all reads to the replica pool. A 200ms gap between commit and replica apply was small enough that the bug only showed up under load, and only on requests that needed to read what the same user had just written.

Retrying is not the fix. It hides the bug behind a slower experience and still fails when the lag is genuinely large. The fix is to send reads that need to be consistent with recent writes to the primary. That request had a read-your-writes requirement. The other 95 percent of requests on the page did not. Routing matters.

Lag Is a Distribution, Not a Number

The dashboard shows replication lag as a single number. The number is usually small. The number that matters is the tail.

Lag is not constant. It spikes when:

  • A long transaction holds locks that the replica cannot replay until they release.
  • The primary takes a write burst. Even if the replica's network and disk can keep up, there is a queue.
  • Maintenance work runs (VACUUM FULL, large UPDATE, schema migrations).
  • The replica's hardware is slower than the primary, even slightly.
  • Network congestion or transient packet loss.

If your monitoring graphs replication lag with a 1-minute average, you will see a flat low number 99 percent of the time. The bug shows up in the 1 percent. P99 lag is the metric that matters. P99 lag plus the rate at which lag-sensitive reads occur is your bug rate. Most teams do not measure P99 lag because the dashboard shows the average and the average looks fine.

What Replicas Are Actually Good For

Replicas earn their keep on workloads that are by design tolerant of staleness:

  • Analytics, reporting, dashboards. A query that aggregates yesterday's data does not care about a 5-second lag.
  • Background jobs. Anything that already runs on a schedule is, by definition, processing data that was written before the job started.
  • Search indexes, caches, downstream pipelines. Eventually consistent by design.
  • Geographic read distribution. Putting a replica in a region close to users to reduce read latency, where staleness is fine.
  • Failover safety. Even if no read traffic ever hits the replica, having a hot standby that can take over the primary's role is worth the cost.

If most of your read traffic is in this category, a replica is a straight win. Reads that need fresh data go to the primary. The bulk go to the replica. CPU and IOPS on the primary get freed up for writes.

What They Don't Solve

Write capacity stays on the primary. Every write still goes there. If your primary cannot keep up with writes, adding replicas does not help. It might hurt you - replicas add WAL streaming load to the primary's network and disk.

Strong consistency is off the table with replicas. They give you eventual consistency with a tunable staleness window. If your application logic requires that a read after a write returns the write, you cannot route that read to a replica. No amount of replica configuration changes that.

Slow queries stay slow. A query that takes 30 seconds on the primary takes 30 seconds on the replica. Adding replicas to handle a slow query workload is sometimes the right move (if you can shed the load from the primary), but the underlying problem is the query.

Patterns That Work

Three patterns cover most of the practical use cases.

Route reads-after-writes to primary. When a user just wrote, route their next few reads to the primary. Spring's AbstractRoutingDataSource, AWS RDS Proxy with cluster endpoints, or a custom interceptor can do this. The marker can be session-based (set a flag on first write, read primary for the next N seconds) or per-request (a header that signals "need fresh").

Use the primary for by design consistent endpoints. Anything that the user would notice as broken if it returned stale data: "my profile," "my recent orders," admin dashboards. Anything that can show stale data: search results, public listings, analytics. The split is by endpoint, not by query.

Track lag explicitly. pg_stat_replication exposes replay_lag per replica. Alert when P99 over the last five minutes exceeds a threshold (start with 1 second, tune from there). Treat sustained high lag as an incident, not a metric.

Postgres Specifics

The relevant configuration knobs in PostgreSQL:

  • pg_stat_replication: shows the state of every connected replica. replay_lag is the most useful column.
  • synchronous_commit and synchronous_standby_names: turn on synchronous replication for specific replicas. Costs latency on every write. Use sparingly.
  • Replication slots: prevent the primary from discarding WAL the replica still needs. Without slots, a replica that falls too far behind is unrecoverable and has to be rebuilt from a base backup.
  • hot_standby_feedback: lets the replica tell the primary which transactions are still visible to its queries, preventing the primary from cleaning up rows the replica still needs. Trades vacuum efficiency on the primary for fewer query cancellations on the replica.

None of these are knobs you should tune by reading a blog post. They are knobs you tune by understanding the specific failure mode you are seeing in production. The starting position is: async replication, no synchronous standbys, replication slots configured, hot_standby_feedback off, alerting on P99 replay lag.

Replicas are a tool. The price of using them is that you have to think about staleness on every read path. Teams that get value out of replicas did this thinking up front. Teams that get burned treated them as a transparent scaling layer, and then spent six months chasing intermittent bugs that only show up when something else goes wrong.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 6 min read

Comments (0)