← Back to Blog

EXPLAIN ANALYZE Lies to You

PostgreSQL's EXPLAIN ANALYZE tells you the plan looks fast. It doesn't tell you the timing is cached, the estimates are stale, or the cost isn't milliseconds.

The query passed every check I ran. EXPLAIN ANALYZE showed 2ms. The indexes were there. The plan looked clean. I merged the PR, deployed to production, and went home.

The alerts started forty minutes later. The endpoint was timing out. P99 latency had climbed to eight seconds. I pulled up the slow query log and saw the same query I had just signed off on.

I ran EXPLAIN ANALYZE again, right there in production. Two milliseconds. I stared at the output for a while, completely confused. The query was fast. Except it was also eight seconds. Both of those things were true at the same time.

That was my introduction to the ways PostgreSQL's EXPLAIN ANALYZE misleads you. The tool is not broken, and it is not lying on purpose. But it measures something subtly different from what you think it measures, and that gap is enough to send you home thinking your query is fine when it is not.

Lie One: Row Estimates Are Fiction

The planner does not know how many rows your query will return. It guesses. It uses statistics collected by ANALYZE to estimate row counts at each step of the plan, and it picks the execution strategy based on those estimates.

When the estimates are close, the plan is usually good. When they are wrong, the plan can be catastrophically wrong. Here is what that looks like in the output:

Nested Loop  (cost=0.43..1842.50 rows=52 width=128) (actual rows=487231 loops=1)
  ->  Index Scan on users ...
  ->  Index Scan on orders ...

The planner estimated 52 rows. The actual result was 487,231. That single discrepancy changed everything. A nested loop join is efficient when the inner side is small. When the inner side returns half a million rows, it becomes one of the slowest things your database can do. The planner chose it because it thought the result would be tiny. It was wrong.

This happens because statistics go stale. PostgreSQL's autovacuum process runs ANALYZE periodically to refresh them, but periodically is not constantly. Take a table that grew from 10,000 rows to 50 million in the past month: the statistics the planner is using might still reflect the old distribution. It estimates 52 rows because that is what the data looked like when ANALYZE last ran.

The fix is simple: run ANALYZE table_name manually after a big batch load, a deletion sweep, or a schema change. But more importantly, learn to read the actual vs estimated divergence in your EXPLAIN ANALYZE output. Any time you see estimated rows that are off by more than a factor of ten, treat the plan as suspect regardless of what the timing shows.

Lie Two: The Timing Is Cached

The 2ms I saw in my EXPLAIN ANALYZE output was real. It just was not measuring what I thought it was measuring.

When you run EXPLAIN ANALYZE, the data your query touches gets loaded into PostgreSQL's shared_buffers, or the OS page cache, or both. By the time you run the query again to read the plan, the data is already warm. The IO that matters in production, the cold disk reads that real users hit, is invisible in the timing you see.

In my case, the query had never been run on production before. The table had 200 million rows and the relevant data was scattered across hundreds of data pages on disk. The first real user to hit the endpoint triggered all of those disk reads. Eight seconds. I ran EXPLAIN ANALYZE right after, and the data was already in cache. Two milliseconds.

The tool to catch this is EXPLAIN (ANALYZE, BUFFERS). The BUFFERS option adds a line to your output showing how many pages came from the buffer cache versus disk:

Seq Scan on orders  (cost=0.00..38421.00 rows=2000000 width=64)
                    (actual time=0.042..4823.112 rows=2000000 loops=1)
  Buffers: shared hit=1240 read=37181

That read=37181 means 37,181 data pages were fetched from disk. At roughly 8KB per page, that is about 290MB of disk IO. The 1,240 hits came from cache. A query that is mostly reads is a query that will be much slower for users who arrive before the cache is warm, which is every query on a cold connection or after a server restart.

If your BUFFERS output shows mostly hits and fast timing, your query genuinely is fast. If it shows heavy reads, the timing you see in EXPLAIN ANALYZE is optimistic. How optimistic depends on your storage hardware, your cache hit rate under real load, and how often the same data is accessed by other queries.

Lie Three: Cost Is Not Time

The numbers in the cost field of EXPLAIN output are not milliseconds. They are not microseconds. They are an abstract unit the planner uses to compare plans against each other.

A cost of 100 is better than a cost of 200, in the planner's opinion. But a plan with cost 100 can easily be slower in wall-clock time than a plan with cost 200, because the cost model is based on assumptions that may not match your hardware or your workload.

The two most important assumptions are seq_page_cost and random_page_cost. By default, PostgreSQL assumes sequential IO costs 1 unit and random IO costs 4 units. This made sense for spinning disks where random access is genuinely slower. On SSDs, the difference is much smaller, sometimes negligible. If your database runs on SSDs and you have never tuned these values, the planner is overweighting the cost of random IO. It will choose sequential scans over index scans more often than it should, because it thinks jumping around on disk is four times as expensive as reading linearly.

This is why a plan can look ideal in EXPLAIN output but still be the wrong choice. The planner picked it because it won the cost comparison based on parameters that do not reflect your actual hardware. Wall-clock numbers in EXPLAIN ANALYZE might even confirm this, since you are running it on warm cache anyway.

Tuning random_page_cost down to 1.1 or 1.5 on SSD storage is one of the most impactful PostgreSQL configuration changes you can make, and it is almost never done by default. Once you change it, the planner starts preferring index scans in cases where it previously avoided them, and you will often see query times drop without changing a single line of application code.

What to Actually Look At

EXPLAIN ANALYZE is still the right tool. You just need to know what you are looking for.

Always run it with BUFFERS: EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT). This gives you the full picture of cache hits versus disk reads, which is the context you need to interpret the timing.

Look for estimated vs actual row divergence at every node in the plan. A factor of ten is worth investigating. A factor of a thousand is a problem. When the estimates are wildly off, the plan shape itself might be wrong, and no amount of index tuning will fix a bad join strategy.

Be skeptical of timing on a warm database. For a query that might be running cold in production, look at the read count in BUFFERS as your real signal instead of the wall-clock time.

Compare plans across environments carefully. A plan that uses an index scan on your dev machine with 50,000 rows might become a sequential scan in production with 50 million rows, because the planner decides a full scan is cheaper past a certain selectivity threshold. The plan you verified on staging is not necessarily the plan running in production.

EXPLAIN ANALYZE is a flashlight in a dark room. It shows you what is directly in the beam. The warm cache problem, the stale statistics problem, the hardware assumption problem -- none of those are in the beam. You have to know to look for them separately.

The query I deployed that night was not slow in any environment I had tested it in. It was slow for the first user who hit it in production, on a cold table, with a plan the planner chose based on estimates that had not been refreshed in weeks. EXPLAIN ANALYZE showed me a successful result every time I ran it, because every time I ran it, I was the second person to ask that question. The first person had already paid the cost.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 6 min read

Comments (0)