It was a Tuesday morning. A report query that normally finished in 40 milliseconds was taking 4 seconds. No deploy the night before. No traffic spike. No schema change. The same query against the same data, ten thousand times slower than yesterday.
We checked the obvious things. The plan had not changed. The indexes were still there. CPU was normal. Disk IOPS were normal. The only thing that looked wrong was the table size on disk, which had grown by about a gigabyte overnight on a table where nothing should have been inserted at that rate.
The culprit was a psql session in a forgotten tmux window. Someone had opened it the previous afternoon, typed BEGIN, run a few SELECTs while debugging, and walked away. Seventeen hours later, the connection was still there, still in a transaction, still doing nothing. And because of it, the database had been holding on to every dead row in the cluster for seventeen hours.
Why MVCC Hangs On To Old Rows
PostgreSQL is MVCC, multi-version concurrency control. Every row update does not actually update a row. It writes a new version of the row and marks the old version as obsolete. Old versions sit there, invisible to new transactions but still on disk, until something cleans them up.
That something is autovacuum. It cannot clean up immediately because of the entire point of MVCC: a transaction that started before the update needs to see the old row, not the new one. Until every transaction that could see the old row has finished, the old row has to stay.
Postgres tracks this with the xmin horizon. The horizon is the smallest transaction ID still active anywhere in the cluster. Any row version older than the horizon can be safely reclaimed. Any row version newer than the horizon cannot, because some transaction might still need it.
This is fine when transactions are short. The horizon advances every time the oldest open transaction finishes. Autovacuum reclaims tuples behind the horizon. The system stays clean.
It is not fine when one transaction stays open for hours. That single transaction's xmin pins the horizon. Nothing newer than it can be reclaimed, anywhere in the cluster. Every dead tuple created in the meantime piles up. Tables grow. Indexes grow. Queries slow down. The database is doing exactly what it is supposed to do, and it is killing you.
What Autovacuum Cannot Do
Autovacuum is not lazy. It is running. It is just not allowed to reclaim what it would normally reclaim.
If you query pg_stat_user_tables during one of these incidents, you will see autovacuum running on the right tables. You will see n_dead_tup climbing. You will see last_autovacuum timestamps that are recent. The vacuum is doing work. None of that work has an effect, because every dead tuple it inspects is still newer than the xmin horizon and cannot be removed.
That part confuses people the first time. They see autovacuum running and assume cleanup is happening. They tune autovacuum_naptime and autovacuum_vacuum_cost_limit looking for the knob that will save them. None of those knobs help. The knob they need is the one that closes the long transaction.
You can verify the xmin horizon directly:
SELECT
pid,
state,
age(backend_xmin) AS xmin_age,
now() - xact_start AS xact_age,
query
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;The top row is your problem. If xmin_age is large and xact_age is hours, that is the connection holding the cluster back.
The Symptoms You Actually See
Bloat shows up everywhere downstream, in ways that do not obviously point at a long transaction.
Tables grow on disk faster than the row count grows. A table with a million logical rows might be storing five million physical rows, four of which are dead. Sequential scans read every one of them. The optimizer's row estimates start to drift, because pg_class.reltuples is approximate and gets recalculated during vacuum.
Indexes bloat the same way. An index that should fit in shared buffers spills to disk. Index scans that should be fast become medium. The optimizer notices, and at some bloat threshold, flips to a sequential scan. The query that was instantaneous yesterday is now reading the entire table.
Replication lag spikes if hot_standby_feedback is on. That setting lets replicas tell the primary not to reclaim rows their queries still need. It is on for a reason. A side effect: a long-running query on a replica pins the xmin horizon on the primary, and the bloat shows up on the writer.
The pattern: everything is slightly worse than it was, in ways that do not correlate to anything you changed. The cause is invisible unless you look at pg_stat_activity with the right filter.
Where Long Transactions Come From
The forgotten psql session is the headline cause but not the most common. In production, the usual sources are:
- An application connection returned to the pool mid-transaction. The pool hands it out to the next caller. That next caller does not know it inherited an open transaction. So the connection sits idle in transaction until something forces a rollback. Connection poolers with strict transaction-level pooling avoid this. Session pooling does not.
- A worker that takes a connection, runs a query, then does external work (an HTTP call, a Kafka publish, a file upload) before committing. If the external work is slow or hangs, the transaction is open the whole time.
- An analytics query that takes an hour. Its implicit transaction pins the horizon for an hour. Data team does not see a problem. OLTP team sees a slow morning.
- A two-phase commit prepared but never committed.
pg_prepared_xactswill show it. Until you commit or roll back, that prepared transaction holds locks and pins the xmin horizon forever. Across restarts. - An app process that died mid-transaction. The TCP connection might stay half-open for the OS's keepalive interval, which is twenty minutes or more by default on Linux. During that twenty minutes, Postgres still thinks the transaction is alive.
Different sources, same symptom. The xmin horizon is pinned to whoever has been holding their transaction the longest.
How To Find The Offender
The query above is the diagnostic one. The triage version, when you have a known incident:
SELECT pid, state, now() - xact_start AS age, query
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'idle in transaction (aborted)')
ORDER BY xact_start
LIMIT 10;Anything with an age over a few minutes is suspicious. Anything with an age over an hour is your problem. Kill it:
SELECT pg_terminate_backend(<pid>);pg_terminate_backend sends SIGTERM to the backend process. The connection drops, the transaction rolls back, the xmin horizon advances. Autovacuum can finally do its job.
If you cannot identify which application owns the connection, pg_stat_activity has application_name, client_addr, and usename. Set application_name from your app on connect. It is one line of config. It is the difference between knowing something somewhere has a stuck transaction and knowing the report worker on a specific host is the problem.
Killing It Is Not Enough
Terminating the backend stops the bleeding. It does not heal the wound.
The bloat is still there. Autovacuum will eventually reclaim it, but it works incrementally and respects autovacuum_vacuum_cost_limit. For a table that grew from 2 GB to 20 GB during the incident, full cleanup might take days. During that time, queries are still slower than they should be.
You have three options.
VACUUM FULL rewrites the table without dead tuples. It is fast (proportional to live data, not dead) and reclaims everything. The cost is that it takes an ACCESS EXCLUSIVE lock. No reads, no writes, no DDL on that table until it finishes. For a 20 GB table on a hot OLTP system, this is not an option.
pg_repack rewrites the table the same way, online. It creates a shadow table, replays writes against both, and swaps. The cost is disk space (you need 2x the table size during the rewrite) and operational complexity (it is an extension, it can fail in inconvenient ways). For most cases where you cannot take downtime, this is the answer.
Indexes bloat too, and ordinary VACUUM does not shrink them. Once the table is healthy, REINDEX CONCURRENTLY rebuilds the index in the background without blocking. Run it on the indexes that show high bloat ratios via pgstattuple or your favorite bloat estimation query.
The Defaults That Save You
The defaults that prevent this are the ones every Postgres install should set on day one.
idle_in_transaction_session_timeout kills any connection that sits idle in a transaction for too long. Set it to something like 5 minutes. The application sees an error, retries, moves on. The horizon never gets stuck.
idle_in_transaction_session_timeout = 5minstatement_timeout caps any single statement. Set per-role so OLTP workloads get a low cap (a few seconds) and analytics workloads get a higher one (minutes). A runaway query stops being a runaway query.
log_min_duration_statement logs every query slower than the threshold. Set to something useful (a few hundred milliseconds, then tune). When the slow query is in the logs, you have a fighting chance of finding the long transaction that caused it.
Monitor the oldest xact_start from pg_stat_activity. One Prometheus metric, one alert when it crosses a threshold. The alert fires before the bloat compounds, not after the queries are already slow.
None of these are aggressive, and none break anything that should be running. They put a ceiling on the worst-case behavior of every connection in the cluster.
The smallest transaction in your system sets the ceiling on how much memory your database has to keep. A connection idle in transaction for an hour is telling Postgres to retain an hour of garbage. Postgres will do it. Postgres is good at doing what you tell it.
The whole problem is that nobody knows they are telling it that. The BEGIN someone typed at 4pm yesterday is still in effect at 9am today, and the database is paying for the difference.
Comments (0)