The demo that fooled everyone
Every pgvector demo I have ever seen runs on a few hundred rows. CREATE EXTENSION vector. ALTER TABLE products ADD COLUMN embedding vector(1536). Insert two hundred test documents. Run a similarity query. Get tens-of-milliseconds latency. Write a tweet about how Postgres just ate the vector database market.
I built one of these demos for an internal review. Worked beautifully. Two months later we had real data. The same query took eleven seconds.
Where it falls over
pgvector with no index does a sequential scan. It compares the query vector to every row's embedding. At a few hundred rows this is fine. At a million rows it is not. At ten million rows it is unacceptable.
The fix is to add an index. The question is which one. pgvector ships two index types: IVFFlat and HNSW. Approximation is the rule for both. Each has parameters that matter. And each will mislead you if you copy the defaults from a tutorial.
IVFFlat is gentle but stale
IVFFlat groups vectors into lists clusters at build time. At query time, it scans the closest probes clusters. The recommendation in every tutorial is lists = sqrt(N) for under a million rows, and lists = N / 1000 beyond that. For ten million rows that is ten thousand lists.
Build time is fast. Query time is fast. Recall is decent. So far so good.
The catch is what happens when you insert. IVFFlat builds the cluster centroids once from the data that existed at index-creation time. New vectors are assigned to whichever existing centroid is closest. If your data distribution shifts (which it will, because the embedding model has biases and your corpus has trends), new vectors pile up in clusters they do not belong in. Recall drops month over month. The only fix is to drop and rebuild the index.
For a static corpus, IVFFlat is fine. For anything with a write stream, it is a maintenance liability you have to schedule around.
HNSW is better, and costlier
HNSW (Hierarchical Navigable Small World) builds a layered graph where each vector has links to its neighbors. Queries traverse the graph from a top-layer entry point downward. Recall is higher than IVFFlat. Inserts update the graph incrementally, so new vectors find their place naturally.
The two parameters that matter: m (the maximum number of links per node, usually 16 or 32) and ef_construction (the breadth-first search size during construction, usually 64 to 200).
Three things to know before you run CREATE INDEX:
- Build time is long. On a single Postgres instance, ten million rows with
m=16, ef_construction=64can take several hours. Crankmaintenance_work_memas high as the box can take or you will be there overnight. - Storage is bigger. Roughly two to three times the raw vector data. For ten million 1536-dimensional float32 vectors, that is around 150 GB of index on top of the 58 GB of raw vectors.
- Concurrent inserts get slow. Each insert traverses the graph. Under high write volume, your insert throughput drops and your write transactions hold locks longer than you expect.
The storage math at 10M rows
Run the numbers before you sign up for the bill. Ten million vectors, OpenAI text-embedding-ada-002 dimensions (1536), default float32 storage:
- Raw vector column:
10,000,000 × 1536 × 4 bytes= ~58 GB - HNSW index on top: ~150 GB
- Other table data, WAL, indexes on other columns: easily another 20-40 GB
You are now north of 200 GB of Postgres storage that did not exist before pgvector. RDS or CloudSQL will bill you for it. Worse, you want the working set in RAM, which means upgrading the instance class. The db.r6i.large you started with is no longer the right shape.
The quantization escape
pgvector 0.7 added two storage types that change the math. halfvec uses float16 instead of float32 (16 bits per dimension), cutting raw storage in half with a small recall loss. bit uses one bit per dimension for fully binary vectors.
The serious pattern is binary quantization with rerank. You store both: a binary bit column for fast first-pass search across the whole corpus, and the full vector or halfvec column for reranking the top candidates. The first pass narrows ten million rows down to a few thousand in milliseconds. The second pass reranks those thousands with exact distances. Latency stays low, storage cost drops, recall stays high.
Most production pgvector setups I see in 2026 are running halfvec by default, with binary quantization layered on top when scale crosses the line where halfvec alone is not enough.
The write-rate problem
HNSW handles writes, but not at the rates a busy product expects. If you are ingesting thousands of new vectors per second (which is normal for a logging or messaging app indexing user content), the index becomes a bottleneck and your inserts back up.
Two patterns to consider:
- Batched build. Write new vectors to a staging table without indexing. Periodically merge them into the main table during a maintenance window, rebuilding or extending the HNSW index in batch. Latency on freshly-written rows is higher (they are searchable only after the batch), but throughput stays sane.
- Two-tier setup. Recent vectors live in a small, fully-indexed hot table. Older vectors live in a large, batch-built cold table. Queries union the two. Most reads only touch the hot table.
Both patterns add operational complexity. They are still simpler than running a separate vector database.
When you do not need pgvector at all
Under 100k vectors, no index. Just use Postgres. Sequential scan is fast enough and you save the indexing complexity.
Above 50M vectors with high write rate and tight latency SLAs, a dedicated vector database (Qdrant, Weaviate, or a managed service) starts to make sense. The operational overhead of running it pays for itself in tighter performance characteristics.
The sweet spot for pgvector is between those two: 100k to 10M-ish vectors, mixed read-write, where keeping vectors next to your relational data simplifies a lot of joins and avoids a second system in the architecture. That is most apps that have shipped AI features in the last two years.
What I actually run now
On the production system I tune most often:
halfvecfor storage. The recall hit is below my measurement noise.- HNSW with
m=16, ef_construction=64. Anything higher than 64 stops paying for itself in my tests. maintenance_work_memset to 8 GB during index builds, dropped back to default after.- A weekly batch job that rebuilds the index on a replica, promotes the replica, demotes the old primary. Avoids the build window stalling writes.
- Recall monitoring at the application layer. I keep a ground-truth set of 200 query-document pairs and run them through the live index on a schedule. If recall drops more than 5%, I get paged.
None of this was in the demo. The demo was four lines of SQL. Production was a quarter of an engineering month and a Postgres bill that has its own line in the invoice. pgvector earns the choice, but it earns it. Walk in knowing the numbers.
Comments (0)