The answer was confident, fluent, and made up
A user asked our internal docs assistant a simple question: what is the refund window on annual plans? It answered without hesitation. "Annual plans can be refunded within 30 days, and the cancellation takes effect at the end of the billing period." It cited two source documents. It read exactly like the rest of our help center.
There is no 30-day refund on annual plans. There never was. The real policy is a 14-day refund on monthly plans, and annual plans are non-refundable but cancellable. The model had taken the "30 days" from a document about the cancellation window, the word "refund" from a document about monthly billing, and welded them into a policy that did not exist. Support honored that invented policy for two weeks before someone in finance noticed the refunds going out.
Nobody caught it sooner because the answer had every marker of being correct. It was fluent. It cited sources. It matched the tone of the real docs. Retrieval worked. Generation worked. The system still lied.
Retrieval succeeded. Grounding failed.
When I pulled the trace, the retriever had done its job by every metric we were watching. The top chunks came back with high cosine similarity. The cancellation-window chunk and the monthly-refund chunk both scored near the top, because "refund," "cancel," "window," and "billing period" all sit close together in embedding space.
That is the trap. Similarity measures whether two pieces of text are about the same topic. It does not measure whether a piece of text answers the question. Those are different questions, and a retriever only knows how to answer the first one. So it handed the model two passages that were on-topic and individually true, sitting next to each other, and the model did precisely what we asked of it. It summarized the context faithfully. The context was two half-truths.
Chunking is where the truth gets cut in half
We chunked every document at a fixed 512 tokens with no regard for structure. The code was as blunt as it sounds.
def chunk(text, size=512, overlap=0):
tokens = tokenizer.encode(text)
return [
tokenizer.decode(tokens[i:i + size])
for i in range(0, len(tokens), size)
]The refund policy lived in a small table. Its header row said "Refund policy." Row one was monthly, 14 days. Row two was annual, none. That 512-token boundary fell between the two body rows, so the chunk the retriever scored highest carried the header, the word "refund," and "14 days," but not the row that said annual plans get nothing. A model handed half a table will cheerfully infer the rest.
The fix is to chunk on structure instead of token count. Keep atomic units whole: a table is one chunk, a list item keeps its heading, a rule never gets split from its exception. Add overlap so anything straddling a boundary survives in both neighbors.
def chunk_structured(blocks, max_size=512, overlap=64):
# block = one heading, paragraph, table, or list item
chunks, current = [], []
for block in blocks:
if token_len(current) + token_len(block) > max_size and current:
chunks.append(join(current))
current = current[-overlap:] # carry the tail into the next chunk
current.append(block)
if current:
chunks.append(join(current))
return chunksThis is not flawless. Long tables still need a size cap, and some documents have no clean structure to chunk on. But it stopped cutting tables in half, and that single change removed a whole category of "the model only saw part of the rule" failures.
You tuned recall when you needed precision
My first instinct when retrieval feels wrong is to grab more of it. We pushed k from 4 to 12. Recall climbed, and so did the hallucinations. More chunks means more chances that one of them is topically close but factually irrelevant, and the model treats every chunk in its context as fair game. At k=4 it fused two documents. At k=12 it had six passages that disagreed with each other and confidently blended the most fluent ones.
Here is the counterintuitive part. For grounded question answering, precision matters more than recall. Three exactly-right chunks beat the right chunk buried under eleven plausible distractors. So we went the other way. We dropped k back to 4, added a reranker (a cross-encoder that actually reads the query against each candidate instead of comparing two embeddings), and discarded anything under a relevance threshold. Fewer chunks, higher quality, fewer invented policies.
None of this was real until we wrote the eval set
Now the uncomfortable admission. Every fix I just described, the structural chunking, the reranker, the smaller k, I have been narrating as if we knew each one helped. We did not. Not at first. We were tuning by feel. Ship a change, eyeball five queries, declare victory. The chunking fix might have solved the refund bug and quietly broken twenty other answers, and we would have had no idea.
The change that mattered more than any retrieval tweak was a labeled eval set. A hundred real questions, each paired with the answer we expected and the source chunk that answer should come from. From that you get two numbers. Retrieval hit-rate: did the correct chunk land in the top-k? Faithfulness: is the generated answer actually supported by the retrieved context, graded by a second model or a human?
The hit-rate half is a few lines over pgvector.
def retrieval_hit_rate(eval_set, k=4):
hits = 0
for case in eval_set:
q_emb = embed(case["question"])
rows = db.execute(
"SELECT chunk_id FROM chunks ORDER BY embedding <=> %s LIMIT %s",
(q_emb, k),
)
retrieved = {r["chunk_id"] for r in rows}
if case["gold_chunk_id"] in retrieved:
hits += 1
return hits / len(eval_set)pgvector's <=> operator is cosine distance, so smaller is closer. Faithfulness is the harder score and the one that catches this exact bug: take the generated answer and the retrieved chunks, and ask a grader whether every claim in the answer is supported by those chunks. A confident, wrong, well-cited answer fails faithfulness even when hit-rate looks healthy. That is the refund bug, caught by a number.
Once those two numbers existed, the work stopped being guesswork. The reranker moved hit-rate from 0.71 to 0.89. Dropping k from 12 to 4 raised faithfulness and left hit-rate flat. Every change became falsifiable. It either moved a number or it did not, and the ones that did not, we threw away instead of shipping on a hunch.
RAG relocates hallucination. It does not remove it.
The pitch for retrieval-augmented generation is that grounding the model in your own documents stops it from making things up. What actually happens is quieter. The hallucination moves. It travels from "the model invents a fact out of nothing" to "the model faithfully summarizes the wrong context you handed it." The second kind is more dangerous, because it arrives with citations and reads like the truth.
Your model is only as honest as your retrieval. Your retrieval is only as good as the eval set you keep finding reasons not to write. So write it. It is the one piece of this whole pipeline that tells you whether anything else you did was real.
Comments (0)