← Back to Blog

// Posted by Umur Inan

// Category Tools

// Posted on May 26, 2026

Your AI Agent Isn't Broken. Your Evals Are.

Your AI agent isn't broken in some mysterious way. You just don't have evals. Why 'works on the demo' is the most expensive sentence in your AI roadmap.

By Umur Inan · 4 min read

"My agent doesn't work"

The most common thing I hear from teams shipping AI features is "the agent worked great in the demo, but in production it's a mess." They want help debugging the agent. They want me to look at the prompts. They want to talk about temperature, sampling, model choice.

None of that is the problem. The problem is that they have no evals. They are debugging by feel.

What "evals" actually means

An eval is a test case for AI behavior. You give the agent an input, you check whether the output matches what you wanted, you grade it. That is the whole concept. Doing it at the scale and rigor the system actually needs is where the work hides.

The same engineers who would never ship a payment endpoint without integration tests will happily ship an agent with zero structured tests. The reason is that AI output is nondeterministic and feels harder to test. It is harder. Not by as much as people think.

The five tiers of AI evals

From worst to best, this is the maturity ladder I see in actual companies:

Tier 0: Vibes. The PM tries the demo at standup. If it doesn't feel weird, you ship. Most teams are here.
Tier 1: Smoke tests. A handful of golden examples in a notebook. Run before each deploy. Catches obvious regressions.
Tier 2: Regression suite. Hundreds of cases in a versioned dataset, with expected outputs or graded rubrics. Run in CI. Catches subtler regressions.
Tier 3: LLM-as-judge. Cases without single right answers (summarization, reasoning, multi-step) get graded by another model against a rubric. Cheaper than human labeling, good enough at the comparison granularity.
Tier 4: Production logging + replay. Every real production conversation gets logged, tagged, and replayable. New model versions get scored against last week's actual traffic before they ship.

Most teams shipping production AI sit at Tier 0 or Tier 1. They genuinely believe they are at Tier 3 because they have a few prompts saved in a Notion page. They are not.

What good evals look like

The eval set you need depends on the agent, but the shape is consistent:

Versioned. The dataset has a Git history. You can answer "what did our agent get right two months ago that it gets wrong now."
Tiered by difficulty. Easy cases (the agent should never fail these), medium cases (improvement frontier), hard cases (research targets).
Adversarial. Prompt injection attempts, ambiguous inputs, conflicting context, role-confusion attacks. If you don't have these, you don't have an eval suite, you have wishful thinking.
Graded per step, not just end-to-end. If the agent has five tool calls, each step needs its own correctness signal. End-to-end success hides a lot of partial failures.
Tracked with cost and latency. Correctness alone is half the picture. An agent that gets the right answer in 90 seconds and $0.40 of tokens is broken even if it's correct.

The dirty secret

Most teams I've worked with don't have a versioned eval set. They have screenshots in Slack. They have a Notion page titled "Test cases" that nobody opens. They have a vague sense that things are getting better because the founder said the new prompt felt better at the demo.

When something regresses (and it will, every model update is a small chance of a big regression), they cannot tell. They notice when customer complaints spike. They notice when a sales call goes badly. They never catch it before it leaks out, because the only eval is the customer.

Build vs buy

The vendor landscape for eval tooling is now reasonable. Braintrust, Langsmith, Helicone, Arize, Phoenix. Each has its sharp edges, but all of them give you the basic shape: dataset versioning, run history, side-by-side comparison, LLM-as-judge integration.

If you have one engineer who can spend three days, build the first version yourself. A JSON file of cases, a script that runs the agent against each, a CSV that records outputs and grades. That is enough to leave Tier 0. You will move to a vendor or a richer in-house tool when your eval suite outgrows the script.

The mistake is to skip the homemade version and wait for the perfect vendor. The vendor will not arrive. Or it will, and you will not know which features matter to you, because you have never run an eval.

A short war story

I worked with a team that had a customer support agent. The agent gave great responses on the test cases they had. After the model provider released a minor version update, the same agent started refusing to give refund estimates. The team thought the system was broken. We had recently set up an eval suite with 80 cases, including 12 about refund logic. Reran it against the old model and the new model side by side.

Old model: 11 of 12 refund cases passed. New model: 3 of 12. Same prompt. Same temperature. Same tools.

The new model had picked up a more conservative refusal stance during training. Nothing in the changelog mentioned it. Without the eval, we would have spent days re-prompting before suspecting the model itself. With the eval, we had the answer in twenty minutes.

That is the experience that converts a team to caring about evals. Until it happens, it sounds like overhead.

What to build first

If you have an agent in production and no evals, here is the order:

Pick 20 representative inputs from your actual production logs. Real user inputs, not made-up ones. Five easy, ten medium, five hard.
For each, write the output you expect, or the rubric you'd grade against if there's no single right output.
Write a script that runs the agent against each and dumps inputs, outputs, elapsed time and token cost into a CSV.
Manually grade the CSV. Repeat with each prompt change or model change. Diff against the last run.
When the manual grading becomes a bottleneck (around 100-200 cases), introduce LLM-as-judge with a rubric you've tuned against your manual grading.

That is six engineering hours from zero to a working eval pipeline. Less than the time you'll spend the next time a model update breaks your agent in a way you cannot diagnose. Stop telling yourself your agent is broken. Build the thing that tells you whether it actually is.

AI LLMEvalsTestingProduction

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

GitHub LinkedIn Email

👁 0 4 min read