The Slack message at 9:14 AM Monday
"Did anyone authorize a $39,847 charge on our Anthropic account this weekend?"
Three people looked at the message. Three people opened the Anthropic console. The number on the dashboard read $39,847.20 for the trailing 60 hours. Friday 6 PM to Monday 9 AM. Nobody had touched the system over the weekend. The team had been at a wedding.
That weekend ended up costing us about half of a junior engineer's annual salary. The reason was small, the lesson was big, and this post is the one I wish someone had written for me before we shipped.
What the dashboard showed
Hourly tokens, charted Friday through Monday, looked like a step function. Flat through Friday afternoon. A small bump at 6:17 PM. Then a nearly vertical climb from 7 PM Friday to 4 AM Saturday. Then a flat line at the API's per-minute rate limit ceiling for the next 53 hours.
The rate limit was the only thing keeping us from a six-figure bill.
Root cause: a retry loop, and three missing guardrails
Friday afternoon a dev had pushed a change to our document-processing pipeline. Documents enter, the agent reads, the agent decides what to do. One step in the pipeline could throw a validation error if the document was malformed. The previous version of the agent would catch the error and skip the document. It retried instead, without bounds.
The retry loop had no max attempts. The malformed document was permanent, so every retry produced the same error. Each retry was a fresh API call, fresh prompt, fresh response. The agent was a long-context one (Sonnet), and the prompt was 4,200 tokens. Each retry was about $0.02. Twenty per second, sustained, for 53 hours.
The retry loop was the bug. The reason it became a $40k bug was the three things we did not have:
- A spend cap on the API key. Anthropic supports them. We had not set one.
- An alert on hourly spend. Our cost dashboard polled daily.
- A circuit breaker around the agent call. If the same agent threw the same error 100 times in 5 minutes, the call should stop. It did not.
One of those three would have ended the incident in minutes. We had none of them.
How teams burn LLM money
After the weekend, I read the post-mortem write-ups from every team I could find that had a public LLM-cost incident. The categories that came up most:
- Unbounded retries. Ours. The most common pattern.
- Conversation context bloat. A chat agent that appends every previous message to the next call. By message 40 the context is 100k tokens. Cost scales linearly with conversation length.
- Cached embedding miss. A team had a cache, the cache was supposed to dedupe embedding calls, the cache key was the wrong shape. Every page load regenerated all embeddings.
- Recursive agent calls. Agent A calls Agent B. Agent B calls Agent A. The exit condition has a bug. The loop terminates when the model finally hallucinates the keyword "DONE."
- Dev environment with the prod key. A test run iterates over 10,000 fixtures using a real API key. The fixtures were synthetic but the dollars were real.
- Streaming that does not actually stream. The client appears to stream but the server pre-buffers the whole response. The user closes the page; the server keeps generating until completion. Tokens billed for output nobody saw.
- No model routing. Every call goes to Opus or Sonnet, including the ones that could have been Haiku at one-fifteenth the cost. The team never measured per-route cost.
- Forgot to use prompt caching. A long system prompt repeated on every call. With Anthropic's prompt cache, that gets billed at one-tenth the rate on cache hits. Without it, full price every time.
Each one of these is preventable. Together they account for the vast majority of LLM-cost incidents I have seen in the wild.
The five-line policy we should have had
None of the fixes are clever. All of them are boring. That is why I keep them on a checklist that gets applied to every new agent or LLM feature before it touches prod:
- Spend cap on the API key. Set it to 2x your forecasted daily spend. Anthropic, OpenAI, and most providers offer this in the console. Five minutes.
- Per-user rate limit at your gateway. Not at the provider, at your own layer. So one user (or one bot, or one runaway retry loop) cannot consume the team's entire budget.
- Circuit breaker on repeated errors. If the same prompt produces the same error 50 times in 10 minutes, stop calling. Page someone. The fix is rarely "keep trying."
- Hourly cost alert, not daily. An hourly check would have woken someone at 7:30 PM Friday. A daily check found the carnage Monday morning.
- Per-feature cost attribution. Tag every API call with the feature that made it. You cannot fix what you cannot attribute. The team that had the recursive-agent bug found it in 20 minutes because their dashboard showed "Feature: scheduler" at 800% of forecast.
That is the entire policy. It is short on purpose. The reason most LLM-cost incidents happen is not that this list is wrong. The list simply does not exist in the team.
What happened after
Anthropic forgave the bill. They were good about it. I do not assume they will be good about it twice, and I do not want to find out. The bill becoming a non-event was the kind of luck you should not plan around.
We spent a calm hour that Monday writing the five-line policy and adding the spend cap. The cap is set to $400 per day, which is more than 10x our normal usage. Two months later, we have not hit it once. The number is high on purpose: a normal week stays well below, an anomalous burst hits the cap before it hits a Slack message.
The category mistake
The reason LLM cost feels different from regular cloud cost is that the unit is invisible. You do not see tokens fly by the way you see EC2 instances spin up. There is no console tab where you watch the runaway happen. The first signal is the bill.
Treat the API key like a write-enabled database credential. Cap it. Rate limit at your gateway. Alert on hourly spend. Attribute every call to the feature that made it. The day you stop thinking of the LLM provider as a friendly cost line and start thinking of it as a resource with the same blast radius as a database is the day you stop having weekends like ours.
Comments (0)