← Back to Blog

// Posted by Umur Inan

// Category Thinking

// Posted on May 16, 2026

Your AI Coding Speedup Is Not What You Think

AI coding tools promise a 10x speedup. After a year of daily Claude Code use, the real number on a real codebase is closer to 1.5x. Here is what I measure.

By Umur Inan · 5 min read

A coworker told me he was getting a 5x productivity boost from Claude Code. I asked how he measured it. He said it felt that way. I told him it does not feel that way to me, and I use the same tools more than he does.

I have been writing about Claude Code on this blog for months. The MCP server setup, the Slack integration that lets me remote-control my terminal, the custom agents I run in parallel. I run probably 30 hours of agent time a week. I am as bought in as anyone.

After a year of this, my honest measurement of the speedup on a real production codebase is closer to 1.5x than 5x. Some days are 2x. Some days are 0.8x. Averaged across a week, the number is small enough that I have stopped describing it as a speedup at all. It is a shift in where my time goes, not a multiplication of it.

Where AI Actually Saves Time

The wins are real. They are narrower than the headlines suggest. Three categories show up consistently for me.

Test generation is the biggest. I give Claude a service class and the existing test conventions and it writes the unit tests in a few seconds. The first draft is usually 80 percent there. I fix the edge cases by hand. On a busy week this is the single largest time saving I can measure. Maybe four hours.

Boilerplate is the second. Configuration files, DTO conversions, Spring controllers that wrap existing service methods. Code that has a known shape and no judgment calls. The kind of code I would type slowly and resent. Claude types it in two seconds and I review it in thirty.

Multi-file refactors round out the list. Renaming a concept across a hundred files used to be a half-day grep job. Now I describe the rename and Claude does the mechanical work. I still review every change. I still find one or two cases where the rename was wrong in context. Net win, but smaller than people claim.

That is the list. Notice what is not on it. New features that require thinking. Architecture decisions. Anything where the right answer is not obvious from existing code. Anything in a domain the model has not seen often enough to be reliable.

Where AI Adds Time, and Nobody Talks About This

The hidden costs are bigger than the wins.

Review cost is higher per line of AI code than per line of human code. When I write code I know what I meant, and I can scan my own diff in a minute. When Claude writes code I have to actually read it. The function signatures look right. The variable names look right. Control flow looks right too. But the implementation might be three different ways of doing the same thing layered together, or a defensive null check that hides a real bug, or an exception type that does not match the upstream pattern. I have to understand what was written, not just check that it compiles.

I started timing this once. A 300-line diff written by Claude takes me about 25 minutes to review properly. A 300-line diff written by a teammate I trust takes maybe 8 minutes. The AI diff is not worse on average. It is denser. It packs more decisions into less code, and I have to verify each one.

The taste tax is real and underdiscussed. Claude's first draft of any non-trivial piece of code is plausible. It compiles. It probably works. But it almost never matches what I would have written if I had been thinking carefully. The idiom is slightly off. There's a layer of abstraction I would not have added. Defensive patterns appear where I would have asserted the invariant.

Accepting that first draft feels productive. The code works, the PR is up. But I am inheriting maintenance debt that compounds for as long as the code lives. Every time I touch that file later, I am reading code I did not write, in a style I would not choose, with decisions I do not fully agree with. Multiply that across a hundred AI-assisted commits and you have a codebase that fights you.

Rare or legacy code is where AI confidently goes wrong. When I ask Claude about Spring Boot 4 features that came out three months ago, the suggestions are confidently obsolete. When I ask about an internal framework with five users and no public documentation, the suggestions are competent-sounding hallucinations. Confidence is the problem. The model lacks awareness of its own gaps, and the patterns it falls back on are versions or libraries from years ago. Catching this requires me to already know the right answer, which defeats most of the point.

Architecture decisions get a net negative. I have asked Claude to suggest service boundaries, event topology, data model trade-offs. The suggestions are average. They reflect a thousand similar systems, not the specific constraints of mine. Worse, having the suggestion in front of me anchors my thinking. I find myself defending or modifying Claude's proposal instead of designing from first principles. The cleanest architecture decisions I have made this year were the ones where I did not open Claude.

The Number Nobody Quotes

My honest measurement, on a real codebase, across a typical week: 1.4x to 1.8x. That number includes the review cost, the taste cost, and the architecture cost.

I am genuinely curious what the people quoting 5x and 10x are measuring. My guess: they are measuring lines of code, or first-draft completion time, or how often Claude produces a working answer. None of those are productivity measurements. Productivity is shipping correct, maintainable software. A 5x increase in code produced is not a 5x increase in product velocity if you are also shipping more bugs and more debt.

Every serious productivity study I have read puts the gain in single-digit to low double-digit percentages, depending on what is measured. Not zero. Also not 5x. It matches what I see when I am honest about it.

What I Do Now

I have stopped reaching for Claude as the default. The heuristic that works for me:

If the code is mechanical and known, use Claude.
Judgment call? Write it myself.
If the code sits in an area where the model is likely outdated, write it myself.
Anything that lives long and gets read often: write it myself.

This sounds restrictive. It is not. Mechanical and known is most of what I do in a week. Tests, refactors, boilerplate, config. The judgment work is the smaller slice, but it is the slice that determines whether the codebase is something I want to be in next year.

The other shift: I use Claude as a rubber duck more than as a code generator. Asking it to explain a stack trace, walk through the failure modes of a design, or summarize how an unfamiliar library works. That is a real speedup. It is also the case where the model's tendency to confidently make things up is the most dangerous, so I verify everything against the source.

The Honest Headline

AI coding tools are a meaningful improvement on parts of the job that I did not enjoy. Roughly neutral on the parts that matter most. Net negative on architecture and on rare code. The total speedup is real but small.

If you are running a team and budgeting based on a 5x productivity bump, you will be disappointed by what actually ships. If you are choosing a tool because you genuinely want to spend less time on mechanical work and more on the parts of the job that take judgment, this stuff is great.

Just be honest about the number.

AI Tooling Software Thinking Engineering Practices

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

GitHub LinkedIn Email

👁 0 5 min read