← Back to Blog

AI Code Review Is Mostly Noise

AI code reviewers are the hot dev-tool category of 2026. After months of real use on a review queue, the signal-to-noise ratio is bad. Here are the numbers.

Every dev tool company shipped an AI code reviewer in the last twelve months. GitHub's Copilot reviewer. Greptile. CodeRabbit. Cursor's review feature. Anthropic's PR reviewer. The marketing pitch is identical across all of them, sometimes verbatim: a tireless senior engineer who reads every pull request and catches the bugs your humans would miss.

The pitch is great. It does not survive contact with a real review queue.

This post is the sequel to Your AI Coding Speedup Is Not What You Think. Same framing: I ran the tool, I measured the result, the result is not the brochure.

What the tools actually do

Every AI code reviewer I've used follows the same loop. Read the diff. Generate comments. Some comments flag style. Some flag potential bugs. Some demand more tests. Some suggest "improvements" with refactor diffs. The output volume is high. The variance across vendors is low. Once you've seen comments from two of them, you've seen comments from all of them.

The differentiation in this market is mostly the integration: how it posts to GitHub, whether it gates merges, whether it summarizes the PR for human reviewers, whether it speaks to your existing linters. The actual review content is roughly the same across vendors because the underlying model is roughly the same. They are all asking some foundation model the same question.

The narrow signal

The genuinely useful catches are real, and worth listing honestly:

Obvious typos in identifiers. Unused imports. Null-dereference patterns the type system would have caught if you had one. Missing test coverage on the obvious paths. Unclosed resources (file handles, JDBC connections, HTTP clients). An obviously wrong condition in a boolean. The classic if (x = 5) instead of if (x == 5) that survived your formatter.

Most of this is what a tightly-tuned linter, a coverage tool, and a strict type checker catch for free. The AI reviewer's contribution here is convenience: you get the catches without configuring the linter. Configuring the linter takes one afternoon. The AI reviewer costs one subscription per seat per month, plus the cost of reading everything else it writes.

The noise floor

The bulk of the output is not signal. After three months of running an AI reviewer on every PR, the noise patterns are predictable:

Hallucinated nulls. The bot insists a value could be null in code where the type system already proves it cannot. Kotlin code with non-nullable types, Java code with @NonNull annotations, TypeScript code with strict null checks: the bot keeps suggesting defensive null guards that are dead code by construction.

Defensive code in safe zones. Functions whose contracts guarantee non-empty input get "consider checking for empty input" comments. The check would never fire. Adding it pollutes the function. Not adding it produces a comment.

Style nitpicks against established conventions. Every codebase has conventions. The bot has not read your style guide. It suggests refactors that fight the patterns the rest of the file uses. "Extract this to a helper" on the one part of a coordinated four-step function that is obviously not extractable without breaking the four steps.

Comment-on-everything energy. "Consider adding a comment explaining this logic" on lines that are self-explanatory. The implied bar is that every line should have a comment, which is the opposite of how good code reads.

Repeated observations. The same point made on twenty files in the same PR. If the convention applies to the codebase, file an issue against the project once. The bot files it twenty times.

"Did you mean to do X?" The author obviously did. The comment exists because the bot cannot model intent.

Concurrency warnings on single-threaded paths. Comments about thread safety on code that lives behind a single-writer queue. The bot does not know the surrounding architecture.

What it misses

The bugs that actually cost you sleep, and that the bot does not flag:

Business logic errors. The function returns the right type and the right shape, but the value is wrong because the bot does not know what your customer is supposed to see. A discount that should compound but doesn't. A status that should transition through three steps but skips one. The bot reads syntax, not policy.

Race conditions across multiple files. Each file looks correct in isolation. The race lives in the interaction. The bot reviews files, not interactions.

Ordering issues in async code. Two awaits that look reasonable side by side, where one needs to complete before the other for the postcondition to hold. The bot does not reason about ordering between independent statements.

Performance issues only visible at scale. The query that runs in 5 ms against your dev seed data and runs in 5 seconds against the production table. Or the N+1 that fires only when the result set is non-empty. Production schema and production data are not in the bot's context.

Security issues requiring trust context. The endpoint that interpolates a path parameter into a SQL string, where the path parameter is in fact validated by middleware two layers up. Or where it isn't. The bot does not know what's trusted.

Architectural drift. This PR is fine in isolation. It normalizes a bad pattern that the team is trying to phase out. The bot has no opinion about your direction.

Bugs in the test. The test passes, the code is wrong. Maybe the assertion checks the wrong thing. To the bot, code and tests are parallel artifacts, not a system where the test must independently constrain the code.

Behavior change in a dependency upgrade. The library went from version 4.0 to 4.1. Buried in the patch notes: a default value change. The PR is a one-line version bump that the bot calls "safe."

The numbers

Three months on a real service. One AI code reviewer running on every PR. One human review queue running in parallel.

Comments generated by the bot: about 3,500.

Comments that led to a real change: about 80, or 2.3%.

Of those 80, the breakdown:

~50: things the linter would have caught with one config rule
~20: style nitpicks I agreed with after the fact
~ 8: real catches that mattered (unused imports, dead code)
~ 2: actual logic bugs

Comments dismissed as wrong: ~1,100. Comments dismissed as nitpicks I disagreed with: ~1,700. Comments that were correct but immaterial: ~620.

Human reviewers in the same period, on the same PRs, flagged 28 logic bugs.

The ratio is the headline. Bot hit rate on actual logic bugs: roughly 7% of a competent human reviewer's hit rate, while producing 40x the comment volume.

The hidden cost

The subscription is not the cost. Review fatigue is.

When the bot drops eight comments on every PR, humans skim them. When humans skim bot comments, the skim behavior leaks into how they read human comments too. The reviewer who used to spend ten minutes per PR now spends six, because seven of the eight bot comments are noise and the human has trained themselves to triage faster. Careful review goes away.

Six weeks later, the bug rate climbs. Blame goes to the new hire, the recent dependency upgrade, the holiday week. Nobody connects it to the bot, because the bot didn't introduce the bug. The bot lowered the bar of what counts as "reviewed."

Where it does help

The wins are real but narrow.

Teams with no review culture at all: any review beats no review. A bot that catches typos and missing tests is a net improvement on "merge after CI passes."

Solo developers: a second pair of eyes is better than zero pairs of eyes. The bot will not catch the business logic bugs, but the solo dev wasn't going to either.

Reviewing an unfamiliar codebase: the bot is a worse senior engineer than the senior engineer who wrote the code, but the senior engineer is not available. The bot's average comment is still informed by patterns from millions of repos.

Compliance theater: when a regulator or a customer auditor needs to see "automated code review," the bot fills the box.

Outside those four cases, the trade is bad.

What works better per dollar

The same monthly budget allocated to other things catches more real bugs:

A real linter with project-specific rules. Configure once, runs forever, no noise floor, no hallucinations. Catches every "unused import" without commenting on the rest of the file.

A small set of well-written invariant tests. The kind that fails when a real customer constraint is violated. One good integration test catches the bugs the AI reviewer cannot see.

A senior engineer who reviews carefully and is given time to do it. Slowest, most expensive, catches the most. The whole point of human review is the part the bot cannot do.

A pull request template with two questions: "what could break" and "what did you not test." Forces the author to think before the reviewer has to.

A pre-merge integration test that hits a real database, real cache, real downstream. The bot cannot run this for you. CI can.

The market is in cosplay

AI code review is a market where the demos look good because the demos are run on toy PRs in fresh codebases with no context. In a real review queue with real conventions and real architectural commitments, the signal-to-noise ratio inverts. You get the catches a linter could give you for free, plus a lot of noise that makes humans worse reviewers.

If a tool's value proposition is "we read every PR" and the actual catches are "unused import on line 47," the tool is solving a problem you already had a better solution for. Spend the same budget on tests, tooling, and senior time. The real bugs still need a human who knows the code and the customer. That hasn't changed.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 7 min read

Comments (0)