Look at enough internal API code and you will eventually find an endpoint that returns HTTP 200 for everything. Success: 200, body {"success": true, "data": ...}. Failure: 200, body {"success": false, "error": "user not found"}. Server bug: 200, body {"success": false, "error": "internal error"}. One status code for every outcome, and a body field that carries the actual meaning.
The team building it usually thinks the design is clean. One code path on the frontend. One parser on the mobile client. No "weird HTTP error stuff" to handle. The argument is always the same: simpler.
It is not simpler. It costs you retries, caches, load balancers, circuit breakers, and every monitoring tool that has ever been built. HTTP status codes are a contract that the entire web ecosystem was constructed around, and when you opt out you take on the work of rebuilding all of it inside your client code. You never finish that work.
Why teams do it
Five reasons I have actually heard, in real meetings, from real engineers. All of them have an answer.
"It is easier for the JavaScript / iOS / Android client to handle." The difference is five lines vs eight lines. Modern fetch wrappers (axios, ky, OkHttp, URLSession) handle 4xx/5xx as a typed branch automatically. The "simpler" argument is a phantom.
"iOS shows ugly errors on 5xx." No, your UI does. The HTTP status does not render anything. The UI layer reads the response and decides what to show. Fix the UI.
"Our monitoring alerts on 5xx and the noise is bad." This one is honest, and it is the worst reason. Tune the alerts. Silencing the signal at the protocol layer is not the fix. The next incident now starts with a customer email, not a page.
"We already put the error in the body; the status code is redundant." The duplication is the point. Status codes are the part of the response that proxies, caches, load balancers, and SDK generators can read without parsing your body. Bodies are for humans and detailed clients. Status codes are for the half-dozen middleware layers between you and the human.
"GraphQL returns 200 with errors in the body, so why not us?" GraphQL is a different protocol with explicit semantics around partial success. HTTP-based REST is not GraphQL. The argument transplants badly.
What you break
Six concrete things. Each one matters; together they are devastating.
Client retries. Apache HttpClient, OkHttp, Spring's RestClient, the JDK's HttpClient, and every reasonable HTTP library defaults to retrying on 5xx and not retrying on 2xx. A 200 wrapping "service temporarily unavailable" tells the client "do not retry, this succeeded." Your transient backend blip becomes a permanent failure for the caller.
HTTP caches. A 200 response is cacheable by default. CDNs, reverse proxies, and browser caches all respect this. A 200 wrapping "user not found" can land in a Cloudflare cache with the same key as the eventual real user record. The next user load hits a stale cached failure for the rest of the cache TTL.
Load balancers. AWS ALB, nginx upstream health checks, Envoy outlier detection, and every other modern LB tracks 5xx rates to decide whether a backend instance is healthy. A 200-wrapped 500 leaves the sick instance in rotation. The bad node serves traffic until a human notices.
Circuit breakers. Resilience4j, Hystrix, Polly, and the rest decide to open the circuit based on HTTP status or thrown exceptions. A 200 with success: false in the body is success to them. The circuit never opens. Cascading failures get worse, not better.
Observability. Datadog APM, Grafana traces, Prometheus exporters, AWS X-Ray, OpenTelemetry collectors: they all bucket spans by HTTP status. Your error-rate dashboard reads zero forever. The first signal you get of a broken endpoint is a customer ticket.
OpenAPI and SDK generation. Generated client SDKs produce typed result objects, with success and error paths derived from status codes. With everything 200, the generated SDK has no error path. Every consumer has to re-derive the error envelope by hand.
The 4xx versus 5xx distinction you are throwing away
The most important loss is the implicit retry policy that HTTP status codes communicate.
A 4xx says "the request itself is the problem; do not retry without changing it." A 5xx says "the server is the problem; retrying may succeed." Every modern client library implements that contract. The retry logic is automatic and correct.
Collapse both to 200 and the contract disappears. Clients hammer endpoints that will never succeed (the 4xx case: invalid input retried thousands of times). Clients silently fail to retry transient errors (the 5xx case: backend blip becomes user-visible). The failure mode is invisible to everyone except the user.
The taxonomy that actually matters
You do not need to memorize the full RFC 9110 list. You need to pick the right one in the right place.
400 malformed request (cannot parse the JSON, missing required header)
422 valid request, business-rule violation (overdrawn account, password too short)
401 not authenticated (no credentials, expired token)
403 authenticated, but not authorized (logged in, wrong role)
404 resource does not exist
410 resource existed and was deleted (signals "stop polling forever")
409 conflict on resource state (concurrent update, duplicate create)
412 precondition failed (If-Match header, optimistic locking)
429 rate limited
500 server bug
502 upstream broken
503 service temporarily overloaded
504 upstream timed outPick five of these, use them consistently, document them in your OpenAPI spec. That is the whole game.
The body still matters
None of this is an argument against putting structured error information in the body. The status code is the routing key; the body carries the detail. Each layer is useful, and they do not compete.
RFC 7807 Problem Details (application/problem+json) is the modern shape:
HTTP/1.1 422 Unprocessable Entity
Content-Type: application/problem+json
{
"type": "https://example.com/errors/insufficient-funds",
"title": "Insufficient funds",
"status": 422,
"detail": "Account 12345 has a balance of $40, but the transfer is for $100.",
"instance": "/transfers/abc-789",
"account": "12345",
"balance": 40,
"requested": 100
}The client sees 422 in the status (cannot fix by retrying; the request is the problem). The body carries the human-readable reason and the structured fields the UI needs. Field-level validation errors live in an errors array. Status, then body. Both, not either.
The monitoring-noise argument, properly addressed
The team that ships 200-for-everything because their on-call gets paged on 5xx has a real problem. The real problem is not the protocol.
Alert on 5xx rate, not absolute count. Threshold on rolling windows (1% of requests over 5 minutes, not "any 5xx in the last hour"). Page per-route, not per-service. Exclude expected 4xx classes (404s on a content endpoint are normal). Treat 4xx separately as "client behavior signal," not "server health signal."
Fix the alerts, not the responses. Silencing errors at the protocol layer is the bandaid that destroys every other use case.
The one exception: GraphQL
GraphQL deliberately uses HTTP 200 with errors in the response body, because a GraphQL response can be a partial success: some fields resolved, others failed. The protocol assigns no meaning to HTTP status for partial responses, and that is a feature.
Do not apply the REST argument to a GraphQL endpoint. Do apply it everywhere else.
How to migrate away
If you have inherited an API that does 200-for-everything, two paths.
The incremental path: add proper status codes to new endpoints; keep the wrapper on old ones; version the API and migrate routes one at a time. Existing clients keep parsing the wrapper. New clients (and the SDK generator) get correct status codes.
The clean-break path: change the responses, ship a release note, fix the clients you own. Every modern HTTP client handles 4xx/5xx natively; the migration surface is usually smaller than the team thinks. Public APIs need a deprecation window. Internal services often do not.
Either way, do not write a third version that wraps the status code in the body again because "the team is used to it."
Why this is worth the work
The web is built on a contract that says HTTP status codes carry meaning. That contract is implemented in dozens of layers: client libraries, proxies, caches, load balancers, observability tools, SDK generators, OpenAPI tooling, every modern API gateway. When your endpoint opts out, you become the only one responsible for translating the body field into all of those layers' expected behavior. You never finish.
The endpoint that always returns 200 is not simpler. It is hiding complexity in the place where it is hardest to find: spread across every consumer, every middleware, and every dashboard you ever wire up. The simple version is the one where the status code says what happened.
Comments (0)