At 3 AM, the on-call engineer is staring at a wall of logs. The service is degraded. A dashboard shows 5x latency on a specific endpoint. Logs are the next stop. Roughly 15,000 lines per minute, all at INFO level, almost all from successful requests, almost all useless. Somewhere in there is the actual error. Finding it takes twenty minutes of grep and luck. The on-call eventually gives up on the logs and starts reading the code instead.
This is the default state of logging in most codebases. Every function logs that it was called. Every successful operation produces an INFO line. Every retryable error produces a WARN that nobody acts on. The actual ERROR lines, when they exist, are buried in the noise. Logging works fine in development because there is no traffic. It falls apart in production because every line you logged on every request is now multiplied by the request rate.
What Levels Are Actually For
The standard log levels (TRACE, DEBUG, INFO, WARN, ERROR, FATAL) are not a UI gradient. They are a contract between the developer who wrote the line and the operator who runs the system. Each level says something specific about what that line is for and what should happen when someone sees it.
- TRACE: step-by-step execution flow. Useful when debugging a specific function in isolation. Should never run in production.
- DEBUG: developer-relevant detail. Variable values, branches taken, inputs to a calculation. Off in production. On in dev.
- INFO: meaningful state changes that an operator might care about. A new user registered. A batch job completed. A configuration was loaded at startup. Not "function called." Not "request received." Not "value is X."
- WARN: a problem occurred but the system handled it. A retry succeeded after one failure. A fallback was used. The operator does not need to do anything right now, but the count of these matters.
- ERROR: a problem occurred that the system could not handle. A request failed. A job crashed. A required dependency was unreachable. The operator probably needs to look.
- FATAL: the application cannot continue. About to exit. Rare in long-running services. Common in CLI tools and one-shot scripts.
If your team's working definition of these levels is different, that is fine, but write it down. The point is to have a definition. Most teams do not, which is why every level ends up meaning roughly the same thing.
The INFO-Is-the-Default Problem
Most logging frameworks default to INFO. Most teams default to "if we might want to see this later, log it." Combined, these produce code that calls log.info() a hundred times per request and means nothing by it.
Disk space is the obvious cost, and it adds up. The real cost is noise. When 99 percent of the lines in your logs are routine successful operations, the human reading the logs becomes desensitized. They scroll past WARN. They scroll past ERROR. They optimize for lines that are visually different from what they have been reading, which usually means looking for stack traces. Anything that is not a stack trace becomes effectively invisible.
The fix is not "log less." The fix is to log the right things at the right levels. Most of what currently lives at INFO should be DEBUG and turned off in production. Most of what currently lives at WARN should be a counter (a metric incremented when the condition happens) plus a single periodic log if the count is unusual.
Three Tests Before You Write the Line
Three quick tests when you are about to write log.X(...):
INFO test. Would an operator care about this if they were just glancing at the logs? "User logged in" - yes, that is a state change. "Calculating discount for user" - no, that is just a function being called. Most calculations are not state changes worth logging.
WARN test. Did something go wrong, even though we recovered? "Retried database call after timeout" - yes. "Fell back to cached value because the API was slow" - yes. "Optional config not set, using default" - probably no, that is just INFO at startup if anything.
ERROR test. Would I want to wake up the on-call for this? If yes, ERROR. If no, it is probably WARN, or it should not be a log line at all. ERROR is for things you want someone to look at, not for things that happened to fail in a normal way.
The most common mistake is logging ERROR for every caught exception. A user submitted invalid input. The framework threw a ValidationException. You caught it and returned a 400 to the client. This is normal traffic. This is not an ERROR. Logging it as ERROR pollutes your error rate dashboards and the alerts that watch error counts.
Log Volume Is a Performance Problem
INFO logs in a hot path are not free. They have a cost in CPU (formatting strings), memory (allocating those strings), and I/O (writing to disk or a remote log aggregator). At 10,000 requests per second with five INFO lines per request, you are emitting 50,000 lines per second. That is real load.
Most systems can handle this until something else goes wrong. Then logging becomes the failure mode. Disk fills. Log aggregator throttles. The application slows because it is blocked on log I/O. You have built a system where logging is on the critical path and you did not realize it. The first time you find out is during the incident that the logs were supposed to help you debug.
Structured Logging Helps, But Does Not Fix the Pile
You will hear that the answer to messy logs is "use structured logging." That is correct, and not enough. Logging key-value pairs in JSON or logfmt instead of pre-formatted strings makes logs queryable. You can ask "show me all WARN lines from the payment service in the last hour where the user_id is 12345" instead of grepping for substrings.
Structured logging fixes the search problem. It does not fix the volume problem. It does not fix the level discipline problem. A structured INFO log on every function call is still noise. It is just queryable noise. The level still matters. The decision of "is this worth logging at all" still matters.
Correlation IDs Are Not Optional
The one piece of operational logging discipline that is genuinely non-negotiable: every log line that comes from a request should include a correlation ID that identifies that request. When the on-call engineer finds an ERROR, they should be able to filter to all log lines from the same request, across every service, and reconstruct what happened.
In a single-service app, this is a request ID generated at the entry point and passed through the call stack. In a microservices architecture, it is a trace ID propagated in HTTP headers (W3C traceparent, B3, or whatever your tracing system uses). Without it, you have a pile of log lines and no way to associate them with a specific user-visible problem.
The Audit That Will Surprise You
If you have not done this recently, run a query against your production logs grouped by level. Most teams find that 95 to 99 percent of their volume is INFO and almost none of it is actionable. WARN is usually around one percent, and most of those are routine errors that always self-recovered. ERROR is usually under 0.1 percent, and a meaningful fraction of those are not actual errors but caught exceptions that represent normal traffic.
The fix is not to delete the logs. The fix is to audit them. For each repeating INFO line, ask "would I miss this if it were gone?" If the answer is no, demote it to DEBUG or remove it. For each WARN, ask "is this actionable?" If not, replace it with a counter or remove it. For each ERROR, ask "would I wake up the on-call for this?" If not, demote it to WARN.
The goal is logs you can actually read. If you cannot read your logs, they are not a system. They are a pile, and at 3 AM during an outage, a pile is worse than nothing because it gives you the illusion that the answer is in there somewhere.
Comments (0)