I used to think monitoring meant having a dashboard. A big screen on the wall with charts that go up and to the right. CPU usage, memory, request count, maybe a P99 latency line that wiggled between 200ms and 400ms. As long as nothing turned red, we were fine.
Then one Friday afternoon, our checkout flow started failing for about 12% of users. The dashboard was green. Every single panel was green. CPU at 30%, memory stable, request rate normal, latency within bounds. A teammate noticed because his wife texted him that she couldn't complete her order. Not because the dashboard told us.
We spent 45 minutes staring at those green panels before someone thought to check the error rate broken down by endpoint. Turns out, one downstream payment service was returning 503s intermittently, but our overall error rate was low enough that the aggregated graph barely twitched. Twelve percent of checkout attempts were failing, and our monitoring setup was smiling at us.
That was the day I realized I didn't understand monitoring at all.
The Dashboard Trap
Dashboards are comfortable. They give you the feeling of being informed without requiring you to actually think. You glance at them, see green, and go back to your work. It's monitoring theater.
The problem with dashboards is that they only show you what you already know to look for. Every panel on a dashboard represents a question someone asked in the past. "What's our CPU usage?" "What's our request latency?" "How many 5xx errors are we returning?" These are fine questions. They're just not the questions that matter when something new goes wrong.
Production incidents are almost never caused by the thing you're already watching. They're caused by the thing you didn't think to watch. The connection pool that silently fills up. The disk that fills because a log rotation config got deleted three deploys ago. A third-party API that starts returning valid-looking but incorrect data. Your dashboard won't catch any of these because nobody built a panel for them.
I've worked at companies with 40-panel Grafana dashboards that missed outages, and I've worked at a startup with three alerts that caught everything. The difference wasn't the tooling. It was the thinking behind it.
Metrics Lie to You in Comfortable Ways
Averages are the worst offender. "Our average response time is 180ms." Great. But what does the distribution look like? If 95% of requests complete in 50ms and 5% take 3 seconds, your average is technically fine but 5% of your users are having a terrible experience. And those 5% are probably the ones doing something important, like checking out or uploading a file, because those are the operations that hit the slow path.
I once inherited a system where the team proudly reported "99.9% uptime." Sounds great until you do the math. That's about 8 hours and 45 minutes of downtime per year. Spread across a year, it's barely noticeable in monthly reports. But those 8 hours happened to cluster into three incidents, one of which was a 4-hour outage during a product launch. "99.9% uptime" doesn't capture the fact that your system went down during the one moment it absolutely could not go down.
Percentiles help, but they have their own blind spots. P99 latency tells you what the slowest 1% of requests experience. But it doesn't tell you who those users are, what they were doing, or whether they retried and made the problem worse. A P99 spike could mean one user hit a cold cache, or it could mean your database is about to fall over. The number alone doesn't tell you which.
Request counts are another trap. "We're handling 10,000 requests per minute" tells you nothing about whether those requests are succeeding, failing gracefully, or returning garbage data with a 200 status code. I've seen systems that returned empty responses wrapped in a 200 OK because the error handling caught the exception and returned an empty body instead of propagating the failure. The request count looked healthy. The error rate looked healthy. Users were getting blank screens.
Alerts That Actually Work
The best alerting philosophy I've encountered came from a principal engineer I worked with early in my career. He had one rule: "Alert on symptoms, not causes."
Don't alert on high CPU. Alert on users experiencing slow responses. Don't alert on database connection count. Alert on failed transactions. Don't alert on memory usage. Alert on the thing that breaks when memory runs out.
Reasoning is simple. If CPU is at 90% but users are happy, you don't have a problem. You have a metric that looks scary. If CPU is at 40% but checkout is failing, you absolutely have a problem, and a CPU alert wouldn't have told you.
Symptom-based alerts also reduce noise. Cause-based alerts fire constantly because systems are always doing something that looks alarming in isolation. Disk usage spikes during backups. CPU spikes during deployments. Memory climbs during batch jobs. These are normal behaviors that generate false alerts, which train your team to ignore alerts, which means they ignore the real ones too.
Alert fatigue is the silent killer of on-call teams. I worked on a team that had 47 alerts configured. We got paged 15-20 times a week. After two months, nobody read the alerts anymore. They'd acknowledge the page, glance at the dashboard, see green, then go back to sleep. When a real incident happened, it took 40 minutes for someone to actually investigate because the first three people who got paged assumed it was another false alarm.
We reduced those 47 alerts to 9. All symptom-based. Pages dropped to 2-3 per week. Every single one was real. Response time went from 40 minutes to under 5. Fewer alerts, better outcomes.
Logs Are Not Monitoring Either
"We have logging" is not the same as "we have observability." I've worked on systems with terabytes of logs that were completely useless during an incident. Unstructured log lines dumped to files that nobody searched unless something was already on fire.
2026-03-15 14:23:01 INFO Processing request for user 12345
2026-03-15 14:23:01 INFO Request completed successfully
What did the request do? How long did it take? What downstream services did it call? What parameters did it use? This log tells you that something happened and that it was fine. During normal operations, that's useless. During an incident, that's worse than useless because it gives you the illusion that you have information when you don't.
Structured logging changed how I think about this. When every log line is a JSON object with consistent fields, you can actually query your logs during an incident.
{
"timestamp": "2026-03-15T14:23:01Z",
"level": "INFO",
"service": "checkout",
"method": "POST",
"path": "/api/orders",
"user_id": "12345",
"duration_ms": 342,
"status": 200,
"payment_provider": "stripe",
"payment_status": "succeeded",
"correlation_id": "abc-789"
}Now I can ask real questions. Show me all requests where payment_status != succeeded in the last hour. Show me the P95 duration for /api/orders grouped by payment_provider. Show me all requests with this correlation_id across every service they touched. That's the difference between having logs and having observability.
But even structured logs have limits. They tell you what happened inside your services. They don't tell you the path a request took through your system, how long each hop took, or where the bottleneck is. That's what traces are for.
The Three Pillars Are Not Equal
Everyone talks about the three pillars of observability: metrics, logs, and traces. What nobody tells you is that they're not equally useful, and most teams invest in the wrong order.
Most teams start with metrics because they're easy. Slap a Prometheus exporter on your service, point Grafana at it, build some dashboards. Done in an afternoon. Logging gets added next, usually by just printing more stuff to stdout. Traces come last, if they come at all, because distributed tracing feels complicated and the setup is annoying.
This is backwards. Here's what I'd recommend if I were starting from scratch.
Start with structured logging. Every request gets a correlation ID. Every log line includes that ID, the service name, the operation, the duration, plus the outcome. You can derive most of the metrics you need from well-structured logs. And when something breaks, you can trace a request manually by searching for its correlation ID.
Add symptom-based alerts early. Before you build any dashboards, define the 5-10 things that matter to your users. Can they log in? Can they load their data? Can they complete the core workflow? Alert on those. You can set these up with basic health checks and synthetic monitors before you have any fancy tooling.
Add traces when you have more than two services. If you're running a monolith, you probably don't need distributed tracing. Your logs and a profiler will tell you what's slow. But the moment requests start crossing service boundaries, traces become essential. Without them, debugging a slow request that touches four services is just guessing.
Build dashboards last. Dashboards are for understanding trends over time, not for detecting incidents. Build them after you have alerts that work and logs you can search. A dashboard should answer "is this getting worse over time?" not "is something broken right now?"
What Good Monitoring Feels Like
When monitoring is working, incidents feel different. You don't find out about problems from users. You find out from your alerts, and the alert tells you enough to start investigating immediately. You don't spend the first 30 minutes of an incident figuring out what's broken. You spend it figuring out why.
At the best-run team I've been on, our incident flow looked like this. Alert fires: "checkout success rate dropped below 95%." On-call opens the runbook linked in the alert. Runbook says: check payment provider status page, check recent deploys, check database latency. Within 10 minutes, we know whether it's us or a third party. If it's us, we have traces showing exactly which requests are failing and where. If it's a third party, we flip the feature flag to the backup provider and page the vendor.
That didn't happen because we had better tools than anyone else. We used the same Grafana, the same Prometheus, the same ELK stack that everyone uses. It worked because someone sat down and thought about what could go wrong, what we'd need to know when it did, and what we'd do about it. Thinking was the hard part. The tooling was just implementation.
The Question That Changes Everything
If you take one thing from this post, take this question: "If this breaks at 2 AM, what information do I need to fix it without being fully awake?"
That question should drive every monitoring decision you make. Every alert, every log line, every dashboard panel. If it doesn't help a sleep-deprived engineer at 2 AM understand what's wrong and what to do about it, it's not monitoring. It's decoration.
Your dashboards are probably fine. Your alerts might even be fine. But are they actually helping you understand your system, or just making you feel like you understand it? Big difference. And you usually don't find out which one you have until something breaks.
Comments (0)