← Back to Blog

CQRS Sounds Fancy Until You Have to Debug It

CQRS separates reads from writes but not bugs from confusion about which side caused them. Here is when the pattern helps and when it just adds complexity.

CQRS stands for Command Query Responsibility Segregation. A mouthful, which should have been a warning. The idea is clean: your write model handles commands like create order, cancel subscription, or update profile, and your read model handles queries, separate data structures optimized for how the UI actually needs to display data. Two sides, no shared schema. The write side writes. The read side reads. Events flow from one to the other, keeping them in sync.

When you first see it laid out on a whiteboard, it makes sense. Read and write workloads are fundamentally different. Writes need consistency. Reads need speed and shape. Why force them through the same model? Your write model can be normalized and integrity-enforced. Your read model can be a flat, denormalized document perfectly shaped for a single screen. No joins. No N+1 queries. No impedance mismatch between your domain model and your UI. It sounds like an engineering upgrade.

Then you ship it, and someone files a bug.

The Bug Report That Started It

The ticket says something like: "User updated their profile but the app still shows the old name." Or: "Order was placed but the dashboard says no recent orders." Or the worst one: "The totals don't match." In a traditional system, this is usually a simple investigation. You look at the database. The data is either there or it isn't. You look at the code path that wrote it. Something is wrong and it's in one place.

With CQRS, you now have a new question to answer before you can even start debugging: which side is wrong?

Is the command side failing to process the command? Did it process the command but fail to publish the event? Did the event publish but the read model consumer fail to process it? Did the consumer process it but write to the wrong read model? Did it write to the right read model but the data transformation was incorrect? Did everything work correctly but the read is hitting a cache that hasn't expired?

Each of these is a different failure mode in a different part of the system. And they all produce the same symptom: the user sees stale or incorrect data. Working backwards from the symptom to the cause in a CQRS system is not harder by a little. It's harder by an order of magnitude, because the gap between cause and effect is now filled with asynchronous infrastructure.

The Two-Database Problem

Most CQRS implementations use a separate data store for the read model. You write your domain state to PostgreSQL and project it into Redis, Elasticsearch, or a read-optimized set of denormalized tables. This gives you the query performance wins the pattern promises. The price: you now have two databases to look at when something is wrong.

In a traditional system, "show me the state of this record" means one query. In CQRS, you need to look at the write side to see what the authoritative state is, look at the read side to see what the application is actually serving, compare them, figure out whether they differ, and if they differ, figure out why the sync broke.

I once spent three hours debugging what turned out to be a consumer group that had accumulated a 90-second lag. Write side fine. Read side fine. Data correct in both places. The read side just hadn't received the event yet. Meanwhile, the user was refreshing the page every five seconds during those 90 seconds and filing increasingly frantic support tickets. Once the lag cleared, everything looked fine and the issue was "resolved." We still don't know exactly why the lag spiked. The consumer processing rate dropped for a few minutes and then recovered. In a traditional system, there's no such thing as a "lag spike" causing data to temporarily look wrong. In CQRS, it's a failure mode you carry permanently.

Read Your Own Writes

Here's the one that surprises teams most: a user submits a form and is immediately redirected to a page that doesn't show what they just submitted. They think the form didn't work. They submit again. Now you have a duplicate.

This is the read-your-own-writes problem, and it's endemic to CQRS with eventual consistency. Commands are processed synchronously. An event publishes. Meanwhile, the read model consumer picks it up asynchronously. Between the command completing and the consumer updating the read model, the gap is usually small. Tens to hundreds of milliseconds. But it's nonzero, and nonzero is enough to produce a broken user experience.

There are several standard workarounds. You can return the updated projection directly in the command response, bypassing the read model for the immediate redirect. You can add a version number to your read model and poll until the projection reflects the version you just wrote. You can add a short delay before the redirect, which is the approach nobody admits to using but everyone has tried. You can use sticky reads that go directly to the write model for a window of time after a write.

Each of these works. Each of them is additional engineering. Each of them is complexity you're adding to compensate for a property of the architecture, not because your domain requires it. You started with CQRS to simplify your data access. You're now building a synchronization protocol between your two models just to make a form submission feel correct.

Projection Rebuilds

At some point, you'll need to rebuild your read model. Maybe you find a bug in the projection logic. Maybe you want to add a new field. Maybe you decide to switch from Redis to Elasticsearch and need to backfill everything. Whatever the reason, you need to throw away your current read model and reconstruct it from the canonical state on the write side.

In a system with one database, schema migrations are already painful. With CQRS, you have a second schema to migrate that's derived from the first, and the derivation process can take hours.

While the rebuild is running, what do you serve? You can take the read model offline, which means taking whatever feature depends on it offline too. You can serve stale data from the old model until the new one is ready, which works until the old model is no longer valid because the schema changed. You can run dual writes to both models during the transition, which requires careful coordination to avoid serving inconsistent results from whichever model a given request hits.

I've been in production incidents triggered by projection rebuilds. Not because the rebuild itself failed, but because the rebuild consumed enough database I/O to slow down the primary write path, which caused command processing to degrade, which caused users to see errors while trying to make changes, which caused a spike in support tickets, which caused a postmortem about why we didn't test the I/O impact of a full rebuild in staging. We tested it in staging after that. The rebuild took four minutes in staging and forty-two minutes in production because staging had 2% of the data.

Tracing Across the Seam

Modern observability tools are good at tracing requests. A trace ID flows through your request context, and you can see the full chain from API gateway to database and back. CQRS breaks this.

The command goes in. It's processed synchronously. The trace ends. An event is published. Some time later, an entirely separate process. A consumer, running in a different thread or a different service. Picks up the event and updates the read model. That consumer has no trace context from the original request unless you explicitly pass it through the event payload.

In practice, this means that when you're investigating why a read model update failed, your traces show two disconnected operations: the command processing (succeeded, no errors) and the consumer processing (failed, no correlated ID, no way to link back to which user triggered it without correlating on a domain entity ID and scanning logs). You're not flying blind, but you're flying with much less instrumentation than you're used to, and what instrumentation remains is something you build by hand, not get from your framework for free.

When CQRS Actually Helps

I want to be accurate here. CQRS is not universally wrong. It has a real use case and it delivers on its promises when that use case is present.

If your read workload and write workload are genuinely different in shape and volume. Many more reads than writes, reads that need complex aggregations across multiple entities, reads that need to be served from different stores based on the query. Then separating the models lets you optimize each side independently. You can scale your read replicas without scaling your write database. You can index your read model for the specific queries your UI makes without polluting your normalized write schema with read-specific denormalization.

If your application has multiple consumers of the same state who each need it shaped differently. An API returning JSON, a reporting system needing aggregates, a search index needing full-text. CQRS lets each consumer maintain its own projection without forcing the write model to serve all of them. That's a genuine win when those consumers would otherwise be pulling the write model in incompatible directions.

And if you're pairing CQRS with event sourcing because your events are the authoritative record and state is derived, then the pattern fits naturally. The projections are just materialized views of the event stream. They're expected to be eventually consistent. The architecture is honest about that.

The problem is that most teams adopt CQRS before they have these problems. They adopt it as an architectural principle, applied uniformly to every part of the system, before they have evidence that the read/write shapes actually diverge enough to justify the overhead.

The Test You Should Run First

Before adding CQRS to a service, answer two questions honestly.

First: what does your read model look like compared to your write model? If they're largely the same shape. You write a user record and you read a user record. You're adding complexity without benefit. The mismatch that CQRS optimizes for doesn't exist.

Second: what happens when the read model is 30 seconds behind the write model? If the answer is "users see stale data but it's acceptable and temporary," you can probably live with CQRS's eventual consistency. If the answer is "users might take actions based on stale data that we then have to unwind," you have a consistency requirement that CQRS doesn't actually solve. It just hides.

The Debugging Experience Is the Feature Tax

Every architectural pattern has a feature tax. The tax for CQRS is the debugging experience. You pay it every time something goes wrong and you spend the first twenty minutes figuring out which side of the seam the bug is on. You pay it every time a projection rebuild goes longer than expected. You pay it every time a new engineer joins and needs to understand why there are two databases and why they sometimes disagree and which one to trust.

For systems where the read/write optimization is real and the gains are measurable, the tax is worth paying. For systems where it's theoretical and aspirational, you're paying the tax without collecting the benefit.

CQRS is a good pattern for the right problem. The right problem is not "we want to write clean code." It's "our read and write workloads are genuinely divergent and the simplicity of a single model is actively hurting us." If you can't point to evidence of that pain in your current system, the pattern is probably solving a problem you don't have yet, and the debugging complexity will arrive before the performance benefits do.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 8 min read

Comments (0)