← Back to Blog

Sagas Are Not Transactions

Sagas replace ACID transactions with compensation actions, not rollbacks. Intermediate states are visible to other services, and compensations can fail too.

Booking confirmed the order. Payment charged the card. Inventory reserved the items. Then the shipping service failed. Three operations succeeded, one failed. Now what? How do you undo the first three?

The answer was a saga. The team added compensation logic: cancel the reservation, refund the payment, mark the order as failed. It looked clean on the whiteboard. It worked in staging. Three weeks later, the refund service was down during a shipping failure. Payments were charged, inventory was reserved, orders were confirmed, and nothing was reversed. The saga compensated up to the point of the refund, failed, and stopped. No one noticed for two days.

Sagas solve a real problem. They also introduce a class of failure modes that most teams only discover after deploying to production.

What a Saga Actually Is

A saga is a sequence of local transactions. Each step updates a single service and publishes an event or sends a command to trigger the next step. If a step fails, the saga executes compensating transactions in reverse order to undo the steps that already completed.

There are two coordination patterns. Choreography: each service listens for events and knows what to do next. No central coordinator. The saga emerges from the interactions between services. Orchestration: a central process tells each service what to do and tracks the state. The logic lives in one place.

Choreography feels simpler to start. Orchestration is easier to debug after the first incident. Both handle failures the same way: by running compensations.

Compensation Is Not a Rollback

This is the thing that surprises teams. A database rollback undoes a transaction as if it never happened. No other transaction sees the intermediate state. The data returns to exactly what it was before. This is what most engineers expect when they hear "compensate on failure."

A saga compensation does not do this.

By the time the compensation runs, the original transaction has already committed. Other services have already read that state. Downstream systems may have already acted on it. An email went out. Analytics fired an event. Some external API got called. Compensation cannot reach back and erase those side effects. It can only create a new transaction that attempts to reverse the observable state.

If an order is created and then cancelled, the database does not end up in the state it was in before the order existed. It ends up in a state where an order exists with status CANCELLED. That is not the same thing. A compensated saga leaves a trail.

Intermediate States Are Visible

Transactions provide isolation. Between the start and commit of a transaction, other transactions cannot see the in-progress state. Sagas provide no isolation.

When a saga is running, every step is immediately committed. The order moves through: PENDING, then PAYMENT_CAPTURED, then INVENTORY_RESERVED, then SHIPMENT_FAILED, then INVENTORY_RELEASED, then PAYMENT_REFUNDED, then FAILED. Every one of those states is visible to any reader at the moment it occurs.

Another service that reads order state during a running saga may see an order in PAYMENT_CAPTURED state and draw conclusions from it. It might send a confirmation email. It might update a dashboard. When the saga compensates and the order ends up FAILED, those actions have already happened based on a state that no longer reflects reality.

This is called a dirty read in distributed systems. It is not a bug in your saga implementation. It is the expected behavior of the pattern. If you need to prevent it, you need to design around it: semantic locking, pessimistic ordering, or avoiding states that external systems should not act on until the saga completes.

What Happens When Compensation Fails

This is the question teams do not ask until it has already happened.

Shipping fails. Compensation kicks in. Inventory releases successfully. Then the payment refund fails because the payment processor is down.

Now you have a saga stuck mid-compensation. The order is partially unwound. Some state has been reversed, some has not. Your saga framework will retry the failed compensation. Eventually it will either succeed or exhaust its retry budget.

If it exhausts the retry budget, the saga is in a stuck state. The only way out is a manual intervention or a dead-letter process that operators handle. This is not a failure of your implementation. It is the fundamental limitation of the pattern. Distributed systems can fail at any point. Compensations are just more local transactions, and local transactions can fail too.

Production-grade saga implementations need:

  • A persistent saga state store that survives process restarts
  • A retry mechanism with exponential backoff and jitter for failed steps and failed compensations
  • A dead-letter queue or manual resolution process for sagas that cannot complete compensation
  • Monitoring that alerts when a saga has been stuck longer than expected
  • Idempotent compensation handlers so retries do not cause double-reversal

The last point matters more than most implementations account for. If your refund compensation is retried three times because the first attempt timed out but actually succeeded, you need the refund handler to detect that the refund already happened and return success without issuing a second refund. This requires idempotency keys on every compensation action and deduplication on the receiving end.

Choreography Has a Debugging Problem

In a choreographed saga, the flow is implicit. The order service emits ORDER_CREATED. The payment service listens and charges the card. On success, it emits PAYMENT_CAPTURED. The inventory service listens and reserves items. And so on.

When something goes wrong, there is no single place to look. The flow is distributed across the event logs of every service involved. Reconstructing what happened requires correlating events by saga ID across multiple message queues or event streams, sorting them by timestamp, and figuring out which step is stuck.

Orchestrated sagas put the flow in one place. The orchestrator has a state machine. You can query it: show me all sagas in COMPENSATION_FAILED state. You can see exactly what step failed, when it failed, and how many times it has been retried. Debugging takes minutes instead of hours.

The tradeoff is coupling. The orchestrator knows about every service it coordinates. Add a new step and you change the orchestrator. In choreography, adding a step means adding a new listener, and the existing services do not change. Teams that start with choreography often migrate to orchestration after the first production incident.

When to Use Sagas

Sagas are the right tool when all of the following are true. The operation spans multiple services, each with its own database. It is long-running. A distributed transaction is off the table because the services use different databases or the latency of two-phase commit is unacceptable. And the business rules define what compensate means for each step.

Sagas are the wrong tool when the operation touches a single database (use a local transaction), when the operation spans two services that share a database (use a local transaction across both tables), or when your team does not yet have a persistent saga state store, retry infrastructure, and dead-letter handling in place. Start there before adopting the pattern at scale.

The Pattern Is Sound, the Assumptions Are Not

The saga pattern is used correctly in systems that process millions of transactions every day. The failure modes described here are not arguments against sagas. They are arguments against treating sagas as a drop-in replacement for database transactions.

A database transaction gives you atomicity and isolation for free. A saga gives you neither. It gives you a way to coordinate multiple services through a sequence of local transactions, with the ability to reverse the observable effects if something goes wrong. That is a different guarantee, and it requires different infrastructure to implement safely.

Teams that understand the difference build systems that handle failure gracefully. Teams that do not discover it during an incident, with stuck sagas, partially refunded orders, and a support queue full of confused customers.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 6 min read

Comments (0)