← Back to Blog

The Outbox Pattern: Reliable Events Without Two-Phase Commit

Reliable event publishing alongside database writes is harder than it looks. The transactional outbox pattern solves it without distributed transactions.

Here is a scenario that has bitten almost every team that starts publishing events from a service. You save an order to the database, then you publish an OrderPlaced event to Kafka. Most of the time this works fine. Then one day, the database write succeeds but the Kafka publish fails. The order exists but nobody downstream knows about it. Or the Kafka publish succeeds and then the database write fails. Now you have an event for an order that does not exist. You have a dual write problem, and the only way to solve it correctly is to make both writes atomic.

The obvious solution people reach for is a distributed transaction. Lock the database row, publish to Kafka inside the same transaction boundary, commit both. Kafka does not participate in database transactions. You cannot roll back a Kafka message after you have sent it. Two-phase commit across a relational database and a message broker is theoretically possible and practically a disaster. It requires special infrastructure, it introduces coordinator nodes that become single points of failure, and it makes your system slow in exchange for a consistency guarantee that most teams never actually verify works correctly end to end.

The outbox pattern gives you the same guarantee without any of that. The idea is simple: instead of publishing to Kafka directly from your application code, you write an event record to a table in your own database inside the same transaction as your business write. A separate process reads from that table and publishes to Kafka. The two writes that need to be atomic are now both database writes inside a single transaction. You get atomicity for free from the database.

How It Works

The outbox table is just a regular table. Something like:

CREATE TABLE outbox (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  topic       VARCHAR(255) NOT NULL,
  key         VARCHAR(255),
  payload     JSONB NOT NULL,
  created_at  TIMESTAMP NOT NULL DEFAULT NOW(),
  published   BOOLEAN NOT NULL DEFAULT FALSE
);

When you process a command, you write to your business table and the outbox table in a single transaction:

BEGIN;

INSERT INTO orders (id, user_id, total, status)
VALUES ($1, $2, $3, 'PLACED');

INSERT INTO outbox (topic, key, payload)
VALUES (
  'orders',
  $1,
  '{"event": "OrderPlaced", "orderId": "...", "userId": "..."}'
);

COMMIT;

Either both rows land or neither does. Your business data and your event intent are always consistent with each other.

The second piece is the relay, sometimes called the outbox processor. A separate process polls the outbox table for unpublished rows, publishes them to Kafka, and marks them as published. Polling intervals run from a few hundred milliseconds to a few seconds depending on how much latency you can tolerate.

SELECT id, topic, key, payload
FROM outbox
WHERE published = FALSE
ORDER BY created_at
LIMIT 100;

After publishing, you mark the rows:

UPDATE outbox
SET published = TRUE
WHERE id = ANY($1);

You can run cleanup separately to delete old published rows and keep the table from growing forever.

At-Least-Once Delivery

The outbox pattern gives you at-least-once delivery, not exactly-once. If the relay crashes after publishing to Kafka but before marking the row as published, it will publish the same message again when it restarts. This is the right tradeoff in most systems, because making your consumers idempotent is usually much simpler than building exactly-once infrastructure.

Idempotency on the consumer side means that processing the same event twice produces the same result as processing it once. For an OrderPlaced event, that might mean checking whether the order already exists before trying to create it, or using INSERT ... ON CONFLICT DO NOTHING with the order ID as the key. The event ID from the outbox makes a natural idempotency key. You store it on the consumer side and skip processing if you have already seen it.

This is not much extra code. And it is code you would need to write anyway in any distributed system, because networks and processes can always deliver duplicates regardless of whether you use an outbox.

CDC as an Alternative

The polling approach works well for most use cases, but it adds load to your database and introduces some delay. An alternative is Change Data Capture with something like Debezium. Instead of polling a table, Debezium reads the database's replication log directly and streams changes to Kafka in near real time.

CDC eliminates polling and gets you much lower latency. There is also no separate relay process to maintain, because Debezium is doing that job. The tradeoff is operational complexity. Debezium needs access to the replication log, which requires specific database configuration. You are now running another piece of infrastructure that can fail, and when it does, the failure mode is different from a polling process failing.

For most teams, the polling approach is the right starting point. It is simple to understand, simple to operate, and simple to debug. If latency becomes a real problem, CDC is a clean upgrade path.

What the Outbox Does Not Solve

The outbox pattern solves the problem of atomically coupling a business write with an event. It does not solve everything.

Ordering is still your problem. If you process two commands for the same aggregate in quick succession, you need the outbox rows published in the right order. Using the aggregate ID as the Kafka partition key and sequencing outbox rows by a monotonic ID or timestamp gets you there for single-aggregate ordering. Cross-aggregate ordering is a harder problem that the outbox does not address.

Schema evolution is still your problem. Events you have published to Kafka are out there and consumers depend on them. Changing the shape of an event payload without coordinating with consumers breaks things just as it would in any event-driven system.

The outbox table can become a bottleneck if your write volume is high enough. In most systems it will not, because the outbox rows are tiny and you are just appending and updating a boolean. But if you are processing millions of events per second, you need to think about partitioning the outbox table or sharding the relay.

When You Actually Need It

If your service only writes to a database and does not publish events anywhere, you do not need the outbox pattern. If your service publishes events to Kafka but you genuinely do not care about the case where the database write succeeds and the Kafka publish fails, you also do not need it, though I would question how you convinced yourself you do not care about that.

You need the outbox pattern when your downstream systems make decisions based on events from your service and those decisions have real consequences. Payment processing. Inventory reservation. Notification delivery. Fraud detection. In any of these cases, a missed event or a phantom event is not a minor inconsistency you can explain away. It is a bug with a user on the other end of it.

The outbox pattern is one of those solutions that feels like more infrastructure than it should be for what it does. You are adding a table, a relay process, and a cleanup job to solve a problem that looks like it should be a single line of code. But the problem it solves is real. Dual writes fail in production. Quietly. And when they do, the debugging story is usually painful because by the time you notice, you have no clear record of which writes succeeded and which did not. The outbox gives you that record. That alone is worth the extra table.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 5 min read

Comments (0)