The cluster we stood up to move three topics
Two of our services needed to talk without calling each other directly. An order service produced events, and a billing service and a notifications service wanted them. That was it. Three topics, a few hundred messages a minute on a busy day.
What we built for that was a three-broker Kafka cluster running KRaft, a schema registry so the events stayed typed, Kafka Connect to land a copy in the warehouse, and a small library wrapping the consumer so every service handled offsets the same way. It worked. And it quietly became the thing we spent the most time operating, for the smallest part of the system. We had bought a streaming platform to send a postcard between three houses on the same street.
Kafka is a commit log, and you bought the whole log
Kafka is a distributed, partitioned, replicated commit log. That sentence is the whole pitch and the whole warning. It is extraordinary at keeping an ordered, durable record of a firehose of events, and at letting many independent consumers read it at their own pace, rewind, and replay history from the beginning.
Every one of those properties assumes you have a firehose. Partitions exist so throughput can scale past a single machine. Consumer groups exist so a fleet of workers can split a stream too big for one of them. Retention and replay exist so you can reprocess weeks of history. When your actual load is a few hundred messages a minute with two consumers that never replay anything, you are paying for a machine built to move an ocean so it can move a bucket.
The operational tax nobody priced
The sticker price of Kafka is the brokers. The real price is everything you learn about it at 2am. When a consumer group rebalances for the first time and processing stalls for thirty seconds, someone has to understand why. The first time lag climbs and nobody can tell whether a consumer is slow, stuck, or dead, someone has to learn the difference between those three states under pressure.
Then there is retention you set wrong and a topic that quietly drops messages older than a day. There is the partition count you picked early and cannot raise later without reshuffling keys. Ordering is guaranteed inside a partition and not across them, so the moment your design needs global order you are back to one partition and no parallelism. None of this is Kafka being bad. It is Kafka being a serious tool that bills you in attention whether or not your scale justifies the charge.
What "we need events" usually means
When a team says it needs events, it usually wants three small things. It wants the producer to stop blocking on the consumer, so a slow billing run does not back up order placement. A failed handler should retry without dropping the message on the floor. And once in a while, someone wants to look back at what actually happened.
That is decoupling, durability, and the occasional audit. It is a real and worthy list. It is also a far smaller want than ordered, replayable, horizontally partitioned streaming, and almost any queue on earth satisfies it without a cluster anywhere in sight.
The smaller tools that actually fit
A managed queue is the boring answer and usually the correct one. SQS, Cloud Pub/Sub, or a hosted RabbitMQ gives you decoupling and retries with a dead-letter queue, and the operational surface is a config screen instead of a cluster you babysit. For handing work off reliably from inside a transaction, the outbox pattern does the job with a table and a poller, and I have a whole post on why that beats a two-phase commit. For genuinely low volume, a jobs table that a worker polls is not a sin. It is a queue sized to the problem in front of you.
This is not "use your database as a stream"
I have to be careful here, because I have also argued that your database is not your message queue, and I still mean every word of it. Right-sizing down has a floor. The failure mode in that other post is a team treating a busy Postgres table as a high-throughput event bus, with dozens of workers running SELECT ... FOR UPDATE SKIP LOCKED in a hot loop and reinventing offsets badly.
The line is throughput and intent. A jobs table polled a few times a second by one or two workers is a queue doing honest work. That same table under a real event stream turns into a lock-contention machine and a worse Kafka than Kafka. Reach for the smaller tool, then stop reaching before you turn your database into the thing you were trying to avoid.
The throughput where Kafka starts to win
There is a real line, and past it Kafka stops being overkill and becomes the only sane option. When you are moving tens of thousands of messages a second, a queue's per-message bookkeeping falls over and the log's sequential design pulls ahead. Five or six independent teams reading the same stream at their own pace is another signal, because the commit-log model fits that shape in a way point-to-point queues never will. And if you have to replay a week of events to rebuild a projection, you want retention you can rewind into.
Fan-out, replay, and serious throughput are the signals worth watching. If two or more of them describe your system honestly, stand up the cluster and do not apologize for it.
The "but we might scale" defense
The usual objection is that adopting a queue now means a painful migration later, when the firehose finally arrives. In practice the opposite holds. Code that publishes to an interface does not care whether the other side is SQS or Kafka, and swapping the transport underneath is a week of work you do once you have the volume to justify it and the real data to size it correctly.
Building Kafka first is paying that migration cost up front, every single day, for a scale you may never reach. You carry the operating burden for years to dodge a week of work that might never come due.
What I actually reach for
My default for service-to-service events is a managed queue, with the outbox pattern when the handoff has to be transactional. I keep that setup until the numbers force a change: tens of thousands of messages a second, several independent consumers, or a genuine need to replay history. At that point Kafka earns its keep and the operational tax becomes a fair price for what it buys.
Three topics and a few hundred messages a minute is not that day. It is a postcard. Send it with a stamp, not a shipping container, and put the months you saved straight back into the product.
Comments (0)