I've built the same system three times. Not the same product, but the same pattern. An application that needs to process something asynchronously. Send an email after a user signs up. Generate a PDF after an order is placed. Sync data to a third-party system after a record changes. And every time, someone on the team, including me the first time, suggests the same solution: just add a status column.
The pitch sounds reasonable. Create a table. Insert a row when work needs to be done with status PENDING. A background job polls the table every few seconds, picks up pending rows, processes them, then sets the status to COMPLETED. No new infrastructure. No message broker to manage. No extra operational burden. It's just SQL. Everyone already knows SQL.
And it works. For a while. It always works for a while.
The Polling Problem
The first thing that goes wrong is the polling. Your background job runs a query every 5 seconds:
SELECT * FROM tasks WHERE status = 'PENDING' ORDER BY created_at LIMIT 10At low volume, this is fine. The table has a few hundred rows. The query hits an index. Life is good.
Then the table grows. You're inserting thousands of tasks a day, and most of them are completed. The table has a million rows. Your index on the status column helps, but now PENDING rows are scattered across a table that's mostly COMPLETED. Query planning has to wade through a lot of dead space to find work.
And you're running this query every 5 seconds. On every instance of your application. If you have four instances, that's 48 queries per minute doing the same thing. Most of them return zero rows. You're paying for a full index scan every 5 seconds just to learn there's nothing to do.
I worked on a system where the task table had 8 million rows. The polling query took 200ms on average, but occasionally spiked to 2 seconds when the database was under load from actual user traffic. The poller was competing with the application for database resources. We were burning I/O to ask "is there work?" and the answer was usually no.
A real message queue doesn't have this problem. Messages sit in a buffer until a consumer pulls them. There's no polling. There's no wasted query. The consumer blocks until there's something to consume. Push, not pull.
The Concurrency Nightmare
The second thing that goes wrong is concurrent processing. You have four application instances, each running the same poller. They all query for pending tasks at roughly the same time. They all see the same rows. Now three instances try to process the same task.
The standard fix is SELECT ... FOR UPDATE SKIP LOCKED:
SELECT * FROM tasks WHERE status = 'PENDING'
ORDER BY created_at LIMIT 10
FOR UPDATE SKIP LOCKEDThis locks the selected rows so other pollers skip them. It works correctly. But now you're using your database as a distributed lock manager. Every poll cycle takes row-level locks, holds them while processing, and releases them when done. If processing takes 10 seconds, those rows are locked for 10 seconds. If processing fails and the transaction rolls back, the rows become visible again and get picked up by another instance, which is what you want, but the retry behavior is implicit and hard to reason about.
You also lose visibility. When a task is locked by FOR UPDATE, it disappears from queries that don't use the same locking clause. Your admin dashboard that shows pending tasks will show fewer tasks than actually exist, because some are locked by active workers. I've had support teams ask why tasks "vanish" for a few seconds and then reappear.
Message queues solve this with consumer groups and acknowledgments. A message is delivered to exactly one consumer. If the consumer doesn't acknowledge it within a timeout, the message becomes available again. The semantics are explicit, not bolted on through database locking.
The Ordering Illusion
"But we need ordered processing," someone always says. "The database gives us ordering with ORDER BY created_at."
Except it doesn't. Not the way you think.
Your four pollers each grab a batch of 10 tasks ordered by creation time. Poller A grabs tasks 1-10. Poller B grabs tasks 11-20. Poller A processes task 7 before task 3 because task 3 hit a slow external API. Now your tasks are processed out of order even though you fetched them in order.
And if task 5 fails and gets retried, it's now processed after task 20. Your ordering guarantee was never real. It was a fetch order, not a processing order.
If you actually need ordered processing, you need a single consumer processing tasks sequentially. Which means you can't scale horizontally. Which defeats half the reason you wanted async processing in the first place.
Kafka handles this properly with partitioned topics. Messages within a partition are ordered and consumed by a single consumer in the group. You get ordering where you need it and parallelism where you don't. Trying to replicate this with database tables means reinventing partition assignment, consumer rebalancing, and offset management. In SQL. Please don't.
The Retry Trap
When processing fails, you need to retry. In the database-as-queue pattern, this usually means setting the status back to PENDING or adding a RETRY status and an attempt_count column.
UPDATE tasks SET status = 'PENDING', attempt_count = attempt_count + 1
WHERE id = 42 AND attempt_count < 5Simple enough. But when do you retry? Immediately? That's how you turn a transient failure into a sustained failure. If the external API is down, retrying immediately just hammers it harder. You need exponential backoff. So you add a next_retry_at column and modify your poller:
SELECT * FROM tasks
WHERE status = 'PENDING'
AND next_retry_at <= NOW()
ORDER BY next_retry_at
LIMIT 10
FOR UPDATE SKIP LOCKEDNow your schema has five columns that have nothing to do with your business logic: status, attempt_count, max_attempts, next_retry_at, last_error. Every table that needs async processing gets these same five columns. You're building a message queue schema inside your business database.
And you still haven't handled dead letter queues. What happens after the fifth retry fails? The task sits there with status = 'FAILED' forever. Someone has to manually check for failed tasks. Then diagnose why they failed. Then patch the issue and reset the status. No infrastructure for this. Just a query someone remembers to run.
RabbitMQ gives you dead letter exchanges out of the box. Kafka gives you dead letter topics. SQS gives you dead letter queues with configurable redrive policies. These aren't features you need to build. They're features you get for free by using the right tool.
The Transactional Outbox: The Exception That Proves the Rule
There is one case where using a database table as a queue is not just acceptable but actually the right pattern: the transactional outbox.
The problem it solves is real. You need to update a database record and publish a message, and you need both to happen or neither to happen. If you update the database and then publish to Kafka, the publish might fail. If you publish to Kafka and then update the database, the database write might fail. You can't do a distributed transaction across your database and your message broker. Not reliably.
The outbox pattern says: write the message to an outbox table in the same database transaction as your business data. Then have a separate process read from the outbox table and publish to the actual message queue.
BEGIN;
INSERT INTO orders (user_id, total) VALUES (123, 59.99);
INSERT INTO outbox (topic, key, payload) VALUES ('order-created', '123', '{...}');
COMMIT;Now a poller (or a CDC connector like Debezium) reads from the outbox table and publishes to Kafka. The outbox table is a queue, but it's a very specific kind of queue. It's a staging area between your database and your real message broker. It's not the destination. It's the bridge.
The key difference is scope. The outbox table handles exactly one thing: reliable event publishing. Retries, consumer groups, backpressure, and dead letters all sit outside it. The real queue downstream owns that work.
If you're using the outbox pattern, you're already using a message queue. You're just using the database to get messages into it safely.
When the Database Actually Is Fine
I'm not saying you should never use a database table for async work. There are scenarios where it's the right call.
You have very low volume. If you're processing 50 tasks a day, the polling overhead is negligible. The complexity of adding Kafka or RabbitMQ to your stack isn't worth it. A simple status column works fine and will continue working fine for years.
You need complex querying. Message queues are append-only logs. You can't query them the way you query a database. If you need to say "show me all pending tasks for user X, ordered by priority, excluding tasks that depend on other incomplete tasks," that's a SQL query. Trying to do that with a message queue is painful.
You need durable, auditable state. If every state transition of a task needs to be recorded for compliance reasons, a database table with a full audit trail might be simpler than a message queue plus a separate audit store.
Your team is small and the operational burden matters. Running Kafka in production is not trivial. If you're a team of three and nobody has Kafka experience, introducing it to process a few hundred async tasks is over-engineering. The database approach has well-understood failure modes that your team already knows how to debug.
The tipping point, in my experience, is somewhere around a few thousand tasks per hour, or when you start adding your third or fourth "queue table" to the schema. At that point, the infrastructure cost of a proper message broker is less than the engineering cost of maintaining your homegrown queue system.
What I Recommend Now
When someone on my team proposes the status column pattern, I don't immediately say no. I ask three questions.
What's the expected volume? If it's under a hundred tasks per hour and unlikely to grow much, the database is probably fine. Keep it simple.
How many consumers do you need? If the answer is one, a single poller on a status column works fine. If the answer is "multiple instances need to compete for work," you're going to end up rebuilding consumer group semantics in SQL. Use a queue.
What happens when processing fails? If the answer is "we retry once and then log it," the database is fine. If the answer involves exponential backoff, dead letter handling, and alerting, you're describing the feature set of a message broker. Use one.
The pattern I've settled on for most projects is simple. Start with the database if the volume is low and the requirements are simple. The moment you catch yourself adding retry logic, backoff calculations, or lock management, stop. That's your signal. You're building a message queue. Use one that already exists.
RabbitMQ if you want simple pub/sub with routing. Kafka if you need event streaming, replay, or high throughput. SQS if you're on AWS and want managed infrastructure. All three are battle-tested, well-documented, and cheaper than the engineering time you'll spend maintaining a database-backed queue that grows in complexity every quarter.
Your database is good at being a database. Let it do that. And let your message queue be a message queue.
Comments (0)