← Back to Blog

The Kafka Consumer Group That Stopped Consuming

A Kafka consumer group can stop consuming while every metric looks healthy. Rebalance storms, max poll timeouts, stuck partitions, and how to actually diagnose.

The on-call ticket at 02:14

The payment-events consumer group had lag of zero in the dashboard. The downstream service was paging because no payment-confirmation emails had been sent in 20 minutes. We pulled up the lag-by-partition view. Two of the six partitions had lag of zero. The other four had lag of 4 million and growing.

Aggregate lag across six partitions, averaged: still small. Per-partition: a cliff.

What "consuming" actually means here

A Kafka consumer group is one or more processes that share the work of reading from a topic. Each partition is owned by exactly one consumer at a time. The group coordinator (a broker) decides who owns what. When a consumer joins or leaves, the coordinator triggers a rebalance and reassigns partitions across the surviving members.

This is the source of every consumer-group failure mode you have ever seen. The rebalance protocol assumes consumers behave in specific ways within specific time windows. When they don't, the group misbehaves quietly.

The three timeouts that decide whether you are in the group

You need three numbers in your head before any of the rest makes sense:

  • heartbeat.interval.ms: how often the consumer's background thread tells the broker "I am still here." Default 3 seconds.
  • session.timeout.ms: how long the broker waits without a heartbeat before declaring the consumer dead. Default 45 seconds in modern Kafka.
  • max.poll.interval.ms: how long the broker waits between calls to poll() before declaring the consumer dead. Default 5 minutes.

The first two are about a background heartbeat thread. The third is about your application thread calling poll(). Two different "is this consumer alive?" detectors, two different ways to fail, one shared name in your team's vocabulary: "the consumer dropped out."

Failure mode 1: the heartbeat lies

Your consumer calls poll(). It gets back a batch of records. Processing starts. The first record makes a synchronous HTTP call to a slow downstream service. Four minutes pass. Meanwhile, the heartbeat thread keeps sending heartbeats. Session timeout is happy.

But max.poll.interval.ms is the default five minutes. If processing takes longer than that, the broker decides this consumer is dead, kicks it out of the group, and triggers a rebalance. The consumer eventually finishes processing the record, calls poll() again, gets CommitFailedException because it no longer owns the partition, and the work it just did was wasted because someone else has already started consuming it.

Lesson: max.poll.interval.ms must be larger than your worst-case batch processing time. The heartbeat being healthy means nothing about whether your processing loop is actually making progress.

Failure mode 2: the rebalance storm

You deploy a new version of the consumer service. It rolls out across six pods. Each pod takes eight seconds to start. Each restart triggers a rebalance. Each rebalance takes around ten seconds to complete because all six partitions have to be reassigned. During the rebalance, no consumer is consuming anything.

You get six rebalances back to back. That is roughly a minute of zero consumption. Lag piles up. Worse: if your rolling deploy is configured to wait for "healthy" and "healthy" is defined as "joined the group successfully," the rolling deploy ping-pongs because every new pod that joins triggers another rebalance that briefly knocks the rest out.

The fix here has two parts. First, use static group membership (group.instance.id): a consumer that disappears for less than session.timeout.ms rejoins with its old partition assignments, no rebalance needed. Second, use incremental cooperative rebalancing (partition.assignment.strategy=CooperativeStickyAssignor): rebalances only redistribute the partitions that need to move, not the entire set. Each change is independent. Apply them both.

Failure mode 3: the stuck partition

This is the one from the war story. One of the consumer instances is stuck. Maybe a deadlock in the processing code, maybe a Kafka network thread blocked on a slow DNS lookup, maybe a JVM in a stop-the-world GC pause that just won't stop. The instance isn't calling poll() anymore. But the heartbeat background thread is still running.

From the broker's view, the consumer is healthy. From reality's view, the partition it owns is going nowhere. Aggregate lag looks fine because five other partitions are draining. Per-partition lag for this one partition is a vertical line.

This is why "is the consumer group consuming?" is the wrong question. The right question is "is every partition draining?" You need per-partition lag in your dashboards. Aggregate lag hides this every time.

Failure mode 4: the poison message

The consumer reads a record. Parsing the record throws an exception. Nobody catches it. The consumer's processing loop dies. No more poll() calls. Eventually max.poll.interval.ms fires and the consumer is kicked out.

Then a new consumer in the group picks up the partition. Same record (the offset was never committed). Same exception. Same eviction. Repeat forever.

Real consumers wrap record processing in a try/catch and either skip-and-log poison messages or route them to a dead-letter topic. The exact policy is your call. Having no policy and letting the exception propagate is the bug.

Failure mode 5: the silent skip

Auto-commit is on. The consumer reads 500 records, processes 50, then crashes. Auto-commit had already committed offset 500 because the commit interval expired during processing. The remaining 450 records are now silently skipped. Nothing alerts. The consumer "kept consuming." The data is gone.

Default behavior for many Kafka clients is auto-commit. For anything that matters, turn it off and commit explicitly after the work is done. The two-line change is worth the discipline.

What "consumer lag" actually measures

Lag is the gap between the latest offset in a partition and the offset the consumer has committed. A snapshot, not a derivative. Lag of zero right now does not mean the consumer is processing fast enough. It means the consumer committed an offset equal to the latest one, which could have been done by skipping records (failure 5) or by committing-without-processing patterns.

The metric that catches problems earlier is lag rate: how fast lag is growing or shrinking, per partition. Flat lag means you are keeping up. Growing lag is the early warning. Shrinking lag means a recent backlog is draining. Lag-rate panels in Grafana save you hours of confusion.

Diagnostic order when a consumer group misbehaves

  1. Look at per-partition lag, not aggregate. If only some partitions are lagging, it is failure 3 (stuck consumer) or failure 4 (poison message).
  2. Look at the consumer group's member list (kafka-consumer-groups.sh --describe). If members are appearing and disappearing, it is failure 2 (rebalance storm).
  3. Look at consumer logs for CommitFailedException. That is the signature of max.poll.interval.ms being too low for the work (failure 1).
  4. Look at consumer logs for unhandled exceptions in the processing loop. That is failure 4 (poison message).
  5. Check whether auto-commit is on. If it is, you might be silently losing records (failure 5).

Five checks. Most consumer-group incidents are one of those five.

One thing that will catch the next one

Per-partition lag rate, alerting on positive sustained slope for any one partition. Not aggregate lag. Not member count. Not group state. The single most informative panel for a Kafka consumer is "is every partition draining at the rate I expect." Build it once, point alerts at it, and three months from now you will be paged twenty minutes earlier than you would have been otherwise.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 5 min read

Comments (0)