← Back to Blog

Webhook Reliability: The Lost Art

Webhooks break predictably: duplicate events, missed deliveries, retry storms. Here is what it actually takes to build receivers that hold up in production.

The first time I implemented a webhook receiver, I treated it like any other POST endpoint. Receive the payload, parse it, do the work, return 200. It worked fine in testing. It worked fine in staging. Then we went live with a payments provider and I learned how much I had missed.

The webhook came in twice within the same second. Same event ID, same timestamp, same payload. Both requests succeeded because my code had no idea it had already processed this event. A duplicate charge notification went out to the customer. Not great.

That was my introduction to idempotency, and it is table stakes for webhook handling. Every webhook delivery needs a unique identifier that the receiver can use to recognize duplicates. Stripe calls it the Idempotency-Key header. GitHub uses X-GitHub-Delivery. The name does not matter. What matters is that you store it and you check it before you do anything else.

Idempotency alone is not enough though. You also need to think about what happens between the webhook arriving and your processing finishing. If your database write succeeds but your response back to the sender fails, the sender sees a timeout. A timeout usually triggers a retry. Now you have the same event coming in again, and this time your idempotency check matters.

There is a race condition here that is easy to miss. Two identical webhook requests arrive nearly simultaneously. Each checks the idempotency key at roughly the same time. Each sees that the event has not been processed yet. Each starts processing. You end up with the same work done twice.

Fix it with a distributed lock around your idempotency check and processing. Nothing fancy required. A database row with a unique constraint on the event ID is often enough. The first request inserts the row and proceeds. A second request gets a duplicate key exception and backs off. Postgres serializes the operations, and you are safe from the race.

Then there is the retry logic on the sender side, which varies wildly between services. Some use exponential backoff. Some use fixed intervals. Some give up after three attempts. Some will retry for days. Your webhook endpoint needs to handle all of them gracefully.

A 200 response means success. Obvious enough, but I have seen implementations that return 200 immediately, queue the work, and then fail the actual processing. The sender thinks the delivery succeeded and will not retry. Your queue dies and the event is lost. Return 200 only after you have durably recorded the event. That means it is written to your database, or acknowledged by your queue, or persisted somewhere the webhook handler itself cannot lose it.

What about failures you want to retry? Return a non-2xx status code. Most webhook senders will interpret 4xx as a permanent failure and stop retrying. This is actually useful for malformed payloads you know you cannot process. Return a 5xx and the sender will retry, which is what you want for transient failures like a database being temporarily unavailable.

Verification is another piece that gets skipped. Webhooks can come from anywhere, and attackers scan for unauthenticated endpoints. Every webhook provider I have worked with offers a signature header that proves the payload came from them. Stripe uses Stripe-Signature. GitHub uses X-Hub-Signature. Verify it before you process anything. The signature is usually an HMAC of the raw request body using a shared secret. Strip it wrong and the check fails. Parse the body before verifying and you might have changed the formatting. Verify the signature against the raw bytes you received.

Retry storms are a failure mode that can take down receivers. If your processing is slow and the webhook sender times out, it retries. Now you have two requests running. Each is slow. Each times out. Each gets retried. The exponential backoff helps, but if your endpoint is genuinely broken or overloaded, the retries become a denial of service directed at yourself.

Circuit breakers help here. If webhooks start failing at a high rate, stop processing them temporarily. Return a clear error code that tells the sender to back off. Most professional webhook providers respect 429 responses and will pause delivery.

Speaking of slow processing, webhooks are not a good place for heavy work. The sender is waiting for your response, and many have strict timeouts. Thirty seconds is common. Some are as low as ten. If your processing takes longer, you need to acknowledge the webhook fast and do the work asynchronously. That means a queue, a background worker, and some way to track that the work completed.

The final piece that gets forgotten is the ordering guarantee, or rather the lack of one. Webhooks do not promise to arrive in order. If a user updates a record twice, webhook A from the first update might arrive after webhook B from the second. If your logic assumes chronological delivery, you will apply stale data over fresh data.

Include a timestamp or version in the webhook payload and check it inside your handler. Only process the event if its timestamp is newer than what you have stored. Optimistic concurrency control applied to asynchronous notifications.

Webhooks are not broken by design. They are just harder than they look because they sit at the intersection of several distributed systems concerns: idempotency, ordering, retries, verification. Most tutorials stop at parsing the JSON. That is where the real work begins.

If you are building webhook receivers, start with the failure modes. What happens if this arrives twice? What happens if processing fails halfway through? What happens if they come out of order? Answer those questions before you write the happy path code. Your future self will thank you when the payment processor has an outage and retries a thousand events at once.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 4 min read

Comments (0)