Reliability

Why webhook retries matter (and how to get them right)

If your webhook delivery doesn't have backoff, jitter, idempotency, and a dead-letter queue, you are silently losing customer events.

·6 min read
TL;DR

Without retries, your 'delivered' webhooks are about 95% delivered in practice. Customer servers go down, networks hiccup, deploys happen. Adding retries with exponential backoff, idempotency keys, dead-letter queues, and HMAC signing takes you from 95% to 99.99%. This post covers the four primitives, the math behind them, and the bugs they catch.

Webhook delivery is one of those problems that looks trivial until you ship it. You make a POST request to a customer's URL. They return 200. Done. Except they didn't, the proxy did. Or they did, and your code crashed mid-write. Or the URL timed out and you retried — and now you've delivered twice. Or you sent for two hours and gave up — and you have no record of what didn't make it.

Reliable webhook delivery is the same problem as reliable AI job execution: you have an unreliable external dependency, you need to keep trying, and you need a record. This post is about the four primitives that get it right.

The delivery reality

If you graph the response codes from a busy webhook deliverer, the picture looks something like this:

  • ~92% — 200/201/204 on first attempt
  • ~5% — 5xx or timeout, succeeds on retry
  • ~2% — slow URLs (customer deploys, maintenance), eventually succeed
  • ~1% — never recoverable (URL gone, wrong endpoint, auth changed)

If you only deliver on the first attempt, your real delivery rate is 92%. That sounds high until you realize it means a customer with 1,000 events per day is missing roughly 80 of them — and most of those failures are recoverable noise. Adding retries takes you toward 99.99% without changing anything else.

Primitive 1: Exponential backoff with jitter

The retry curve has to be aggressive at first (catch transient blips quickly) and patient later (don't hammer a server that's down). A typical schedule:

AttemptDelay before retry
110 seconds
21 minute
35 minutes
430 minutes
52 hours
66 hours
724 hours
8Dead-letter

Add jitter at each step (uniform random within ±25%) so that when many events are queued against a downed customer, they don't all wake up simultaneously when the customer recovers.

Primitive 2: Idempotency

Retries create duplicates whenever the customer's server processed your request but failed to return a 200 (maybe it crashed after writing to the database, maybe a proxy timed out the response). The fix is on both sides:

  • Your side. Include a stable event ID in the payload AND in a header (e.g. X-Webhook-Event-Id). Use the same ID across all retries of the same event.
  • Their side. Document that customers must deduplicate based on the event ID. Most will already do this; saying it explicitly removes the excuse not to.
Idempotency keys aren't optional

If you don't send an idempotency key, customers can't reliably deduplicate. Their workarounds (deduplicating on payload hash, on timestamp, on user-id+event-type) are all worse. Send a stable ID per event, document it once, and you've offloaded the deduplication problem onto a primitive customers can actually use.

Primitive 3: Dead-letter queue

After max attempts, the event must go somewhere visible. A dead-letter queue (DLQ) is exactly that: a holding pen for events that exhausted retries. Two non-negotiables:

  • Inspect. You should be able to filter the DLQ by customer, event type, and date, and see the full request/response history per event.
  • Replay. You should be able to re-enqueue events from the DLQ with one click. Most DLQ entries become deliverable once the customer fixes their endpoint.

Without a DLQ, exhausted retries vanish. With one, the worst case is 'we'll deliver as soon as the customer's URL is back' — which is what customers actually expect.

Primitive 4: Signed payloads

Webhooks are POSTs to URLs that anyone can hit. Customers need a way to verify the request is from you. HMAC-SHA256 over the body plus a timestamp is the standard:

webhook/sign.ts
ts
1import crypto from "node:crypto";
2 
3export function signWebhook(body: string, secret: string) {
4 const timestamp = Math.floor(Date.now() / 1000).toString();
5 const signed = `${timestamp}.${body}`;
6 const sig = crypto
7 .createHmac("sha256", secret)
8 .update(signed)
9 .digest("hex");
10 return {
11 headers: {
12 "X-Webhook-Timestamp": timestamp,
13 "X-Webhook-Signature": `v1=${sig}`
14 }
15 };
16}

Customers verify by recomputing the HMAC with their shared secret. Include the timestamp in the signed payload (not just as a header) to prevent replay attacks — and recommend customers reject signatures older than 5 minutes.

Per-customer secrets, rotation, and revocation are the operational layer on top. Don't share secrets across customers; if one leaks, you don't want to rotate everyone.

Common gotchas

A 200 from a proxy isn't a 200 from the app

If a customer's reverse proxy buffers the request and returns 200 immediately, you can get a 200 before the customer's app has even seen the event. Result: retries don't fire on failures the proxy hides. The fix is on the customer's side, but you should document the recommendation: respond with 200 from the application after persisting, not from the proxy.

Customers' own rate limits

If a customer has a rate limit on their webhook endpoint, your retry storm can starve them. Cap per-customer concurrency in your deliverer, and respect any Retry-After they return. This matters most during incident recovery — when their server first comes back and your DLQ replays simultaneously.

Slow consumers

Some customer endpoints take 20+ seconds to respond. If your timeout is 30 seconds and your delivery worker concurrency is small, a few slow customers can block everyone else. Use per-customer concurrency limits and aggressive timeouts (15s is reasonable; document it).

What this looks like with SimpleQ

webhooks/deliver.ts
ts
1await simpleq.enqueue({
2 queue: "webhooks",
3 type: "http.request",
4 idempotencyKey: `evt_${event.id}`,
5 payload: {
6 url: customer.webhookUrl,
7 method: "POST",
8 body: event,
9 headers: signWebhook(JSON.stringify(event), customer.secret).headers,
10 timeoutMs: 15_000
11 },
12 retry: {
13 maxAttempts: 8,
14 backoff: "exponential",
15 initialDelayMs: 10_000,
16 maxDelayMs: 24 * 60 * 60 * 1000,
17 jitter: "full"
18 }
19});

Idempotency key, signed payload, exponential backoff capped at 24 hours, eight attempts. Failed deliveries land in the queue's dead-letter, replayable from the dashboard.

See the webhook delivery use case for the full pattern, or read the docs for the underlying primitives.

Frequently asked questions

When a webhook delivery fails (timeout, 5xx, network error), a webhook system retries the same payload with backoff. Without retries, transient failures become permanent data loss — your customers miss events that occurred but were never successfully delivered.
Try SimpleQ

Ship reliable async work in minutes.

Free tier covers 10,000 job executions a month. No credit card.