Webhook Retry Logic Done Right: Exponential Backoff, Jitter, and When to Give Up
Most webhook retry implementations are either too aggressive or too naive. This post breaks down exponential backoff with jitter, idempotency, and how to stop re-implementing retry logic for every integration.
You shipped a webhook integration in an afternoon. Two weeks later, you're staring at a support ticket: "We never got that event."
The payload was sent. The destination returned a 503. Your code didn't retry. The event vanished.
So you add retries. Now you have new problems.
The retry spectrum: too little to too much
Most teams land on one of two extremes.
Too naive: Retry once, immediately. If the destination is down for thirty seconds, you lose the event. If it's rate-limiting you, you just hit it again before it's ready.
Too aggressive: Retry every five seconds, forever. You hammer a recovering server. You burn through rate limits. You create a thundering herd when a popular destination comes back online and every tenant's queued events fire at once.
Neither approach is reliable. The goal is something in between: persistent enough to survive transient failures, polite enough not to make them worse.
Exponential backoff: the basics
Exponential backoff spaces retries further and further apart. Instead of retrying at fixed intervals (1s, 1s, 1s…), you double the wait each time:
Attempt 1: immediate
Attempt 2: wait 1 second
Attempt 3: wait 2 seconds
Attempt 4: wait 4 seconds
Attempt 5: wait 8 seconds
…and so on
This gives the destination time to recover without your system pounding it during an outage.
A typical implementation looks something like this:
function getBackoffDelay(attempt: number, baseDelayMs = 1000): number { return Math.min(baseDelayMs * Math.pow(2, attempt), 60_000); // cap at 60s}
The cap matters. Without it, your tenth retry waits over seventeen minutes. A cap of sixty seconds (or whatever your SLA tolerates) keeps retries aggressive enough to catch short recovery windows without disappearing into the void.
Why you need jitter
Pure exponential backoff has a hidden problem: synchronization.
If a destination goes down and five hundred events queue up across all your tenants, every one of those events will retry at nearly the same moment—1s, 2s, 4s, 8s—in lockstep. When the destination recovers, it gets hit with a wall of traffic. This is the thundering herd problem.
Jitter adds randomness to the delay so retries spread out over time instead of clustering:
AWS published a well-known analysis comparing jitter strategies. Full jitter (randomizing across the entire window) tends to outperform "equal jitter" or "decorrelated jitter" in most webhook-style workloads because it maximizes spread across the retry window.
When to give up: dead letters and alerting
Retries can't go on forever. At some point you need to accept that delivery failed and handle it:
Set a max retry count. Five to ten attempts with exponential backoff covers most transient failures. After that, the issue is probably not transient.
Move failed events to a dead-letter store. Don't discard them. Engineers (or customers) need to inspect what failed, why, and replay it once the root cause is fixed.
Alert on dead-letter volume. A spike in dead letters usually means a destination is misconfigured or permanently down—something a human should look at.
The hard part isn't implementing any one of these. The hard part is implementing them per integration. If you have five webhook destinations with different retry semantics, you now maintain five retry policies, five dead-letter stores, and five alerting rules.
Idempotency: because retries mean duplicates
Retries guarantee that some events will be delivered more than once. A destination might receive the payload, process it, and then fail to return a 200 before your timeout fires. You retry. Now the destination has processed the same event twice.
This is why every event should carry a unique ID (like a UUID or a deterministic hash), and consumers should check whether they've already processed it:
This is table stakes for reliable webhook delivery, and it's entirely on the consumer to implement. As the producer, all you can do is include a stable event ID and document your retry behavior so consumers know to expect duplicates.
What this looks like at scale
For one or two integrations, hand-rolling retries with a job queue works fine. But as you add more destinations, the complexity multiplies:
Different destinations have different rate limits.
Some require OAuth token refresh before retrying.
Some return 429 with a Retry-After header you should respect.
Some destinations are per-customer, meaning retry isolation matters.
You end up building a mini delivery engine inside your app—queues, retry tables, dead-letter stores, per-destination configuration, observability dashboards—all to solve a problem that's fundamentally the same across every integration.
A better approach: externalize the retry engine
This is the exact problem a dedicated integration layer solves. Instead of implementing retry logic in your app code, you emit events to a single API and let the integration layer handle delivery:
Exponential backoff with jitter is configured once, not per-destination.
Dead letters are stored and surfaceable per event and per destination.
Rate limiting is respected per connection.
Retry isolation means one failing destination doesn't back up others.
The retries, backoff, dead letters, and delivery guarantees happen outside your codebase. You define them once per connection and forget about them.
Stop re-inventing the retry wheel
If you've built retry logic once, you understand it. If you've built it three or four times for different integrations, you understand why it shouldn't live in your app.
Webhook retry logic is a solved problem—but only if you solve it in one place. A dedicated integration layer like Meshes gives you retries, backoff, jitter, dead letters, and per-destination observability out of the box, so you can focus on what your product actually does.
Ready to stop hand-rolling retries?Join Meshes and let the platform handle delivery while you ship features.