You shipped a webhook integration in an afternoon. Two weeks later, you're staring at a support ticket: "We never got that event."
The payload was sent. The destination returned a 503. Your code didn't retry. The event vanished.
So you add retries. Now you have new problems.
The retry spectrum: too little to too much
Most teams land on one of two extremes.
Too naive: Retry once, immediately. If the destination is down for thirty seconds, you lose the event. If it's rate-limiting you, you just hit it again before it's ready.
Too aggressive: Retry every five seconds, forever. You hammer a recovering server. You burn through rate limits. You create a thundering herd when a popular destination comes back online and every tenant's queued events fire at once.
Neither approach is reliable. The goal is something in between: persistent enough to survive transient failures, polite enough not to make them worse.
Exponential backoff: the basics
Exponential backoff spaces retries further and further apart. Instead of retrying at fixed intervals (1s, 1s, 1s…), you double the wait each time:
- Attempt 1: immediate
- Attempt 2: wait 1 second
- Attempt 3: wait 2 seconds
- Attempt 4: wait 4 seconds
- Attempt 5: wait 8 seconds
- …and so on
This gives the destination time to recover without your system pounding it during an outage.
A typical implementation looks something like this:
function getBackoffDelay(attempt: number, baseDelayMs = 1000): number {
return Math.min(baseDelayMs * Math.pow(2, attempt), 60_000); // cap at 60s
}
The cap matters. Without it, your tenth retry waits over seventeen minutes. A cap of sixty seconds (or whatever your SLA tolerates) keeps retries aggressive enough to catch short recovery windows without disappearing into the void.
Why you need jitter
Pure exponential backoff has a hidden problem: synchronization.
If a destination goes down and five hundred events queue up across all your tenants, every one of those events will retry at nearly the same moment—1s, 2s, 4s, 8s—in lockstep. When the destination recovers, it gets hit with a wall of traffic. This is the thundering herd problem.
Jitter adds randomness to the delay so retries spread out over time instead of clustering:
function getBackoffWithJitter(attempt: number, baseDelayMs = 1000): number {
const exponentialDelay = Math.min(baseDelayMs * Math.pow(2, attempt), 60_000);
return Math.random() * exponentialDelay; // "full jitter"
}
AWS published a well-known analysis comparing jitter strategies. Full jitter (randomizing across the entire window) tends to outperform "equal jitter" or "decorrelated jitter" in most webhook-style workloads because it maximizes spread across the retry window.