Webhook Retry Logic: Exponential Backoff, Jitter, and Limits

You shipped a webhook integration in an afternoon. Two weeks later, you're staring at a support ticket: "We never got that event."

The payload was sent. The destination returned a 503. Your code didn't retry. The event vanished.

So you add retries. Now you have new problems.

The retry spectrum: too little to too much

Most teams land on one of two extremes.

Too naive: Retry once, immediately. If the destination is down for thirty seconds, you lose the event. If it's rate-limiting you, you just hit it again before it's ready.

Too aggressive: Retry every five seconds, forever. You hammer a recovering server. You burn through rate limits. You create a thundering herd when a popular destination comes back online and every tenant's queued events fire at once.

Neither approach is reliable. The goal is something in between: persistent enough to survive transient failures, polite enough not to make them worse.

Exponential backoff: the basics

Exponential backoff spaces retries further and further apart. Instead of retrying at fixed intervals (1s, 1s, 1s…), you double the wait each time:

Attempt 1: immediate
Attempt 2: wait 1 second
Attempt 3: wait 2 seconds
Attempt 4: wait 4 seconds
Attempt 5: wait 8 seconds
…and so on

This gives the destination time to recover without your system pounding it during an outage.

A typical implementation looks something like this:

function getBackoffDelay(attempt: number, baseDelayMs = 1000): number {
  return Math.min(baseDelayMs * Math.pow(2, attempt), 60_000); // cap at 60s
}

Webhook Retry Logic Done Right: Exponential Backoff, Jitter, and When to Give Up

Most webhook retry implementations are either too aggressive or too naive. This post breaks down exponential backoff with jitter, idempotency, and how to stop re-implementing retry logic for every integration.

The retry spectrum: too little to too much

Exponential backoff: the basics

Why you need jitter

When to give up: dead letters and alerting

Idempotency: because retries mean duplicates

What this looks like at scale

A better approach: externalize the retry engine

Stop re-inventing the retry wheel