Production webhook retry logic requires exponential backoff, jitter, idempotency tracking, dead letter queues, and per-destination rate limiting. Most teams underestimate the scope: a production-grade retry system runs 1,500–3,000 lines of code before you add any destination-specific logic. Tools like Meshes handle retries automatically with exponential backoff and dead letter management, letting you emit events once and track delivery status through the API.
This post walks through building webhook retry logic yourself in Node.js — not because you should, but so you understand exactly what you're signing up for.
The first approach (and why it fails immediately)
Every webhook retry system starts the same way. Someone writes a function that sends an HTTP POST and wraps it in a loop:
async function sendWebhook(url: string, payload: object, retries = 3) {
for (let attempt = 1; attempt <= retries; attempt++) {
try {
const res = await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
});
if (res.ok) return { success: true, attempt };
} catch (err) {
if (attempt === retries) throw err;
}
}
}
This works in a tutorial. It fails in production for a half-dozen reasons:
No delay between retries. If a destination returns a 503, it's probably overloaded. Hammering it three times in a row makes things worse. You need backoff.
No jitter. If you have a hundred events queued for the same destination and they all retry on the same schedule, you create a thundering herd that can DDoS the destination.
No idempotency. If your process crashes between sending the webhook and recording the result, you'll retry an event that already succeeded. The destination processes it twice: two CRM contacts created, two emails sent, two charges attempted.
No persistence. If your server restarts, every in-flight retry is lost. Events disappear silently.
No distinction between retryable and permanent errors. A 503 (Service Unavailable) is worth retrying. A 401 (Unauthorized) is not — the credential is wrong, and retrying the same request a thousand times won't fix it.
No dead letter handling. When retries exhaust, the event vanishes. There's no record it existed, no way to replay it, and no alert that something broke.
Let's fix each of these. It gets complicated fast.
Step 1: Exponential backoff with jitter
The first thing your retry system needs is intelligent delays between attempts. Exponential backoff increases the wait time after each failure. Jitter adds randomness so concurrent retries don't collide.
function calculateBackoff(
attempt: number,
baseDelay = 1000,
maxDelay = 300_000
): number {
// Exponential: 1s, 2s, 4s, 8s, 16s, 32s...
const exponential = baseDelay * Math.pow(2, attempt - 1);
// Cap at max delay (5 minutes)
const capped = Math.min(exponential, maxDelay);
// Full jitter: random value between 0 and capped delay
return Math.floor(Math.random() * capped);
}
Full jitter (randomizing between zero and the calculated delay) outperforms equal jitter and decorrelated jitter in most real-world scenarios. AWS published research on this in their architecture blog — the short version is that full jitter minimizes total work across all retrying clients while still providing reasonable retry delays.
With five retry attempts and a one-second base delay, your schedule looks roughly like this:
- Attempt 1: immediate
- Attempt 2: 0–1 second
- Attempt 3: 0–4 seconds
- Attempt 4: 0–8 seconds
- Attempt 5: 0–16 seconds
That's your first fifty lines of real code. Now you need somewhere to track these retries.
Step 2: Persistent queue
In-memory retries are worthless. If your process restarts — a deploy, a crash, an OOM kill — every pending retry is gone.
You need a durable queue. Most teams reach for one of three options:
Database-backed queue (PostgreSQL or SQLite). Simple to set up, no new infrastructure. But you're polling for work, which means either wasted queries or delayed retries.
Redis-backed queue (BullMQ). Fast, battle-tested, purpose-built for job scheduling. But now you're running Redis, which means another service to operate, monitor, and back up.
Managed queue (SQS, Cloud Tasks). Scalable, durable, and hands-off. But now you're coupled to a cloud provider and paying per-message.
Here's a minimal BullMQ implementation:
import { Queue, Worker } from 'bullmq';
import IORedis from 'ioredis';
const connection = new IORedis({ host: '127.0.0.1', port: 6379 });
const webhookQueue = new Queue('webhook-delivery', { connection });
// Enqueue a webhook delivery
async function enqueueWebhook(
destinationUrl: string,
payload: object,
eventId: string
) {
await webhookQueue.add(
'deliver',
{ destinationUrl, payload, eventId },
{
attempts: 5,
backoff: { type: 'exponential', delay: 1000 },
removeOnComplete: 1000,
removeOnFail: false, // Keep failed jobs for dead letter inspection
}
);
}
// Process webhook deliveries
const worker = new Worker(
'webhook-delivery',
async (job) => {
const { destinationUrl, payload, eventId } = job.data;
const res = await fetch(destinationUrl, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Event-Id': eventId,
},
body: JSON.stringify(payload),
signal: AbortSignal.timeout(30_000),
});
if (!res.ok) {
throw new Error(`Webhook failed: ${res.status} ${res.statusText}`);
}
},
{ connection, concurrency: 10 }
);
This looks clean. It is also missing about two hundred lines of code you'll need almost immediately.
Step 3: Error classification
Not all errors deserve retries. You need to distinguish:
Retryable errors: 408 (Request Timeout), 429 (Too Many Requests), 500 (Internal Server Error), 502 (Bad Gateway), 503 (Service Unavailable), 504 (Gateway Timeout), network errors (ECONNREFUSED, ECONNRESET, ETIMEDOUT).
Permanent errors: 400 (Bad Request — your payload is wrong), 401 (Unauthorized — credentials are invalid), 403 (Forbidden), 404 (Not Found — the endpoint doesn't exist), 410 (Gone).
Special cases: 429 with a Retry-After header. You need to parse that header and schedule the retry for the specified time, not your default backoff schedule. This one header introduces branching logic that touches your entire retry calculation.
function isRetryable(statusCode: number): boolean {
const retryableCodes = new Set([408, 429, 500, 502, 503, 504]);
return retryableCodes.has(statusCode);
}
function getRetryDelay(res: Response, attempt: number): number {
// Respect Retry-After header if present
const retryAfter = res.headers.get('Retry-After');
if (retryAfter) {
const seconds = parseInt(retryAfter, 10);
if (!isNaN(seconds)) return seconds * 1000;
// Retry-After can also be an HTTP date
const date = new Date(retryAfter);
if (!isNaN(date.getTime())) {
return Math.max(0, date.getTime() - Date.now());
}
}
return calculateBackoff(attempt);
}
Now your worker needs to catch errors, classify them, and either retry or send to a dead letter queue based on the classification. That's another forty lines of branching logic.
Step 4: Idempotency tracking
Webhooks get delivered more than once. It happens when your process crashes after sending but before recording success. It happens when the destination is slow and your timeout fires before their 200 arrives. It happens when a network partition makes both sides think the request failed.
Your system needs to track which events have been successfully delivered to which destinations:
// Track deliveries to prevent duplicates
async function hasBeenDelivered(
eventId: string,
destinationId: string
): Promise<boolean> {
const key = `delivered:${eventId}:${destinationId}`;
const exists = await redis.get(key);
return exists !== null;
}
async function markDelivered(
eventId: string,
destinationId: string
): Promise<void> {
const key = `delivered:${eventId}:${destinationId}`;
// TTL: keep for 7 days to catch delayed retries
await redis.set(key, '1', 'EX', 604_800);
}
But this only covers your side. You should also be sending an idempotency key so the destination can deduplicate on their end. Some APIs accept Idempotency-Key headers. Some use the event ID. Some don't support idempotency at all, which means duplicate delivery causes duplicate side effects and there's nothing you can do about it from the sending side.
Step 5: Per-destination rate limiting
Here's where things get genuinely tricky. Different destinations have different rate limits. HubSpot allows 100 requests per 10 seconds. Salesforce varies by edition. Mailchimp has different limits for different endpoints.
If you're sending webhooks to multiple destinations, you need per-destination rate limiting to avoid getting blocked:
import { RateLimiterRedis } from 'rate-limiter-flexible';
const rateLimiters = new Map<string, RateLimiterRedis>();
function getRateLimiter(destinationId: string, maxPerSecond = 10) {
if (!rateLimiters.has(destinationId)) {
rateLimiters.set(
destinationId,
new RateLimiterRedis({
storeClient: redis,
keyPrefix: `rl:${destinationId}`,
points: maxPerSecond,
duration: 1,
})
);
}
return rateLimiters.get(destinationId)!;
}
// In your worker, before sending:
async function sendWithRateLimit(
destinationId: string,
destinationUrl: string,
payload: object
) {
const limiter = getRateLimiter(destinationId);
try {
await limiter.consume(destinationId);
} catch {
// Rate limited — delay and retry
throw new Error('Rate limited, will retry');
}
// ... send the webhook
}
Now multiply this by every destination you support. Each one has its own rate limits, its own authentication patterns, its own error formats, and its own retry semantics. HubSpot returns rate limit info in response headers. Salesforce uses a daily API call allocation. Some APIs return 429 with a Retry-After header. Some return 429 with no guidance at all.
Step 6: Dead letter queue
When retries exhaust, you need somewhere to put events that couldn't be delivered. A dead letter queue preserves the event, the failure context, and the delivery history so someone can investigate and replay later.
interface DeadLetter {
eventId: string;
destination: string;
payload: object;
attempts: Array<{
timestamp: Date;
statusCode: number | null;
error: string;
}>;
createdAt: Date;
lastAttemptAt: Date;
}
async function moveToDeadLetter(job: Job, error: Error): Promise<void> {
const deadLetter: DeadLetter = {
eventId: job.data.eventId,
destination: job.data.destinationUrl,
payload: job.data.payload,
attempts: job.data.attemptHistory || [],
createdAt: new Date(job.timestamp),
lastAttemptAt: new Date(),
};
await db.collection('dead_letters').insertOne(deadLetter);
// Alert if DLQ is growing
const recentCount = await db.collection('dead_letters').countDocuments({
destination: job.data.destinationUrl,
createdAt: { $gte: new Date(Date.now() - 3600_000) },
});
if (recentCount > 10) {
await alertOps(
`Dead letter spike: ${recentCount} failures for ${job.data.destinationUrl} in the last hour`
);
}
}
You also need a way to replay dead letters — individually and in bulk — once the root cause is fixed. That's another endpoint, another set of authorization checks, and another set of edge cases (what if the replayed event fails again?).
Step 7: Observability
You're not done. You now need to know what your retry system is doing:
- How many events are pending, retrying, delivered, and dead-lettered per destination?
- What's the p50/p95/p99 delivery latency?
- Which destinations are failing and why?
- How full is your dead letter queue, and is it growing?
This means metrics, dashboards, and alerts. Prometheus counters for delivery attempts and outcomes. Grafana dashboards for per-destination health. PagerDuty alerts for dead letter spikes and queue depth anomalies.
The real cost: maintenance
Let's say you build all of this. You've written roughly 1,500–3,000 lines of code across the queue, the worker, the retry logic, the error classification, the rate limiting, the dead letter queue, the idempotency tracker, and the observability layer; all for just one destination.
Now maintain it.
- BullMQ releases a new major version with breaking changes to the backoff API. Update and test.
- A destination changes their rate limits. Update your config and verify.
- A customer reports missing events. Dig through your dead letter queue, cross-reference with delivery logs, figure out why your retry classification marked a transient error as permanent.
- Redis hits memory limits during a spike. Your queue backs up. Events are delayed. Customers notice.
- Someone on your team adds a new destination. They need to understand the retry system, the rate limiter configuration, the dead letter schema, and the idempotency tracking. They have questions. You're the only one who knows how it works.
This is the real cost: not the initial build, but the permanent operational overhead of running what is effectively a message delivery system alongside your actual product.
Or skip all of this
Meshes is an event routing platform that handles retry logic, exponential backoff with jitter, per-destination rate limiting, dead letter management, and delivery tracking — so you don't build or maintain any of it.
Your code stays this simple:
import { MeshesEventsClient } from '@mesheshq/events';
const meshes = new MeshesEventsClient('your_publishable_key');
await meshes.emit({
event: 'user.signup',
payload: {
email: 'jane@example.com',
plan: 'pro',
source: 'website',
},
});
One API call. Meshes fans the event out to every configured destination — HubSpot, Salesforce, Mailchimp, Intercom, Slack, custom webhooks — with automatic retries, exponential backoff, jitter, and dead letter management for each destination independently.
If a destination is down, Meshes retries with backoff. If retries exhaust, the event moves to a dead letter queue with full context: the original payload, every attempt timestamp, and the error chain. When the issue is resolved, you replay the dead letters and every event is delivered.
No Redis to manage. No worker processes to monitor. No rate limiter configs to tune. No 1,500–3,000 lines of retry infrastructure to maintain alongside your product.
Meshes supports 10+ integrations including HubSpot, Salesforce, Intercom, Mailchimp, Resend, Slack, and Zoom — with OAuth token management, field mappings, and workspace-per-tenant isolation included. SDKs are available for Node.js, TypeScript, and Go.
The free tier includes 100 events per month. Paid plans start at $49/month.
When to build it yourself
There are legitimate reasons to build your own retry system:
- You have a single destination and simple delivery requirements.
- Your team has deep operational experience with message queues and you already run the infrastructure.
- You need custom retry behavior that no platform supports.
- You're building the retry system as your core product.
For everyone else — and that's most SaaS teams shipping integrations to CRMs, email tools, and communication platforms — the build-vs-buy math doesn't work. You'll spend weeks building what a platform handles in minutes, and then spend months maintaining it.
The code you don't write is the code you don't debug at 3am.
Want to see delivery reliability instead of coding it? Meshes gives you a per-event delivery timeline (attempts, status codes, retries, DLQ, replay) across every destination. Try it free →