• Blog
  • Agents
  • Compare
  • Documentation
  • Pricing
  • FAQ
  • Book Demo
Sign InStart free

The outbound integration layer for SaaS products: emit once, then let Meshes handle routing, retries, fan-out, and delivery history.

© Copyright 2026 Meshes, Inc. All Rights Reserved.

About
  • Blog
  • Contact
  • FAQ
Product
  • Compare
  • Pricing
  • Status
Developers
  • Documentation
  • Agents
  • API Reference
  • MCP Server
  • llms.txt
Legal
  • Terms of Service
  • Privacy Policy
  • Acceptable Use Policy
  • Cookie Policy

Why Every SaaS Needs a Dead Letter Queue (And How to Stop Losing Events)

When webhooks fail and retries are exhausted, events disappear unless you have a dead letter queue. This post explains what DLQs are, how to implement them for webhooks, and why most teams get this wrong.

Cover Image for Why Every SaaS Needs a Dead Letter Queue (And How to Stop Losing Events)

Here's a scenario that plays out at every SaaS company eventually:

A customer opens a support ticket: "We're missing three days of lead data in our CRM." You check your logs. The webhook endpoint they configured returned 500 errors for seventy-two hours. Your system retried a few times, gave up, and moved on.

The events are gone. There's no way to replay them. The customer is furious.

This is what happens when you have retries but no dead letter queue.

What is a dead letter queue?

A dead letter queue (DLQ) is where events go when delivery has failed and all retry attempts are exhausted. Instead of silently discarding the event, you park it in a durable store where it can be inspected, debugged, and replayed.

The name comes from postal services: a "dead letter office" handles mail that can't be delivered to the intended recipient. The letter isn't destroyed—it's held until someone figures out what went wrong.

In webhook delivery, a DLQ serves the same purpose. When an event can't reach its destination after multiple retries, it moves to the dead letter store with context: the original payload, the destination, the failure reason, the number of attempts, and the timestamp of each attempt.

This transforms a silent failure into an actionable one. Instead of "the event disappeared," you get "the event failed to deliver to webhook X after five attempts, last error was HTTP 503, and here's the full payload ready to replay."

Why retries alone aren't enough

Retries handle transient failures: a momentary network blip, a destination that's briefly overloaded, a timeout that resolves on the next attempt. With exponential backoff, most transient failures resolve within a few retry cycles.

But some failures aren't transient:

  • The destination URL is wrong. A customer typos their webhook endpoint. Every delivery fails with a DNS resolution error. No amount of retrying will fix it.
  • Credentials expired. The OAuth token for a CRM integration expired and the customer didn't re-authorize. Every delivery fails with a 401.
  • The destination is permanently down. The customer shut down a server or deprecated an endpoint. The URL returns 404 indefinitely.
  • Payload schema mismatch. Your event includes a field the destination rejects. Every delivery fails with a 400. Retrying the same payload won't change the result.

In all of these cases, retries eventually exhaust. What happens next is what separates reliable systems from unreliable ones.

Without a DLQ, the event is gone. With a DLQ, the event is preserved, visible, and replayable once the root cause is fixed.

What a good DLQ implementation looks like

Not all dead letter queues are created equal. A useful DLQ provides more than a dumping ground for failed events.

Full event context. Store the complete original payload, the destination that failed, the event type, the workspace or tenant, and the error details from each delivery attempt. Without this context, debugging is guesswork.

Searchability. You need to find dead letters by tenant, by destination, by event type, and by time range. A DLQ that requires you to grep through raw files or scroll through an unsorted list is barely better than no DLQ at all.

Replay capability. The single most valuable feature of a DLQ: the ability to re-deliver dead-lettered events once the underlying issue is fixed. Replay should be available per-event (retry this one event) and in bulk (retry all dead letters for this destination in the last 24 hours).

Alerting. A growing dead letter count is a signal. Alert on sudden spikes (a destination went down), sustained volume (a credential expired), and per-tenant anomalies (one customer's webhook is misconfigured).

Retention policy. Dead letters shouldn't live forever. Define a retention window—thirty days is common—after which unresolved dead letters are archived or purged. But don't make the window too short: if a customer doesn't notice the issue for a week, you need the events to still be there.

The cost of not having a DLQ

The real cost isn't technical. It's operational.

Lost revenue. If you're syncing leads to a CRM and events are lost, your customer's sales pipeline has holes. They don't know about leads that entered their funnel. Deals fall through the cracks.

Broken trust. When a customer discovers data loss, they lose confidence in your integration. If it happened once, they'll assume it can happen again. Rebuilding that trust is expensive.

Debugging black holes. Without dead letters, debugging delivery failures is forensic work. You're reconstructing what happened from application logs, queue metrics, and timestamps. With dead letters, you open the DLQ, find the failed event, and see exactly what went wrong.

Manual reconciliation. When events are lost, someone has to manually re-enter the data or build a one-off script to backfill. This is slow, error-prone, and doesn't scale.

Common DLQ mistakes

Even teams that implement dead letter queues often make mistakes that reduce their effectiveness.

DLQ with no replay. You store failed events but provide no mechanism to re-deliver them. The DLQ becomes an archive of regrets rather than a recovery tool.

Global DLQ with no scoping. All dead letters from all tenants and all destinations land in one undifferentiated queue. Finding one customer's failed events requires filtering through thousands of unrelated entries.

No alerting on DLQ growth. Dead letters accumulate silently. Nobody notices until a customer complains—sometimes days or weeks later. By then, the retention window may have expired.

DLQ as a substitute for good retries. Some teams implement minimal retries (one or two attempts) and lean on the DLQ to catch everything else. This creates noise: the DLQ fills up with events that would have succeeded with a proper retry policy. Save the DLQ for genuinely undeliverable events.

How Meshes handles dead letters

Meshes includes dead letter management as a core part of its delivery engine.

When an event exhausts all retry attempts for a given connection, it's moved to the dead letter store with full context: the original payload, the destination, the error chain, and every attempt timestamp.

Dead letters are scoped per workspace, so you (and your customers) only see failures relevant to their integrations. You can inspect individual dead letters, replay them one at a time, or replay in bulk.

Because Meshes already handles the retry engine (exponential backoff with jitter, per-destination rate limiting, configurable retry policies), events that land in the DLQ are genuinely undeliverable—not false positives from inadequate retries.

Your app code stays the same regardless of whether events succeed or dead-letter:

await fetch('https://api.meshes.io/api/v1/events', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${token}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    workspace: 'ws_customer_456',
    event: 'invoice.paid',
    payload: { invoiceId, customerId, amount },
  }),
});

If the destination is down, the platform retries. If retries exhaust, the event lands in the DLQ with full context. When the customer fixes their endpoint, you replay the dead letters and every event is delivered.

No lost data. No manual reconciliation. No support escalation.

When to add a DLQ to your integration stack

If you're sending webhooks or pushing data to external systems, you need a dead letter queue. The question isn't whether you'll have delivery failures—it's when, and whether you'll be able to recover.

Add a DLQ when:

  • You have any production webhook integration. Even one destination with one customer is enough to justify it.
  • You're seeing "missing data" support tickets and can't explain them. A DLQ turns mystery into traceability.
  • You're building customer-facing integrations. If your customers configure their own endpoints, they will misconfigure them. A DLQ catches the fallout.
  • You want to offer delivery guarantees in your SLA. You can't promise reliable delivery without a recovery mechanism for failed events.

Stop losing events

Every webhook failure without a DLQ is a silent data loss. Every silent data loss is a support ticket waiting to happen. Every support ticket erodes the trust you've built with your customer.

A dead letter queue turns unrecoverable failures into recoverable ones. Paired with a solid retry engine, it's the difference between "we lost your events" and "we caught the failure and here's every event ready to replay."

Meshes gives you DLQs, retries, backoff, and per-destination observability out of the box—so you can stop losing events and start guaranteeing delivery.

Tired of losing events to failed webhooks? Join Meshes and get dead letter management, retries, and replay built in.