What Your Integration Code Is Missing: Delivery Guarantees for Product Events
Your SaaS sends events to external tools, but the gap between "sends events" and "guarantees delivery" is wider than most teams realize. This post breaks down the five delivery gaps — retries, idempotency, dead letters, isolation, and visibility — and what it takes to close them.
Your SaaS sends events to external tools. A signup goes to HubSpot. A payment failure goes to Salesforce. A cancellation triggers a message in Intercom.
The code that sends those events probably looks reasonable. An HTTP call, maybe a try/catch, maybe a timeout. It works in development. It works in staging. It works in production — until a timeout duplicates a signup, creates two CRM contacts, and nobody notices until a customer complains that they're getting double onboarding emails.
The gap between "sends events" and "guarantees delivery" is wider than most teams realize. And it's a gap that only shows up when something goes wrong at the worst possible time.
What "delivery" actually means
Most integration code treats delivery as a binary: the HTTP call succeeded, or it threw an error. But delivery in a distributed system is a spectrum, and production traffic lives in the uncomfortable middle.
Delivery means the destination received the event, processed it, and acknowledged it. That's three things, not one. A 200 response from HubSpot's API means something different than a 200 from a webhook endpoint you don't control. A timeout doesn't mean the event wasn't received — it means you don't know.
When teams build integration code, they tend to optimize for the happy path: the network is fast, the destination is healthy, the payload is valid. Delivery guarantees are about everything else.
The five gaps
After watching teams debug integration failures across dozens of SaaS products, the same five gaps show up repeatedly. Most codebases have at least three of them.
To make these concrete, let's follow one event through a system that has all five. Here's what that looks like in production.
A customer upgrades from a free trial to a paid plan. Your app emits a subscription.started event. It needs to reach HubSpot (update the contact lifecycle stage), Salesforce (create an opportunity), and Slack (notify the sales channel). The HubSpot API is having a slow day — responses are timing out at about a 30% rate. Here's what happens.
1. No retry strategy (or the wrong one)
The subscription.started call to HubSpot times out. Your code retries immediately — three times in quick succession. HubSpot is already slow, and now you've tripled the load. All three retries time out too.
This is worse than no retries at all. Immediate retries hit a destination that's already struggling. If the failure was caused by rate limiting or temporary overload, you're making the problem worse. If it was a network blip, you're competing with the original request that may still be in flight.
A real retry strategy needs exponential backoff (wait longer between each attempt), jitter (randomize the wait so retries from different events don't all hit at the same time), and a maximum attempt count. It also needs to distinguish between retryable failures (503, timeout, connection reset) and permanent ones (400, 401, 404). Retrying a malformed payload forever doesn't help anyone.
Most hand-rolled retry code gets at least one of these wrong. If you want the full picture on retry implementation, webhook retry logic done right covers the backoff math and failure classification in detail.
2. No idempotency protection
Here's the thing about that HubSpot timeout: the first call might have actually gone through. HubSpot received the request and updated the contact, but the response never made it back before your client gave up. Your retry succeeds too — and now HubSpot has processed the same event twice. The contact's lifecycle stage was updated once (fine), but your custom event log shows two subscription.started entries, and the workflow triggered by that event fires duplicate emails.
Idempotency protection means tagging every event with a unique key and ensuring that the destination only processes it once, even if it arrives multiple times. Some APIs support this natively (Stripe's Idempotency-Key header, for example). For destinations that don't, you need deduplication logic on your side — and that means tracking which events have been delivered, which adds state, storage, and cleanup concerns.
3. No dead letter handling
Let's say HubSpot's bad day gets worse. Your retries exhausted, the subscription.started event to HubSpot is simply dropped. A log line gets written — maybe — and the event is gone. Three weeks later, a sales rep notices the customer's lifecycle stage never updated. There's no way to recover the original event, no way to replay it. The data is wrong now — and fixing it is manual.
Retries have a limit. After three, five, or ten attempts, some events will still fail. What happens to them?
In most codebases, the answer is: they disappear.
A dead letter queue captures events that have exhausted all retry attempts. It gives your team visibility into what failed, why it failed, and the ability to replay those events once the underlying issue is fixed. Without one, event loss is silent and permanent.
4. No per-destination isolation
Remember, the subscription.started event needed to reach three destinations: HubSpot, Salesforce, and Slack. Your code sends them sequentially. HubSpot is timing out, your retry logic blocks for 30 seconds, and now the Salesforce opportunity and the Slack notification are sitting in a queue behind a destination that can't receive. The sales team doesn't find out about the upgrade for another hour. HubSpot's bad day became everyone's bad day.
When you send the same event to multiple destinations, a failure to one shouldn't affect the others. But in most integration code, it does.
Destination isolation means each delivery path has its own retry budget, its own failure state, and its own dead letter queue. A degraded HubSpot API doesn't slow down Slack notifications. An expired Salesforce token doesn't prevent Intercom messages.
This is the fan-out problem: one event in, many deliveries out, each with independent lifecycle management. It sounds simple. The implementation is not.
5. No delivery visibility
A week later, someone asks: "Did customer X's upgrade event actually reach Salesforce?" Your team checks application logs. There's a generic INFO: event sent line, but no record of the response code, no record of how many attempts were made, no record of whether HubSpot got it on the first try or the third. The investigation takes two hours and ends with "probably?"
If someone asks "did the subscription.started event from customer X reach Salesforce last Tuesday?" — can your team answer that question?
Most integration code has no audit trail. Events are fire-and-forget by default. When they succeed, nobody asks questions. When they fail, there's nothing to investigate.
Delivery visibility means logging every attempt — not just successes and final failures, but each retry, each response code, each timeout. It means being able to trace a single event from emission through every delivery attempt to every destination.
Without this, debugging integration issues means reading application logs, correlating timestamps, and guessing. With it, you can answer "what happened to this event?" in seconds.
Why teams skip these
Nobody sets out to build unreliable integrations. These gaps accumulate because each one feels like a separate concern, and each one is genuinely hard to get right.
Retry logic with proper backoff and jitter is maybe 300 lines. Idempotency tracking adds a data store and TTL management. Dead letter queues add storage, a UI for inspection, and replay logic. Destination isolation means rearchitecting from sequential calls to parallel delivery with independent failure handling. Delivery logging means schema design, retention policies, and a way to query it.
Add it up and you're looking at thousands of lines of infrastructure code that have nothing to do with your product's core value. And that's before you handle edge cases like partial failures, API schema drift, per-destination rate limits, or replay safety. This isn't one-time code — it's a system you now own. It needs to be maintained, tested, monitored, and debugged by the same team that's supposed to be shipping features.
So teams compromise. They ship the happy path. They add "TODO: retry logic" comments. They tell themselves they'll come back to it when integrations are more mature.
They usually don't come back to it until something breaks in production.
The build-vs-buy decision
At some point, every team that takes integrations seriously faces the same question: should we build this infrastructure ourselves, or use something purpose-built?
The answer depends on your situation. If you have one integration with one destination and low event volume, a simple retry wrapper is probably fine. If you have multiple destinations, customer-facing integrations, or events where loss has business impact (revenue events, compliance events, user-facing notifications), the math changes quickly. Even at a 1% failure rate, a product sending 10,000 events a day is dropping 100 events daily — and that's before a destination has a bad day.
The questions that matter are: how many hours per month does your team spend debugging silent delivery failures? How confident are you that every event reaches every destination? And what's the cost when one doesn't?
Most teams don't decide to build this infrastructure. They accidentally end up owning it.
Don't want to build and maintain event delivery infrastructure? Join Meshes and get retries, deduplication, fan-out, dead letter queues, and delivery logging for your product events out of the box.