Why Webhook Delivery Fails in Production

Your webhook integration works perfectly in development. You fire a test event, the destination returns a 200, and you move on to the next feature. Six months later, you're debugging why 300 HubSpot contacts never updated — and nobody noticed until pipeline numbers were wrong.

Webhook failures in production don't look like webhook failures in development. They're intermittent, silent, and context-dependent. They show up at 2am when a destination's API is under maintenance, or gradually over weeks as a token expires and nobody notices.

This post covers the actual failure modes for outbound event delivery — the ones that don't appear until you have real traffic, real destinations, and real consequences.

In production, you're not sending webhooks — you're operating a delivery system.

The failure taxonomy

Most delivery failures aren't outages — they're mismatches between how your system behaves and how external APIs behave under load.

Not all delivery failures are created equal. Some are loud and obvious. Most aren't. The dangerous ones are the failures that look like successes, or the ones that happen so quietly that nobody knows until the data is already wrong.

1. Destination downtime and partial outages

This is the failure mode everyone thinks about, and ironically, the easiest to handle. The destination returns a 503 or a connection refused. Your code sees the error. You retry. Eventually the destination comes back.

The harder version is the partial outage. HubSpot's API is responding, but slowly. Response times jump from 200ms to 8 seconds. Your timeout is set to 10 seconds, so requests technically succeed — but your throughput drops by 40x, your queue backs up, and events start arriving minutes or hours late. Nothing technically failed — but everything is functionally broken.

Partial outages are more common than full outages, and they're harder to detect because your error rate stays flat. Your latency metrics are the only signal, and most outbound delivery implementations don't track per-destination latency.

2. Authentication expiry

OAuth tokens expire. API keys get rotated. Service accounts get deactivated. Every authentication mechanism has a failure mode, and most of them are silent.

The typical pattern: your team sets up a HubSpot integration, authenticates via OAuth, and moves on. The access token expires in 6 hours, but you have a refresh token. The refresh token works for 6 months. On month 7, deliveries start failing with 401s. If your code retries 401s (it shouldn't — they're not transient), you burn through your retry budget on every event. If it doesn't retry, events are silently dropped.

Salesforce is worse. A Salesforce admin changes the connected app's permissions, or the integration user's profile gets updated during a security audit. Your token is still technically valid, but now you're getting 403s on endpoints that worked yesterday. The error message says "insufficient privileges" and tells you nothing about what changed.

Why Webhook Delivery Fails in Production (And What to Do About It)

Your webhook integration works perfectly in development. In production, failures are intermittent, silent, and context-dependent. This post covers the seven actual failure modes for outbound event delivery — and what a real fix looks like.

The failure taxonomy

1. Destination downtime and partial outages

2. Authentication expiry

3. Payload rejection

4. Rate limiting

5. Timeout ambiguity

6. DNS and network-layer failures

7. Schema drift on the destination side

Why these failures compound

What "fixed" actually looks like