Your at-least-once pipeline is probably at-most-once
Your at-least-once pipeline is probably at-most-once
Most teams say their event delivery is at-least-once. The actual semantics are usually weaker — and the gap is one ack-ordering decision in a queue handler. A walk through the patterns that silently downgrade delivery guarantees, and how to tell which one you have.
Ask an engineer about their outbound event pipeline and you'll usually hear "at-least-once." It's the answer that justifies idempotency keys, retries, dead letter queues, the whole apparatus. At-least-once is the floor: messages might be delivered more than once, but they won't be lost.
The actual semantics are usually weaker. Most pipelines self-described as at-least-once are at-most-once in practice, and the gap is invisible until something is already gone. The difference is rarely a missing component. It's an ack-ordering decision inside a queue handler, made years ago, that nobody noticed was load-bearing.
This post is about how to tell which pipeline you actually have.
The semantic ladder
Three delivery guarantees, in order of strength:
At-most-once. A message is delivered zero or one times. Loss is possible; duplication is not. This is what naive code gives you — send the message, hope for the best. Acceptable for low-stakes telemetry; catastrophic for transactional events.
At-least-once. A message is delivered one or more times. Duplication is possible; loss is not. This is the standard guarantee from production message systems. It requires that the message stays in the queue (or log) until the worker confirms successful processing — and "successful" has to mean the work is durably committed, not the worker has it in memory.
Exactly-once. Each message has exactly one effect on downstream state. Achievable end-to-end only through idempotency on top of at-least-once delivery. As a transport guarantee, exactly-once doesn't exist in distributed systems without strong assumptions about coordination that most pipelines can't make.
The ladder matters because each rung requires different machinery, and the difference between rungs isn't a feature you add. It's a property of how acks interact with work.
The ack-ordering trap
This is the single decision that silently downgrades at-least-once to at-most-once. A worker pulls a message from a queue and has two operations to perform: acknowledge the message back to the queue (so it doesn't get redelivered), and do the work the message describes (deliver to HubSpot, fire the webhook, update the database).
The order matters absolutely.
Ack-after-success is at-least-once. The worker pulls the message, does the work, then acks. If the worker crashes between pulling and finishing, the queue redelivers — possibly to another worker — and the work runs again. Idempotency handles the duplicate.
Ack-before-success is at-most-once. The worker pulls the message, acks immediately, then does the work. If the worker crashes between the ack and the work completing, the message is gone. The queue has already forgotten about it.
The seductive thing about ack-before-success is that it usually works. The crash window between ack and work-complete is small — milliseconds at most for a fast handler. In normal operation, it's invisible. Workers process millions of messages and nobody notices the occasional lost one because the loss rate is below the noise floor of every other failure mode.
It becomes visible exactly when you don't want it to. Deploys are when workers get killed mid-handler. Auto-scaling events are when workers get terminated mid-handler. Out-of-memory kills are when workers get killed mid-handler. Every event that requires terminating a process gracefully but doesn't always is an opportunity to lose every message currently being processed.
// At-most-once. The ack happens before the work completes.async function handleMessage(msg: Message): Promise<void> { await queue.ack(msg); // gone from queue await deliverToDestination(msg); // if this throws or the process dies, message is lost}// At-least-once. The work completes before the ack.async function handleMessage(msg: Message): Promise<void> { await deliverToDestination(msg); // commit the side effect first await queue.ack(msg); // then mark the message processed}
The two versions are four lines apart. The semantics they produce are an entire rung apart.
Diagnostic checklist
The ack-ordering trap is the most common downgrade, but not the only one. Six patterns to audit against. Each one breaks at-least-once delivery in a way that's invisible during normal operation.
1. Ack-before-work
The fundamental case described above. The audit question: in the queue handler, does the ack happen before or after the work commits?
In some queue libraries the answer is hidden by the abstraction. Auto-ack modes — common in AMQP clients, RabbitMQ consumers configured with noAck: true, and several Kafka client defaults — ack on receipt, not on success. If you didn't explicitly disable auto-ack, you may be running at-most-once without knowing it.
2. Visibility timeout shorter than work duration
A subtler version of the same problem, common in SQS-style queues. The worker pulls a message; the queue marks it invisible for, say, 30 seconds. If the worker doesn't ack within the visibility timeout, the message becomes visible again and gets redelivered.
So far so good — this is exactly what makes at-least-once delivery work in the face of worker crashes. The problem starts when the work routinely takes longer than the visibility timeout. The worker is still busy delivering to a slow destination; the queue has already redelivered the message to a second worker; the second worker delivers, acks, and moves on; the first worker finishes, acks the now-orphaned message, and the queue silently drops the ack because the message is gone.
The first delivery succeeded. The second delivery succeeded. The system claims one delivery. Two records got created downstream. Idempotency keys catch the duplicate side effect, but the duplicate work was wasted — and worse, the metrics report fewer deliveries than actually occurred. Operationally you're flying blind because the data is wrong, not absent.
The fix is either to extend the visibility timeout to comfortably exceed worst-case work duration, or to heartbeat — periodically extending the timeout while work is in progress.
3. In-memory queues without persistence
A second-layer queue inside the worker process — typically used to smooth bursts, batch by destination, or apply per-destination rate limits — is a common pattern. The worker pulls from the durable queue, drops the message into an in-memory buffer, and acks the durable queue. A separate goroutine or async task drains the buffer.
The buffer is at-most-once by construction. If the process dies, the buffer is gone. Acking the durable queue before the work has reached a durable downstream state — including the case where "downstream" is an in-process buffer — collapses the guarantee.
This pattern is often introduced as a performance optimization after the rest of the pipeline is already at-least-once. The optimization quietly downgrades the whole system.
4. Success metrics that count the wrong thing
The clearest tell that a pipeline has been at-most-once all along is that its "successful delivery" metric counts the wrong event.
Counting request sent as success is at-most-once accounting. The metric increments whether or not the destination accepted the message. Network errors, 5xx responses, and timeouts all show up as successes in the dashboard. The pipeline reports green while events are dropping.
Counting 2xx response received as success is closer, but still not the same as durable delivery. Some destinations return 200 on receipt and then fail to persist asynchronously. A signed receipt — destination returns a stable record ID or echoes an idempotency key — is the strongest signal that the message actually landed.
The metric matters because it's what defines "delivery" for the rest of the system. If alerting fires on a drop in success rate and success is defined as "request sent," the alert won't fire when the destination is rejecting every request — only when something stops the worker from sending at all. The actually-broken case is invisible.
5. Queue purge on deploy
Some queue systems, configured carelessly, drop in-flight messages on worker restart. The exact mechanism varies — Redis Streams consumer groups that aren't reclaimed, ephemeral queues that vanish with their owning process, Kafka consumer groups that rebalance and skip messages — but the symptom is the same: deploys are message-loss events.
The audit question: what happens to in-flight messages when every worker is restarted in sequence over the course of a deploy? If the answer is "they're processed by other workers as the rebalance happens," the pipeline is at-least-once. If the answer is "I'd have to check," the pipeline is probably at-most-once during deploys, and the team has been compensating without realizing it. The 0.3% drop in success rate that always happens during deploys is the silent loss surfacing.
6. Log truncation that loses pending retries
If the retry queue lives in the application database (a common bootstrapping choice), retention policies on the queue table can quietly drop pending retries. A table with a created_at > NOW() - INTERVAL '7 days' filter on its retry worker will silently abandon any retry whose first attempt was more than seven days ago — even if the retry was scheduled for tomorrow.
This isn't a delivery-semantics problem in the formal sense. It's a logical truncation that the rest of the system doesn't know about. The retry was scheduled; the destination was unreachable; the retry never ran; the event is lost. From outside the pipeline, the at-least-once guarantee held until it didn't.
How to tell
The diagnostic isn't a single test. It's a few questions, each of which probes a different downgrade mode.
What is your worker code's commit order? Read the queue handler. Identify where the side effect commits (the HTTP request to the destination, the database write, the inner queue enqueue). Identify where the ack to the outer queue happens. If the ack happens first, the pipeline is at-most-once.
What is the relationship between visibility timeout and p99 work duration? Look at the visibility timeout configured on the queue and the p99 latency of the work each message triggers. If p99 work duration is anywhere near the visibility timeout, redeliveries are happening and the system is doing duplicate work without recording it.
Where does the worker actually buffer work? If there's an in-memory queue or batch buffer between the durable queue and the destination, the ack ordering question applies to that interface too. The durable queue should not be acked until the buffer has reached durable state — which usually means until the destination has accepted the work.
What does the success metric count? Look at the dashboard panel labeled "delivery success rate." Find the metric definition. If it increments on request sent, the metric is at-most-once accounting. If it increments on 2xx response, it's closer. If it increments on a confirmed receipt from the destination — an idempotency key echo, a destination-side record ID — it's at-least-once accounting.
What happens to in-flight messages on deploy? Roll a deploy in staging while watching the queue depth. Healthy at-least-once pipelines show queue depth dip during rolling restart and recover; at-most-once pipelines show messages disappear from the queue without arriving downstream.
Where do retries live, and what cleans them up? If retries are stored in a table, find the cleanup job. Look at the retention window. Make sure the window is longer than the maximum retry horizon — including the case where a destination is unreachable for a full day and retries are still backing off.
Three out of six questions answered the wrong way is enough to call the pipeline at-most-once and plan accordingly.
Fixing the gap
Most fixes are small. The ack-ordering fix is a two-line change in the queue handler. Visibility timeouts can be extended in queue configuration. In-memory buffers can be replaced with persistent intermediate queues or eliminated entirely if the downstream tolerates the volume. The metric fix is replacing one increment with another.
The hard part isn't the fix. It's the audit — finding all the places where the downgrade is sitting, none of which look like bugs because none of them break in normal operation. The gap between "we built at-least-once" and "we have at-least-once" is almost entirely in places nobody thought to look, because each one was a small local decision that didn't feel load-bearing at the time.
Once a team has done the audit, the pattern is usually visible. The same engineer who chose ack-before-work in one handler did it in three others, because the abstraction looked the same. Visibility timeouts that were tuned for fast destinations didn't get retuned when slow ones got added. Buffers that were introduced for one rate-limited destination ended up handling all destinations. The downgrade is rarely a single decision — it's a pattern of small ones that didn't get questioned.
Closing
At-least-once delivery is a property of the entire pipeline, not a checkbox on the queue layer. Every place a message changes hands — from queue to worker, from worker to in-memory buffer, from buffer to destination — is a place where the guarantee can break. The difference between at-least-once and at-most-once is one ack-ordering decision, and most teams got it wrong at least once.
The first step is naming the failure modes. The second is auditing against them. The third is fixing the gap and pushing the metric definition forward so the pipeline reports what it actually delivers, not what it attempts to send.
If outbound delivery design is also useful elsewhere in the system, Designing idempotency keys for outbound event delivery covers the complementary problem of handling the duplicates that real at-least-once delivery produces.
Want at-least-once outbound delivery with confirmed-receipt metrics, durable retries, and replay across every destination — without auditing your own queue handlers?Join Meshes — emit one event, and we handle outbound delivery to every destination with the guarantees intact.