Idempotent tools, retry-owning workflows.

At 02:14 on a Tuesday our payments provider charged the same customer twice, and the run log was green for both. Nobody pushed a bad deploy. What happened is the most boring thing in automation: the provider's webhook timed out waiting on us, retried the delivery like it is supposed to, and our n8n workflow happily ran the whole charge flow a second time because nothing in the chain remembered it had already seen that event. The customer got two charges and a very polite Slack message from me at 6am. That incident is the entire thesis of this piece, so I will say it once plainly and then earn it: glue-code automation wins when MCP tools are idempotent and workflows own their retry semantics. Everything else is decoration.

An indie maker-style illustration on warm cream paper: an n8n node graph drawn in coral and teal, with rounded workflow nodes wired by hand-drawn connectors into an MCP tool pipe, and a small label reading idempotency_key clipped to the wire between a webhook node and a database node. — The picture I keep sketching in my notebook: n8n nodes on the left, an MCP tool pipe on the right, and the one label that actually saves you clipped onto the wire between them. The `idempotency_key` is doing more work than the whole node graph.

I have spent the better part of a decade building this kind of plumbing, first for logistics, now for climate operations at Grove. Triggers, retries, dead letter queues, the unglamorous wiring that decides whether a pipeline shrugs off a flaky upstream or pages me at 2am. n8n is a lovely tool for that wiring, and bolting MCP onto it lets an agent drive the same nodes I used to drive by hand. But the failure modes did not change just because there is a model in the loop now. If anything they got sharper, because the model retries with the confidence of something that has never been woken up at 2am. So this is not a "look how cool agents are" post. It is the four patterns that keep my n8n plus MCP setups from double-charging anyone, written by someone who has cleaned up the duplicate.

One framing note before we wire anything. I am going to talk about n8n specifically, the node-and-trigger glue, not generic MCP server hardening. If you want the deeper treatment of what a tool contract is, the schema and scope and error vocabulary that live at the boundary, the tool contracts piece is the companion to this one and I lean on it constantly. This post is the n8n-flavored, get-your-hands-dirty half.

Pattern one

Acknowledge fast, process slow

The double-charge started with a timeout, and almost every webhook retry storm does. A sender, Stripe, a CRM, a partner API, fires a webhook and starts a stopwatch. If you do not answer with a 200 before that stopwatch runs out, the sender assumes the delivery failed and sends it again. And again. Your workflow did not fail. It was just slow, because it was busy doing the actual work, charging the card, writing the row, sending the email, before it got around to replying. So the fix that prevents most retry storms is almost insultingly simple: acknowledge the webhook immediately, then do the work asynchronously.

In n8n that means your Webhook node responds with a 200 right away, and the heavy lifting happens on a separate branch or a queued execution. The sender is happy, the stopwatch never expires, and you stop manufacturing duplicate deliveries out of thin air. The n8nlab team makes this point in their error handling write-up, and it matches every postmortem I have run: a lot of "duplicate event" bugs are really "we answered too slowly" bugs wearing a disguise.

But acknowledging fast only stops the duplicates you cause. It does nothing about the duplicates that are already in flight, the genuine network blip, the sender that retries even after a 200, the agent that calls the same tool twice because the first response got lost. For those you need the next pattern, and it is the load-bearing one.

Pattern two

Claim the event before you touch anything

Idempotency is a fancy word for a simple promise: doing the thing twice has the same effect as doing it once. The way you keep that promise is to claim the event in a place that can say no, before any side effect runs. Not after. Not "we'll check if the charge exists first," because that check-then-act has a race in it the size of a barn door. You want a single atomic operation that either says "you're the first, go ahead" or "someone already claimed this, stop."

Postgres gives you exactly that for free, and it is the snippet I paste into every n8n project I touch. A dedicated webhook_idempotency table, a unique key derived from the business event, and an insert that races against itself safely:

-- Idempotency claim (article demo)
INSERT INTO webhook_idempotency (key, status)
VALUES ($1, 'pending')
ON CONFLICT (key) DO NOTHING
RETURNING key;  -- empty = duplicate, short-circuit

Read what that does carefully, because the whole pattern lives in four lines. The first delivery inserts a row and the RETURNING key hands you back the key, so you know you won the claim and you proceed to the side effects. The second, duplicate delivery hits the unique constraint, ON CONFLICT DO NOTHING swallows it, and RETURNING gives you back nothing. Empty result equals duplicate. You short-circuit, answer 200 so the sender stops retrying, and you do not charge anyone a second time. The database, not your workflow logic and definitely not the prompt, is the thing enforcing the promise. The Automation Labs team frames this well in their production webhook agent walkthrough: the Postgres table "atomically claims the event before any side effect," which is the part teams skip when they hand-roll a "check if it exists" node and then act on a stale read.

Figure 1 · the claim decides

webhook → ack → atomic claim → side effects, or short-circuit

The fork is the whole pattern. RETURNING key is your branch condition: a key means you won the claim and run the side effects on the teal path, an empty result means it is a duplicate and you take the coral path straight to a clean 200. The database arbitrates, so two simultaneous deliveries cannot both win.

Two practical notes from getting this wrong. First, derive the key from the business identity, not from anything the transport hands you. An order_id or an invoice number is stable across retries; a delivery UUID that the sender regenerates on each attempt is not, and you will dedupe nothing. The n8nlab guidance is to check that business ID against the table before any side effect, which is exactly the claim above. Second, you do not need a guard on every node. As flowgenius puts it, "only side-effect nodes (HTTP Request, Send Email, Database Write) need idempotency guards." A node that transforms JSON in memory is safe to run a hundred times. Guarding it just adds latency and a table you have to clean up. Spend the rigor where the money moves.

Pattern three

The workflow owns the retry, not the toggle

Here is the trap that looks like a feature. n8n has a "Retry on Fail" toggle on most nodes. Flip it on, set the number of attempts, done, right? Not quite. That toggle gives you linear retries, the same fixed delay every time. Automation Labs says it cleanly: n8n's built-in "Retry on Fail toggle gives you linear delays only, which is what triggers the retry-storm anti-pattern." When an upstream API wobbles, every workflow that was talking to it fails at roughly the same moment, and then every one of them retries at the same fixed interval, in lockstep, hammering the recovering service in a synchronized wave. You have built a thundering herd, and you built it with a checkbox.

The fix is to take retry away from the node toggle and give it to a Code node that does exponential backoff with full jitter. Exponential means each attempt waits longer than the last, so you back off a struggling service instead of pounding it. Full jitter means you add randomness to the delay, so a thousand workflows that failed together do not retry together. They smear across time. The herd disperses.

// Code node: backoff with full jitter (capped)
const base = 500;        // ms
const cap  = 30000;      // 30s ceiling
const n    = $json.attempt;          // 0, 1, 2, ...
const expo = Math.min(cap, base * 2 ** n);
const wait = Math.random() * expo;   // full jitter: 0..expo
return [{ json: { ...$json, waitMs: Math.round(wait) } }];

That Math.random() * expo is the line that matters. Without it you get exponential backoff that is still synchronized, because every worker computes the same delay. With it, attempt three might wait anywhere from zero to eight seconds, and the workers fan out. And critically, this retry logic does not wrap your idempotency claim blindly. The retry is safe precisely because pattern two is in place: if a retry re-delivers an event you already claimed, the claim returns empty and the work is skipped. Retry and idempotency are two halves of one mechanism. Bolt on retries without the claim and you are just automating the double-charge.

Figure 2 · two ways to retry

linear toggle stacks the herd · backoff + jitter smears it

Same failure, same number of retries. The coral timeline is the toggle: fixed intervals, every worker firing in lockstep, a spike that re-breaks the thing it is waiting on. The teal timeline is the Code node: growing, randomized gaps that fan the load out. The only code difference is one Math.random(), and it is the difference between recovery and a self-inflicted outage.

Pattern four, the agent edition

The MCP tool that lies about failing

Now the part that is specific to driving n8n with an agent over MCP, and it is genuinely nastier than the webhook case because the failure lies to you. There is an open n8n issue, #31328, where the create_workflow_from_code path can return a 500 to the caller while having already persisted the workflow on the server. The maintainers describe it directly: "the 500 is frequently a false negative: the workflow is persisted in n8n despite the error returned to the client, so a retry creates a duplicate." Sit with that. Your agent calls a tool, gets a clear error, does the sensible thing any retry policy would do, calls it again, and now you have two identical workflows where you wanted one. The error told the truth about the response and a lie about the state.

This is why "retry on failure" is dangerous advice for mutating MCP tools and why pattern two has to follow the agent all the way to the boundary. The rule I drill into every agent setup: verify state before you assume failure. Before retrying a create, search for what you were trying to create. If it is already there, the previous call actually succeeded and you reconcile instead of duplicating. The n8n-ops MCP server makes this practical with tools built for exactly this kind of careful operation, and I gate the destructive ones behind a confirm:

// Agent loop: verify before retrying a mutating call
const existing = await mcp.call("search_workflows", { name });
if (existing.length > 0) {
  // the 500 was a false negative: it actually persisted
  return reconcile(existing[0]);
}
// safe to (re)create only because nothing matched
return mcp.call("create_workflow_from_code", { name, code });

// ops tools used elsewhere, behind confirm gates:
//   n8n_retry_execution   -> only when the step is idempotent
//   n8n_archive_workflow  -> never auto-fired, always confirmed

The n8n-ops-mcp project exposes n8n_retry_execution and n8n_archive_workflow for agent-driven operations, and the reason I like it is that it treats those as operations that deserve confirm gates, not fire-and-forget calls. Use n8n_retry_execution only when the underlying step is genuinely idempotent, which loops you right back to pattern two. And never let an agent auto-archive a workflow; that is a "human says yes" action every time. The broader version of this discipline, idempotency keys and confirm gates at the tool boundary, is the heart of the tool contracts piece, and it is the difference between an agent that operates your n8n and an agent that vandalizes it.

When retries run out, fail somewhere you can see

Backoff and jitter buy you resilience against transient failures. They do nothing for the permanent ones, the malformed payload, the revoked credential, the upstream that is down for an hour. After a bounded number of attempts you have to stop retrying and accept that this event needs a human. The wrong move is to let it vanish into a failed execution nobody is watching. The right move is a dead letter queue: when retries are exhausted, an Error Trigger catches the failure, writes the full event to a dead letter table, and alerts a channel a person actually reads. Then you give that person a one-click replay path back into the workflow once the underlying problem is fixed.

Figure 3 · the exit ramp

retries exhausted → error trigger → dead letter + alert → manual replay

The dead letter table is the promise that nothing disappears. Retries handle the transient; the Error Trigger catches the permanent and parks it somewhere visible, with an alert and a replay path. And notice the teal replay loops back through the same idempotency claim, so a human re-running a fixed event still cannot double-charge. Every pattern in this piece leans on pattern two.

The replay detail is the one people skip and regret. A dead letter table you can only read is a graveyard. The win is that once the bad credential is rotated or the malformed field is patched, an operator clicks one button and the event flows back through the normal path, claim and all. Because the replay goes through the same idempotency guard, you do not have to reason about whether part of the original run half-completed. The claim already knows.

Where the glue stops being the right tool

I have to be honest about the ceiling here, because I have watched teams fall in love with n8n and then try to make it do something it should not. These patterns make your glue reliable. They do not make it smart. n8n is a visual workflow engine; it is fantastic at "when this happens, do these steps, retry like so, fail safely." It is a poor fit for genuinely branching agent reasoning, the kind where the next step depends on the model thinking hard about the last three. The moment your node graph starts sprouting a dozen IF nodes trying to emulate a decision tree, you are over-engineering a simple flow, and that is a real cost for a solo builder shipping fast.

The honest tradeoff

Use n8n as the glue and the side-effect layer, where idempotency and retry and DLQ earn their keep. Push the actual reasoning to something built for it, a real agent framework with planning and memory. Circuit breakers, extra dead letter tiers, the heavier resilience machinery, only pay off at meaningful API call volume. Below that, they are complexity you will maintain and never need. Match the rigor to the blast radius, not to the diagram you saw on someone's blog.

And know your boundary. Reliability at the workflow level is necessary, not sufficient. A perfectly idempotent, retry-owning workflow will still cheerfully execute a logically wrong instruction the agent handed it; pattern two guarantees it only does so once, which is genuinely valuable and is not the same as doing the right thing. If you are wiring agents into production n8n, the production-hardening companion, the production MCP servers guide, covers the server side of this, and when it is time to justify the build to whoever signs off, the agent ROI playbook is the piece I send to product. Reliability is a number you can defend, and "we have not double-charged anyone since the claim table shipped" is a very good number to bring to that conversation.

Acknowledge fast so you stop manufacturing duplicates. Claim the event atomically so the ones you did not manufacture cannot do damage. Own the retry in code so a wobble does not become a storm. Fail loud into a dead letter queue so nothing vanishes. The toggle was never going to do any of that, and the prompt definitely was not.

None of this is clever. It is the same plumbing discipline we have applied to queues and pipelines for twenty years, pointed at a fuzzier client. The agent is new; the duplicate charge at 2am is not. Wire the claim table first, this week, before you wire anything exciting. Future-you, the one not awake at 6am writing apology messages, will thank present-you for the four boring lines of SQL.

Comments (4)

Join the discussion

Kenji OlsenAwakened1/27/2026

Idempotent tools, retry owning workflows is the right division of labour. The mistake I made early was letting the agent own retries, so it would helpfully re run a tool that had already half succeeded and create duplicates. Moving retry logic into the n8n workflow and making the tools idempotent fixed a whole class of weird state bugs.

Freya WrightAwakened1/28/2026

Yes, that is the whole pattern in one paragraph. The agent should decide what to do, the workflow should own whether and how to retry it. Once you stop asking the LLM to be a reliable retry engine, everything calms down. It is not good at being a state machine and it should not have to be.

Kenji OlsenAwakened1/28/2026

As a solo builder this is the stack that actually pays rent. Half my so called agents are a cron and an n8n flow with one model call in the middle, and they are more reliable than the fancy graph version I tried first. Boring glue that works beats a clever loop that flakes, especially when you are the only one on call.

Jasmine ParkAscendant1/29/2026

The retry example was super clear, copied the pattern into my flow already. Thanks!