Replay the critic's last veto.

A planner-worker-critic loop in one of our market-intelligence pipelines rejected a perfectly good draft on Tuesday, accepted a near-identical one on Wednesday, and left no trace of why. Same documents, same retrieval corpus, same model version pinned in the lockfile. An engineer ran it four times locally, got four passes, closed the ticket as not-reproducible, and moved on. I reopened it, because not-reproducible is not a diagnosis. It is an admission that we were not recording the thing that varied. The bug was real. The critic vetoed on a state we never captured, so the failure had nowhere to live between runs. That is the entire subject of this piece: multi-agent bugs are state-machine bugs, and you debug a state machine by replaying its transitions, not by re-prompting it and praying for a different mood.

A dark terminal-style illustration of a checkpoint fork timeline rendered in purple and ice-blue on near-black. A horizontal sequence of numbered checkpoint nodes runs left to right; at the critic veto node the line forks downward into a branched continuation produced by an update_state mutation, with the original path continuing dimmed above and the replayed branch glowing below. — The picture I sketch on the whiteboard before anyone touches a prompt: the run is a sequence of checkpoints, the critic veto is a node like any other, and the fix is a fork off that node, not a rewrite of the whole graph. The mutation point is where you stop guessing and start testing.

I do applied research for a security shop that sells into regulated and regulated-adjacent buyers, which means my standard for "we fixed it" is higher than "it passed in the demo." A fix I cannot reproduce is a fix I cannot defend to an auditor, and a fix I cannot defend is not a fix. So I treat every multi-agent failure as a claim that needs evidence, and the evidence is a replayable transition. The good news is that the framework people already reach for, LangGraph, ships the primitives to do this properly. The bad news is that most teams checkpoint the wrong things and then wonder why their replays lie.

Non-determinism is the premise, not the excuse

Start from the uncomfortable claim and demand the evidence. The claim is that you cannot trust a single run of an agent to tell you anything stable. The evidence is blunt: the DEV Community write-up on debugging non-deterministic agents states it plainly, that "LLM based agents are inherently non-deterministic, which makes reproducibility, debugging, and post execution analysis difficult in production systems." Read that as a specification, not a complaint. If identical inputs and identical parameters can yield different outputs each run, then "I ran it and it worked" carries exactly zero evidentiary weight. It is one sample from a distribution you have not characterized.

This is where the framing has to change or nothing downstream works. The instinct, when a multi-agent system misbehaves, is to anthropomorphize: the worker got confused, the critic was too harsh, the planner went off the rails. That language is comfortable and it is useless, because it points you at the prompt and the prompt is rarely the bug. The productive framing is structural. You have a state machine with explicit nodes and explicit edges. A "bug" is a wrong transition: the graph moved from one state to another it should not have, given the state it was actually in. The model's stochasticity is the environment your transitions run in, not a property you can debug by scolding. You debug the transition. You make it observable, you capture the state on both sides of it, and you replay it.

Figure 1 · the machine

planner → worker → critic, with explicit edges

Draw the loop as a machine and the bug stops being mysterious. The planner emits a plan, the worker produces a draft, the critic either approves to END or vetoes back to the planner. The veto edge (purple) is the transition that carries the most decision weight and the least observability by default, which is exactly why it is where failures hide.

Step 1

Checkpoint every super-step, and mean it

You cannot replay a transition you did not record, so the first discipline is checkpointing: persist the full graph state after every super-step. In LangGraph terms a super-step is one tick of the machine, one node executing and writing its updates to state. Checkpoint after each one and you get a thread, an ordered history of snapshots that you can later walk, inspect, and resume from. Skip it and you are back to anthropomorphizing, because you have no artifact to point at.

Where those checkpoints live is not a detail you get to wave away in a regulated context. The DEV Community walkthrough builds on LangGraph's checkpointer abstraction precisely because it lets you swap the backend without rewriting the graph. In-memory is fine for a notebook and useless the moment a process dies mid-run. For anything that has to survive a crash or feed an audit, you persist to a real store. Aerospike's field notes on running LangGraph in production make the operational case for durable checkpointing under latency and scale pressure, and the ScyllaDB write-up on agentic state management goes further, treating the checkpoint store as a first-class database concern with the throughput and retention characteristics that implies. Postgres, Redis, ScyllaDB: the choice is a capacity-planning question, not a philosophical one. The non-negotiable is that the history outlives the run.

Here is the part people underweight. The value of the Aerospike framing is not "checkpointing is good," which everyone nods along to. It is that durable checkpoints give you, in their words, the ability to reproduce a failure and test a fix "without inventing a custom event-sourced run log." You already wrote the event-sourced log. It is the checkpoint history. If you have been building a parallel tracing system to answer "what state was the agent in when it broke," and you also checkpoint, you have built the same thing twice. The checkpoint thread is the source of truth, and the trace is the index into it.

Figure 2 · the history

a checkpoint stack, forked at the critic veto

Time travel, as Towards AI describes it, persists a thread of checkpoints and lets you resume from any of them, optionally mutating state, which branches the history. The dimmed checkpoints (04, 05) are the original failing lineage and they stay put. The purple branch is the experiment. You never overwrite the bug, you fork off it, which is the property an auditor actually cares about.

The replay pattern: find the fork, mutate, re-invoke

Now the methodology gets concrete. Once you have a durable thread, debugging a multi-agent failure is three moves: read the history, locate the last good decision before the failure, and re-enter the graph from a mutated copy of that checkpoint. Towards AI's treatment of time travel in agentic systems lays out exactly this loop, and LangGraph exposes it through three calls. You list the history with get_state_history, you pick the checkpoint just before the bad transition, you write a correction into it with update_state, and you resume with invoke. The pattern from the source, trimmed to the spine:

# LangGraph time-travel debug pattern
states = list(graph.get_state_history(config))
failing = states[3]            # last critic approval before the veto cascade
new_config = graph.update_state(
    failing.config,
    {"plan": corrected_plan},  # mutate the decision boundary, not the prompt
)
graph.invoke(None, new_config) # continue from the mutated checkpoint

Notice what this buys you that a re-prompt does not. You are not re-running the whole workflow and hoping the stochastic dice land the same way. You are pinning every upstream state, changing exactly one field, the plan, and asking the single downstream question: with this correction in place, does the critic still veto. That is a controlled experiment. The planner's earlier reasoning, the worker's tool calls, the retrieved context, all of it is held fixed by the checkpoint. The only variable is the one you chose to vary. Compare that to the usual loop, where someone edits the system prompt and re-runs the whole graph, changing a dozen things at once and then attributing the outcome to the one change they remember making.

The skeptic's note, because I am contractually obligated to supply one: states[3] is a magic index and you should never hardcode it. In a real harness you walk the history and select the fork point by predicate, not position. Find the last checkpoint where the critic approved, or the last super-step before the state field you suspect went wrong. The index in the source is illustrative. The selection logic is the part you actually have to build, and getting it right is most of the work.

Figure 3 · the audit bar

what the checkpoint must capture, beyond graph state

This is the figure I argue about most. Karan Singh's point on deploying agentic AI in regulated settings is the one to internalize: "Checkpointing matters, but the real test is whether the snapshot captures enough of the decision boundary, not just the graph state." A checkpoint that stores only the route the graph took will replay the route and tell you nothing about why. Capture the tool outputs, the retrieved context, and any human edits that fed the decision, or your replay is a re-enactment with the evidence removed.

Why graph state alone is a trap

Let me defend the previous figure, because it is the claim most likely to get hand-waved. The seductive thing about LangGraph checkpointing is that it works out of the box on graph state: messages, the plan, the routing flags, whatever you declared in your state schema. That feels complete. It is not, and the gap is exactly where regulated deployments fail their audits. Consider the critic veto we opened with. The graph state after the veto records that the critic rejected and routed back to the planner. Replay it and you faithfully reproduce the rejection. You still do not know why the critic rejected, because the inputs to that judgment, the specific tool output the worker fed it, the retrieved passages the critic read, the human reviewer's earlier edit that shifted the draft, were never in the state schema. They were ambient. They influenced the transition and then evaporated.

This is the difference between a debuggable system and a system that merely looks debuggable. Singh's framing is precise about it: the decision boundary is the set of everything that fed a transition, and the audit bar is whether your snapshot captures that boundary or just the graph's bookkeeping. In a market-research or financial context, "the agent decided X" is not an acceptable record. "The agent decided X given these tool returns, these retrieved sources with these relevance scores, and this prior human override at this timestamp" is the record that survives review. The first is a vibe with a database backing it. The second is evidence.

Practically, this means widening what you persist. Tool outputs go into the checkpoint, not just the fact that a tool was called. Retrieved context goes in with its source identifiers and scores, so a replay reads the same passages the critic read. Human edits and approvals go in with timestamps, because in a loop with a human in it, the human is a node and their input is state. None of this is exotic. It is a schema decision you make before the incident, because you cannot retroactively capture a decision boundary you did not record. The checkpoint store can hold it; ScyllaDB's state-management write-up is largely an argument that the store can carry this richer payload at production volume. The question is whether you put it there.

Figure 4 · the workflow

get_state_history → update_state → invoke

The full loop in four moves: list the history, select the fork point by predicate rather than by magic index, mutate exactly one field of that checkpoint, and re-invoke to watch the downstream transition. Because every upstream state is pinned, the result is attributable to your change. This is the methodology, and the three LangGraph calls are just its implementation.

Where this approach earns its skepticism

I would not trust a methodology that did not name its own failure modes, so here are three, in order of how often they bite.

First, the audit-bar problem cuts both ways. Capture too little and your replay is theater, as Figure 3 argues. But capture everything and you have a different problem: checkpoint bloat and a retention bill that scales with run volume. The ScyllaDB and Aerospike pieces both treat this as the central operational tension, not a footnote. You are now storing tool outputs and retrieved context for every super-step of every run, which in a busy pipeline is a lot of bytes with regulatory retention attached. The honest answer is that decision-boundary capture is expensive and you scope it to the loops that matter, the ones where a wrong transition has a cost a regulator would care about. Checkpointing everything at full fidelity is as wrong as checkpointing nothing.

Second, this is LangGraph-shaped, and not everyone runs LangGraph. The time-travel ergonomics, get_state_history, update_state, the checkpointer abstraction, are framework features. If your shop standardized on CrewAI or rolled its own orchestration, you do not get this loop for free. You build the persistence layer and the replay tooling yourself, and the DEV Community and Towards AI patterns become a specification you have to reimplement rather than an API you call. That is a real cost and a legitimate input to a framework decision, which is why I treat checkpointing maturity as a selection criterion, not a nice-to-have you bolt on later.

Third, and this is the one that humbles the whole exercise: replay does not make the model deterministic. When you re-invoke from a mutated checkpoint, the downstream nodes still call an LLM, and that call is still a sample from a distribution. The same forked checkpoint can branch differently on two different re-runs. The DEV Community piece is honest that checkpoint-based replay improves reproducibility; it does not deliver it. So a single replay that passes is not proof your fix works, any more than a single original run that failed was proof of the bug. You replay the fork many times and you reason about the distribution of outcomes. Time travel turns an unobservable failure into a measurable one. It does not turn a stochastic system into a deterministic one, and any pitch that claims otherwise is selling.

The honest tradeoff

Checkpoint-based replay is the right methodology and it is not free. You pay in storage, in schema discipline up front, and in framework lock-in if you want the ergonomic version. For a weekend agent it is absurd. For a loop whose wrong transitions land in a compliance report, the cost of the checkpoint store is rounding error against the cost of telling an auditor you cannot reproduce your own system's decision. Match the fidelity to the blast radius.

Where this connects

The critic veto is the load-bearing fork in all of this, and it is worth understanding the loop topology in its own right. If you are designing or debugging the planner-worker-critic pattern itself, the planner-worker-critic loops piece goes deeper on why the critic edge concentrates so much decision weight and how to keep the loop from oscillating, which is the failure mode hiding behind the repeated-veto lineage in Figure 2.

Checkpoints are half of the observability story; traces are the other half, and they are complementary rather than redundant. The trace tells you a transition happened and when; the checkpoint tells you the full state it happened from. Wired together, the trace is the index and the checkpoint thread is the storage, which is the architecture the observability for agent systems write-up lays out. If you are building either in isolation, read that before you build the second one twice.

And because the second counterpoint above is really a framework question, it is worth situating checkpointing where it belongs, as a selection criterion. The LangGraph vs CrewAI for enterprise comparison treats time-travel and durable state as a first-class axis rather than a feature-matrix checkbox, which is the right altitude for a decision you will live with through every future incident.

Stop calling multi-agent failures non-reproducible. Model the loop as a state machine, checkpoint every super-step to a durable store, and capture the decision boundary, tool outputs and retrieved context and human edits, not just the graph state. When it breaks, walk the history, fork the last good checkpoint, mutate one field, and re-invoke. Replay the critic's last veto until the wrong transition admits what it was reacting to. The model will keep being non-deterministic. Your debugging does not have to be.

Comments (3)

Join the discussion

Omar SinghAwakened9/16/2025

Replay the critic last veto is the single most useful debugging move in here. We kept re prompting the worker because the worker is where the bad output appeared, and the actual bug was the critic passing something it should have rejected three steps earlier. The failure shows up downstream of where it was caused, every time.

Bianca DuboisDormant9/17/2025

This is the correct framing and it is basically just root cause analysis applied to a state machine. The novelty is not the technique, it is realising the agent is a state machine in the first place. Once you accept that, multi agent debugging stops feeling mystical and starts feeling like debugging, which is the point of the article.

Hassan LopezAwakened9/17/2025

Advanced but worth it. The prerequisite people miss is that you cannot replay what you did not record. If your loop does not persist the state and the decision at each node, there is nothing to replay and you are back to re prompting and praying. The debugging method assumes the observability work is already done.