Multi-agent bugs look like model failures but they're state-machine failures. The fix is replaying the critic's last rejection and finding the fork, not re-prompting the worker.
A planner-worker-critic loop in one of our market-intelligence pipelines rejected a perfectly good draft on Tuesday, accepted a near-identical one on Wednesday, and left no trace of why. Same documents, same retrieval corpus, same model version pinned in the lockfile. An engineer ran it four times locally, got four passes, closed the ticket as not-reproducible, and moved on. I reopened it, because not-reproducible is not a diagnosis. It is an admission that we were not recording the thing that varied. The bug was real. The critic vetoed on a state we never captured, so the failure had nowhere to live between runs. That is the entire subject of this piece: multi-agent bugs are state-machine bugs, and you debug a state machine by replaying its transitions, not by re-prompting it and praying for a different mood.

I do applied research for a security shop that sells into regulated and regulated-adjacent buyers, which means my standard for "we fixed it" is higher than "it passed in the demo." A fix I cannot reproduce is a fix I cannot defend to an auditor, and a fix I cannot defend is not a fix. So I treat every multi-agent failure as a claim that needs evidence, and the evidence is a replayable transition. The good news is that the framework people already reach for, LangGraph, ships the primitives to do this properly. The bad news is that most teams checkpoint the wrong things and then wonder why their replays lie.
Start from the uncomfortable claim and demand the evidence. The claim is that you cannot trust a single run of an agent to tell you anything stable. The evidence is blunt: the DEV Community write-up on debugging non-deterministic agents states it plainly, that "LLM based agents are inherently non-deterministic, which makes reproducibility, debugging, and post execution analysis difficult in production systems." Read that as a specification, not a complaint. If identical inputs and identical parameters can yield different outputs each run, then "I ran it and it worked" carries exactly zero evidentiary weight. It is one sample from a distribution you have not characterized.
This is where the framing has to change or nothing downstream works. The instinct, when a multi-agent system misbehaves, is to anthropomorphize: the worker got confused, the critic was too harsh, the planner went off the rails. That language is comfortable and it is useless, because it points you at the prompt and the prompt is rarely the bug. The productive framing is structural. You have a state machine with explicit nodes and explicit edges. A "bug" is a wrong transition: the graph moved from one state to another it should not have, given the state it was actually in. The model's stochasticity is the environment your transitions run in, not a property you can debug by scolding. You debug the transition. You make it observable, you capture the state on both sides of it, and you replay it.
Figure 1 · the machine
planner → worker → critic, with explicit edges
Step 1
You cannot replay a transition you did not record, so the first discipline is checkpointing: persist the full graph state after every super-step. In LangGraph terms a super-step is one tick of the machine, one node executing and writing its updates to state. Checkpoint after each one and you get a thread, an ordered history of snapshots that you can later walk, inspect, and resume from. Skip it and you are back to anthropomorphizing, because you have no artifact to point at.
Where those checkpoints live is not a detail you get to wave away in a regulated context. The DEV Community walkthrough builds on LangGraph's checkpointer abstraction precisely because it lets you swap the backend without rewriting the graph. In-memory is fine for a notebook and useless the moment a process dies mid-run. For anything that has to survive a crash or feed an audit, you persist to a real store. Aerospike's field notes on running LangGraph in production make the operational case for durable checkpointing under latency and scale pressure, and the ScyllaDB write-up on agentic state management goes further, treating the checkpoint store as a first-class database concern with the throughput and retention characteristics that implies. Postgres, Redis, ScyllaDB: the choice is a capacity-planning question, not a philosophical one. The non-negotiable is that the history outlives the run.
Here is the part people underweight. The value of the Aerospike framing is not "checkpointing is good," which everyone nods along to. It is that durable checkpoints give you, in their words, the ability to reproduce a failure and test a fix "without inventing a custom event-sourced run log." You already wrote the event-sourced log. It is the checkpoint history. If you have been building a parallel tracing system to answer "what state was the agent in when it broke," and you also checkpoint, you have built the same thing twice. The checkpoint thread is the source of truth, and the trace is the index into it.
Figure 2 · the history
a checkpoint stack, forked at the critic veto
Now the methodology gets concrete. Once you have a durable thread, debugging a multi-agent failure is three moves: read the history, locate the last good decision before the failure, and re-enter the graph from a mutated copy of that checkpoint. Towards AI's treatment of time travel in agentic systems lays out exactly this loop, and LangGraph exposes it through three calls. You list the history with get_state_history, you pick the checkpoint just before the bad transition, you write a correction into it with update_state, and you resume with invoke. The pattern from the source, trimmed to the spine:
# LangGraph time-travel debug pattern
states = list(graph.get_state_history(config))
failing = states[3] # last critic approval before the veto cascade
new_config = graph.update_state(
failing.config,
{"plan": corrected_plan}, # mutate the decision boundary, not the prompt
)
graph.invoke(None, new_config) # continue from the mutated checkpoint
Notice what this buys you that a re-prompt does not. You are not re-running the whole workflow and hoping the stochastic dice land the same way. You are pinning every upstream state, changing exactly one field, the plan, and asking the single downstream question: with this correction in place, does the critic still veto. That is a controlled experiment. The planner's earlier reasoning, the worker's tool calls, the retrieved context, all of it is held fixed by the checkpoint. The only variable is the one you chose to vary. Compare that to the usual loop, where someone edits the system prompt and re-runs the whole graph, changing a dozen things at once and then attributing the outcome to the one change they remember making.
The skeptic's note, because I am contractually obligated to supply one: states[3] is a magic index and you should never hardcode it. In a real harness you walk the history and select the fork point by predicate, not position. Find the last checkpoint where the critic approved, or the last super-step before the state field you suspect went wrong. The index in the source is illustrative. The selection logic is the part you actually have to build, and getting it right is most of the work.
Figure 3 · the audit bar
what the checkpoint must capture, beyond graph state
Let me defend the previous figure, because it is the claim most likely to get hand-waved. The seductive thing about LangGraph checkpointing is that it works out of the box on graph state: messages, the plan, the routing flags, whatever you declared in your state schema. That feels complete. It is not, and the gap is exactly where regulated deployments fail their audits. Consider the critic veto we opened with. The graph state after the veto records that the critic rejected and routed back to the planner. Replay it and you faithfully reproduce the rejection. You still do not know why the critic rejected, because the inputs to that judgment, the specific tool output the worker fed it, the retrieved passages the critic read, the human reviewer's earlier edit that shifted the draft, were never in the state schema. They were ambient. They influenced the transition and then evaporated.
This is the difference between a debuggable system and a system that merely looks debuggable. Singh's framing is precise about it: the decision boundary is the set of everything that fed a transition, and the audit bar is whether your snapshot captures that boundary or just the graph's bookkeeping. In a market-research or financial context, "the agent decided X" is not an acceptable record. "The agent decided X given these tool returns, these retrieved sources with these relevance scores, and this prior human override at this timestamp" is the record that survives review. The first is a vibe with a database backing it. The second is evidence.
Practically, this means widening what you persist. Tool outputs go into the checkpoint, not just the fact that a tool was called. Retrieved context goes in with its source identifiers and scores, so a replay reads the same passages the critic read. Human edits and approvals go in with timestamps, because in a loop with a human in it, the human is a node and their input is state. None of this is exotic. It is a schema decision you make before the incident, because you cannot retroactively capture a decision boundary you did not record. The checkpoint store can hold it; ScyllaDB's state-management write-up is largely an argument that the store can carry this richer payload at production volume. The question is whether you put it there.
Figure 4 · the workflow
get_state_history → update_state → invoke
I would not trust a methodology that did not name its own failure modes, so here are three, in order of how often they bite.
First, the audit-bar problem cuts both ways. Capture too little and your replay is theater, as Figure 3 argues. But capture everything and you have a different problem: checkpoint bloat and a retention bill that scales with run volume. The ScyllaDB and Aerospike pieces both treat this as the central operational tension, not a footnote. You are now storing tool outputs and retrieved context for every super-step of every run, which in a busy pipeline is a lot of bytes with regulatory retention attached. The honest answer is that decision-boundary capture is expensive and you scope it to the loops that matter, the ones where a wrong transition has a cost a regulator would care about. Checkpointing everything at full fidelity is as wrong as checkpointing nothing.
Second, this is LangGraph-shaped, and not everyone runs LangGraph. The time-travel ergonomics, get_state_history, update_state, the checkpointer abstraction, are framework features. If your shop standardized on CrewAI or rolled its own orchestration, you do not get this loop for free. You build the persistence layer and the replay tooling yourself, and the DEV Community and Towards AI patterns become a specification you have to reimplement rather than an API you call. That is a real cost and a legitimate input to a framework decision, which is why I treat checkpointing maturity as a selection criterion, not a nice-to-have you bolt on later.
Third, and this is the one that humbles the whole exercise: replay does not make the model deterministic. When you re-invoke from a mutated checkpoint, the downstream nodes still call an LLM, and that call is still a sample from a distribution. The same forked checkpoint can branch differently on two different re-runs. The DEV Community piece is honest that checkpoint-based replay improves reproducibility; it does not deliver it. So a single replay that passes is not proof your fix works, any more than a single original run that failed was proof of the bug. You replay the fork many times and you reason about the distribution of outcomes. Time travel turns an unobservable failure into a measurable one. It does not turn a stochastic system into a deterministic one, and any pitch that claims otherwise is selling.
Checkpoint-based replay is the right methodology and it is not free. You pay in storage, in schema discipline up front, and in framework lock-in if you want the ergonomic version. For a weekend agent it is absurd. For a loop whose wrong transitions land in a compliance report, the cost of the checkpoint store is rounding error against the cost of telling an auditor you cannot reproduce your own system's decision. Match the fidelity to the blast radius.
The critic veto is the load-bearing fork in all of this, and it is worth understanding the loop topology in its own right. If you are designing or debugging the planner-worker-critic pattern itself, the planner-worker-critic loops piece goes deeper on why the critic edge concentrates so much decision weight and how to keep the loop from oscillating, which is the failure mode hiding behind the repeated-veto lineage in Figure 2.
Checkpoints are half of the observability story; traces are the other half, and they are complementary rather than redundant. The trace tells you a transition happened and when; the checkpoint tells you the full state it happened from. Wired together, the trace is the index and the checkpoint thread is the storage, which is the architecture the observability for agent systems write-up lays out. If you are building either in isolation, read that before you build the second one twice.
And because the second counterpoint above is really a framework question, it is worth situating checkpointing where it belongs, as a selection criterion. The LangGraph vs CrewAI for enterprise comparison treats time-travel and durable state as a first-class axis rather than a feature-matrix checkbox, which is the right altitude for a decision you will live with through every future incident.
Stop calling multi-agent failures non-reproducible. Model the loop as a state machine, checkpoint every super-step to a durable store, and capture the decision boundary, tool outputs and retrieved context and human edits, not just the graph state. When it breaks, walk the history, fork the last good checkpoint, mutate one field, and re-invoke. Replay the critic's last veto until the wrong transition admits what it was reacting to. The model will keep being non-deterministic. Your debugging does not have to be.
Replay the critic last veto is the single most useful debugging move in here. We kept re prompting the worker because the worker is where the bad output appeared, and the actual bug was the critic passing something it should have rejected three steps earlier. The failure shows up downstream of where it was caused, every time.
This is the correct framing and it is basically just root cause analysis applied to a state machine. The novelty is not the technique, it is realising the agent is a state machine in the first place. Once you accept that, multi agent debugging stops feeling mystical and starts feeling like debugging, which is the point of the article.
Advanced but worth it. The prerequisite people miss is that you cannot replay what you did not record. If your loop does not persist the state and the decision at each node, there is nothing to replay and you are back to re prompting and praying. The debugging method assumes the observability work is already done.
Comments (3)
Join the discussion
Sign in to comment, bookmark threads, and continue lessons across sessions.