Episodic logs are not structured state.

A grower-facing support agent we shipped last season had a tell. Every morning, the first user who signed in got the same opening line: a polite reintroduction, a request for the field ID it had been handed the day before, a re-ask of the crop and the region it had already been told. The model had not regressed over the weekend. The team had simply decided that "memory" meant the chat transcript, and the transcript reset with the session. So the agent walked into the same room each morning and introduced itself to people it had worked with a hundred times.

A clean corporate architecture illustration of an agent's memory split into two layered tiers: a fast working-memory band feeding a reasoning core, and a durable, governed long-term store beneath it, rendered in muted blue and grey with soft card elevation. — The fix is not "more memory." It is two memories with different jobs: a fast, disposable working tier for the current run, and a durable, governed tier for what is true across runs.

The vendors describe the symptom well. Moxo puts it plainly in its architecture overview: without persistent memory, an agent is "perpetually introducing itself at a party it's attended a hundred times." That line got passed around our team for weeks, because it named exactly what we had built by accident. The deeper point is that the fix is not to stuff the transcript somewhere persistent. The transcript is the wrong shape. What production needs is structured state, and structured state has tiers.

The transcript is not memory

When people say "the agent has memory," they almost always mean one thing, when the system actually needs two. Clarion AI, writing on production multi-agent systems, draws the line cleanly: production agents need at minimum a short-term checkpointer that persists conversation state within a session, and a long-term store that retrieves user-specific or domain knowledge across sessions. Two tiers, two contracts, two failure modes. Conflate them and you get the morning reintroduction at best, and a cross-tenant data leak at worst.

Figure 1 · The decomposition

Two swimlanes: working memory and durable memory do different work

Top lane is fast and disposable; bottom lane is slow and governed. The status pills are the contract in miniature. Anything marked volatile can be dropped without a review. Anything marked governed needs a tenant key, provenance, and a deletion path before it is allowed to persist.

Problem: a single "memory" bucket forces one set of rules onto two workloads that want opposite things. Constraint: working state has to be fast and cheap to write on every tool loop, while durable state has to be auditable and never silently overwritten. Recommendation: separate the stores physically, not just logically. A session checkpointer for the working tier, a deliberately distinct vector or graph store for the durable tier, and no code path that writes to both with the same call.

Split two

Episodic logs versus semantic facts

Even inside the durable tier, there is a second split that teams miss. Episodic memory is what happened: the raw turns, the timeline, the noisy back-and-forth. Semantic memory is what is true: the field is forty hectares, the account is on the regulated tier, the user prefers metric units. The expensive mistake is treating these as the same thing and pouring whole transcripts into a vector database, then calling retrieval "long-term memory." Retrieval then hands the model yesterday's small talk instead of the one fact that mattered.

A corporate deck-style side-by-side comparison. On the left, a single-store design where the full chat transcript is dumped into one vector index. On the right, a tiered design where a working checkpointer sits above a durable store of extracted, tenant-scoped facts, drawn as muted blue and grey cards. — Side-by-side from a vendor reference deck: the single-store pattern (left) keeps everything in one index and loses both precision and provenance. The tiered pattern (right) separates a session checkpointer from a durable store of extracted facts. Source: vendor architecture references (Clarion AI, Exabeam).

The retrieval machinery for the durable tier is where the vendor patterns converge. Exabeam, in its architecture explainer, notes that long-term memory typically combines vector stores for semantic recall with knowledge graphs for relationships and provenance. That pairing matters in a compliance setting: the vector store tells you a fact is relevant, and the graph tells you where it came from and what it is connected to. If memory is one of the system components you are still mapping, it's worth taking a look at the broader decomposition of an agentic system, where memory sits as one component among six.

Who is allowed to read this row?

This is the question I ask first on any memory design, and on a multi-tenant platform it is not optional. The moment your durable store holds facts from more than one customer, every write and every read is a potential cross-account leak. A vector index does not know that embedding number 9,481 belongs to a different grower than the query that just came in. You have to enforce that, and you have to enforce it on the store, not in the prompt.

Problem: semantic search is similarity-based, so without a hard filter it will happily return the closest match from any tenant. Constraint: in regulated industries, "the model usually does not surface other accounts" is not a control an auditor will accept. Recommendation: row-level isolation on every memory store, with the tenant identifier as a mandatory partition key on both write and query, enforced below the application so a forgotten filter fails closed rather than open. Treat a long-term write without a tenant key as a build break, not a code-review nit.

Working memoryThe checkpointer is scoped to a thread

Concretely, the working tier is a checkpointer keyed by a thread or conversation identifier. LangGraph's persistence model is a clean reference here, and the one I point teams to: a checkpointer persists the state of a single run under a thread_id, while a separate store handles cross-thread retrieval for user and organization context. The two keys are different on purpose. Working state is keyed by the conversation. Durable state is keyed by who the conversation is for.

Figure 2 · Scope keys

Checkpointer keyed by thread, store keyed by tenant and user

Switch thread, lose working memory. Switch user, never see another tenant's facts. The checkpointer's scope is the conversation; the store's scope is the customer. When those two keys get crossed, you either drop state mid-run or, far worse, return one account's data to another.

So the durable tier should never receive a raw transcript. It receives facts that have been pulled out of the transcript on purpose, with a source attached. That extraction step is the part teams skip, and it is the part that makes the difference between memory you can audit and a vector index full of yesterday's chatter.

From transcript to durable fact

Figure 3 · The extraction pipeline

Transcripts become facts only after extraction, validation, and provenance

The pipeline is short but non-negotiable: extract candidate facts, validate and scope them, attach provenance, then persist. Every durable fact should be able to answer "where did you come from," because the day an auditor asks is the day you will wish you had stored the answer.

Problem: raw transcripts are episodic and noisy, so persisting them directly gives you recall without precision and storage without provenance. Constraint: a fact you cannot trace to a source is a fact you cannot defend, correct, or delete on request. Recommendation: run an explicit extraction step that turns turns into typed facts, dedupes against what is already stored, scopes each fact to its tenant, and stamps it with provenance before it lands in the durable store. The transcript can be retained separately for replay; it just does not get to be the memory.

But the context window keeps growing

The fair counterpoint, and I hear it in every design review, is that context windows are getting large enough to hold the whole history, so why bother with external memory at all. Just paste it all in. It is a real argument, and for a prototype it is often the right call. At enterprise scale it falls apart on three axes the window size does not fix.

Cost and latency scale with every token you re-send, so replaying a long history on each turn is a bill that grows with tenure rather than value. Auditability does not come for free with a bigger window: a fact buried in a 200,000-token prompt is not governed, scoped, or deletable. And memory injection widens the attack surface, because anything you feed back into the prompt is a place for prompt injection to hide, which is why retrieved memory has to be sanitized and access-controlled before it reaches the model. A bigger window changes what you can do; it does not change what you can defend. What you choose to load, and at what altitude, is itself a design decision. An interesting read for that is the piece on context engineering altitudes, which applies directly to deciding what earns a place in memory.

MondayAudit the memory you already shipped

You do not need a rewrite to start. Take the agent already in production and run it through six questions. Each one maps to a contract from the tiers above, and the first one you answer with a shrug is your next incident.

Figure 4 · The Monday audit

Six questions, each with a status you can defend

The two rows I see come back red most often are the tenant key and the sanitization step. Both are invisible in a demo and both are the kind of finding that turns into a compliance conversation. Run this before someone external runs it for you.

Do this Monday

Open your agent's memory writes and find the single line that persists to the long-term store. Then ask: is the tenant key on that line, and is it impossible to call this path without it? If the answer is "usually" or "we add it in the service," you have a leak waiting for traffic. Make the key mandatory at the store boundary, not a habit at the call site.

What memory is, once you stop calling it the transcript

None of this is exotic. It is the same discipline we bring to any system of record: separate the hot path from the durable path, scope every row to its owner, keep provenance, and write down a deletion policy before someone asks for one. Agents make it feel new because the reasoning core is probabilistic and the inputs are conversational, but the store underneath is sober data engineering, and that is exactly why it holds up under a regulated load.

So the agent that stops introducing itself every morning is not the one with the largest context window. It is the one whose working state is scoped to the run, whose durable facts are scoped to the customer, and whose memory was designed as structured state instead of inherited from the transcript. Draw the two swimlanes for your own system this week, then go find the write that is missing its tenant key.

Episodic logs tell you what was said. Structured state tells you what is true, who it belongs to, and where it came from. Only one of those is memory you can ship.

Comments (4)

Join the discussion

Ravi Vargasaura farming8/26/2025

Treating chat history as memory is the exact mistake that cost us a quarter. We kept stuffing the transcript back in and wondering why the agent forgot the one fact that mattered while remembering small talk. The moment we split episodic logs from a structured state store, the forgetting problem mostly went away. It was never a context window size problem, it was a we never decided what to keep problem.

Ingrid MartinezUnproven8/27/2025

Yes, that is the failure I wrote this around. The transcript feels like memory because it is right there, but it has no schema and no notion of what is durable. Glad the split worked for you. The hard part is usually getting a team to agree on what counts as durable state, because that is a product decision masquerading as an engineering one.

Nadia ReyesAwakened8/27/2025

For the self hosted crowd, worth saying the structured store does not need to be fancy. We run the long term knowledge in plain postgres on prem and only the embeddings on the GPU box. Keeps the sensitive state inside the boundary and the memory tier stays auditable without shipping anything to a vendor.

Deepa Brooksaura farming8/28/2025

Where do you draw the session boundary in practice? Thats the bit I never get clean. Per conversation is too short, per user is too broad, and time based feels arbitrary. Curious what actually held up in production for you rather than what sounds tidy.