March 15, 20268 min read

The context window is not memory: designing stateful AI systems

Passing a growing conversation transcript into every inference call is the simplest possible state management strategy and also the most brittle. Here is what a serious approach to context actually looks like.

The most common approach to giving an LLM-powered application 'memory' is to concatenate the conversation history into the prompt and send it with every call. This works in demos. In production, it degrades in at least three directions simultaneously: the context window fills up (every frontier model has a hard token limit), the cost grows linearly with conversation length (you pay for every token on every call), and model performance on tasks requiring attention to specific earlier details actually declines as the context grows beyond a few thousand tokens on most current architectures.

Context length is a shared resource, and unmanaged growth of that resource is a reliability problem with a predictable failure mode. The solution is not a longer context window — it is a deliberate architecture for deciding what information to include in any given inference call, and why.

What the model actually needs at each call

Before deciding how to manage context, it helps to be precise about what the model needs to do its job at any given inference call. For most tasks, this breaks into three categories: the standing instructions (system prompt, behavioral constraints, task definition — rarely changes); the relevant background (documents, records, user profile information that is relevant to this specific request — highly selective); and the immediate conversational context (the recent turns that establish what the user is trying to accomplish right now — usually small).

These categories have different update frequencies, different relevance criteria, and different optimal representations. Collapsing them into a single undifferentiated message history is convenient but wasteful. A standing instruction that repeats verbatim in every call is a token tax. A document that was relevant ten turns ago but isn't relevant now is noise that the model has to reason around. Separating these concerns is the first step toward a context management strategy that scales.

Retrieval instead of accumulation

The key architectural shift is from accumulating context to retrieving context. Rather than growing a conversation buffer indefinitely, store information that might be needed later (documents, facts established in conversation, tool outputs) in a retrievable store, and retrieve selectively based on what the current request actually needs.

Embedding-based retrieval over a document store is the well-known version of this, but the same principle applies to conversation memory. Rather than prepending the last twenty turns of conversation, maintain a compact, structured representation of what has been established: the user's goal, any constraints they've expressed, decisions that were made, facts that were confirmed. Retrieve and inject that representation rather than the raw transcript. This is more work to build and dramatically more robust at scale.

The retrieval step is also where you have the most leverage on quality. A vector similarity search returns the most semantically similar chunks, which is often not the same as the most relevant chunks for the task at hand. Reranking, metadata filtering, and hybrid retrieval strategies (dense + sparse) are all tools for improving relevance precision, and each has a measurable effect on downstream output quality that is worth evaluating explicitly.

Structured state vs. unstructured transcript

For any application where continuity across sessions matters — a support assistant that should remember a user's previous issue, a workflow tool that should resume where the user left off — the choice of state representation matters as much as the retrieval strategy. A raw conversation transcript is maximally expressive and minimally structured. A structured state record (a set of typed fields representing what has been established, what is pending, what has been decided) is maximally structured and slightly lossy.

In practice, most production applications benefit from a layered approach: a structured state record for the facts that need to be reliably preserved and acted upon, combined with selective access to relevant verbatim transcript segments for tasks where the exact phrasing matters. The structured layer gives you reliable, cheap access to the most important context; the transcript layer gives you the nuance that structure can't capture.

Designing this architecture up front is cheaper than retrofitting it. An application that starts with raw transcript accumulation and needs to migrate to structured state management at scale faces a difficult migration: the schema of the state record has to be reverse-engineered from the semantics of historical conversations, and there is no clean cutover path while users have ongoing sessions.

Token budgeting as a first-class concern

Context management ultimately reduces to token budgeting: deciding how many tokens each category of context may consume, and enforcing those budgets when assembling the prompt. This is an engineering constraint, not an afterthought, and it should be modeled explicitly in the code rather than discovered empirically when a request fails with a context length error.

A well-designed prompt assembly layer takes the available token budget, allocates a ceiling to each context category (system prompt, retrieved documents, recent history, current turn), and trims or summarizes each category to fit within its ceiling. The trimming strategy for each category should be deliberate: a sliding window over recent turns is appropriate for conversational context; selecting the highest-scoring retrieved documents is appropriate for document context; neither is appropriate for system prompt content, which should rarely be truncated at all.

Track token usage per request and per category in your observability pipeline. Over time, this data tells you whether your budgets are realistic, which categories are consistently over-budget, and whether changes to retrieval strategy are actually reducing the token consumption of the document context category as intended. Without this visibility, token budgeting is guesswork dressed as engineering.

ArchitectureGenAI

What the model actually needs at each call

Retrieval instead of accumulation

Structured state vs. unstructured transcript

Token budgeting as a first-class concern

The context window is not memory: designing stateful AI systems

What the model actually needs at each call

Retrieval instead of accumulation

Structured state vs. unstructured transcript

Token budgeting as a first-class concern

More insights

Production over demos: shipping LLM features that survive real users

Evals as a first-class artifact

The context window is not memory: designing stateful AI systems

What the model actually needs at each call

Retrieval instead of accumulation

Structured state vs. unstructured transcript

Token budgeting as a first-class concern

More insights

Production over demos: shipping LLM features that survive real users

Evals as a first-class artifact