Why Your AI Agent Loses the Thread After 30 Minutes

If you use AI agents to develop software — Claude Code, Cursor, Copilot, Codex, Gemini CLI — this has definitely happened to you:

You start a conversation. Explain the feature. The agent understands perfectly, explores the code, proposes a solid solution. Everything is going great.

At 20 minutes it's still responding well, but you notice it repeats things it already said. At 40 minutes it generates code that contradicts a decision you made together 15 messages ago. After an hour, you're spending more time correcting the agent than coding yourself.

It's not a bug. It's how tokens work — and it can be fixed.

What are tokens and why they matter

A token is the minimum unit a language model processes — not exactly a word, more like a fragment. "Implement" might be 1-2 tokens; "destructuring" could be 3-4.

Every model has a context window: the maximum number of tokens it can "see" at once. As of February 2026:

Model	Context window	Max output
Claude Opus 4.6	200K (1M in beta)	128K tokens
Claude Sonnet 4.6	200K (1M in beta)	64K tokens
GPT-5 (API)	400K	—
GPT-5.3-Codex	192K	—
Gemini 2.5 Pro	1M (2M coming soon)	—

200K tokens sounds like a lot. But when you're working on a real project — code files, accumulated conversation, system instructions, previous responses — the window fills up faster than you'd expect.

When the conversation exceeds the window, the model compacts previous messages. Both Claude and OpenAI have automatic compaction: when context approaches the limit, the API summarizes the oldest parts. This allows technically "infinite" conversations, but details get discarded. Decisions get summarized. Nuances are lost.

And the real problem: you don't know what's been lost. The model doesn't tell you "I just forgot we decided to use PostgreSQL." It simply acts as if that decision never existed. Or worse: it partially remembers and generates something inconsistent.

Those are degraded tokens. Tokens that technically exist in the context, but whose information quality has deteriorated.

The "single conversation" approach: simple but fragile

The workflow most developers use:

Open a conversation
Describe what you need
The agent explores, you discuss, it implements
You correct, iterate, and adjust until done

All in a single context. Intuitive and natural. The problem is that each message accumulates:

Phase	Approximate tokens
System instructions	2,000–5,000
Code exploration	15,000–40,000
Discussion + implementation	15,000–45,000
Corrections and adjustments	10,000–25,000
Accumulated total	42,500–116,000

A medium session already consumes between 21% and 58% of a 200K window. Not counting that the model's responses also count as input tokens in the next turn.

When it approaches the limit, the system compacts. The symptoms:

Contextual hallucinations: generates functions that already exist
Lost decisions: forgets constraints from the beginning
Inconsistent code: one file follows one pattern, another follows a different one
Repetition: explains things it already explained

3K / 200K

Quality100%

And what nobody measures: the cost of correction tokens. Every time you say "no, we had decided to use X," you're consuming tokens. In a long session, between 30% and 50% of tokens can be pure rework.

The orchestrator + sub-agents pattern

There's another way to work. Instead of a monolithic conversation, you split the flow into phases, and each phase is executed by an independent agent with clean context.

An orchestrator agent coordinates the flow. It doesn't do the work — it only decides which agent to launch, with what information, and presents results between phases.

Think of it this way: instead of one employee who does everything — researches, plans, designs, implements, tests — you have a team coordinator and specialists. The coordinator only needs to know what's been done, what's left, and who does it.

How it works in practice

Phase 1 — Exploration. A sub-agent reads the relevant code and returns a structured summary. It finishes and its context is released.

Phase 2 — Proposal. Another sub-agent receives the summary and produces a proposal with scope and success criteria. The developer reviews it.

Phase 3 — Specification. Another agent takes the approved proposal and writes detailed specifications.

Phase 4 — Technical design. Another agent analyzes the spec and produces architecture decisions.

Phase 5 — Implementation. Another agent receives specs + design and writes code. No exploration, no debates — context 100% dedicated to implementing.

Phase 6 — Verification. Another agent runs tests and verifies that the code meets the specs.

Each sub-agent is born, does its job, and terminates. The orchestrator only retains summaries between phases.

ModelSonnet

InputProject files

OutputStructured summary

Context35K / 200K

Released when done

Orchestrator~8K

The right model for each task

With an orchestrated flow, each sub-agent can use a different model. And models don't just differ in price — they differ in type of thinking:

Strength	Models	When to use
Deep reasoning	Gemini 2.5 Pro, Claude Opus	Architecture, design, critical decisions
Clean and precise code	Claude Opus, GPT-5	Complex implementation, refactoring
Speed + balance	Claude Sonnet, GPT-5.3-Codex	Exploration, proposals, standard code
Testing and validation	Codex, Claude Opus	Verification, security review
Fast autocomplete	Copilot, Claude Haiku	Boilerplate, snippets, repetitive tasks

The idea: use the premium model where an error has serious consequences, and the efficient model where the work is more mechanical.

In an OAuth2 authentication flow, for example, you'd use a fast model to explore the code and draft the proposal. A deep reasoning model to design the token and session strategy — where a security error is critical. And back to the efficient model for implementation that follows an already-validated spec.

Exploration

Proposal

Specification*

Design*

Implementation

Verification*

Quality99%

Sonnet builds. Opus decides and reviews.

* Critical phases — errors propagate downstream

The component shows an example with the Claude family — but the principle applies to any combination of models and providers.

The truth about token consumption

Let's be honest: an orchestrated flow consumes more tokens in total. Each sub-agent needs to load context from scratch — instructions, previous artifacts, source code. There's inevitable duplication.

The difference is in a line that doesn't appear in the table: rework due to degraded context: 0.

The advantage doesn't show up in small features. If your task fits in 20 minutes, you don't need orchestration. The difference becomes dramatic in features that require more than an hour, that cross multiple domains (frontend, backend, infra), or that extend over several days. In a single conversation, rework cost grows exponentially as context degrades. In an orchestrated flow, each sub-agent works with the same quality in phase 6 as in phase 1.

And even though providers offer automatic compaction, it's still lossy compression — the summary discards information that, for your specific task, could be critical.

The amnesia problem between agents

If each sub-agent is born with clean context, how does it know what happened before?

The obvious answer is to pass it all the previous output. But that would recreate the context accumulation of the monolithic approach. What we need is selective memory: each agent accesses what it needs, without loading the entire history.

Engram: persistent memory for agents

Engram solves this with a "progressive disclosure applied to tokens" approach.

It's a Go binary with SQLite and full-text search, compatible with any MCP agent. No external dependencies. Installs in one minute with Homebrew, and engram setup claude-code configures everything.

Three key concepts:

Observations: the unit of memory. Each one is a structured summary — a decision, a resolved bug, an established pattern. With title, type, content, and timestamp.

Sessions: temporal context. When finished, the agent generates a summary. The next session starts with that context.

Topic keys: keys that enable upserts. If you save a decision with key auth/strategy and later update it, Engram modifies the existing one. Memory evolves, it doesn't grow indefinitely.

The progressive disclosure pattern

When a sub-agent needs previous information, it doesn't load the full artifact:

Step 1 — Search: searches Engram with a query. Receives compact results — title, snippet, and an ID. Each result: ~100 tokens.

Step 2 — Selective retrieval: only if it needs the full content, it retrieves the observation by ID.

An implementation sub-agent can search "design decisions for feature X," get 5 results for ~500 tokens, read in detail only the 2 relevant ones, and never load the other 3.

Without Engram — loads everything

Exploration3K

Proposal2K

Specification4K

Design3K

Total12K

With Engram — loads selectively

Exploration

Proposal

Specification

Design

Total0

At the scale of a complete flow with 6 sub-agents, this accumulated difference can mean 15,000 to 30,000 fewer tokens. Not a spectacular saving in absolute cost, but they're tokens that don't take up space in the window, leaving more room for the sub-agent's real work.

Tool outputs: the forgotten vector

There's a consumption vector that almost nobody mentions: the outputs of the tools the agent uses.

When an agent works, it uses MCP tools: reads files, executes commands, takes browser snapshots. Each result enters the context window directly.

Operation	Tokens consumed
Playwright snapshot (repo page)	~144 KB
List 20 GitHub issues (full JSON)	~62 KB
Access log file (500 requests)	~52 KB
Analytics CSV (300 rows)	~14 KB
Test suite output (30 suites, 136 tests)	~6 KB

Real data: I navigated with Playwright to the Engram repo, listed 20 issues from vercel/next.js with gh, and processed realistically generated logs and tests. Those five operations alone total ~278 KB — more than 100% of a 200K token window. Cloudflare identified this problem with their MCP servers: with 81 active tools, 143K tokens were consumed before the user's first message.

Context Mode: sandbox for outputs

Context Mode sits between the agent and its tool outputs. Raw data never enters the window — it's processed in an isolated sandbox, and only a summary enters the conversation.

The result measured in real operations: 278 KB of raw output → 2.1 KB. A 99.3% reduction.

Operation	Without Context Mode	With Context Mode	Reduction
Playwright snapshot (repo)	143.6 KB	562 B	99.6%
20 GitHub issues (next.js)	62.4 KB	735 B	98.8%
Log of 500 requests	51.8 KB	229 B	99.6%
Analytics CSV (300 rows)	14.1 KB	223 B	98.5%
Test suite (30 suites, 136 tests)	5.7 KB	223 B	96.2%

Raw143.6 KB

- generic [ref=e2]: - banner [ref=e6] - main [ref=e9]: - heading "Gentleman-Programming/engram" [level=1] - tab "Code" [selected] - tab "Issues" - tab "Pull requests"

Processed562 B

GitHub repo (engram). Nav, tabs, README with instructions, sidebar with releases and contributors.

99.6% reduction

The cumulative effect: the time before the agent starts degrading goes from ~30 minutes to ~3 hours.

The four layers together

Each layer attacks the problem from a different angle:

Orchestration → clean context per phase
Model selection → cost vs quality optimized
Engram → selective memory between agents
Context Mode → compressed tool outputs

Not everything needs all four layers. For a 30-minute task touching 1-3 files, a direct conversation is perfect. Consider orchestration when the feature requires more than an hour, crosses multiple domains, or you need to pick up the next day. Add persistent memory when you work on a project for days and today's decisions affect tomorrow's code.

The future: tokens as a resource

Context windows are skyrocketing. Claude Opus already has 1M in beta. Gemini 2.5 Pro works with 1M standard. But more context doesn't mean better context — the more content in the window, the harder it is for the model to focus on what's relevant.

The real optimization isn't cramming more tokens in. It's making sure every token in the window is relevant to the current task.

Focused tokens. Fresh tokens. Chosen tokens. Fairly priced tokens.

It's not about spending less. It's about spending better.