Logan Kelly
Context windows grow silently in agentic systems. Here's the math behind why naive history management costs 5x more — and how to enforce hard limits.

The math isn't complicated. It's just that nobody runs it until they get the bill.
An AI agent handling a 10-turn workflow — reading files, calling tools, revising output — doesn't cost 10x a single query. Because transformer inference processes the entire context on every call, cost compounds with each additional turn. The tenth turn carries everything that preceded it: the original file reads, every tool call and its return payload, every intermediate plan and revision. A team that models agent cost as "turns × average cost per turn" will consistently underprice their system by 3x to 5x.
This is the context window cost problem. It is structural, not anecdotal. And in 2026, with context windows exceeding 200,000 tokens and frontier model input pricing in the range of $2.50–$5 per million tokens, it has become one of the most significant and least-governed cost drivers in production AI systems.
Why Context Compounds
Transformer-based language models have no native memory across turns. Each inference call receives the full context — every prior message, every tool result, the complete system prompt — and pays for all of it. If a message was sent three turns ago, it still occupies tokens on every subsequent call, at full cost.
Consider a debugging agent. On turn one, it reads the codebase: roughly 20,000 tokens. On turn two, it calls a tool that returns 5,000 tokens and produces a plan. By turn ten, the context window contains the original file read, every intermediate plan, every tool call and its return payload, and every revision cycle. A workflow that felt like ten small steps has accumulated 80,000–200,000 tokens — and every token introduced in turn three is being billed again on turns four through ten.
The naive approximation — "each turn costs roughly the same" — ignores this compounding entirely. The accurate model is closer to a triangular series: total cost grows roughly with n(n+1)/2 where n is the number of turns with new context additions, not linearly with n. Teams that model per-turn costs independently consistently underestimate multi-step agentic workflow costs by 3x to 5x once context accumulation, tool call payloads, and system prompt repetition are properly accounted for.
At current frontier pricing — Claude Opus 4.7 at approximately $5/M input tokens, GPT-5.4 at approximately $2.50/M input tokens — this spread translates directly into budget overruns that appear unpredictable until the underlying architecture is understood.
Where the Money Disappears
There are four principal context cost drivers in agentic systems that teams routinely fail to model:
System prompt duplication. System prompts are included on every turn. An agent with a 4,000-token system prompt running 20 turns will spend 80,000 tokens on system prompt repetition alone — roughly 16% of the total bill for a 500,000-token workflow, paid not for reasoning but for structural overhead. System prompts rarely appear as a line item in cost dashboards.
Tool call return payloads. MCP servers, APIs, and retrieval layers return raw payloads, and those payloads accumulate in the context window. A search tool returning 3,000 tokens per call across 8 calls contributes 24,000 tokens of accumulated results — many of which are no longer relevant to the agent's current reasoning step. Standard agentic stacks have no native mechanism to truncate stale tool outputs from the active context.
Re-retrieved redundant information. Agents without memory management will frequently re-retrieve documents they have already read when a new task step begins. Each redundant retrieval event adds tokens to an already-loaded context. In multi-step research or coding workflows, this pattern is common and expensive.
Idle context carrying. The planning output from step one is still in the context window at step ten, whether or not it remains relevant. Without explicit summarization or pruning policies, rejected approaches, superseded plans, and obsolete tool outputs carry through the entire workflow — contributing to cost without contributing to reasoning quality.
None of these cost drivers requires a runaway loop or an agentic failure to appear. They are present in ordinary, well-functioning multi-step workflows. The cost problem here is not exceptional behavior; it is normal behavior, unmanaged.
The Enforcement Gap
Platforms like LangSmith, Helicone, and Arize Phoenix offer cost tracking for agentic workflows. This is useful for retrospective analysis — identifying which agents are expensive after the fact, and correlating spend with model version, prompt configuration, or task type.
What these platforms cannot do is intervene. They observe cost as it accumulates, but they do not operate in the execution path. They cannot halt a workflow when a per-session token budget ceiling is reached. They cannot enforce a maximum context size before the inference call is submitted. They cannot trigger a compression or summarization subroutine mid-session when context approaches a cost threshold.
This is not a criticism of observability tooling — it is a description of its scope. Observability is logging and analysis. What production agentic systems additionally require is enforcement: runtime controls that act on cost policy before spend is incurred, not reporting on it afterward.
The gap between "we can see how much this agent costs" and "we can enforce how much this agent is allowed to cost" is the governance gap that most teams in 2026 have not yet closed.
How Waxell Runtime Handles This
Waxell Runtime operates in the execution path, not alongside it. Before an agent submits an inference call, Runtime evaluates whether that call complies with active policies — including token budget policies that limit total context accumulation per session, per agent class, or per task type.
This creates hard stops, not soft alerts. An agent that has accumulated 150,000 tokens in a session configured with a 100,000-token policy ceiling will not silently proceed to the next turn. Runtime can be configured to halt the workflow, trigger a compression subroutine, or escalate to human review — depending on the policy definition and the risk tier of the agent.
Waxell Runtime ships with 26 policy categories out of the box, including cost hard stops, context window enforcement, budget-triggered escalation paths, and loop detection. The enforcement architecture requires no rebuilds: Runtime deploys as a governance plane above existing agents, without requiring modification of the agent code itself.
Waxell Observe, the SDK-level observability layer, complements Runtime with real-time telemetry — providing per-turn, per-call cost visibility that feeds Runtime's policy decisions. Observe initializes in two lines of code and auto-instruments 157+ libraries, which means cost attribution begins immediately, at full fidelity, without a custom integration effort.
Together, they create the architecture that observability-only platforms cannot deliver: cost policy enforced at execution time, not reviewed in a dashboard after the bill arrives.
FAQ
What is the most expensive hidden cost in agentic AI systems in 2026?
Context maintenance — the accumulated cost of carrying prior turns, tool call results, and system prompts through every inference call — is consistently underestimated. Because cost scales roughly with the compound growth of context across turns rather than linearly with turn count, teams that model per-turn costs independently will underprice multi-step agentic workflows by 3x to 5x.
Do large context windows make agentic systems more expensive or more efficient?
Both, simultaneously — which is why context window size alone is a poor cost metric. A 200,000-token context window can enable a more capable single-pass workflow that avoids expensive re-retrieval cycles. But it also increases the cost of every subsequent turn that carries that loaded context. The efficient approach manages what enters the context window and when it is pruned, not just how large the window can get.
Why can't LangSmith or Helicone stop runaway context costs?
Observability platforms sit outside the execution path. They record what happened after inference calls return. Enforcing a cost limit requires operating before the inference call — validating the pending context size against a budget policy and blocking or modifying the call if the policy would be violated. This is the function of a runtime governance layer, not an observability layer.
What is a token budget policy and how does it work in practice?
A token budget policy defines a maximum token allocation for an agent within a defined scope — per session, per task type, or per time period. At runtime, the governance layer evaluates each pending inference call against the active budget, comparing the proposed context size against remaining quota. If the call would exceed the limit, the governance layer can block, compress, summarize, or escalate — depending on the configured policy response.
Is automatic context compression safe to apply to all agent workflows?
Compression strategies — summarization, pruning, retrieval replacement — involve tradeoffs between cost reduction and information fidelity. Automatic compression is appropriate for intermediate planning text and superseded outputs. It is less appropriate for verbatim technical payloads — code snippets, regulatory text, contract language — where precision matters. Governance policies should distinguish between content types when defining compression rules.
How does Waxell Connect help with context cost in third-party or vendor agent scenarios?
Waxell Connect governs agents that a team did not build — vendor agents, third-party integrations, and MCP-native agents — with no SDK and no code changes required. This matters for cost control because vendor agents often have opaque context management behaviors that cannot be modified. Connect enforces budget policies externally, without requiring access to or modification of the vendor agent's internals.
Sources
Company of Agents — "AI Agent Unit Economics: Scaling Your Agentic Fleet in 2026": https://www.companyofagents.ai/blog/en/ai-agent-unit-economics-scaling (accessed May 2026) — context maintenance framing and general agentic cost analysis
AI Credits — "The Real Cost of Building an AI Agent in 2026": https://www.aicredits.co/en/blogs/real-cost-of-ai-agents-2026 (accessed May 2026) — source for 3–5x cost underestimation figure and coding agent token volume estimates
Byteiota — "Agentic AI Coding Costs: Why Devs Ask 'Which Tool Won't Torch My Credits?'": https://byteiota.com/agentic-coding-economics/ (accessed May 2026) — practitioner cost framing
Hacker News — "Effective context engineering for AI agents": https://news.ycombinator.com/item?id=45418251 (accessed May 2026) — community discussion on context engineering tradeoffs
Hacker News — "In my experience with AI coding, very large context windows aren't useful in practice": https://news.ycombinator.com/item?id=42834527 (accessed May 2026) — practitioner perspective on large context limitations
Hacker News — "Show HN: Context Lens – See what's inside your AI agent's context window": https://news.ycombinator.com/item?id=46947786 (accessed May 2026) — practitioner tooling for context visibility
Anthropic model pricing (verified 2026-05-04): https://platform.claude.com/docs/en/about-claude/models/overview — Claude Opus 4.7: $5/M input tokens, $25/M output tokens
OpenAI API pricing (verified 2026-05-04): https://openai.com/api/pricing/ — GPT-5.4: $2.50/M input tokens, $15/M output tokens; GPT-5.5: $5.00/M input tokens
Agentic Governance, Explained




