Waxell

Product

Compare

START FREE

Waxell

Logan Kelly

Mar 2, 2026

The Hidden Cost of AI Agents: Why Token Spend Spirals and How to Control It

Engineers think in requests. Agents run in loops. Here's the math behind why agent costs explode, and four practical strategies to control token spend in production.

There's a mental model mismatch that causes a predictable category of production disasters, and it comes down to this: engineers think in requests. Agents run in loops.

When you're building a traditional API-backed feature, cost math is simple. One user action equals one API call. One API call has a known cost. You multiply by your DAU and you have your monthly bill, approximately. Budget accordingly.

When you're building agents, this model falls apart. An agent doesn't make one call. It reasons, retrieves, calls tools, observes the results, reasons again, calls more tools, synthesizes — and each of those steps is a context window, and each context window grows as the conversation goes on because everything that came before gets appended. The token count isn't static. It compounds.

AI agent cost control is the practice of defining and enforcing token budgets, session limits, and spend guardrails at the infrastructure layer — not inside agent code. Unlike traditional API cost management, agent costs don't scale linearly with requests: they compound with context accumulation, loop behavior, and tool call overhead, making proactive enforcement essential. A 5-step agent loop serving 200 concurrent users can cost 10× what naive per-request estimates suggest. (See also: What is agentic governance →)

Why Do AI Agent Costs Multiply With Each Loop?

Take a modest agent. It handles customer support queries. For a typical question, it goes through this cycle: initial reasoning call, two tool calls (retrieve relevant docs, look up account status), a synthesis call to compose the response. Four LLM calls. Say your context window at each step is, on average, 4,000 tokens. That's 16,000 tokens for one resolved query.

That doesn't sound terrible. At GPT-4 pricing, that's maybe 50 cents for a complex query. You're building a support product. You're charging for it. Fine.

Now add scale. You're serving 200 concurrent users. You're now at 3,200,000 tokens per "round" of queries. If an average session involves five exchanges before it's resolved, you're at 16,000,000 tokens per hour of support traffic.

Now add the thing nobody budgets for: variance. Some queries don't resolve in five exchanges. Some agents get into loops — they call a tool, the tool returns something unexpected, they reason about it, call the tool again, get a similar result, reason again. Without a hard stop, a single looping session can consume 100x the token budget of a normal one. You don't notice it until it's in the tail of your cost distribution, and by the time it's notable in your aggregate, it's happened hundreds of times.

This is not a hypothetical. Every company that has shipped agents at scale has a version of this story.

Where Do AI Agent Cost Spirals Actually Come From?

Context accumulation. The longer a conversation runs, the more tokens get included in every subsequent call. A 20-turn conversation doesn't cost 20x a 1-turn conversation — it costs significantly more, because every turn includes the entire history. If you're not actively managing context window size (through summarization, pruning, or hard turn limits), you're letting cost compound with every exchange.

Tool call overhead. Each tool call requires the agent to explain what it's calling and why, get the result back, and reason about the result — all of which adds tokens. Agents that call tools aggressively, or that call tools whose results are verbose, pay a significant overhead per call. A tool that returns 500 tokens when 50 would do is costing you 10x more than necessary.

Retries. When a tool call fails, or when the model's output doesn't pass a validation step, the agent retries. Retries are full calls at full cost. An agent that retries aggressively on flaky tools can run up a large bill in a short time.

Concurrent sessions at peak. If your agent is customer-facing, you have peak hours. During peak, the number of concurrent sessions multiplies your per-session costs. If your agent also tends to take longer (more reasoning steps, more tool calls) on high-complexity queries, and high-complexity queries cluster at peak... you can see where this goes.

What Are the Most Effective Strategies for Controlling AI Agent Costs?

1. Session-level token budgets with hard stops. The most direct intervention. Set a maximum token budget per session. When a session approaches the limit, either summarize the conversation so far and continue with a compressed context, or gracefully terminate and hand off to a human. The key word is "hard stop" — a soft warning that the agent can ignore doesn't help. The governance layer needs to be able to terminate a session before it burns past the ceiling.

This sounds aggressive. In practice, a well-tuned budget catches the outlier sessions (the loops, the unusually complex queries) without affecting the median session at all.

2. Context pruning and compression. Rather than including the full conversation history in every call, summarize older turns. After every N exchanges, run a lightweight summarization call that compresses the conversation history, replace the raw history with the summary, and proceed. You lose some fine-grained context but retain the semantic substance. For most agent tasks, this tradeoff is highly favorable.

The cost of the summarization call is paid back immediately in the reduced token count of every subsequent call. For long sessions, the savings compound.

3. Tiered model routing. Not every step in your agent's reasoning requires your most capable (and most expensive) model. Tool selection, parameter extraction, routine classification — these can often be handled by smaller, faster, cheaper models. Reserve GPT-4-class models for reasoning steps that genuinely require them: complex synthesis, nuanced judgment calls, multi-step reasoning chains.

This requires knowing, at design time, which steps in your agent's workflow actually need heavy reasoning. It's worth the analysis. The cost difference between a frontier model and a smaller model on simple tasks is often 10x or more.

4. Real-time spend visibility with alerting before thresholds. This sounds obvious. It isn't standard practice. Most teams find out about cost anomalies after the billing period closes. You need to know about a spiraling session while it's happening — specifically, before it hits a threshold that matters.

Set a per-session alert at 60% of your budget ceiling. Set a fleet-level alert when aggregate spend for the hour is trending beyond your daily allocation. These alerts should go somewhere someone is actually watching, not just into a log.

What Do AI Agent Budget Guardrails Look Like in Practice?

A budget guardrail isn't a feature you add to your agent code. It's a policy at the infrastructure layer — something that observes session costs in real time, tracks against a defined budget, and takes action when thresholds are reached.

The action options are: warn (send an alert but let the session continue), compress (trigger a context compression step to reduce future call costs), cap (refuse to initiate new LLM calls until the session ends), or terminate (end the session and hand off appropriately).

Which action is right depends on your product. A customer support agent might compress first, then cap; termination would be a last resort because it creates a bad experience. An internal automation agent might terminate cleanly — the job can be requeued when budget resets.

The important thing is that these decisions are made at policy definition time, not ad hoc in the moment. The governance layer enforces the policy. The policy is the decision.

Who Owns AI Agent Cost Governance?

One more thing worth naming: cost governance for agents requires cross-functional alignment that most organizations haven't established yet.

Engineering sets the technical budget ceiling. Finance needs to understand that agent costs are variable in ways traditional SaaS infrastructure costs are not. Product needs to understand the cost implications of features that increase session depth or tool call frequency. Without this alignment, you get a cost spike, a panicked response, and a retroactive patch — instead of a thoughtful policy that was in place from the start.

The time to build that alignment is before the spike. It's a boring conversation to have. Have it anyway.

Agent costs are controllable. The math isn't mysterious. The tools exist. What's usually missing is the governance layer to enforce the policies that translate knowledge into behavior — automatically, in production, without requiring an engineer to be watching.

How Waxell handles this: Waxell's Cost policy type enforces session-level token budgets in real time — tracking spend as it accumulates and triggering configurable responses (compress, cap, or terminate) when sessions approach defined thresholds. Fleet-level spend alerts fire before your daily allocation is consumed, not after. You define the budget once at the policy layer; Waxell enforces it across every session without touching agent code. See how it works →

Frequently Asked Questions

Why do AI agent costs spiral in production? Agent costs spiral because they compound rather than scale linearly. Each turn in a conversation extends the context window, making every subsequent LLM call more expensive. Tool calls add overhead at each step. Loops — where an agent retries a failing operation — can run up 100× the normal session cost. Without a governance layer enforcing budget ceilings, a single outlier session can consume as much as hundreds of typical sessions.

How do you calculate the real cost of an AI agent session? Multiply the average number of LLM calls per session by the average context window size at each call. For a typical 4-step agent handling a query with a 4,000-token average context per step, that's 16,000 tokens per resolved query. At 200 concurrent users, 5 exchanges per session, you're at roughly 16 million tokens per hour of traffic — before accounting for variance from long sessions or loops. Most teams' initial estimates are off by 5–10× before they do this math explicitly.

What is a token budget for AI agents? A token budget is a hard limit on the total tokens a single agent session is allowed to consume. When the session approaches the limit, the governance layer triggers a predefined response: summarize the conversation history and continue with compressed context, stop accepting new LLM calls until the session ends, or gracefully terminate and hand off to a human. A well-tuned budget catches outlier sessions — loops, unusually long conversations — without affecting typical sessions at all.

What causes AI agent cost explosions? Four mechanisms cause most cost explosions: context accumulation (full conversation history included in every subsequent call), tool call overhead (verbose tool responses inflating every context window), retry behavior (failed calls at full cost), and concurrent session peaks (multiplied per-session costs during high traffic). Any one of these is manageable. When they compound — a peak traffic moment with verbose tools and some agents in retry loops — costs can spike dramatically in minutes.

How do you stop an AI agent from going over budget? Budget enforcement has to be at the infrastructure layer, not in the agent's system prompt. The governance layer observes token spend in real time, compares against the defined budget, and triggers hard stops — not suggestions the model can override. Effective implementations alert at 60% of the budget ceiling (while there's still time to act), compress context at 80%, and cap or terminate at the ceiling. The key word is "hard stop": a soft warning the model can reason around does not constitute cost governance.

Agentic Governance, Explained

Waxell blog cover: CVE-2026-21520 Copilot prompt injection governance kill switch

CVE-2026-21520: Why Patching a Prompt Injection Doesn't Fix the Architecture

Microsoft patched CVE-2026-21520. Data still exfiltrated. The problem isn't the vulnerability — it's that AI safety filters aren't kill switches. Here's the architecture fix.

Logan Kelly

Apr 21, 2026

Waxell blog cover: abstract dark grid pattern, text reads "53% of AI Agents Exceed Their Permissions"

53% of AI Agents Exceed Their Permissions. That's an Architecture Problem.

CSA's April 2026 study found 53% of organizations had AI agents exceed intended permissions. This isn't a policy gap — it's a missing enforcement layer.

Logan Kelly

Apr 20, 2026

Waxell blog cover: The Three-Layer Agentic Architecture Most Teams Get Wrong

The Three-Layer Agentic Architecture Most Teams Build Wrong

LangChain says own your cognitive architecture, outsource your infrastructure. That's two layers. The governance plane is the third — and it's the one where production failures live.

Logan Kelly

Apr 17, 2026

Waxell blog cover: GitHub AI Agent Prompt Injection — Claude, Gemini, Copilot, No CVE

Comment and Control: The GitHub AI Agent Attack That Three Vendors Hushed

Researcher Aonan Guan hijacked Claude Code, Gemini CLI, and Copilot Agent via PR titles and hidden HTML comments. All three paid bug bounties. None filed a CVE. Here's what that means for your agents.

Logan Kelly

Apr 16, 2026

CVE-2026-21520: Why Patching a Prompt Injection Doesn't Fix the Architecture

Microsoft patched CVE-2026-21520. Data still exfiltrated. The problem isn't the vulnerability — it's that AI safety filters aren't kill switches. Here's the architecture fix.

Logan Kelly

Apr 21, 2026

53% of AI Agents Exceed Their Permissions. That's an Architecture Problem.

CSA's April 2026 study found 53% of organizations had AI agents exceed intended permissions. This isn't a policy gap — it's a missing enforcement layer.

Logan Kelly

Apr 20, 2026

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

Product

Company