Waxell

Product

Compare

START FREE

Waxell

Logan Kelly

Mar 30, 2026

The Loop Tax: Why Cutting Your Token Price Won't Fix Your AI Agent Budget

Most teams optimize for cheaper tokens. The actual cost driver in agentic systems is loop count × context accumulation — and cheaper models won't fix that.

Waxell blog cover: The Loop Tax — AI Agent Budget and Token Cost

GPT-4o input pricing halved in August 2024, dropping from $5 to $2.50 per million input tokens. One team's four-agent market research system still ran up $47,000 in a single month. The cheaper the tokens got, the worse the problem became — because they reinvested the savings in more agent capacity, not in anything that would stop a session gone wrong.

Most teams chasing AI cost reductions are optimizing the wrong variable. Per-token price is a fixed external factor. The variables you actually control are loop count, context window accumulation, and whether anything terminates a session before it compounds past its useful life. A 50% drop in token price means nothing if your agents run twice as many loops.

The teams that consistently underspend on AI aren't the ones with the best API contracts. They're the ones with enforcement at the execution layer, not just visibility at the billing layer.

The loop tax is the cost multiplier inherent to every agentic workload: each reasoning cycle — plan, retrieve, call, verify, respond — consumes tokens at every step, and each step's output adds to the context window that subsequent steps must process. A user request that costs $0.01 as a direct LLM call can cost $0.10–$1.00 as a multi-step agent session, not because token prices changed, but because the agent ran 10–100 LLM calls to complete it, each one processing an ever-growing input window. Agentic governance policies that enforce per-session token budgets terminate sessions before the loop tax compounds to an unacceptable level — before the next call runs, not after the session closes.

Why do AI agent costs scale differently than LLM API costs?

When you call an LLM directly, cost is linear: one call, one bill. The input tokens go in, the output tokens come back, you pay for the sum. If you move to a cheaper model, your bill drops proportionally.

Agentic systems break this model at the architecture level.

A typical agentic workflow for a non-trivial task runs through at least five LLM call types: a planning call to decompose the task, one or more retrieval calls to gather context, tool execution calls (which may each spawn sub-calls), a verification call to check the work, and a response synthesis call. That's five calls minimum — and each one processes the full accumulated input context from prior steps, not just the new information for the current step.

Here's the math that matters. Consider a 10-step ReAct agent where each step adds 1,000 tokens of new context — tool outputs, intermediate reasoning, retrieval results. Step 1 processes 1,000 input tokens. Step 2 processes 2,000. By step 10, each LLM call processes 10,000 tokens. Total input token consumption across the session: 55,000 tokens. The naïve estimate — ten calls at 1,000 tokens apiece — is 10,000 tokens. The actual session cost is 5.5× higher, and that gap widens the longer the session runs.

To make this concrete: the tenth call alone costs as much as the first four steps combined (steps 1–4 total 10,000 tokens; step 10 is 10,000 tokens). That's the arithmetic progression at work, and it means per-token price improvements have diminishing returns as session length increases. Even a 50% price drop still leaves you paying 2.75× the modeled cost for a session that's 5.5× your estimate.

The $47,000 incident that circulated in late 2025 was a textbook loop tax case. A team deployed four LangChain agents using A2A coordination for market research. Two agents entered an endless dialogue loop — each responding to the other's clarification requests with more questions, treating failed handoffs as retries. The loop ran for 11 consecutive days, with costs escalating from $127 in week one to $891 in week two, $6,240 in week three, and $18,400 in week four, for a $47,000 total before the team shut it down. No per-session budget policy existed — only a monthly aggregate view, which saw gradual increases until the sudden spike.

What is context window accumulation and why does it compound?

Context accumulation is the specific mechanism behind most runaway agent costs, and it's the part that's structurally invisible in cost dashboards that display per-call totals.

Standard observability dashboards show you cost per LLM call. They don't surface the growth trajectory: call 1 at 1,000 tokens, call 5 at 5,000 tokens, call 10 at 10,000 tokens. Each entry in the log looks similar — they all say "LLM call, $0.002" — but they're paying for very different inputs. The session gets more expensive with every step, and the dashboard doesn't show that as a trend unless you're actively querying per-step token counts.

Waxell's telemetry layer surfaces per-step token consumption as part of the full execution trace, which is what lets your team see context window growth as it's happening rather than discovering it on the invoice.

Context accumulates from three main sources in agent workflows, each of which can be managed but is routinely left unchecked:

Full conversation history appended naively. Many agent implementations pass the entire conversation history — all prior tool outputs, all intermediate reasoning steps, all retrieval results — to every subsequent LLM call. If a tool returns a 2,000-token database record on step 3, that record rides in the input window for every call from step 4 onward. By step 10, you've paid for that database record seven additional times.

Raw tool outputs included without compression. Tool calls frequently return more data than the agent needs for the next step. A web search returning 5,000 tokens of content, an API response with a 3,000-token JSON blob, a code execution output with full stack traces — all of these inflate the context window at every step they're appended. Passing a structured summary instead of the raw output can cut per-step context significantly for tool-heavy agents.

Verification and reflection loops. Agents that self-check their work run additional LLM calls on the full existing context. If verification fails and triggers a retry, you get another complete execution pass with an even larger accumulated context. An agent that re-verifies three times has paid for the same initial context four times total.

Anthropic's prompt caching (cache_control) helps here: cache reads are billed at 10% of the standard input rate for matching prefixes. That's a real cost reduction for long, stable system prompts. But caching reduces the per-token price on cached content — it doesn't reduce loop count. The loop tax still applies; it's now cheaper per iteration.

Why does cheaper-model arbitrage fail for agentic workloads?

Here's the counterintuitive part: LLM price drops tend to worsen the loop tax problem for most teams, because teams use the savings to expand agent usage rather than to build enforcement.

When GPT-4o input pricing halved in August 2024, the rational response looked like: "we can run the same workloads at half the cost." Many teams did run the same workloads cheaper — and then expanded scope, raised concurrency, and gave agents longer autonomous task chains. The per-token savings were reinvested in more agent capability, not in per-session cost ceilings.

Model routing compounds this. Smart routing directs simpler tasks to cheap, fast models — GPT-4.1 Nano at $0.10 per million tokens is genuinely economical — and complex tasks to larger ones. This is the right strategy for per-call cost efficiency. But routing by task complexity doesn't constrain session duration. A simple task routed to a $0.10/million model can still enter a retry loop. A cheap model running 1,000 loops costs the same as an expensive model running 10 loops if the per-loop context is identical.

Routing optimizes cost-per-call. What matters for runaway spend is cost-per-session. These aren't the same metric.

The FinOps Foundation's 2026 State of FinOps report captures the gap: 98% of organizations now manage some form of AI spend, up from 63% the prior year — but the same report found only 44% have financial guardrails in place. The monitoring adoption is near-universal. The enforcement is not. Most teams know their monthly AI bill to the dollar. Far fewer have defined the maximum any single agent session is permitted to spend before something stops it.

Dashboards tell you what you spent. Per-session enforcement stops you from spending it. That's not a subtle distinction — it's the difference between a monitoring capability and a governance capability.

What does actual cost enforcement look like — and why alerting isn't enough?

Budget alerting — the kind that Helicone, LangSmith, and Braintrust provide — is reactive by design. The alert fires when cumulative spend crosses a threshold. The session that generated the spend is already complete. Helicone's budget alerts work well for monthly budget management; they don't terminate an individual session mid-execution when it hits a per-session ceiling.

This isn't a criticism of those tools. They're built for cost visibility, and they deliver it accurately. The architectural gap is structural: alerting evaluates at the billing layer, which sits outside the execution loop. Per-session enforcement requires a layer that evaluates before each agent step, not after the session finishes.

Real cost enforcement for agents needs two components:

Per-session token budgets. Define a maximum token allocation for any single session — 10,000 tokens, or whatever threshold fits your workload. When a session hits that ceiling, the next LLM call doesn't run. Not after the session closes; at the call. The difference between a session that costs $0.10 and one that costs $10 is frequently a single error-retry loop during a bad session that nothing intercepted.

Loop circuit breakers. Separately from absolute token budgets, a circuit breaker fires when a session executes more than N consecutive tool calls with the same signature — a reliable signal for error-retry behavior. This catches the $47,000 scenario before it even approaches the token budget ceiling, because the loop pattern is detectable from call signatures before the cost accumulates to alarming levels.

Both mechanisms are governance policies, not monitoring alerts. They act on the session before cost accretes, not after it's logged.

The enforcement layer also produces something alerting doesn't: a durable record of the enforcement event. When Waxell terminates a session on a cost policy, that termination is logged as a governance event — what policy triggered, what threshold was hit, which session was stopped. For teams with CFO or engineering leadership oversight of AI spend, this record changes the conversation from "our monitoring noticed this" to "our governance stopped this."

How Waxell handles this

How Waxell handles this: Waxell's per-session token budgets are defined once as governance policies and enforced at the execution layer across every agent session, regardless of framework. When a session hits its token ceiling, the next LLM call is blocked — not logged as a budget exceedance after the fact. Waxell's telemetry captures per-step token consumption as part of the full execution trace, making context window growth visible in real time rather than as a post-session total. Three lines of SDK to instrument; policies defined once in the governance layer; no agent code changes required when limits change.

from waxell import WaxellSDK
from openai import OpenAI

waxell = WaxellSDK(api_key="...")
client = OpenAI()

# Per-session cost policy is defined in Waxell governance config — not in agent code.
# When the session hits its token budget, the next call is blocked before it runs.
with waxell.trace("market_research_agent"):
    agent.run(task)

from waxell import WaxellSDK
from openai import OpenAI

waxell = WaxellSDK(api_key="...")
client = OpenAI()

# Per-session cost policy is defined in Waxell governance config — not in agent code.
# When the session hits its token budget, the next call is blocked before it runs.
with waxell.trace("market_research_agent"):
    agent.run(task)

from waxell import WaxellSDK
from openai import OpenAI

waxell = WaxellSDK(api_key="...")
client = OpenAI()

# Per-session cost policy is defined in Waxell governance config — not in agent code.
# When the session hits its token budget, the next call is blocked before it runs.
with waxell.trace("market_research_agent"):
    agent.run(task)

from waxell import WaxellSDK
from openai import OpenAI

waxell = WaxellSDK(api_key="...")
client = OpenAI()

# Per-session cost policy is defined in Waxell governance config — not in agent code.
# When the session hits its token budget, the next call is blocked before it runs.
with waxell.trace("market_research_agent"):
    agent.run(task)

from waxell import WaxellSDK
from openai import OpenAI

waxell = WaxellSDK(api_key="...")
client = OpenAI()

# Per-session cost policy is defined in Waxell governance config — not in agent code.
# When the session hits its token budget, the next call is blocked before it runs.
with waxell.trace("market_research_agent"):
    agent.run(task)

When a session terminates on a cost policy, the enforcement event is embedded in the execution trace alongside every other session event: what policy evaluated, what threshold was hit, what action Waxell took. Your finance team can audit it. Your on-call engineer didn't have to manually catch the loop.

Frequently Asked Questions

Why do AI agent costs spiral out of control?
Agent costs spiral because agentic systems run in multi-step loops, and each step adds to the context window that subsequent steps must process. Unlike direct LLM calls where cost is linear, a 10-step agent session accumulates input tokens at each step — the final calls are significantly more expensive than the first. Runaway costs typically originate from error-retry loops (an agent re-trying a failing tool call repeatedly), naïve context accumulation (appending full conversation history and raw tool outputs to every call), or multi-agent systems where agents exchange full message histories without compression. Without a per-session enforcement layer, there's no mechanism to stop the accumulation mid-session.

What is a token budget for AI agents?
A token budget is a maximum token allowance for a single agent session. When the session hits the ceiling, the next LLM call is blocked rather than executed. Unlike monthly or daily aggregate caps, a per-session budget constrains individual session runaway — which is where most dramatic cost incidents originate. Token budgets work at the execution layer: they evaluate before each LLM call, not after the session closes. Setting an effective one requires enforcement at the infrastructure layer, not just a post-session monitoring alert.

How do I set per-session cost limits for AI agents?
Per-session cost limits require a governance layer at the execution level that evaluates token consumption before each agent step. The common approaches: build the limit check into agent code (fragile — breaks when code changes, doesn't apply consistently across frameworks), use an API proxy that enforces at the HTTP level (works for LLM calls but misses multi-step context tracking and tool-call accounting), or use a governance control plane like Waxell that enforces session limits as a policy defined once and applied across all agent sessions regardless of framework. The policy approach decouples enforcement from agent code — when limits change, you update the policy, not every agent.

What's the difference between cost alerting and cost enforcement for AI agents?
Cost alerting fires after spend has occurred — it tells you your monthly budget hit 80%, or that a session exceeded a threshold after it completed. Cost enforcement terminates sessions before they exceed their ceiling — the session reaches its limit and the next call does not execute. Alerting is a monitoring capability; enforcement is a governance capability. Most observability platforms (LangSmith, Helicone, Braintrust, Arize) offer alerting at the aggregate level. Per-session enforcement requires a layer inside the execution loop, not one observing it from outside.

Why don't cheaper models solve the AI agent cost problem?
Cheaper models reduce cost per token, not cost per session. If your agent enters a retry loop or accumulates context across 50 steps, a 50% price reduction still means paying for those 50 loops at half the prior rate — which may or may not keep you within budget depending on session volume. Model routing improves per-call efficiency but doesn't constrain session duration or loop count. The cost driver for most runaway incidents isn't token price; it's unconstrained execution behavior. A session that's 5.5× more expensive than your model predicted is still 2.75× over budget after a 50% price drop.

How do I stop a runaway AI agent from burning my API budget?
Three mechanisms work in combination: per-session token budgets that block execution when a session hits a ceiling (governance layer), loop circuit breakers that detect consecutive identical-signature tool calls (a reliable signal for error-retry loops), and context compression that prevents the context window from growing unconstrained across steps (engineering pattern applied at the agent level). The first two are governance policies enforced at the execution layer; the third is an agent implementation choice. Budget dashboards and aggregate alerts cannot stop a session already running — they report after the fact. Stopping runaway sessions requires enforcement inside the execution loop.

Sources

Medium / CodeOrbit, "Our $47,000 AI Agent Production Lesson: The Reality of A2A and MCP" (2025) — https://medium.com/@theabhishek.040/our-47-000-ai-agent-production-lesson-the-reality-of-a2a-and-mcp-60c2c000d904
Tech Startups, "AI Agents Horror Stories: How a $47,000 AI Agent Failure Exposed the Hype and Hidden Risks of Multi-Agent Systems" (Nov 14, 2025) — https://techstartups.com/2025/11/14/ai-agents-horror-stories-how-a-47000-failure-exposed-the-hype-and-hidden-risks-of-multi-agent-systems/
FinOps Foundation, State of FinOps 2026 — https://data.finops.org (98% manage AI spend, 44% have guardrails)
Anthropic, Prompt Caching — Claude API Documentation (2025) — https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Silicon Data, "OpenAI API Pricing Per 1M Tokens" — https://www.silicondata.com/use-cases/openai-api-pricing-per-1m-tokens/
LangChain, State of Agent Engineering (2026) — https://www.langchain.com/state-of-agent-engineering

Agentic Governance, Explained

Waxell blog cover: separation of developer and operator authority in agentic system architecture

Agentic Architecture: Developer vs. Operator Authority [2026]

Gartner: 40% of enterprise apps run AI agents by end of 2026. The teams failing ask one question too late: who controls them in production?

Logan Kelly

Jun 1, 2026

Waxell blog cover: AI agent runbook and on-call operations guide 2026

AI Agent Runbook: What On-Call Looks Like in 2026

No runbook for your AI agent means a 3am call with no playbook. Here's what on-call operations looks like for production agent systems in 2026.

Logan Kelly

May 27, 2026

Waxell Connect blog: multi-agent handoffs and agent coordination

Multi-Agent Handoffs in Waxell Connect [2026]

When every AI agent handoff goes through you, you're the bottleneck. Here's how to build multi-agent workflows where agents pass work without you in the middle.

Frances @ Waxell

May 26, 2026