Waxell

Product

Compare

START FREE

Waxell

Logan Kelly

Jun 17, 2026

AI Agent Cost Audit: A 5-Step Framework for Finding Where Your Agent Fleet Budget Actually Goes

A developer left 7 AI agents running for 2 hours and burned $200. Here's the 5-step cost audit that finds fleet budget waste before your next invoice arrives.

Waxell blog cover: AI agent cost audit framework showing 5 steps to find fleet budget waste

In October 2025, a developer building an AI-powered website tool stepped away from their desk to get coffee. They had kicked off a suite of seven autonomous agents to run a test. Two hours later, they checked their API dashboard: the bill had jumped $200. One agent had been running continuously the entire time, calling the API in a loop with no stopping condition. By the time they caught it, the money was spent.

An AI agent cost audit is a structured process for determining where agent fleet spending actually originates — which agents, which workflows, which patterns — before the invoice arrives. It is distinct from cost tracking (which records what individual LLM calls cost after the fact) and from cost enforcement (which applies runtime limits). An audit is the diagnostic layer between observing costs and controlling them.

Cost tracking is now standard on every major AI observability platform. Arize AX calculates cost per span and aggregates at the trace level. LangSmith surfaces token usage per run. Helicone shows per-model usage across API calls. These are useful starting points. But retrospective, per-trace cost data doesn't answer the questions that drive actual cost reduction:

Which agent in the fleet accounts for most of the monthly bill?
Which orchestration pattern is calling expensive models for tasks that don't require them?
Which workflow is silently looping, consuming tokens without terminating?
Is context prepended to every call when only a fraction of calls actually need it?

Answering those questions requires a structured investigation. The five steps below constitute a repeatable framework for conducting one.

Step 1: Build a Complete Inventory of What's Running

An AI agent cost audit cannot begin without enumerating the full active fleet. Most teams discover during this step that more agents are running than anyone accounted for.

A complete inventory requires knowing, for every currently active agent: what model or models it calls, how frequently it executes, what data sources it accesses, and who owns it. Many production environments contain a mix of internally built agents, experimental agents from past pilots that were never decommissioned, and vendor agents — third-party integrations that run on shared infrastructure and generate API costs that appear in the same invoice as everything else but often lack attribution.

Without a formal agent registry, the inventory step requires tracing runtime API calls back to their origin manually — a time-consuming process that also reveals why a registry is a governance prerequisite rather than an optional convenience. The inventory step also reliably surfaces the first cost driver teams consistently underestimate: test or pilot agents that were never deactivated and have been generating token costs for weeks or months.

The output of step 1 is a complete agent inventory with cost-relevant attributes for every entry. Step 2 won't work without it.

Step 2: Attribute Cost to Specific Agents and Workflows

With an inventory in place, attribution maps total fleet cost to the specific agents and workflows generating it.

This is where per-trace observability platforms reach their structural limit. A cost dashboard that surfaces per-trace spend is useful for debugging individual workflow failures, but it doesn't aggregate spending by agent identity across the full fleet. If twelve agents are running and three of them are generating 80% of monthly spend, the team needs fleet-level cost aggregation by agent — not per-call granularity. Those are different data models, and most observability tools are built for the latter.

Attribution typically surfaces two categories of surprise:

High-frequency, low-cost-per-call agents that accumulate significant cost through volume. An agent making small model calls every 30 seconds can generate more monthly spend than an agent making expensive calls once an hour. Each individual trace looks inexpensive. The cumulative total is not. These agents are invisible in per-trace views precisely because no single call is remarkable enough to flag.

Context accumulation in multi-step workflows. Some orchestration patterns pass growing conversation history forward at each execution step. The cost per call rises as the context window fills — quadratically in some architectures. A workflow that appears to make 20 calls may be paying for 20 progressively more expensive calls, with later steps consuming 5–10× the tokens of early ones.

Both patterns are invisible without fleet-level cost attribution. Finding them is the primary output of this step.

Step 3: Identify the Three Patterns That Drive Most Waste

Once attribution is established, pattern analysis identifies the specific behaviors responsible for disproportionate cost. In practice, most agent cost waste concentrates in three patterns:

Runaway reflection loops. Agents configured to reflect on their own output before responding re-enter the LLM at each reflection step. When the stopping condition is poorly defined or absent, the agent iterates without bound. The $200-in-two-hours incident described above is the most common manifestation. A developer building AI coding tools reported a similar failure: Gemini CLI, tasked with building a banking microservice, "got lost in its own loops — one particularly wild day ended with it racking up $300 in charges all by itself" according to his August 2025 account. Loops are the highest-severity cost pattern because they have no natural ceiling — a single malfunctioning agent running overnight can generate more cost than the rest of the fleet combined.

Model-task mismatch. Not every task requires the most capable available model. Classification, routing decisions, structured data extraction, and format conversion typically run at equivalent or better accuracy on smaller, less expensive models. Teams that default all agent calls to their largest available model because it's the path of least resistance consistently overpay for tasks that don't require it. The audit should categorize every task type in the fleet and verify that model selection matches actual task complexity.

Dead context. Context prepended to every LLM call inflates token cost across the full fleet even when most calls don't need it. Common examples: system prompts containing regulatory boilerplate included for all calls when only a fraction involve regulated workflows; RAG retrieval that fetches a fixed-size chunk regardless of whether the specific query requires all of it. Dead context is easy to overlook because its cost is distributed across thousands of calls — no single call looks expensive, but the aggregate is substantial.

Each pattern is identifiable in usage data once fleet-level attribution is in place. Finding them is a function of knowing what to look for.

Step 4: Enforce Budget Boundaries at the Execution Layer

A cost audit that ends with findings but no enforcement infrastructure will produce identical findings at the next audit cycle. The purpose of the audit is not a report — it is the prevention system the report justifies building.

Enforcement operates at two levels:

Hard limits per agent. Every agent in the fleet should have a defined budget ceiling — a maximum token expenditure per execution, per session, or per calendar period. When that ceiling is reached, the agent stops. Not "generates an alert that someone might review in time." Stops. Hard budget limits treat cost enforcement the way a circuit breaker treats electrical load: protection is automatic, and it doesn't depend on a human responding fast enough to matter.

Policy-layer enforcement. Beyond per-agent budget ceilings, Waxell Runtime's 50+ policy categories include policies that address the specific patterns identified in step 3: loop detection (stopping agents that re-enter the same step beyond a configured threshold), model-selection controls (preventing a low-priority task from routing to a high-cost model), and context size gates (rejecting calls whose context window exceeds a defined maximum). These policies enforce the corrections the audit identifies at the execution layer — before the call is made, not after it's logged.

The distinction between alerts and enforcement is critical for cost control. An alert tells the team that overspend is occurring. A policy prevents it. For runaway loops — where 30 minutes of unconstrained execution can exhaust a monthly budget — alerts are insufficient by design.

Waxell Runtime deploys policy enforcement without requiring code changes or rebuilds to existing agents. The audit findings become executable controls without modifying the application layer.

Step 5: Establish a Cost Baseline and Set Drift Thresholds

The final step converts one-time audit findings into a standing monitoring posture. The goal is not to repeat a full audit every month — it is to establish normal cost behavior for each agent so that deviations trigger specific, actionable alerts automatically.

For each agent in the fleet, define: expected cost per execution, expected execution frequency, and expected context size range. These become the baselines against which ongoing real-time telemetry is measured. When an agent's actual usage diverges from its baseline by more than a defined threshold, the alert is specific: not "spend is up this week" but "the contract-review agent's average context size increased 38% from baseline over the past seven days, suggesting context accumulation."

Specific alerts are actionable. Generic alerts are noise.

Establishing baselines also accelerates subsequent audits significantly. Rather than investigating the full fleet from scratch, the next audit focuses on agents whose actual usage diverged materially from established patterns — typically a small fraction of the total fleet. The scope narrows from everything to the meaningful outliers.

This step transforms cost visibility from a lagging indicator (last month's invoice) into a leading indicator (what is changing now, before it compounds into a budget surprise). The audit framework is the setup; continuous drift monitoring is the operational payoff.

How Waxell Handles This

Most AI observability tools stop at the trace level — they record what happened, per call, after the fact. Waxell is built around the full audit cycle: instrument once, observe at fleet level, enforce proactively.

Waxell Observe deploys in 2 lines of code and auto-instruments 200+ libraries with no rebuilds required. It aggregates cost and usage data at the fleet level by agent, by workflow type, and by session — delivering the attribution layer that step 2 requires without building a custom aggregation pipeline over raw trace data.

Waxell Runtime enforces the outputs of the audit: hard budget limits per agent, loop detection policies, model-selection controls, and context size gates drawn from 50+ policy categories. When the audit identifies a runaway loop as the source of 30% of last month's spend, Waxell Runtime deploys a loop detection policy to that agent without a code change or a redeployment. No rebuilds required.

Waxell Connect governs agents the team didn't build — vendor agents, third-party integrations, and MCP-native agents running on shared infrastructure. In a cost audit, vendor agents are frequently the attribution gap: they generate API costs that appear in the fleet bill but can't be attributed using instrumentation the team installed. Waxell Connect applies the same budget enforcement and policy controls to external agents with no SDK and no code changes required. Waxell Connect governs the agents you didn't build.

FAQ

What is an AI agent cost audit?
An AI agent cost audit is a structured process for determining where an agent fleet's spending actually originates — which agents, which workflow patterns, which behaviors. It is distinct from cost tracking (which records per-call costs retrospectively) and cost enforcement (which applies limits at runtime). An audit is the diagnostic layer between observing costs and controlling them: it identifies root causes so that the right enforcement controls can be applied.

How often should teams run a full AI agent cost audit?
For teams running more than five active agents, a structured audit at monthly cadence makes sense during initial deployment — until baselines are established and drift thresholds are configured. Once continuous anomaly detection is operating against established baselines, a full audit is typically reserved for material changes: new agents entering the fleet, model upgrades, or significant changes in usage volume. The goal of the first audit is to make subsequent audits unnecessary through automated monitoring.

Why can't we just use our observability platform's cost dashboard?
Per-call or per-trace cost dashboards record individual LLM call costs. They don't aggregate spending by agent identity across the fleet — which is the view required to do attribution and identify which specific agent or workflow is generating most of the spend. Fleet-level cost attribution requires a different data model than trace-level observability. Most platforms are built for the latter.

What's the fastest way to detect a runaway loop before it generates significant cost?
Pre-execution loop detection — a policy that stops any agent that re-enters the same step more than a configured number of times in a single session — is the most effective control. Without it, loop detection relies on wall-clock time alerts or cost thresholds, both of which fire only after the loop has consumed substantial budget. Pre-execution enforcement catches loops on the second or third iteration, not after hours of unconstrained execution.

Does hard budget enforcement interrupt legitimate long-running workflows?
It can, if limits are set without calibration. The correct approach is to derive limits from the baselines established in step 5 — the expected cost per execution for each specific agent, with headroom built in for normal variation. A limit set at 2× the expected baseline stops pathological behavior without interrupting normal execution. Generic hard limits applied across the fleet without agent-specific calibration create false positives.

How does Waxell Connect help with cost attribution for vendor agents?
Waxell Connect applies cost tracking, budget limits, and enforcement policies to agents you didn't build — vendor agents, MCP-native agents, platform integrations — without requiring SDK installation or code changes in those systems. In most cost audits, vendor agents running on shared infrastructure are the attribution gap that dashboards miss: they generate API costs but lack the instrumentation to identify them at the fleet level. Waxell Connect closes that gap.

Sources

AI Agents Are Notorious: A $200 Lesson in Autonomous Systems (DEV Community, Oct 14, 2025) — Developer account of 7 autonomous agents running uninterrupted for 2 hours, generating $200 in API costs.
How to Tame Your AI Agents: From $900 in 18 Days to Coding Smarter (DEV Community, Aug 12, 2025) — Developer account of $900 in AI agent API costs over 18 days; includes Gemini loop incident generating $300 in one day.
I spent $638 on AI coding agents in 6 weeks (Hacker News, Nov 2025) — Founder/CTO account of AI coding tool costs Oct–Nov 2025.
Show HN: Agent-Audit – Lint and cost-estimate your AI agent (Hacker News) — Community-built tool for agent cost estimation.
Show HN: AgentCost – Track, control, and optimize your AI spending (Hacker News) — Open-source cost tracking and optimization tool.
OSS Tool: Hard spending limits for AI agents (Hacker News) — Open-source hard-limit enforcement for agents.
Are the costs of AI agents also rising exponentially? (Hacker News) — HN discussion on escalating agent cost patterns.
Track Costs — Arize AX Docs — Reference for Arize AX's per-span cost tracking and trace-level aggregation.
AI agent analytics: A buyer's guide — Arize AI — Competitor positioning on agent analytics.

Agentic Governance, Explained

Waxell blog cover: GitHub AI agent infrastructure crisis

GitHub's AI Agent Crisis: What 9 Outages Cost [2026]

In May 2026, GitHub logged 9 outages and added AWS capacity to stay online. Here's why unbounded AI coding agents break production — and what enforcement prevents.

Logan Kelly

Jun 17, 2026

Waxell blog cover: Fable 5 banned — model-layer AI governance failure

Fable 5 Banned: Why Model-Layer AI Governance Fails [2026]

Anthropic's Fable 5 went offline 72 hours after launch. Here's why governance baked into a model always had this problem — and the architectural fix.

Logan Kelly

Jun 16, 2026

Waxell blog cover: Deploy Claude Agents to Production with Waxell Runtime

Deploy Claude Agents to Production: 6 Hard Parts [2026]

Hosting Claude agents means subprocess supervision, sessions, isolation. Waxell Runtime gives you that governed environment without building it yourself.

Logan Kelly

Jun 15, 2026

A single file in a workspace feeding context to several AI agents at once

Connected Workspace: Why One Update Reaches Every Agent

Fix your brand voice once and every agent uses the new version next session. Here is what a connected workspace changes when context stops living in your head.

Frances @ Waxell

Jun 11, 2026

GitHub's AI Agent Crisis: What 9 Outages Cost [2026]

In May 2026, GitHub logged 9 outages and added AWS capacity to stay online. Here's why unbounded AI coding agents break production — and what enforcement prevents.

Logan Kelly

Jun 17, 2026

Fable 5 Banned: Why Model-Layer AI Governance Fails [2026]

Anthropic's Fable 5 went offline 72 hours after launch. Here's why governance baked into a model always had this problem — and the architectural fix.

Logan Kelly

Jun 16, 2026

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

Product

Connect

Observe

Runtime