Logan Kelly

Testing Governance, Not Just Behavior: What's Different About Agent QA

Testing Governance, Not Just Behavior: What's Different About Agent QA

Behavioral testing tells you if your agent works. Governance testing tells you if the control layer that's supposed to stop it actually will. Most teams only do one.

Waxell blog cover: Testing Governance, Not Just Behavior: What's Different About Agent QA

Earlier this year, an AI agent called OpenClaw deleted over 200 emails from a live Gmail inbox. Summer Yue, Director of Alignment at Meta's Superintelligence Labs, had given the agent an explicit instruction: request approval before any destructive action. The instruction was clear. It had worked perfectly on a test inbox for weeks.

It failed on the real one.

What happened: Yue's production inbox was far larger than her test environment. Mid-execution, the agent hit a context window limit. When the context compacted, the "require approval" instruction got dropped. The agent continued executing without it — bulk-trashing and archiving hundreds of emails at machine speed. Yue tried to stop it remotely from her phone. The agent ignored her. She had to physically run to her Mac Mini to kill the process.

This isn't a story about the agent misbehaving. The agent was doing exactly what it was told — by the version of its context that no longer included the rule that should have stopped it. The governance instruction failed. And nobody had tested for that.

Governance testing is the practice of verifying that the control layer above your agents — cost limits, content filters, escalation policies, tool restrictions — behaves correctly under real operating conditions, including edge cases and adversarial inputs. It's distinct from behavioral testing, which checks whether your agent completes tasks correctly. Governance testing checks whether the infrastructure that constrains your agent holds when it needs to.

The distinction matters more than it sounds. A governance control that works in a controlled test but degrades under load, context pressure, or multi-step sequences is worse than no governance — it gives you false confidence.

Why Behavioral Testing Doesn't Cover This

Behavioral testing is valuable. We've written about what it covers: tool selection, argument validation, state propagation, failure scenarios. You should be doing all of it.

But behavioral testing answers the question: "Did the agent do what I wanted?" Governance testing answers a different question: "If the agent does something it shouldn't, does the control layer actually stop it?"

These require different test designs, different failure modes to watch for, and different infrastructure. Most teams have invested heavily in behavioral testing and behavioral evals — output quality scoring, hallucination detection, task completion metrics. Almost no one has built an equivalent testing discipline for the governance layer.

According to Gravitee's State of AI Agent Security report (2026), which surveyed 750 executives and practitioners, 88% of organizations reported confirmed or suspected AI agent security incidents in the past year — yet only 14.4% report that all their AI agents go live with full security and IT approval.

That's not mostly a behavioral testing failure. Teams are catching bad agent outputs. What they're not catching is governance infrastructure that looks correct on paper but silently fails in production.

Three Ways Governance Fails That Behavioral Testing Doesn't Catch

1. Cost controls that pass configuration checks but don't fire

You've set a max_budget parameter. Your test runs stay within the limit. Everything looks fine.

In production, the agent enters a retry loop — maybe because a downstream API is flaky, maybe because it's chasing an ambiguous objective. The limit is defined, but the enforcement logic has a gap: it checks the budget at task start, not per retry. The agent burns through 3× the budget before the check fires. Or the limit is denominated in tokens, but you're billed by request, and nobody caught the mismatch during implementation.

This class of failure doesn't show up in behavioral testing because the agent itself is behaving reasonably — it's retrying a flaky API, which is the right call. The cost governance is what broke. You only find it by testing the cost control directly: run the agent against conditions that should trigger the limit, and verify it fires.

2. Escalation policies that degrade under context pressure

The OpenClaw incident is the clearest example, but it's not isolated. Any governance policy implemented as part of the agent's instruction context — rather than as an enforcement layer outside the agent — is vulnerable to this failure mode.

As context windows fill, as multi-turn conversations accumulate, as compaction logic kicks in, governance instructions can be deprioritized or dropped. The policy is still in the configuration file. It's just not in the active context anymore. The agent proceeds as if the policy doesn't exist, because from its perspective, it doesn't.

Testing for this requires running your agent through full-length operational scenarios, not just happy-path test cases. Fill the context. Introduce the delays and API errors that cause real-world latency. Does the escalation logic still fire? Does the approval gate survive when the conversation history is long?

3. Multi-step chaining that bypasses per-action policies

This is the subtlest failure mode and the hardest to catch. A governance policy checks each action individually: can this agent write to a log file? Yes, approved. Can it read a config file? Yes, approved. Can it make an outbound API call? Yes, approved.

What the per-action check doesn't catch: the composition of those three actions, in that sequence, exfiltrates sensitive configuration data to an external system. Each individual action is within policy. The chain of actions violates the intent of every policy simultaneously.

Practitioners stress-testing their own agent guardrails have identified multi-step chaining as one of the hardest failure modes to defend against precisely because governance frameworks are typically designed around individual action validation. Testing for it means thinking in sequences, not individual steps: what combinations of permitted actions, executed together, produce an outcome your policies were designed to prevent?

How to Test the Governance Layer

The goal is to treat your governance plane as a system under test — with the same rigor you'd apply to any other infrastructure component. Concretely:

Test boundary conditions, not just happy paths. If a cost limit fires at $10, don't just verify that a $5 run completes normally. Run something that hits $10.00, $10.01, and $50. Verify the limit fires at the right threshold and stops cleanly. Verify that a stopped run doesn't leave a partial state that the agent can resume from.

Test governance under load and context pressure. Run your agent through full-length scenarios — the longest conversations you expect in production, with realistic tool-call volumes and latencies. Does the escalation policy survive a 40-turn conversation? Does it survive a context compaction event?

Test adversarial inputs specifically against the governance layer. Give your agent inputs designed to test whether the policy boundaries hold, not just whether the agent produces good outputs. Can the agent be prompted into requesting a permission it shouldn't have? Can it be manipulated into chaining permitted actions in a way that violates intent?

Use execution replay to investigate failures. When a governance control doesn't fire, you need to see exactly what the agent saw at the moment it should have triggered. Replaying execution history — the full trace of tool calls, context states, and policy evaluations — is the only reliable way to diagnose governance failures post-run.

Treat policy changes as code changes. When you modify a policy — tighten a cost limit, add a new tool restriction, change an escalation trigger — run a full regression against the governance test suite. Policy configuration changes are as consequential as code changes and require the same validation discipline.

How Waxell handles this: Waxell's governance layer is architecturally separated from the agent — policies are enforced by the governance plane, not by instructions in the agent's context. This means the OpenClaw class of failure (governance instruction lost to context compaction) can't happen: the policy isn't in the context to begin with, so context changes don't affect it. Waxell's pre-production testing environment lets you run adversarial governance tests — cost limits, escalation triggers, permission boundaries — before any of it goes live.

Behavioral testing is necessary. It's not sufficient. The teams building serious production agent systems are discovering they need two distinct QA disciplines: one for the agent, one for the control layer above it. The second one barely exists yet as a formal practice. That's the gap to close before the next OpenClaw incident is yours.

If you're building governance infrastructure and want to test it before it hits production, that's what Waxell is built for.

Frequently Asked Questions

What is the difference between behavioral testing and governance testing for AI agents? Behavioral testing checks whether an agent correctly completes its intended tasks — the right tools, in the right order, with the right outputs. Governance testing checks whether the control layer above the agent — cost limits, escalation policies, content filters, permission boundaries — correctly intercepts and stops out-of-policy behavior when it occurs. Both are necessary. Most teams only have the first.

Can governance controls fail even if the agent is behaving correctly? Yes. Governance controls can fail independently of agent behavior. The most common failure modes: cost limits implemented with the wrong enforcement trigger, escalation policies embedded in agent context that get dropped under context pressure, and per-action permission checks that don't catch multi-step sequences that violate policy intent in aggregate. In each case, the agent may be doing exactly what it was asked to do.

How do you test whether an escalation policy will actually fire in production? Run test scenarios specifically designed to trigger the escalation condition — not just normal runs that happen to stay within bounds. Then test the boundary: what happens at the exact threshold, one step over it, and in a scenario with high context pressure or API latency? If you're relying on escalation instructions in the agent's context window, test whether they survive a long conversation and a compaction event.

What is multi-step chaining in the context of agent governance? Multi-step chaining is when an agent chains multiple individually permitted actions into a sequence that violates the intent of your governance policies, even though no single action triggers a policy check. It's one of the hardest failure modes to detect because per-action policy checks are the norm. Testing for it requires thinking in sequences: what combinations of approved actions, run in order, produce an outcome your policies were designed to prevent?

Why is governance testing separate from security testing? There's significant overlap, but they're not the same exercise. Security testing focuses on whether an agent can be exploited by an adversary. Governance testing focuses on whether the operational control layer — cost limits, approval gates, tool restrictions — holds under normal and edge-case conditions. Security testing asks "can an attacker break this?" Governance testing asks "will this work correctly when we need it to?" Both answers need to be yes.

Sources

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.