Back to Blog
AI Governance

Glasswing Is Day Zero. What Comes Next?

Anthropic’s infrastructure vulnerability scanner is real progress. But hardening the platform is a different problem from governing what your AI agents actually do in production.

Joe Braidwood
Joe Braidwood
Co-founder & CEO
· April 2026 · 8 min read

Anthropic just announced Glasswing — a vulnerability-scanning programme that pays researchers to find bugs in Claude’s infrastructure. Buffer overflows, privilege escalation, prompt injection in the scaffolding itself. This is good work. It deserves credit.

But it surfaces a distinction that matters enormously for anyone deploying AI in healthcare, financial services, or insurance: the difference between infrastructure security and runtime behavioural governance.

Put simply: Glasswing is Day 0. Governing AI deployments at scale is Day 1.

The categorical difference

Glasswing asks: “Is the platform itself secure?” Does Claude’s container environment have exploitable bugs? Can an attacker escape the sandbox? Can a crafted input corrupt the model-serving pipeline?

These are infrastructure questions. They belong to the same category as Kubernetes CVEs, TLS misconfigurations, and memory-safety bugs. Important. Necessary. But they live in a different universe from the question regulated organisations actually need answered:

“What is my AI agent doing right now, and can I prove it?”

Day 0 — Infrastructure

Glasswing

  • Finds bugs in Claude’s code and container
  • Patches platform-level vulnerabilities
  • Secures the model-serving pipeline
  • Point-in-time, per-vulnerability
Day 1+ — Governance

Runtime Monitoring

  • Monitors what AI agents actually do
  • Detects behavioral drift and guardrail erosion
  • Catches discrimination, PHI leakage, policy violations
  • Continuous, across every interaction

Both matter. But they aren’t the same problem, they don’t require the same tooling, and one doesn’t solve the other. A perfectly patched infrastructure can still host an agent that leaks protected health information, denies claims along racial lines, or gradually abandons its safety guardrails over the course of a five-turn conversation.

What happens after Day 0?

Assume Glasswing works perfectly. Every infrastructure vulnerability in Claude is found, patched, and verified. The platform is airtight. Now you deploy your prior-authorisation agent, your ambient clinical scribe, your claims-adjudication bot.

What happens next?

The agent processes 10,000 interactions a day. On Tuesday, the model provider pushes a minor update. On Thursday, your ops team adjusts the system prompt to handle a new edge case. On Friday, a user discovers that if they build rapport with the agent over five turns — gradually escalating trust — the agent relaxes its refusal boundaries and starts volunteering information it was explicitly instructed to withhold.

None of these are infrastructure vulnerabilities. Glasswing won’t find them. They’re behavioural failures — and they’re the ones that generate regulatory action, malpractice exposure, and reputational damage.

What we’re actually seeing

We built autoredteam to do continuous adversarial evaluation of AI systems in production. Here’s what it finds that no infrastructure scanner can:

Demographic disparity in high-stakes decisions

Autoredteam flagged a 6% demographic disparity in a healthcare prior-authorisation agent. Same clinical scenario, different patient demographics, materially different approval rates. The infrastructure was flawless. The model was behaving within its technical parameters. The outcome was discriminatory — and invisible without behavioural testing that varies demographic inputs systematically.

This isn’t a bug in Claude’s code. It’s a bias in the agent’s deployed behaviour. Glasswing can’t find it because it isn’t looking for it — and shouldn’t be. That’s not Glasswing’s job.

Trust-building escalation attacks

Five-turn conversations where the adversary builds rapport, establishes a collaborative tone, and then makes progressively boundary-pushing requests. By turn five, the agent is volunteering information its system prompt explicitly forbids. Not because it was jailbroken in the traditional sense — because the model’s attention mechanism weighted the accumulated conversational trust more heavily than its instruction hierarchy.

This is a behavioural vulnerability. It lives in the interaction pattern, not the codebase. A bug bounty can’t find it because there’s no discrete “bug” to report — it’s an emergent property of multi-turn conversation.

Behavioural drift that point-in-time testing misses

We use CUSUM-based statistical process control to detect subtle behavioural degradation over time. A model that was 94% policy-compliant last Tuesday might be 88% today — not because anyone changed anything, but because a provider-side model update shifted the latent decision boundary. Point-in-time red-teaming (including a bug bounty) catches what’s broken now. CUSUM catches the system that’s drifting toward failure — and flags it before it gets there.

We’ve mapped these failure modes across 18 root cause categories, each linked to specific OVERT controls. The taxonomy didn’t come from theory — it came from running continuous adversarial evaluation against real production deployments.

The Day 1 arc: Scan, Harden, Prove

If Glasswing represents the mature version of Day 0 — infrastructure hardened, platform secured — then Day 1 is the operational governance loop that runs continuously on top of it:

1 Scan

Automated adversarial evaluation. Behavioural probes across demographics, escalation patterns, policy boundaries. Continuous, not one-off.

2 Harden

Discovered bypasses become training data for a defender model that deploys at runtime. Every vulnerability you find makes the system stronger.

3 Prove

Every control execution gets a cryptographic attestation receipt via OVERT. Tamper-evident, independently verifiable, audit-ready.

The loop runs continuously. Attack success rate drops. A new baseline gets committed. The loop runs again. Drift gets caught by CUSUM before it becomes a compliance violation. Every interaction gets an evidence chain that a regulator, an auditor, or a plaintiff’s attorney can verify independently.

Complementary, not competitive

Let’s be direct: Glasswing is something Anthropic should be doing. Model providers securing their own infrastructure is a baseline expectation, and making it a formal programme with external researchers is the right approach. We want Anthropic to find and fix every infrastructure vulnerability in Claude. That makes the foundation more trustworthy for everyone building on top of it.

What Glasswing doesn’t do — and shouldn’t try to do — is govern what you build on that infrastructure. Your agent, your system prompt, your tool integrations, your data pipelines, your compliance obligations. That’s a different problem, with different tooling, owned by a different team.

Anthropic can’t tell you whether your clinical decision-support agent is exhibiting demographic bias. They can’t monitor whether your ambient scribe is leaking PHI into its summaries. They can’t prove to your state attorney general that your AI guardrails actually executed on a specific patient interaction at a specific time. That’s not a criticism — it’s a scope boundary. And it’s the scope boundary that defines where Day 0 ends and Day 1 begins.

The question to ask

If you’re deploying AI in a regulated industry, the useful question isn’t “Is the model provider’s infrastructure secure?” — though you should confirm it is. The question is: “Do I have continuous, evidence-grade visibility into what my AI agents are doing in production, and can I prove it to someone who doesn’t trust me?”

If the answer is no, you’ve solved Day 0 but haven’t started Day 1.

See it yourself

Free 30-minute behavioural scan of your AI deployment. We’ll show you exactly what autoredteam finds — demographic disparity, escalation vulnerabilities, drift signals — against your actual system.

Book a scan 30 minutes. We run autoredteam against your system, live.
autoredteam.com Run it yourself. Open source, Apache 2.0, five minutes to first results.
overt.is The open standard for cryptographic AI runtime trust.

Ready for Day 1?

Free 30-minute behavioural scan. We’ll run autoredteam against your AI system and show you what continuous governance looks like in practice.

Book a Scan