After a Prompt Injection Attack, Prove What Held

Joe BraidwoodCo-founder & CEO

June 2026 · 5 min read

A single prompt injection attack just demonstrated how much is at stake. According to reporting from Bloomberg and CNBC, the US Commerce Department ordered Anthropic on 11–13 June 2026 to suspend access to its most capable models — Fable 5 and Mythos 5 — for all foreign nationals, citing national-security concerns. The reported trigger was a discovered method of jailbreaking Fable 5: a cybersecurity vulnerability. One injection finding, and a frontier model came off the market for a whole class of users.

For security reviewers and AppSec teams defending LLM applications, the lesson is uncomfortable but clarifying. No system that takes instructions in natural language can be guaranteed injection-proof. The defensible move — and what increasingly decides whether a deployment survives scrutiny — is to hold independent, tamper-evident proof of which controls fired and which boundaries held, verifiable after the fact without exposing the protected content behind them.

What a prompt injection attack actually is

A prompt injection attack manipulates a model into ignoring its operating instructions by smuggling adversarial text into its input — directly in a user message, or indirectly through a document, a webpage, a tool result, or an email the model later reads. The model has no reliable way to separate trusted instructions from untrusted data, because to a language model both are just tokens.

The consequences scale with what the model can reach. A chatbot tricked into rude output is an embarrassment. An agent with tool access — able to query a database, call an API, move money, or touch a patient record — tricked into exfiltrating data or taking an unauthorised action is an incident. This is why ai agent security is the sharp edge of the problem: the blast radius is the union of every tool the agent can invoke. A jailbreak that bypasses a model's safety training and an indirect injection that hijacks an agent's task are different mechanisms with the same downstream question — what did the system do next, and what stopped it.

Why prevention alone is not a defensible position

The honest engineering reality is that no input filter catches every injection. Adversarial phrasing evolves; encodings shift; the attack surface includes content you don't author. Teams layer defences — input sanitisation, output checks, allowlists on tool calls, human-in-the-loop on high-risk actions — and each layer reduces risk without eliminating it. Mature agentic ai security treats prevention as one tier, not the whole strategy.

So when an incident lands, prevention is not the question a reviewer, a regulator, or an insurer asks first. They ask: what happened, and can you show us? And here most teams discover their evidence is thin. Application logs are operator-controlled — the same party under scrutiny wrote them, and could in principle have edited them. A screenshot of a dashboard is an assertion, not proof. Self-reported logs record what a system believed it did; they are recollections, not evidence. As the OVERT standard puts it: governance has always been able to say what ought to be done, and has rarely been able to prove what was.

That gap — between we have controls and here is independently checkable proof a specific control executed at the moment it mattered — is the verification gap. The Fable 5 episode is what it looks like when that gap meets a regulator with the power to act on a single finding.

Runtime evidence: proof that the guardrail fired

The alternative to trusting a log is producing a receipt. A runtime control — the component that inspects a prompt, screens a tool call, denies an action, or escalates to a human — can emit, as a by-product of doing its work, a signed record that an outside party can verify. Not a richer log: evidence. Tamper-evident, independently checkable, and silent about everything it need not disclose.

This is the motion GLACIS calls runtime coverage. Controls enforce at the inference, tool-call, and agent boundary. Each enforcement event — permit, deny, override, escalation, response — produces a signed receipt. Crucially, only cryptographic fingerprints, signatures, and verification metadata cross the trust boundary. The prompt, the document, the patient record, the customer data — the protected content stays inside your environment. The result is proof the guardrail held without turning the evidence trail into a new data-egress channel.

Concretely, mature security operations need five things from this kind of runtime evidence, and the open OVERT standard is built to provide them:

Trusted execution evidence — which enforcing component, in which configuration, was active when a governed action occurred.
Reliable coverage accounting — what was in scope, what was excluded, and how the denominators were derived, so "we screen tool calls" comes with a measured rate rather than an adjective.
Tamper-evident telemetry — records not reducible to operator-controlled logs.
Independent verification of enforcement events — permits, denials, overrides, and escalations an outside party can check.
Post-incident reconstruction without routine content disclosure — replay the event history of an injection attempt without re-exposing the sensitive data involved.

After a prompt injection attack, that last property is the difference between a defensible afternoon and a bad one. The record can show, cryptographically, that the malicious instruction hit the tool-call boundary and was denied — or, if a control failed, exactly which one and when — without handing investigators the very data the attacker was after.

Independence is what makes the proof worth anything

Self-attestation is not independent attestation. A receipt the governed party could have written, and could have altered, proves little under adversarial scrutiny. OVERT's design separates the two roles by structure: whoever attests is distinct from whoever is governed, and the verifier checks signatures rather than taking the operator's word. That structural independence is what makes OVERT an ai security standard rather than another self-report format.

The 1.1.0 release of OVERT, published 11 June 2026, hardens exactly the machinery that makes this work across an organisational boundary. It is an additive, backward-compatible minor release — an implementation conformant to 1.0 stays conformant to 1.1 unmodified — and its new normative Annex G adds the cross-boundary plumbing: a local content-addressed storage model for evidence retrieval and retention integrity, an HTTP transport binding so attestation can travel between an operator and an external verifier, an automated auditor-discovery protocol via a well-known endpoint, and a reference schema for the ControlAction artifact that records each enforcement event. The scanner and a local classifier are defined as supporting components. In plain terms: it makes "prove the guardrail held" a thing an outside auditor can request and verify over the wire, on demand, without your content ever leaving home.

What to do before the next finding

The Fable 5 jailbreak is a reminder that the question is no longer hypothetical and no longer slow. For teams defending LLM applications, the practical posture is to assume some injection attempt will land, and to make sure that when one does, the answer is proof rather than a hastily assembled narrative.

That means enforcing at the boundaries where agents can act, and emitting independent, tamper-evident receipts for every permit, deny, and escalation — so coverage is a measured number, incidents are reconstructable, and the proof survives a hostile read. Documentation describes intent. Receipts prove execution. When a single finding can move a market, the operators who can answer prove it are the ones who keep their deployments. The same posture is what mature ai security solutions are converging on: verifiable enforcement, not asserted compliance.

If you want to see what a verifiable enforcement record looks like, you can verify a receipt yourself, or get runtime coverage for the boundaries your agents act on. The open standard lives at overt.is and /standard.

Navigate

Solutions

Evidence

Regulations