Utah OAIP · Research team
April 22, 2026

Evidence infrastructure for the sandbox era.

A conversation with the Utah Office of AI Policy research team.

GLACIS April 22, 2026
§ I · Why we asked for this meeting 02 / 06

We listened to Zach at CHAI.
Three things he said changed how we're thinking about the sandbox.

Each one, in his words. Each one implies something about what your sandbox needs next.

i · Case review
I know it when I see it. I can evaluate any case. I can't yet write the rules that let someone else do what my team does.
— Zach Boyd, CHAI 2026

The sandbox runs on staff judgment today. The next phase is turning that judgment into generalizable rules.

ii · Oversight budget
‘Human in the loop’ is an allocation problem. When can a clinician responsibly delegate ninety to one hundred percent?
— Zach Boyd, CHAI 2026

The honest answer needs evidence of what the system actually did — not vendor self-reports.

iii · Standards drift
Once standards are set, they ossify. The underlying technology keeps moving. The standard stops moving with it.
— Zach Boyd, CHAI 2026

Standards at the policy layer. Evidence at the runtime layer. One evolves without re-certifying the other.

§ II · The gap we see 03 / 06

Three questions your sandbox has to answer eventually — and none can be answered from vendor self-reports alone.

Q I

Was the declared control actually running at the moment of the decision?

Q II

Did the system's behavior drift from its baseline over time — and if so, when?

Q III

Can an independent auditor verify the first two offline, years later, without trusting us or the vendor?

§ III · The mechanism 04 / 06

What we built
— the mechanism, not the product.

Three components. Each answers one of the questions on the previous slide. Each is independently verifiable.

i · Attestation

A receipt for every inference call.

Every inference call is wrapped. A cryptographic receipt is emitted with the inputs, the declared policy, and the output — signed at the moment of execution.

Ed25519 · canonical JSON
No PHI egress
ii · Drift detection

Continuous statistical monitoring.

A behavioral baseline is established and watched. Distributional shift is caught as it happens — not weeks later in a retrospective review.

Page-CUSUM · auto-tuning
Streaming, per‑deployment
iii · Verification

Offline, independent replay.

A Python verifier reads the receipt stream and confirms the chain. Your auditor verifies our evidence without trusting us, our servers, or our certificates.

407 lines · stdlib only
Zero network
§ IV · What we'd like to propose 05 / 06

A structured pilot, not a procurement.

— i

Instrument one existing sandbox participant with our evidence layer.

— ii

Zero cost. Zero PHI egress. Zero procurement process.

— iii

Monthly evidence report your team can independently verify — and in return, we learn what format is most useful to your review process.

§ V · What we'd like to show you now 06 / 06

A five-stage walkthrough of how this works end to end.

I

Declare

The participant states the policy the system is supposed to follow, machine-readable.

II

Discover

We establish the behavioral baseline from real, consented traffic before anything goes live.

III

Defend

Every inference call in production is wrapped and the declared control is enforced at runtime.

IV

Detect

Drift from the baseline is flagged as it happens, with the evidence attached.

V

Prove

Your auditor re-runs the verifier offline and confirms the record, months or years later.

—  After the demo · two questions
What would the ideal evidence format look like from your side?
Who else from your office should see this?