A conversation with the Utah Office of AI Policy research team.
Each one, in his words. Each one implies something about what your sandbox needs next.
I know it when I see it. I can evaluate any case. I can't yet write the rules that let someone else do what my team does.— Zach Boyd, CHAI 2026
The sandbox runs on staff judgment today. The next phase is turning that judgment into generalizable rules.
‘Human in the loop’ is an allocation problem. When can a clinician responsibly delegate ninety to one hundred percent?— Zach Boyd, CHAI 2026
The honest answer needs evidence of what the system actually did — not vendor self-reports.
Once standards are set, they ossify. The underlying technology keeps moving. The standard stops moving with it.— Zach Boyd, CHAI 2026
Standards at the policy layer. Evidence at the runtime layer. One evolves without re-certifying the other.
Was the declared control actually running at the moment of the decision?
Did the system's behavior drift from its baseline over time — and if so, when?
Can an independent auditor verify the first two offline, years later, without trusting us or the vendor?
Three components. Each answers one of the questions on the previous slide. Each is independently verifiable.
Every inference call is wrapped. A cryptographic receipt is emitted with the inputs, the declared policy, and the output — signed at the moment of execution.
A behavioral baseline is established and watched. Distributional shift is caught as it happens — not weeks later in a retrospective review.
A Python verifier reads the receipt stream and confirms the chain. Your auditor verifies our evidence without trusting us, our servers, or our certificates.
Instrument one existing sandbox participant with our evidence layer.
Zero cost. Zero PHI egress. Zero procurement process.
Monthly evidence report your team can independently verify — and in return, we learn what format is most useful to your review process.
The participant states the policy the system is supposed to follow, machine-readable.
We establish the behavioral baseline from real, consented traffic before anything goes live.
Every inference call in production is wrapped and the declared control is enforced at runtime.
Drift from the baseline is flagged as it happens, with the evidence attached.
Your auditor re-runs the verifier offline and confirms the record, months or years later.