Legion Sentinel

A fail-closed oversight layer for AI-authorized psychiatric refills.

Inspiration

In March 2026, Utah's Office of Artificial Intelligence Policy signed a first-of-its-kind regulatory agreement with Legion Health: AI is now allowed to authorize psychiatric medication refills for Utah residents. The motivation is real and urgent — all 29 of Utah's counties have designated mental-health shortages, and up to 500,000 residents (nearly 20%) receive no behavioral-health care at all. A renewal is high-volume, high-friction, and tied directly to continuity of care. Automating it could help a lot of people.

But psychiatric refills are exactly the place you cannot afford to miss someone. An autopilot that quietly renews an antidepressant for a patient who has become suicidal, manic, pregnant, or who is having a dangerous cardiovascular side effect isn't a convenience — it's a hazard. The executed agreement knows this: it mandates seven hard-stop escalation triggers, a fail-closed posture, and a phased physician-review gate before any scaling.

So the regulation already defines the safety surface. The question that inspired us was: who builds the layer that actually enforces it? That's Legion Sentinel. Their tool is the autopilot; Sentinel is the gate that decides, at every single refill, whether it's safe to clear — or whether a human needs to see this one.

What it does

At each refill check-in, Sentinel runs two channels through a triage brain and returns one of three decisions:

  • Routine clear — the refill goes through, with a pre-visit brief.
  • Med-safety hold — an unsafe refill is held and routed to a prescriber, with a bridge supply offered so the patient never runs out.
  • Acute-crisis escalation — a patient in crisis is handed to a human, immediately, with a warm voice handoff.

The two channels are deliberately complementary. The voice check-in hears what the patient says and how they say it — speech rate, pauses, and energy are themselves triage signals, not just transcription. The wearable stream reads what their body shows — resting heart rate, HRV, sleep, activity — catching the things a patient might downplay or not mention. The body catches what words miss, and vice versa.

The whole system is built on one principle: fail-closed. Sentinel clears a refill only when it has complete, affirmative evidence of safety. Absence of a signal is never a clear. Everything ambiguous, degraded, or uncertain routes to a human.

How we built it

The architecture is a deterministic safety spine with an LLM sensitizing layer on top — never the other way around.

The deterministic spine (no LLM, fail-closed): A pre-gate enforces the agreement's exact 15-medication non-controlled formulary and the clinician-touch cadence (≤10 automated refills or 6 months). A separate self-harm safety net runs a lexicon over the transcript and the mandatory safety screen. These are pure functions — fast, testable, and impossible for a model error to override.

The seven detectors (xAI Grok): One detector per Utah-mandated trigger — suicidality, mania/hypomania, pregnancy, severe adverse effect, worsening symptoms or loss of efficacy, identity/prescription mismatch, and out-of-scope. Each returns a calibrated probability; a central aggregator sets fired and makes the final call, with clear reachable only through the last, hardest branch after self-harm has been affirmatively ruled out.

Voice (xAI): Speech-to-Text (/v1/stt) and Text-to-Speech (grok-tts), with acoustic features derived from STT word timestamps plus local PCM analysis. A mandatory safety screen runs every check-in, and the crisis handoff is pre-rendered and never generative — a guarded script with a demo-truthful 988 line and a persistent SIMULATION badge, so nothing improvised ever plays to someone in distress.

Wearables: A simulator emits physiological summaries against fixed enrollment baselines (so a slow drift can't quietly become the new "normal"), where the body's signal corroborates a trigger but never solely drives one.

Evaluation: An adversarial harness with hand-curated gold for the catastrophic buckets (suicidality, mania), run blind, producing an OAIP-style report with Wilson confidence intervals and a separate, explicitly empty physician-gate panel.

Surface: Next.js (App Router), React, and Tailwind, generated with v0 and deployed on Vercel — a patient check-in, a clinician dashboard, and the OAIP report, plus a ?demo=baked path so the demo never depends on a live network call.

We grounded every rule in the real executed agreement: we OCR'd the signed PDF and clause-mapped each obligation to its section. To build fast, we froze a TypeScript "Tier-0" contract first, then fanned out seven parallel agents (Cursor) against it.

Challenges we ran into

Designing fail-closed is harder than it sounds. A 61-pass adversarial review found fail-open as the systemic root cause — dozens of subtle paths where a missing signal could slip into a clear. Fixing it meant inverting the default: the system holds unless it can prove safety, and clear is the last branch, not the first.

The agreement was a scanned PDF with no text layer. We OCR'd it and reconstructed its tables — and then discovered our med-safety demo persona was on atomoxetine, which isn't actually on the agreement's formulary. It would have failed the pre-gate before our detector ever ran. We switched her to an in-formulary SNRI (venlafaxine, which genuinely raises heart rate) and the beat survived, more accurate than before.

Coordinating parallel agents created its own failure modes. A contract amendment for the cadence cap rippled across tracks; multiple agents independently built their own copies of the personas; the Next.js shell got scaffolded ad hoc; a colon in the workspace path silently broke the build tooling; and a stub-vs-real detector import collision could have made the whole pipeline look real while running on stubs. Catching these before integration was as important as the features themselves.

xAI's voice API took some reverse-engineering — TTS is a native /v1/tts call (not OpenAI-compatible), STT is a dedicated /v1/stt endpoint, and batch STT returns no per-token confidence, which makes the live confidence-degradation path coarser than the design wants.

Staying honest under time pressure was the hardest discipline. Our internal red-team concordance came out high — and the tempting move is to show it next to the regulator's 0.98 bar. We didn't. The report leads with the confidence interval (whose lower bound sits below 0.98), keeps the regulatory bars only on the empty physician-gate panel, and labels the whole thing an internal pre-screen — never a substitute for Utah's real 250-case physician review.

What we learned

The biggest lesson was that safety-critical AI is mostly about what you refuse to let the model decide. The load-bearing parts of Sentinel are deterministic; the LLM is a sensitizer, not a barrier.

The most surprising lesson came from the agreement itself. Its concordance metric is asymmetric: over-escalation — holding a refill a doctor would have approved — explicitly does not count against the 98% goal. The regulator says, in writing, that a "risk-averse safety slant in product" is fine. That single clause validated our entire fail-closed design and changed how we scored ourselves: we benchmark only the unsafe direction, because that's the one Utah actually measures.

We also learned that voice is a real signal, not a UI flourish — paralinguistics are triage features — and that honest evaluation means confidence intervals over point estimates, hand-curated gold over self-grading, and never confusing an internal pre-screen with the regulatory gate.

What's next

Wire the live streaming voice path for real per-token confidence; add a true per-patient voice baseline so prosody is interpreted relative to each person rather than population cutoffs; integrate a real identity-verification vendor and wearable feeds; and run the actual phased physician-review gate that turns our empty panel into real numbers. The architecture is built to be audited against the executed agreement — clause by clause — which is exactly what a real deployment would require.

Built With

Share this project:

Updates