Inspiration

In a real emergency you can't read a web page. Your hands are busy, you're panicking, and a wrong instruction can kill. Open models are almost good enough, but "mostly right" is unacceptable for CPR or choking, and fine-tuning is slow and still probabilistic.

What it does

Lifeline is a hands-free, voice-first first-aid assistant covering 17 emergencies that speaks back protocol-correct steps. Every step is checked against the official Red Cross, AHA, and NHS protocols before you hear it, and it never speaks an unverified instruction.

How we built it

We keep Google's DiffusionGemma-26B frozen and spend compute at inference time (denoising depth times best-of-N), then a deterministic verifier keeps only the answers that pass the protocol. Because best-of-N accuracy rises as 1 - (1 - p)^N, cheap extra samples plus a near-free verifier take single-shot 79.6% to 98.45% with zero training, and the live app even runs the recognizer and protocols client-side.

Challenges we ran into

vLLM can't serve the diffusion architecture, and the 26B model only fits with explicit single-GPU placement, not naive offload. The verifier was the hard part: a negation-blind check let a fluent-but-dangerous "withhold CPR" answer slip through, which we fixed with clause-scoped negation and forbidden-action rules.

Accomplishments that we're proud of

On a frozen open model with zero training, inference-time compute reaches 98.45% verified, and the recipe generalizes (Qwen2.5-7B went from 65% to 98.3%). On a 42-candidate adversarial set the deterministic verifier caught all 12 fluent-but-dangerous answers with zero unsafe leaks, and the whole thing ships as a live, installable web app with passing CI.

What we learned

More compute only becomes more safety if the selector is a rule, not a vibe: an LLM-as-judge gets reward-hacked by confident wrong text and flip-flops from run to run. The verifier is the load-bearing piece, because a cheap, deterministic check is what makes best-of-N both scalable and trustworthy.

What's next for Lifeline

More protocols and languages, on-device deployment, and a live 911 hand-off with confidence-calibrated follow-up questions. The deeper bet: this is a general reliability harness, so point it at any task with a checkable protocol (code plus tests, structured extraction, compliance) and frozen open models become safe to ship.

Built With

Share this project:

Updates