NightCanary

Inspiration

The brief said to avoid generic lifestyle advice and support specific, realistic action. That killed our first instinct, which was another symptom checker. So we flipped the question: where in the UK is the gap between vague symptoms and a working NHS pathway actually fatal?

Sleep apnoea kept coming up. About 1 in 25 UK adults have it, and roughly 85% are undiagnosed. The symptoms are easy to misread, since tiredness, snoring, low mood and morning headaches all get pinned on stress or age or having young kids. People often get a "let's see how it goes" or two before anyone takes it seriously, and NHS sleep clinic waits run 6 to 18 months. In that gap people have strokes and heart attacks, or fall asleep at the wheel.

What was missing wasn't another diagnostic. It was the prep step before the GP visit: a way to turn a vague worry into structured evidence that fits inside a 10-minute appointment.

What it does

NightCanary is a five-step web app.

Voice or text intake. The user describes how they've been feeling and Whisper transcribes it.
A short About-You form: age, sex, height and weight (with a live BMI calculator), neck circumference, blood pressure. Five fields, about twenty seconds.
An AI-led conversation. Claude asks focused follow-ups, covering only the symptom items the form didn't already capture (snoring, observed apnoeas, daytime sleepiness, the Epworth scenarios). A sidebar checklist ticks each validated screening item as the user confirms it, so judges can watch the AI map plain speech onto clinical instruments in real time.
Overnight pulse oximetry, via CSV upload, live USB streaming through the Web Serial API, or a seeded sample for an instant demo.
Results: a risk band (low, moderate or high), the overnight SpO₂ chart with desaturation markers, a plain-English explanation, and a NICE-aligned GP referral letter the user can download or email.

There's also a standalone /compare page showing a healthy night next to a moderate-OSA night, and a /record page that streams from a real USB pulse oximeter and saves to localStorage so an overnight session survives a browser refresh.

How we built it

The stack is Next.js 16 (App Router), TypeScript throughout, and Tailwind v4. We hand-rolled a small UI library (button, card, badge, progress) because getting shadcn working with v4 wasn't worth the time.

Claude Sonnet 4.6, via the Anthropic SDK, does three jobs: running the conversation, pulling structured STOP-BANG and Epworth answers out of the transcript, and writing the GP letter. Whisper handles voice-to-text in about ten lines. Recharts draws the overnight SpO₂ chart, react-markdown with remark-gfm renders the letter, and the Web Serial API talks to the pulse oximeter.

The scoring is a pure TypeScript clinical library with no LLM involved: ODI per NICE NG202, STOP-BANG per Chung 2008, Epworth per Johns 1991, plus T90 and minimum SpO₂ feeding the risk band. Eighteen Vitest tests cover it.

Hosting is Vercel, with an in-memory session store stashed on globalThis so it survives Turbopack hot reloads in dev and warm-instance reuse in production.

That split is the decision everything else followed from: the AI articulates, deterministic code scores, the GP decides.

Challenges we ran into

A lot.

Tailwind v4 and shadcn didn't want to play together. We gave up after half an hour and hand-rolled the components instead.

The Vercel build failed with OPENAI_API_KEY is not set, because the OpenAI SDK was throwing at module load during static analysis. Lazy-initialising both LLM clients fixed it, so the build works whether or not the keys are present.

The coverage checklist hallucinated coverage. Claude was ticking STOP-BANG items the user hadn't actually answered: "I don't know, I live alone" still ticked snoring, and daytime tiredness got ticked before the user was even asked. We rewrote the prompt with worked examples and added server-side validation: a cumulative union of covered codes, whitelisted character codes, and explicit evidence requirements with bad and good examples.

The chat asked compound questions like "Roughly how old are you, and is your neck over 40 cm?", which confused people. We pulled the bundling out of the prompt and moved all the deterministic facts into the About-You form.

Sessions kept vanishing in dev because Turbopack's hot reload wiped the in-memory Map on every save. The fix is the standard Next.js trick: stash the store on globalThis so it survives module reloads.

The pulse oximeter refused to talk over USB at first. We wrote a CMS50-family serial parser and a CSV fallback, so the demo holds up either way.

The voice intake had two empty textareas, which was confusing. We collapsed it into one shared field that fills from voice or typing, with a status indicator under the mic.

Every one of these was a real fix, and the commits show it.

Accomplishments that we're proud of

We shipped a deployed, working product in 24 hours that judges can actually click through. The live coverage checklist turned out to be the clearest way to show the AI-safety story without explaining it; people get it the moment they watch the boxes tick. Every clinical threshold is sourced (Chung 2008, Johns 1991, NICE NG202, AASM 2017), and we wrote a 200-line clinical-rules.md so judges can audit the claims. The scoring library passes all 18 of its unit tests.

The /compare page ended up being a better demo opener than the assessment itself, since judges grasp the problem in about five seconds. The GP letter renders as Markdown and reads like a real document rather than source code, so a receiving GP would actually open it. And the build deploys to Vercel with or without API keys, which is a small thing but exactly the kind of detail that matters when someone opens your URL cold.

What we learned

Prompt engineering is fragile, so deterministic safeguards aren't optional. Every "the model will probably do X" assumption broke at least once.

Structured forms beat chat for deterministic facts. Voice and conversation are great for how do you feel and terrible for how old are you. Splitting the journey along that line fixed three usability bugs at once.

Live UI feedback does more in a demo than any amount of talking. The checklist ticking in real time sold the safe-AI story better than we could.

Lazy-init your SDK clients. It's five lines and it prevents a whole category of build failures.

And the hardest part of health AI wasn't the AI. It was deciding what the AI is and isn't allowed to claim, and writing the docs that prove we held the line.

What's next for NightCanary

GP Connect / NHS App integration. The letter is structured Markdown today, one rewrite away from a FHIR Composition that lands straight in the GP's workflow.

Real persistence. Swapping the in-memory session for Redis or Postgres means cold starts won't lose state, and patients could record overnight without leaving a laptop running.

Multi-condition support. The same framework (symptom intake, a cheap consumer sensor, deterministic scoring, a GP letter) extends to COPD (walking SpO₂ plus breathlessness), atrial fibrillation (smartwatch ECG plus a palpitation log), and heart failure (SpO₂, weight, breathlessness). All are under-diagnosed and well suited to home triage.

A local LLM option through Ollama or LM Studio, for NHS settings where patient data can't leave the device. The architecture already allows it, since Claude sits behind a single swappable interface.

Validation against real clinical data. Our "moderate OSA" sample is currently synthetic, so the next research step is working with a UK sleep clinic to test the ODI algorithm against real overnight recordings.