Inspiration

Psychiatrists operate under extreme cognitive load. In each 15 to 30 minute session, they must build rapport, track subtle behavioral cues, recall long patient histories, and map symptoms to hundreds of DSM-5 criteria, all under time pressure.

This environment amplifies cognitive biases such as anchoring, where early impressions dominate and later signals are missed. Misdiagnosis is common for conditions with overlapping symptoms, and many patients spend years receiving ineffective or harmful treatment before correction.

Arden exists because this is a bandwidth problem. Real-time AI can continuously track signals and surface patterns during the session itself, when clinical judgment has the greatest impact.

What it does

Arden is a real-time AI copilot that watches and listens to psychiatric interviews, surfacing diagnostic signals the clinician might miss.

During a session:

  • Video analysis: Overshoot's vision API extracts 28 biometric measurements continuously—eye contact percentage, gaze stability, blink rate, facial tension, posture, fidgeting, breathing patterns, and distress signals
  • Voice analysis: LiveKit's agent framework transcribes speech in real time while detecting 50+ crisis keyword patterns across 6 categories (suicidal ideation, self-harm, hopelessness, severe depression, anxiety crisis, substance crisis)
  • Multimodal fusion: Visual emotion signals flow to the voice agent via data channels, allowing the AI to modulate its tone based on observed patient state
  • Clinician dashboard: Live biometric timeline, differential diagnosis suggestions, and crisis alerts with severity levels and recommended actions

After a session:

  • AI-generated clinical report with DSM-5 codes, treatment suggestions, and follow-up questions
  • Exportable transcript, biometric timeline, and assessment scores

All analysis runs during the session, when intervention can still change outcomes.

How we built it

Vision Pipeline (Overshoot)

Using @overshoot/sdk with Qwen3-VL-30B, we process video at 30 FPS with 15 percent frame sampling, producing ~1-second inference windows at roughly 300 ms latency.

A custom 75-line clinical prompt extracts 28 structured biometric fields spanning affect, eye behavior, facial tension, posture, breathing, engagement, and distress. To avoid repeated reactions to static emotional states, we implemented a 60-second temporal memory filter that forwards only novel observations.

Voice Pipeline (LiveKit)

We built a custom PsychiatricAssistant agent class using the LiveKit Agents Framework (Python):

  • Speech-to-text (STT): AssemblyAI Universal Streaming
  • LLM: OpenAI GPT-4.1-mini
  • Text-to-speech (TTS): Cartesia Sonic-3 with a warm, professional voice
  • Voice Activity Detection (VAD): Silero with multilingual turn detection
  • Noise cancellation: BVC telephony mode

The agent supports structured assessments (PHQ-9, GAD-7, C-SSRS) and adapts responses using emotion signals sent over LiveKit data channels.

Frontend

React 18 + TypeScript + Vite with Tailwind CSS and Radix UI. Real-time biometric visualization uses Recharts. Crisis keyword highlighting uses regex-based detection with severity classification.

Backend

Supabase for Postgres storage and Edge Functions. Session insights generated via Gemini 2.0-Flash-Exp with structured JSON output.

Challenges we ran into

1. Biometric schema consistency

Early prompts produced inconsistent JSON from the vision model. We solved this by defining a strict 28-field output schema and iterating on prompt structure until the model reliably returned valid measurements for every field.

2. Observation flooding

Without filtering, the dashboard received a new observation every second, triggering redundant UI updates and downstream reactions. We implemented a temporal memory system that hashes key observation fields and suppresses duplicates within a 60-second window.

3. Voice agent latency

Initial turn detection was too aggressive, cutting off patient mid-sentence. We switched to LiveKit's multilingual turn detection model and enabled preemptive generation to reduce perceived latency.

4. Crisis keyword precision

Naive keyword matching produced false positives ("I'm dying to try that restaurant"). We restructured patterns to require word boundaries and added contextual phrases, reducing noise while maintaining sensitivity to genuine crisis language.

5. Git merge conflicts at 4am

Two parallel feature branches (Overshoot integration and LiveKit voice panel) diverged significantly. We resolved 800+ lines of merge conflicts by isolating shared state and refactoring the Dashboard component.

Accomplishments that we're proud of

  • End-to-end multimodal system with sub-2-second latency
  • 28 structured biometric measurements per observation with enforced schema consistency
  • 50+ crisis patterns with severity levels and actionable guidance
  • Temporal memory system that reduces alert fatigue without losing signal
  • Stable live demo across 8,000+ lines of code built in 24 hours
  • Clinical-grade prompting aligned with DSM-5 criteria and standard assessments

What we learned

Domain-specific prompting is hard. Generic "describe what you see" prompts don't produce clinically useful output. We spent time iterating on prompt structure to get reliable, structured biometric extraction.

Real-time multimodal fusion is architecturally non-trivial. You can't simply merge two data streams; you need a shared state model, deduplication logic, and careful timing to prevent race conditions.

Psychiatrists don't necessarily want more data; they want less cognitive load. Early designs showed too much information. We learned to surface only novel observations and actionable alerts.

Recording consent matters. Even in a hackathon prototype, building the consent flow early forced us to think about real deployment constraints.

What's next for Arden

  1. Deploy LiveKit agent to production for multi-user concurrent sessions
  2. Clinical validation pilot with psychiatry residency programs (IRB-friendly research tool)
  3. Fine-tune vision prompts on psychiatric case video datasets for higher accuracy
  4. HIPAA certification audit for clinical deployment
  5. EHR integration (Epic FHIR, Cerner) to incorporate longitudinal patient context

Built With

Share this project:

Updates