DrDejaVu — Your AI Medical Memory
Inspiration
We've all walked out of a doctor's office and immediately forgotten half of what was said. "Was my A1c improving? What medication did they adjust? What diet advice did they give six months ago?" Patients lose critical health context between visits, and paper aftercare summaries don't cut it. We wanted to build something that listens, remembers, and speaks back — a medical memory powered by voice AI.
What We Learned
- Orchestrating 4 Eigen AI models (Higgs ASR V3.0, Higgs Audio V2.5, Higgs Audio Understanding V3.5, gpt-oss-120b) into a single coherent pipeline
- Building a voice-first RAG system — converting spoken consultations into vector embeddings via ChromaDB, then retrieving them with semantic search
- The importance of chunking strategies — splitting transcripts at sentence boundaries (~1000 chars) dramatically improved retrieval quality:
$$\text{similarity}(q, d) = \frac{q \cdot d}{|q| \cdot |d|}$$
- Converting LLM markdown output into voice-friendly text for natural TTS delivery
- Using Higgs Audio V2.5's voice cloning to make AI responses sound like the patient's actual doctor — building familiarity and trust
How We Built It
Architecture: A full-stack voice RAG pipeline in ~1,300 lines of code.
Voice Input → Higgs ASR V3.0 (transcribe)
→ gpt-oss-120b (summarize)
→ ChromaDB (index as vectors)
→ Patient asks a question (voice/text)
→ RAG retrieval (top-10 cosine similarity)
→ gpt-oss-120b (generate contextual answer)
→ Higgs Audio V2.5 (speak the answer in doctor's cloned voice)
Stack:
- Frontend: React 18 + TypeScript + Vite — Dashboard, Upload, and History pages with real-time voice recording via MediaRecorder API
- Backend: FastAPI (Python) — async endpoints for transcription, chat, and RAG queries
- Vector DB: ChromaDB with cosine similarity + Sentence-Transformers embeddings
- Metadata: SQLite for consultation records
- Deployment: Docker Compose — one command to run everything
Eigen AI Models Used: | Model | Role | |-------|------| | Higgs ASR V3.0 | Speech-to-text transcription (9.12% WER) | | Higgs Audio V2.5 | Text-to-speech with voice cloning — responses sound like the patient's own doctor (~150ms latency) | | Higgs Audio Understanding V3.5 | Tone, sentiment & wellbeing analysis from patient voice | | gpt-oss-120b | Summarization + RAG-powered chat completions |
Voice Cloning Flow: During consultation upload, the doctor's voice is captured from the audio recording. Higgs Audio V2.5 clones this voice profile (with permission) so that when the patient later asks a question, the AI answer is delivered in their doctor's familiar voice — making the experience feel like a real follow-up conversation, not a robotic chatbot.
Challenges We Faced
- Audio pipeline orchestration — Coordinating transcribe → summarize → index → retrieve → generate → speak across 4 different models required careful async handling
- Voice cloning quality — Extracting a clean doctor voice profile from a two-speaker consultation audio required isolating the doctor's segments for cloning input
- TTS formatting — LLM responses contain markdown, emojis, and bullet points that sound terrible when spoken aloud. We built a conversion layer to produce voice-friendly text
- RAG retrieval quality — Early attempts returned irrelevant chunks. Adding date metadata and hybrid retrieval (transcript chunks + summaries) with patient-scoped filtering (
where: {patient_id}) fixed it - Latency budget — A voice-first UX demands fast responses. Higgs Audio V2.5's ~150ms first-token latency was critical for keeping the experience conversational
Built With
- all-eigen-ai-integrations
- python
- react
- typescript
- vite
Log in or sign up for Devpost to join the conversation.