Inspiration
This idea came up from one of our friends who was hard of hearing, the current captioning tools only gave them a lot of text without much context. We also talked about how video game captions add a name before any character speaks so we thought of how we can incorporate that into a project.
What it does
Its a live captioning tool with a voice identification engine using Deepgram API and the Resemblyzer neural network. They work in tandem to provide names of previously added users to captions.
How we built it
We built with the following technologies:
Python 3.9 — Backend server, audio processing, voice matching engine
JavaScript (ES6+) / JSX — Frontend UI and real-time audio capture
SQL — Database schema and Row Level Security policies Frontend
React 19 — Component-based UI with hooks (useState, useEffect, useRef) for state management, real-time transcript rendering, and audio lifecycle control
Vite 8 — Build tool and dev server with hot module replacement Tailwind CSS (CDN) — Utility-first styling for the landing page, auth forms, transcript panel, and voice profile cards
Web Audio API — Captures live microphone audio in the browser, creates a ScriptProcessorNode that converts float32 PCM samples to 16-bit int16 at 16kHz mono, and streams raw audio bytes over the WebSocket in real time
Iconify — Icon library for UI icons (Solar icon set)
DM Sans (Google Fonts) — Typography
Backend FastAPI — Python async web framework serving REST API endpoints (/profiles, /sessions, /auth/signup) and a WebSocket endpoint (/ws/transcribe) that handles the entire real-time pipeline
Uvicorn — ASGI server with hot-reload for development
WebSockets (Python websockets library) — Maintains a persistent connection to Deepgram's streaming API, relaying audio from the browser and receiving transcription results back
Speech-to-Text API Deepgram Nova-3 — Real-time streaming speech-to-text via WebSocket.
Configured with: diarize=true — Speaker diarization (labels each word as speaker_0, speaker_1, etc.) smart_format=true — Auto-formats numbers, dates, currencies punctuate=true — Adds punctuation numerals=true — Converts spoken numbers to digits filler_words=true — Captures "um", "uh", etc. endpointing=250 — Finalizes results 250ms after silence for low latency vad_events=true — Voice activity detection events encoding=linear16, sample_rate=16000, channels=1 — 16-bit PCM mono at 16kHz
Voice Identification Engine Resemblyzer — A pretrained d-vector neural network model (VoiceEncoder) that converts any audio clip into a 256-dimensional voice embedding (a numerical fingerprint of someone's voice). This is the core of how we go from Deepgram's anonymous speaker_0 / speaker_1 labels to real names like "Alex" or "Israr".
How voice matching works step-by-step:
Enrollment — A user uploads a 5-10 second voice sample for each person they want to identify. The audio is loaded via librosa (supports MP3, M4A, WAV, etc. through ffmpeg). We use a windowed averaging technique: the clip is split into overlapping 3-second windows with 1.5-second hops, an embedding is generated for each window, and all embeddings are averaged and L2-normalized. This produces a more stable, representative voice profile than a single embedding. The resulting 256-float vector is base64-encoded and stored in Supabase.
Rolling Audio Buffer — During a live session, the backend keeps a 15-second sliding window of raw PCM audio in memory. As Deepgram returns transcription results with timestamps and speaker labels, we use those timestamps to extract the exact audio segment for each speaker turn from the buffer (with 0.3s padding on each side for context).
Audio Accumulation — Rather than matching on a single short utterance (which is unreliable), we accumulate up to 12 seconds of audio per speaker_id across multiple segments. Matching only begins once at least 1.5 seconds of audio has been accumulated for that speaker.
Cosine Similarity Matching — The accumulated audio is preprocessed and converted to a Resemblyzer embedding. This embedding is compared against every enrolled profile using cosine similarity (dot product of normalized vectors). Scores range from 0 to 1 where higher = more similar.
Three-Gate Acceptance — A match is only accepted if it passes three checks:
Score gate: cosine similarity ≥ 0.65 (the SIMILARITY_THRESHOLD) Margin gate: the gap between the best and second-best candidate ≥ 0.05 (MIN_MARGIN) — prevents ambiguous "coin flip" matches where two profiles score almost identically
Already-taken gate: if a name is already confirmed for a different speaker_id, that name is skipped and the system tries the next best candidate — prevents two different speakers from getting the same name
Fallback — If no enrolled profile passes all three gates, the speaker is labeled generically as "Speaker 1", "Speaker 2", etc. rather than guessing wrong.
Database & Auth Supabase — Backend-as-a-service built on PostgreSQL, used for: Authentication — Email/password signup and login. The backend uses the Supabase Admin API (service_role key) to create users with auto-confirmed emails, so no email verification is required. The frontend uses @supabase/supabase-js for client-side session management (JWT stored in localStorage, auto-refreshed).
PostgreSQL Database — Two tables: voice_profiles — Stores each enrolled speaker's name and their Resemblyzer embedding (base64-encoded 256-dim float32 vector), linked to the user via user_id foreign key to auth.users
chat_sessions — Stores completed conversation transcripts as JSONB arrays (each entry has speaker name, text, timestamp, and color index), along with session duration
Row Level Security (RLS) — Both tables have RLS enabled so users can only read, insert, and delete their own data. The backend bypasses RLS using the service_role key for admin-level operations.
Audio Processing NumPy — Array operations for audio data (int16 ↔ float32 conversion, embedding math, cosine similarity)
librosa — Audio file loading (supports any format via ffmpeg — MP3, M4A, WAV, FLAC, etc.)
ffmpeg — System-level audio codec support used by librosa under the hood
Infrastructure Git / GitHub — Version control and repository hosting python-dotenv — Environment variable management from a single .env file (shared between frontend VITE_ vars and backend secrets)
Challenges we ran into
Keeping Deepgram’s text timestamps perfectly synced with our backends raw audio buffer was incredibly complex. We had to engineer a rolling 15-second sliding window and manually add 300ms padding to segments to ensure we were slicing the exact right audio bytes for the neural network.
We kept running into weird matches where two profiles scored similarly. We had to engineer a strict "Three-Gate Acceptance" system (Score Gate, Margin Gate, and an Already-Taken Gate) to filter matches and fall back to generic labels rather than guessing incorrectly.
Relying on a single short audio snippet for voice matching was too unreliable. We solved this problem by building an audio buffer
Accomplishments that we're proud of
We are honestly incredibly proud of the matching logic. Seeing the system reject false positives and confidently lock onto the correct family member's voice in real time without tanking the backend CPU or causing massive UI lag
What we learned
We learnt concepts like sample rate, PCM conversion, and cosine similarity and actually made it into tangible code in our project. We also leanrt the value of lazy matching, waiting for good data rather than forcing the AI to guess on bad data.
What's next for Defy
Moving forward we wish to implement a summarize feature, using Gemini API and tweak our diarizing to better match voices and strangers.
Built With
- css
- deepgramapi
- html
- javascript
- python
- resemblyzer
- supabase
Log in or sign up for Devpost to join the conversation.