Inspiration
Speech is how most people connect with the world, but it fails a lot of people in a lot of moments. People who are mute or have lost their voice cannot be heard, you cannot speak freely in a library or a quiet office, and a microphone is useless when its loud with a lot of overlapping noise. We also saw that in January 2026 Apple spent nearly $2 billion to acquire Q.ai, an audio AI startup focused on interpreting whispered and even silent speech, its biggest deal since Beats. When a company that size bets billions on silent speech, it told us this was not niche, it was the next way people will connect. We wanted to build an open, accessible version.
What it does
Conduit reads your silent lip movements through an ordinary webcam and speaks the words aloud. It is open vocabulary with no fixed phrase list and needs no microphone. That makes it work for mute users, for private moments where you cannot speak out loud, and in environments too loud for audio. It restores the connection between what you mean to say and being understood.
How we built it
We started with research instead of product code, building a small offline test just to see whether silent open-vocabulary lipreading was even possible. Rather than trust one guess, we show a few options and let the user choose. We took a pretrained lip-reading model, Auto-AVSR through the chaplin wrapper, kept it frozen, and wrote a thin Python layer around it that fixes a frame-rate mismatch and pulls out the model's full list of ranked guesses, which the original code normally throws away. Around that we built a FastAPI backend that runs the model and a local LLM, Ollama with Qwen3, to clean up the top guess and re-rank the rest. The frontend is Next.js with Tailwind and Framer Motion. It records video-only webcam clips, shows the top three options to pick from, and speaks the chosen one aloud through the browser. The whole thing runs on your device, so nothing is ever uploaded.
Challenges we ran into
Open-vocabulary silent lipreading is genuinely unsolved. Many sounds are visually identical on the lips, since p, b, and m are all the same mouth shape, so even a strong model cannot reliably commit to a single transcription. That pushed us toward a design that offers several candidates instead of guessing once. We also had to keep the full capture-to-speech loop fast enough for real conversation. Along the way we caught a frame-rate mismatch in the baseline, where video recorded at one rate was being fed to a model expecting another, which we now correct on a per-clip basis.
Accomplishments that we're proud of
We built a working open vocabulary system on a normal webcam, with no microphone, in one hackathon, aimed at the same problem a two billion dollar acquisition was made to solve.
What we learned
The biggest lesson was to test things ourselves instead of trusting the numbers in papers. Those accuracy scores are measured on clear, spoken, well-lit speech, and they fall apart when someone is silently mouthing at a webcam. Sounds like p, b, and m look exactly the same on the lips, so no model can tell them apart, and we only found that out by measuring it ourselves. The answer wasn't to chase one better guess, it was to accept that the model will never be perfect and design around it, so we show a few good options and let the person pick. We also learned a lot about how speech models work, and that reading other people's code closely is worth it, since that's how we found the frame-rate bug that was hurting accuracy.
What's next for Conduit
Better accuracy (in general first) and then across speakers and languages, lower latency, and phoneme based prediction with a language model backbone.
Log in or sign up for Devpost to join the conversation.