Project Architechture

Our Story

What inspired us

We were trying to talk to someone from the Modal team during the hackathon, and it was kinda awkward since there were so many people and so much noise around us. We realised what it would be like working with so much noise and people. Voice typing is so much faster than regular typing, and we've been using it for a while, but we thought: how could you do it with that much clutter and around noise? So we built Vizem Flow - an agent that works in the background and, with a click on the keyboard, reads your lips and outputs text wherever you want it to.

Why it matters

Two main selling points:

1/ Awkward or loud scenarios - Library, open office, café, anywhere it's too loud or weird to talk. Vizem Flow is faster, private and less awkward than speaking into your phone.

2/ Social impact - Important for people who have speech difficulties. It helps them use lip movements to work faster than typing.

How we built it

We researched how lips move when you speak and found out about visemes. Took our own pictures to train our own model, then ran the whole thing on Modal. The pipeline is basically:

$$ \text{Keyboard click} \rightarrow \text{Video frames} \rightarrow \text{lip geometry} \rightarrow \text{visemes} \rightarrow \text{phonemes} \rightarrow \text{words} $$

MediaPipe gets the lip geometry per frame, we match it to our reference set (from those photos), then map visemes to phonemes. Modal is the nice part 0 we don't manage any servers. One FastAPI endpoint, Modal spins up a container when you POST an MP4, runs the full pipeline (OpenCV, MediaPipe, our code) in the cloud, and returns the decoded sentence. We use an LLM (OpenAI) plus Supermemory to turn the phoneme sequence into a real sentence and to eventually understand unique user speaking styles - the model adapts to each user over time (using Supermemory), and eventually accuracy improves.

Challenges

Correctly matching phonemes and visemes to real words was hard - pacing and transitions make it tricky to match. Also, making sure the LLM was doing the conversions properly (one clean sentence, no junk). We tuned the smoothing and the prompt to get there.