SignSpeak
Inspiration
A hearing parent whose child is Deaf spends months waiting for formal ASL classes to begin. In the meantime, they have no practical way to practice fingerspelling at home — existing tools are either static image charts with no feedback, or full ASL course platforms that assume months of commitment before any interactive practice begins. Neither solves the immediate problem: someone who needs to communicate now, wants to practice independently, and needs to know whether what they're doing is actually correct.
SignSpeak was built for that gap. The goal was a system that watches your hand, tells you in real time whether your sign is being read correctly, and gives you something to practice against — without requiring a human teacher or a pre-existing foundation in sign language.
What it does
SignSpeak is a real-time sign language recognition and learning platform built around webcam-based gesture inference.
A user opens the app, holds a letter sign up to their webcam, and within milliseconds the system returns a predicted letter with a confidence score. There is no upload step, no delay, no round-trip to a remote server for each frame — inference runs through a local WebSocket connection to a FastAPI backend processing frames in real time.
Beyond raw recognition, the platform includes a structured learning flow: reference materials for each letter, a timed practice mode for building muscle memory, and a Hangman-style challenge mode where the game state advances only when the correct sign is held with sufficient confidence. Progress is tracked within the session so users can see which letters they consistently miss.
How we built it
The core inference pipeline runs as follows: the webcam feed is captured in the browser and streamed over a WebSocket to the FastAPI backend. Each frame is passed through MediaPipe Hands, which extracts 21 3D hand landmarks per frame. Those landmark sequences feed into a custom 2-layer LSTM classifier trained on gesture sequences rather than individual frames. The LSTM was chosen specifically because ASL letters like J and Z involve motion — a single-frame classifier would misread them. The model outputs a letter prediction and a confidence score, both of which are returned to the frontend over the same WebSocket connection.
Training data was collected as sequences of hand landmark coordinates across multiple signers under varied lighting conditions. The model was validated on a held-out test split with per-letter accuracy tracked to identify weak letters and guide retraining. Letters with motion components (J, Z) required augmented sequence data to reach stable accuracy.
On the frontend, React and TypeScript handle state across the learning modes. WebSocket lifecycle management required careful work to prevent duplicate game-state updates when prediction events fired faster than UI state could process them — a race condition that took significant debugging to stabilize.
Features
- Real-time sign recognition via webcam with per-prediction confidence scores
- Reference guide for all 26 ASL letters
- Timed practice mode with per-letter completion tracking
- Hangman challenge mode — game state advances only on confirmed correct signs
- Session progress indicators and achievement feedback
Challenges we ran into
The hardest problem was WebSocket state synchronization. Prediction events from the backend fire at inference speed, but game state updates in React don't process that fast — this caused duplicate completion events, phantom notifications, and game state desync. Solving it required debouncing the prediction stream on the backend and adding gesture stability checks so a letter is only confirmed after it has been held at sufficient confidence for multiple consecutive frames, not just detected once.
Deploying reference image assets correctly was a separate class of problem. The build pipeline has multiple asset directories serving different purposes — Vite's public folder, build output, and backend-served static assets — and manual image replacements were silently overwritten by the build process until the copy chain was mapped and fixed at the source.
Accomplishments we're proud of
The Hangman mode working end-to-end is the accomplishment that matters most. It's the proof that the system is accurate and stable enough to drive an interactive game — if the classifier were noisy or the WebSocket unreliable, the game would be unplayable. Getting it to a state where a user can actually sit down and play a round without prediction errors breaking the session was the real milestone.
What we learned
The main technical lesson was that real-time ML applications fail at the integration boundary, not at the model level. The LSTM classifier was solid. The failures were all in the WebSocket synchronization, the frame rate management, and the state consistency between backend prediction events and frontend game logic. Building a working system meant solving distributed state problems, not just ML problems.
Responsible AI considerations
SignSpeak operates with two concrete safeguards. First, a confidence threshold is enforced — a prediction is only accepted when the model returns confidence above 0.75 for a minimum of three consecutive frames. A single high-confidence frame is not enough to advance game state or register a practice completion. Second, the system surfaces a low-confidence warning in the UI when predictions are unstable, signaling to the user that their sign is not being read clearly rather than silently returning incorrect output.
Known limitation: MediaPipe hand landmark extraction degrades under low ambient light and shows reduced accuracy on darker skin tones due to contrast sensitivity in the underlying hand detection model. This is a bias inherited from MediaPipe's training distribution, not from the LSTM classifier. Retraining on a more diverse dataset across skin tones and lighting conditions is the first named item on the roadmap.
What's next for SignSpeak
The immediate priority is full word and sentence recognition, which requires moving from letter-level classification to a sequence-of-sequences model. Per-user progress persistence, mobile support, and additional sign language variants follow from there.
Log in or sign up for Devpost to join the conversation.