VAST - Visual ASL Speech Translator

Inspiration

This project is a digital voice built for someone we care about. For a close friend who is mute, interacting in a predominantly speaking world can be exhausting. We wanted to build a bridge. Our translator turns silent gestures into audible speech in real-time. This isn’t just a technical challenge; it’s a tool for true inclusivity, ensuring that a speech impairment never means your voice goes unheard.

What it does

Our program provides real-time translation of American Sign Language (ASL) fingerspelling. By capturing individual letter gestures via webcam, the system identifies characters and groups them into words, bridging the communication gap between ASL users and non-signers.

How we built it

We engineered a multi-stage pipeline using:

MediaPipe: For high-fidelity 3D hand landmark extraction and spatial tracking.
PyTorch: To build and train the core neural network responsible for classifying specific hand configurations into ASL letters.
Gemini API: To provide a linguistic "smoothing" layer, taking raw letter predictions and using LLM reasoning to correct typos and format them into coherent sentences.
ElevenLabs: To transform the translated text into high-quality, natural-sounding speech for an accessible user experience.

Challenges we ran into

As a team new to computer vision and machine learning, we faced a steep learning curve in understanding how to translate raw pixel data into meaningful coordinates. A significant hurdle was data variance: ensuring the model could distinguish between visually similar signs (like 'M', 'N', and 'S') across different lighting conditions and hand shapes.

Accomplishments that we're proud of

High-Precision Gesture Classification: Successfully trained a PyTorch model to achieve reliable accuracy on the ASL alphabet, overcoming the "noise" inherent in live video feeds.
Seamless Hardware-to-Software Integration: Developed a robust computer vision algorithm that tracks 21 distinct hand landmarks in real-time, maintaining low latency even during rapid fingerspelling.
Contextual Linguistic Refinement: Leveraged the power of Large Language Models to interpret sequences of letters. This allows the system to not only recognize "signs" but to understand language, using the Gemini API to provide autocorrect and semantic structure to the final output.
End-to-End Accessibility Loop: Created a functional prototype that moves from physical movement to synthesized speech in seconds, demonstrating a viable path for assistive technology. ## What we learned Building VAST taught us that real-time data is a lot messier than we expected. We had to figure out how to manage an asynchronous pipeline so that the computer vision, LLM, and audio engine could all run at once without lagging or crashing. We also got a crash course in spatial data,learning that raw coordinates from MediaPipe need a lot of filtering and "debounce" logic before they actually become reliable for sign recognition. Most importantly, we learned that for accessibility tools to be useful, they don't just need to be accurate; they have to be fast enough to feel like a natural conversation.

What's next for VAST - Visual ASL Speech Translator

The next step is moving beyond single-letter fingerspelling toward full-sentence ASL recognition, which means tracking how signs move and change over time. We also want to implement a custom-trained model to better handle the specific grammar and syntax of ASL, which is different from English. On the hardware side, we’re looking into optimizing the engine to run on mobile devices so VAST can eventually become a portable, real-time communication tool people can use anywhere.

Built With

elevenlabs
gemini
python

Submitted to

HackUSF 2026

Created by

I built the 4-layer PyTorch neural network that classifies ASL gestures from 63 live MediaPipe coordinates. I engineered wrist-anchoring normalization for position invariance, left-hand mirroring for both dominant hands, and a confidence threshold with stable-frames debounce to eliminate junk predictions.

Steven Yodice-Smith
I built a low-latency audio engine to handle the final stage of the translation. I also implemented a stable-frame debounce system so the program doesn't spam the same letter repeatedly. By running the audio on a dedicated background thread with a bounded queue, I ensured the ML pipeline never has to wait for a sound to finish before processing the next frame, automatically dropping any stale gestures to stay in sync.

Aswath Sundar
Happy Patel

Updates

Aswath Sundar started this project — Mar 29, 2026 12:28 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.