Inspiration
This project is a digital voice built for someone we care about. For a close friend who is mute, interacting in a predominantly speaking world can be exhausting. We wanted to build a bridge. Our translator turns silent gestures into audible speech in real-time. This isn’t just a technical challenge; it’s a tool for true inclusivity, ensuring that a speech impairment never means your voice goes unheard.
What it does
Our program provides real-time translation of American Sign Language (ASL) fingerspelling. By capturing individual letter gestures via webcam, the system identifies characters and groups them into words, bridging the communication gap between ASL users and non-signers.
How we built it
We engineered a multi-stage pipeline using:
- MediaPipe: For high-fidelity 3D hand landmark extraction and spatial tracking.
- PyTorch: To build and train the core neural network responsible for classifying specific hand configurations into ASL letters.
- Gemini API: To provide a linguistic "smoothing" layer, taking raw letter predictions and using LLM reasoning to correct typos and format them into coherent sentences.
- ElevenLabs: To transform the translated text into high-quality, natural-sounding speech for an accessible user experience.
Challenges we ran into
As a team new to computer vision and machine learning, we faced a steep learning curve in understanding how to translate raw pixel data into meaningful coordinates. A significant hurdle was data variance: ensuring the model could distinguish between visually similar signs (like 'M', 'N', and 'S') across different lighting conditions and hand shapes.
Accomplishments that we're proud of
- High-Precision Gesture Classification: Successfully trained a PyTorch model to achieve reliable accuracy on the ASL alphabet, overcoming the "noise" inherent in live video feeds.
- Seamless Hardware-to-Software Integration: Developed a robust computer vision algorithm that tracks 21 distinct hand landmarks in real-time, maintaining low latency even during rapid fingerspelling.
- Contextual Linguistic Refinement: Leveraged the power of Large Language Models to interpret sequences of letters. This allows the system to not only recognize "signs" but to understand language, using the Gemini API to provide autocorrect and semantic structure to the final output.
- End-to-End Accessibility Loop: Created a functional prototype that moves from physical movement to synthesized speech in seconds, demonstrating a viable path for assistive technology. ## What we learned Building VAST taught us that real-time data is a lot messier than we expected. We had to figure out how to manage an asynchronous pipeline so that the computer vision, LLM, and audio engine could all run at once without lagging or crashing. We also got a crash course in spatial data,learning that raw coordinates from MediaPipe need a lot of filtering and "debounce" logic before they actually become reliable for sign recognition. Most importantly, we learned that for accessibility tools to be useful, they don't just need to be accurate; they have to be fast enough to feel like a natural conversation.
What's next for VAST - Visual ASL Speech Translator
The next step is moving beyond single-letter fingerspelling toward full-sentence ASL recognition, which means tracking how signs move and change over time. We also want to implement a custom-trained model to better handle the specific grammar and syntax of ASL, which is different from English. On the hardware side, we’re looking into optimizing the engine to run on mobile devices so VAST can eventually become a portable, real-time communication tool people can use anywhere.
Log in or sign up for Devpost to join the conversation.