Inspiration

We were inspired by the communication barriers faced by Deaf and hard-of-hearing individuals in daily interactions. Many existing solutions rely on expensive gloves or sensors, so we wanted to design a low-cost, camera-only system that anyone could use. Our goal was to make real-time, inclusive communication more accessible through computer vision and deep learning.

What it does

S2S recognizes sign language gestures from a live webcam feed and converts them into spoken words using an on-device machine learning model. It serves as a real-time AI interpreter, bridging the gap between signers and non-signers.

How we built it

We built S2S using Python, OpenCV, and MediaPipe Hands to capture and process real-time video input from a webcam. The system detects 21 hand landmarks per frame, normalizes their coordinates, and converts them into structured numerical features. These features are then passed to a trained machine learning classifier built with scikit-learn, which predicts the corresponding letter in American Sign Language (A–Z).

Challenges we ran into

  • Latency issues: Real-time frame processing required optimizing frame skip and model size.
  • Limited compute: Training on a local machine restricted model complexity and batch size.
  • Trouble downloading data: To interpret words, we required a lot of video data as well as a a more complex model that we were simply unable to get to work, largely due to trouble actually getting the data.

Accomplishments that we're proud of

  • Built a fully functional real-time sign-to-speech prototype using only a webcam.
  • Designed a lightweight model that runs smoothly on consumer hardware.
  • Demonstrated the potential for inclusive communication technology without specialized hardware.

What we learned

We learned how to apply computer vision and machine learning to real-time gesture recognition. Using MediaPipe and OpenCV taught us how to process and normalize hand landmarks, while training classifiers in scikit-learn helped us understand model optimization and feature handling. We also learned to balance accuracy and latency through frame-skipping and temporal stability techniques. Integrating text-to-speech gave us insight into end-to-end accessibility systems and the challenges of scaling from letter recognition to full word translation.

What's next for S2S

The next major goal for S2S is to extend recognition beyond individual letters to full words, phrases, and gestures, enabling fluid real-time conversations rather than just spelling. Achieving this will require a larger and more diverse video dataset, improved model architectures (such as recurrent or transformer-based networks), and better temporal modeling to capture the motion dynamics of continuous signing.

We also plan to improve the user interface and accessibility by adding visual feedback, voice control, and customizable gesture mappings. Deploying S2S on mobile and embedded devices like the Raspberry Pi or Jetson Nano would make it more portable and usable in everyday interactions. Finally, collaborating with the Deaf and hard-of-hearing community will be key to refining the system’s accuracy, inclusivity, and real-world practicality.

Built With

Share this project:

Updates