Inspiration

A professional ASL interpreter costs between $50 and $150 per hour and requires at least 48 hours advance notice. For the 500,000+ people who communicate with ASL in the United States, this means that spontaneous, everyday communication like a quick video call with speakers or a conversation at a pharmacy counter often simply doesn't happen.

Every existing real-time ASL tool either requires expensive specialized hardware, runs through a slow server connection that adds too much latency for real conversation, or sits behind a subscription that recreates the exact cost barrier it claims to solve.

EchoSense is built for the moments that are too small for a scheduled interpreter and too important to abandon. We want anyone with a laptop or phone camera to be able to communicate instantly and freely, with who ever they choose.

What it does

EchoSense is a real-time ASL gesture interpreter that runs entirely in your browser. Open the app, point your camera at a signing hand, and EchoSense does three things simultaneously:

  1. Tracks precise landmarks across your hand in real time using computer vision, rendering a live skeleton overlay so you can see the AI working.
  2. Recognizes ASL gestures and translates them into plain English text on screen: Yes, No, Hello, Stop, Wait, I love you, and more.
  3. Speaks each recognized phrase aloud in a human voice using ElevenLabs, so users can communicate with people who aren't looking at a screen.

Every recognized sign is added to a running transcript that can be copied or cleared, with the option for users who don't know ASL to practice their ASL.

How we built it

EchoSense is a fully client-side React and TypeScript application built with Vite.

The computer vision pipeline runs entirely in the browser using Google MediaPipe Tasks Vision. MediaPipe's GestureRecognizer model loads via WebAssembly and processes the webcam feed frame by frame at 30fps, extracting 21 3D hand landmarks per frame and classifying the hand shape into one of its recognized gesture categories.

We built a gesture debounce system on top of the raw MediaPipe output: a gesture must be held consistently for a threshold number of consecutive frames before it commits to the transcript. This prevents the common problem of false positives during natural hand movement between intentional signs.

Voice output is powered by the ElevenLabs Turbo v2 API, which generates natural-sounding speech from the recognized text. We implemented a silent fallback to the browser's built-in Web Speech API if the ElevenLabs call fails, so the app always speaks regardless of network conditions.

The app is deployed on Vercel with custom Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy headers configured in vercel.json, which is required for MediaPipe's WebAssembly runtime to access SharedArrayBuffer in a deployed environment.

The entire stack: React, TypeScript, Vite, MediaPipe Tasks Vision, ElevenLabs API, TailwindCSS, TerpAI, deployed on Vercel.

Challenges we ran into

The biggest deployment challenge was MediaPipe's WebAssembly runtime requirement. MediaPipe needs access to SharedArrayBuffer, which browsers block unless the page is served with specific Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy headers. Getting these configured correctly in Vercel's environment, without breaking the React app's routing was challenging because it is required in vercel.json configuration.

The second challenge was gesture debouncing. MediaPipe fires a gesture classification on every single frame, which meant raw output produced a flood of detections as hands naturally moved between intentional signs. We had to design a frame-count threshold system that felt responsive without generating false positives. Balancing sensitivity and reliability required testing across different people's hand sizes, lighting conditions, and signing speeds.

We also had to train our model to be able to recognize over 85,000 images of different sign language charts, to be able to accurately translate the hand gestures. This prove to be harder than we expected, as AI training takes days of work, which was a constraint, considering the amount of time we have in Bitcamp. This prevented us from being able to track gestures with two hands, and faster paced sign, which we plan to work on after the hackathon.

Accomplishments that we're proud of

We created the fundamentals of a, real-time ASL interpreter in under 30 hours with no prior experience in computer vision.

The live hand skeleton overlay (21 tracked landmarks rendering at 30fps on a canvas element layered over the webcam fee) is something we're genuinely proud of. It makes the AI pipeline visible in a way that's immediately compelling to anyone who sees it.

We're also proud of the ElevenLabs integration: the voice output sounds natural and human, which matters deeply for an accessibility tool. A robotic voice would undermine the dignity of the communication experience we're trying to enable.

Most importantly, the app actually works. You can sign in front of it right now and it responds. For a team that started from scratch at our first hackathon, shipping something that genuinely functions as described feels like the real accomplishment.

What we learned

We learned how browser-based computer vision actually works. We observed how WebAssembly enables running ML models client-side and determined how to process a real-time video stream frame by frame without blocking the browser's main thread.

We learned that raw Machine Learning output needs product thinking on top of it. MediaPipe gives you a gesture classification every frame, but turning that into a usable experience requires deliberate design decisions around debouncing, feedback, and error states.

We also learned how to ship fast under real pressure, determining what to cut, what to keep, and how to make decisions as a team.

What's next for EchoSense

The most important next step is expanding the vocabulary and accuracy. We want EchoSense to be able to identify the signer's hand gestures despite hand movements of people that may be in the background. MediaPipe's built-in recognizer covers 7 gestures, but a real ASL vocabulary has thousands of signs. The architecture we built is designed to swap in a custom-trained TensorFlow.js model that we would train on the full ASL alphabet and common words. The MediaPipe landmark pipeline stays the same; only the classification layer changes.

We also want to add mobile support as a Progressive Web App, so a person who communicates with ASL does not have to hand their phone to someone for translation, and have it translate in real time.

Longer term plans: Creating support for British Sign Language, French Sign Language, and other regional sign languages. Partnering with schools, hospitals, and pharmacies, which are the exact environments where the spontaneous communication gap hurts most.

Built With

  • elevenlab
  • react
  • tailwindcss
  • tensorflow.js
  • terpai
  • typescrip
  • vercel
  • vite
  • webassembly
Share this project:

Updates