EchoSense

Bridging silence and sound.

Real-time American Sign Language interpretation — no hardware, no install, no cost. Just a camera and a browser.


What it does

EchoSense is a browser-native ASL interpreter that runs entirely on your device. Point any camera at a signing hand and EchoSense tracks 21 hand landmarks using Google MediaPipe at 30fps, recognizes ASL gestures, translates them to text, and speaks each one aloud in a natural human voice using ElevenLabs.

The app supports three translation modes:

  • Phrase mode — recognizes whole ASL gestures and common signs instantly. Ignores letter signals entirely and always prioritizes word-level interpretation.
  • Spell mode — users fingerspell words letter by letter using the full ASL alphabet. Each letter requires a 2-second hold, then the completed word is auto-corrected to the closest English word using AI.
  • Sentence mode — signs accumulate into a buffer, pass through a custom ASL grammar parser, and a language model evaluates them into a grammatically correct English sentence — handling ASL's topic-comment structure, negation, and dropped pronouns automatically.

EchoSense also integrates TerpAI, UMD's generative AI gateway built on Microsoft Azure, which monitors the full conversation history and generates three contextual alternative sentence suggestions after each translated phrase — making the app smarter the longer you use it.


How we built it

EchoSense is a fully client-side React + TypeScript application built with Vite.

The computer vision pipeline runs in the browser via MediaPipe Tasks Vision, loading Google's GestureRecognizer model through WebAssembly. The model extracts 21 3D hand landmarks per frame at 30fps with no server round-trip. On top of the raw landmark output, we built a geometric rule-based classifier for the full ASL alphabet and numbers, and scaffolded a TensorFlow.js inference layer that automatically activates when trained CNN or LSTM model files are dropped into the project.

The sentence builder uses a three-layer pipeline:

  1. Lexer — normalizes the raw sign stream, groups consecutive letters into words
  2. ASL Parser — detects grammar patterns (topic-comment, negation, question form, fingerspelling)
  3. LLM Evaluator — sends the parsed ASL structure to claude-sonnet-4-6 with grammar expansion rules and failsafes, returning a single grammatically correct English sentence

Voice output is powered by the ElevenLabs Turbo v2 streaming API with a silent Web Speech API fallback. Auth0 handles authentication with transcript persistence across sessions. The entire app is deployed on Vercel with custom Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy headers required for MediaPipe's WebAssembly runtime.


Challenges we ran into

MediaPipe in production. MediaPipe's WebAssembly runtime requires SharedArrayBuffer, which browsers block unless the page is served with specific security headers. Getting these configured correctly in Vercel's deployment environment without breaking the React app's routing took significant debugging.

Gesture debouncing. MediaPipe fires a classification on every single frame — roughly 30 per second. Raw output produced a flood of repeated detections as hands moved naturally between intentional signs. We designed a two-phase lock system: a 2-second hold threshold to commit a sign, followed by a mandatory cooldown period before the next sign can register. Tuning this to feel responsive without generating false positives required testing across different hand sizes, lighting conditions, and signing speeds.

ASL grammar is not English. ASL uses topic-comment structure — the object comes before the verb. It drops pronouns. Negation comes after the verb. WH-questions appear at the end of a sentence. Building a parser that understood these rules and passed them correctly to the LLM evaluator — rather than just sending raw

Built With

  • anthropic
  • api
  • audio
  • auth0
  • azure)
  • canvas
  • claude-sonnet-4-6
  • cloudforce
  • elevenlabs
  • godaddy
  • mediapipe
  • microsoft
  • react
  • registry
  • speech
  • tailwindcss
  • tasks
  • tensorflow.js
  • terpai
  • turbo
  • typescript
  • v2
  • vercel
  • vision
  • vite
  • web
  • webassembly
Share this project:

Updates