Sign Language Bridge

Inspiration

Our motivation for Sign Language Bridge stemmed from the communication barriers faced by the 11+ million American Sign Language (ASL) users in the United States. Deaf and hard-of-hearing individuals often struggle to access emergency services, telehealth, and customer support—human interpreters cost $100–150/hour with wait times of 2–24 hours. Existing solutions are either text-only (losing ASL nuance), phone-bound (Video Relay Service), or require expensive hardware. We saw an opportunity to build a real-time, web-based bridge that captures ASL through a webcam, classifies signs using an on-device model, and converts them to text and spoken audio in multiple languages—empowering Deaf individuals to communicate with anyone, anywhere.

What it does

Sign Language Bridge is a web application that provides:

Real-Time Sign Recognition:

  • Capture American Sign Language through your webcam at 10fps.
  • MediaPipe Holistic extracts a 27-node skeleton (pose + both hands) from each frame.
  • An ST-GCN model trained on the ASL Citizen dataset classifies signs in real time.
  • Frequently used signs are cached in Redis for instant lookup (~60–70% hit rate).

Translation & Speech:

  • Convert recognized gloss sequences into natural English text.
  • Translate to Spanish or French using Amazon Nova Micro.
  • Generate spoken audio in the target language using Amazon Nova Sonic.
  • All translation and TTS output is cached to minimize latency and API costs.

Session Management:

  • Users can create and manage translation sessions (ChatGPT-like experience).
  • Full translation history—gloss sequences, source text, and translations—is persisted in PostgreSQL.
  • Download transcripts for offline use.

How we built it

Technology Stack:

  • Backend: Python 3.11 + FastAPI with WebSocket support for real-time frame streaming, integrating MediaPipe Holistic for pose extraction and a custom ST-GCN model for sign classification.
  • Frontend: React 18 + TypeScript + Vite + Tailwind v4 + Zustand + Radix UI for webcam capture, transcript display, session sidebar, and audio playback.
  • ML Pipeline: PyTorch ST-GCN trained on the ASL Citizen dataset; MediaPipe Holistic extracts 543 landmarks per frame, with a 27-node skeleton subset used for inference.
  • Database: PostgreSQL 16 for users, sessions, and translation history.
  • Cache: Redis 7 for sign predictions, translations, and TTS audio.
  • AI Services: Amazon Nova Micro (Bedrock) for EN→ES/FR translation; Amazon Nova Sonic for multilingual text-to-speech.
  • Auth: JWT (PyJWT) with bcrypt password hashing.
  • Containerization: Docker Compose for single-command local deployment.

Key Implementation Details:

  • Local-First ML: Sign classification runs entirely on the backend; no cloud dependency for recognition.
  • WebSocket Pipeline: Frames stream from browser to backend; predictions stream back with exponential backoff reconnection.
  • Multi-Layer Caching: Sign cache (1hr TTL), translation cache (24hr TTL), TTS cache (24hr TTL) to reduce latency and API costs.
  • Session Persistence: Every translation is saved to PostgreSQL with gloss sequence, source text, translated text, and timestamps.

Challenges we ran into

Model Training Complexity:

  • Adapting the ST-GCN architecture from OpenHands to the ASL Citizen dataset required careful handling of MediaPipe Holistic’s 543-landmark output and selecting the right 27-node subset.
  • MediaPipe Holistic is only available in older releases (0.10.x), requiring Python 3.10 for the ML pipeline while the backend uses Python 3.11.

Real-Time Performance:

  • Achieving responsive sign recognition at 10fps required a 128-frame sliding window, which introduced a delay between signing and prediction—balancing sequence length with latency was critical.
  • Redis caching for frequent signs (HELLO, YES, NO, THANK-YOU) proved essential to keep response times acceptable.

Gloss-to-Text Conversion:

  • Converting ASL gloss sequences (e.g., HELLO, NAME, WHAT) into natural English is an open research problem; we used rule-based mapping for common phrases with fallback to simple concatenation.

WebSocket Reliability:

  • Maintaining a stable connection for continuous frame streaming required exponential backoff reconnection and graceful handling of camera permission errors on the frontend.

Accomplishments that we're proud of

End-to-End Pipeline:

  • Built a complete flow from webcam → skeleton extraction → sign classification → gloss-to-text → translation → TTS → audio output, with session history persisted in PostgreSQL.

Inclusive Design:

  • The application addresses a real need for Deaf and hard-of-hearing users, providing a bridge to communicate with hearing individuals in English, Spanish, and French.

Production-Ready Architecture:

  • Modular codebase with clear separation between model service, cache service, translation service, and TTS service; Docker Compose enables one-command deployment.

Caching Strategy:

  • Achieved ~60–70% cache hit rate for sign predictions, significantly reducing inference load and improving response times for common signs.

Robust Frontend:

  • 40+ Radix UI components, Zustand state management, custom hooks for WebSocket, camera, and audio playback—all with TypeScript strict mode.

What we learned

Skeleton-Based Sign Recognition:

  • ST-GCN’s graph convolution over spatial-temporal skeleton sequences is well-suited for sign language, capturing both hand shape and motion dynamics.

Importance of Caching:

  • A small set of signs (greetings, yes/no, help) dominates typical conversations; caching these dramatically improves perceived performance.

Cloud vs. Local Trade-offs:

  • Keeping sign classification on-device (backend) preserves privacy and reduces cost; using Bedrock for translation and TTS provides high-quality output without building custom models.

Session-Based UX:

  • A ChatGPT-like session sidebar with full translation history makes the product feel familiar and useful for repeated use.

What's next for Sign Language Bridge

Expanded Vocabulary:

  • Scale from 50–100 signs to hundreds by training on more ASL Citizen data and fine-tuning the ST-GCN model.

Bidirectional Translation:

  • Add speech-to-sign: hearing users speak, and the system displays an avatar or animation performing the corresponding signs.

Improved Gloss-to-Text:

  • Integrate an LLM (e.g., Nova Micro) for gloss-to-natural-text conversion instead of rule-based mapping, improving fluency for complex sentences.

Mobile & Offline Support:

  • Port to mobile or PWA with optional offline mode for environments with limited connectivity.

Healthcare Integration:

  • Partner with telehealth and emergency service providers to embed Sign Language Bridge into their workflows.

Built With

Share this project:

Updates