Sign Language Bridge

Inspiration

Our motivation for Sign Language Bridge stemmed from the communication barriers faced by the 11+ million American Sign Language (ASL) users in the United States. Deaf and hard-of-hearing individuals often struggle to access emergency services, telehealth, and customer support—human interpreters cost $100–150/hour with wait times of 2–24 hours. Existing solutions are either text-only (losing ASL nuance), phone-bound (Video Relay Service), or require expensive hardware. We saw an opportunity to build a real-time, web-based bridge that captures ASL through a webcam, classifies signs using an on-device model, and converts them to text and spoken audio in multiple languages—empowering Deaf individuals to communicate with anyone, anywhere.

What it does

Sign Language Bridge is a web application that provides:

Real-Time Sign Recognition:

Capture American Sign Language through your webcam at 10fps.
MediaPipe Holistic extracts a 27-node skeleton (pose + both hands) from each frame.
An ST-GCN model trained on the ASL Citizen dataset classifies signs in real time.
Frequently used signs are cached in Redis for instant lookup (~60–70% hit rate).

Translation & Speech:

Convert recognized gloss sequences into natural English text.
Translate to Spanish or French using Amazon Nova Micro.
Generate spoken audio in the target language using Amazon Nova Sonic.
All translation and TTS output is cached to minimize latency and API costs.

Session Management:

Users can create and manage translation sessions (ChatGPT-like experience).
Full translation history—gloss sequences, source text, and translations—is persisted in PostgreSQL.
Download transcripts for offline use.

How we built it

Technology Stack:

Backend: Python 3.11 + FastAPI with WebSocket support for real-time frame streaming, integrating MediaPipe Holistic for pose extraction and a custom ST-GCN model for sign classification.
Frontend: React 18 + TypeScript + Vite + Tailwind v4 + Zustand + Radix UI for webcam capture, transcript display, session sidebar, and audio playback.
ML Pipeline: PyTorch ST-GCN trained on the ASL Citizen dataset; MediaPipe Holistic extracts 543 landmarks per frame, with a 27-node skeleton subset used for inference.
Database: PostgreSQL 16 for users, sessions, and translation history.
Cache: Redis 7 for sign predictions, translations, and TTS audio.
AI Services: Amazon Nova Micro (Bedrock) for EN→ES/FR translation; Amazon Nova Sonic for multilingual text-to-speech.
Auth: JWT (PyJWT) with bcrypt password hashing.
Containerization: Docker Compose for single-command local deployment.

Key Implementation Details:

Local-First ML: Sign classification runs entirely on the backend; no cloud dependency for recognition.
WebSocket Pipeline: Frames stream from browser to backend; predictions stream back with exponential backoff reconnection.
Multi-Layer Caching: Sign cache (1hr TTL), translation cache (24hr TTL), TTS cache (24hr TTL) to reduce latency and API costs.
Session Persistence: Every translation is saved to PostgreSQL with gloss sequence, source text, translated text, and timestamps.

Challenges we ran into

Model Training Complexity:

Adapting the ST-GCN architecture from OpenHands to the ASL Citizen dataset required careful handling of MediaPipe Holistic’s 543-landmark output and selecting the right 27-node subset.
MediaPipe Holistic is only available in older releases (0.10.x), requiring Python 3.10 for the ML pipeline while the backend uses Python 3.11.

Real-Time Performance:

Achieving responsive sign recognition at 10fps required a 128-frame sliding window, which introduced a delay between signing and prediction—balancing sequence length with latency was critical.
Redis caching for frequent signs (HELLO, YES, NO, THANK-YOU) proved essential to keep response times acceptable.

Gloss-to-Text Conversion:

Converting ASL gloss sequences (e.g., HELLO, NAME, WHAT) into natural English is an open research problem; we used rule-based mapping for common phrases with fallback to simple concatenation.

WebSocket Reliability:

Maintaining a stable connection for continuous frame streaming required exponential backoff reconnection and graceful handling of camera permission errors on the frontend.

Accomplishments that we're proud of

End-to-End Pipeline:

Built a complete flow from webcam → skeleton extraction → sign classification → gloss-to-text → translation → TTS → audio output, with session history persisted in PostgreSQL.

Inclusive Design:

The application addresses a real need for Deaf and hard-of-hearing users, providing a bridge to communicate with hearing individuals in English, Spanish, and French.

Production-Ready Architecture:

Modular codebase with clear separation between model service, cache service, translation service, and TTS service; Docker Compose enables one-command deployment.

Caching Strategy:

Achieved ~60–70% cache hit rate for sign predictions, significantly reducing inference load and improving response times for common signs.

Robust Frontend:

40+ Radix UI components, Zustand state management, custom hooks for WebSocket, camera, and audio playback—all with TypeScript strict mode.

What we learned

Skeleton-Based Sign Recognition:

ST-GCN’s graph convolution over spatial-temporal skeleton sequences is well-suited for sign language, capturing both hand shape and motion dynamics.

Importance of Caching:

A small set of signs (greetings, yes/no, help) dominates typical conversations; caching these dramatically improves perceived performance.

Cloud vs. Local Trade-offs:

Keeping sign classification on-device (backend) preserves privacy and reduces cost; using Bedrock for translation and TTS provides high-quality output without building custom models.

Session-Based UX:

A ChatGPT-like session sidebar with full translation history makes the product feel familiar and useful for repeated use.

What's next for Sign Language Bridge

Expanded Vocabulary:

Scale from 50–100 signs to hundreds by training on more ASL Citizen data and fine-tuning the ST-GCN model.

Bidirectional Translation:

Add speech-to-sign: hearing users speak, and the system displays an avatar or animation performing the corresponding signs.

Improved Gloss-to-Text:

Integrate an LLM (e.g., Nova Micro) for gloss-to-natural-text conversion instead of rule-based mapping, improving fluency for complex sentences.

Mobile & Offline Support:

Port to mobile or PWA with optional offline mode for environments with limited connectivity.

Healthcare Integration:

Partner with telehealth and emergency service providers to embed Sign Language Bridge into their workflows.

Built With

amazon-bedrock
docker
fastapi
mediapipe
postgresql
python
pytorch
react
redis
typescript

Updates

Chawana Kazunda started this project — Mar 16, 2026 07:45 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.