Project Signbridge

SignBridge: Bridging sign language and spoken language in real time using AI. No specialized hardware needed—just your web browser!
Experience seamless communication with Project SignBridge. We are breaking down barriers between the Deaf and hearing communities today.
See SignBridge in action. Our real-time AI translates ASL to text instantly, extracting hand coordinates directly from your webcam.

Inspiration

The core inspiration for Project SignBridge stemmed from the massive and often frustrating communication gap that exists between the Deaf community and individuals who do not know American Sign Language. This barrier impacts daily interactions, from simple retail transactions to essential services, limiting accessibility and inclusion.

We were motivated to create a truly zero-friction solution that uses ubiquitous technology—a standard webcam and a web browser—to instantly bridge this gap without the need for specialized hardware or app downloads, making communication seamless and universally accessible.

What it does

Project SignBridge is a web application that uses artificial intelligence to translate American Sign Language (ASL) into English text in real time.

Hand Tracking: It captures the user's hand movements via a standard web camera and processes the spatial coordinates using Google's MediaPipe framework.
Neural Inference: It sends this normalized data to a Python backend, which uses a custom-trained neural network to identify signed letters and common words, displaying the translated text on screen.
Advanced Processing: We integrate the Gemini API for advanced post-processing to predict the user's intended next words and synthesize the on-screen text into natural-sounding spoken audio.

How we built it

The project was built in three chronological phases:

Phase 1 (Frontend & Tracking): We implemented the front-end to utilize the browser's WebRTC API to access the webcam. We integrated the Google MediaPipe Hand Landmarker model to detect and extract 21 coordinate points from the user's hands in real time.
Phase 2 (Data & Modeling): We focused on the backend, converting raw 3D landmarks into normalized, relative coordinates to build our custom training dataset. We then trained a lightweight deep neural network using PyTorch on an ASL fingerspelling dataset to recognize static letters, and trained an additional model using the WLASL dataset to interpret common words.
Phase 3 (Integration & AI): We integrated the complete application using WebSockets to send normalized hand coordinates from the browser to the Python ML backend. We rendered predictions onto the UI and integrated the Gemini API for intelligent word prediction and text-to-speech output.

Example: Normalized landmark data sent via WebSocket

landmarks = extract_landmarks(frame)           # MediaPipe hand detection 
normalized = normalize_coordinates(landmarks)  # Convert to relative coords 
prediction = model.predict(normalized)         # Neural network inference

Challenges we ran into

Latency: Ensuring low latency for real-time conversation was a primary challenge. The computational demands of the ML model on a standard server posed a significant hurdle. We addressed this by strictly sending only the mathematical coordinate data—not raw video frames—and optimizing our deep learning model for edge performance.
Accuracy: Achieving high, consistent accuracy for fingerspelling across different users and varying lighting conditions required extensive data normalization and careful model tuning.

Accomplishments that we're proud of

End-to-End Delivery: Successfully launching a complete web application within the hackathon timeframe that seamlessly integrates real-time hand-tracking, a custom Python ML backend, and the Gemini API.
Zero-Friction Accessibility: Delivering a solution that requires zero downloads, zero specialized hardware, and zero accounts.
Speed: Achieving an end-to-end latency of approximately 300 milliseconds, a major technical achievement that ensures the tool is viable for natural, unscripted conversations.

What we learned

We learned the critical importance of data processing and normalization in machine learning for computer vision. Converting pixel data into relative coordinates was essential for creating a model that generalizes well across different devices and viewing distances.

The key normalization insight was:

$$\text{normalized}i = \frac{x_i - x{\min}}{x_{\max} - x_{\min}}$$

Furthermore, we gained valuable experience in optimizing deep learning models for low-latency, real-time performance and utilizing efficient WebSocket communication to bridge the gap between a browser-based frontend and a Python ML backend.

Built With

css
fastapi
gemini
google
javascript
mediapipe
numpy
opencv
pydantic
python
pytorch
react
tailwind
tensorflow
typescript
uvicorn
vite
web-audio-git
websockets

Updates

Krish Bharadiya started this project — Apr 10, 2026 12:05 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.