Inspiration

In Pakistan, there are over 10 million individuals who are Deaf or Hard of Hearing, yet the country faces a staggering shortage of professional sign language interpreters. This creates a "silence barrier" in critical settings like hospitals, banks, and schools. We built SignSpeak to empower this community with a portable, digital interpreter that understands not just the signs, but the cultural and linguistic nuances of Pakistan Sign Language (PSL).

What it does

SignSpeak is a bi-directional, real-time multimodal translation platform:

Sign-to-Speech (Deaf ➔ Hearing): The system tracks 147 3D skeletal landmarks. Unlike standard "sign dictionaries" that output single words, SignSpeak uses Gemini 2.0 Flash as a linguistic reasoning engine. It takes raw detected keywords and "reasons" them into polite, grammatically correct Urdu and English sentences, which are then vocalized via high-quality speech synthesis.

Speech-to-Sign (Hearing ➔ Deaf): Hearing users can speak naturally into the app. Gemini acts as a linguistic parser, simplifying complex natural language into core sign-keywords, which are then performed by a 3D AI Avatar in a seamless sequence.

How we built it

Linguistic AI: We utilized the latest Google Gemini 2.0 Flash API via the google-genai SDK for low-latency multimodal reasoning and sentence formulation.

Computer Vision: Implemented MediaPipe Holistic to capture high-fidelity 3D coordinates for hands, arms, and shoulders simultaneously.

Mathematics: Built a custom Fast Dynamic Time Warping (FastDTW) engine to handle spatial pattern matching between live movement and our trained library.

Backend: A Python Flask architecture with thread-safe AI processing locks.

Voice: Integrated Google Text-to-Speech (gTTS) for natural Urdu vocalization.

Challenges we ran into

One of the biggest hurdles was "Spatial Invariance"—ensuring the AI recognized signs correctly whether the user was 2 feet or 5 feet from the camera. We solved this by developing a Shoulder-Width Normalization algorithm that scales all 147 points relative to the user's body size. We also overcame Gemini Free Tier rate limits by building a hybrid local-pattern/AI-refinement architecture that ensures the app remains functional even during high traffic.

Accomplishments that we're proud of

We are incredibly proud of our Linguistic Expansion feature. Moving beyond a simple "dictionary" was our goal. Seeing Gemini take a raw "Hospital" sign and transform it into: "Excuse me, can you please help me find the nearest hospital?" showed us that we had created a tool for truly natural communication.

What we learned

We learned the power of Multimodal AI. Gemini 2.0 Flash allowed us to bridge the gap between raw vision data (coordinates) and human language (context). We also gained deep experience in optimizing real-time data transmission by using hidden low-resolution canvases to reduce network latency without sacrificing UI quality.

What's next for SignSpeak

Our vision is to scale SignSpeak globally:

Dataset Expansion: Growing our library from conversational basics to thousands of specialized signs for medical and legal use. Three.js Integration: Moving from a video-based avatar to a fully rendered 3D rigged character for fluid, real-time sign generation. Global Localization: Using Gemini’s multilingual power to adapt SignSpeak for ASL, BSL, and ISL, creating a universal communication bridge for the global Deaf community

Built With

Share this project:

Updates