Inspiration

Communication barriers have isolated Zimbabwe's 15,000+ Deaf community members for too long. With virtually no digital resources for Zimbabwe Sign Language (ZSL), we saw an opportunity to leverage Gemini 3 Pro's advanced multimodal capabilities to create something transformative. The goal was simple yet ambitious: turn cutting edge AI into an accessible tool that preserves linguistic heritage while empowering real time communication.

What We Built

ZSL Vibe Interpreter is a multimodal web application with three innovative modes, all built entirely in Google AI Studio: Interpreter Mode Real time computer vision analyzes handshapes and signing patterns through live video. Gemini 3 Pro deconstructs frames to identify ZSL signs including the full A to Z Manual Alphabet and instantly translates them into fluent English, delivered via a custom "Zola" audio persona for natural speech output. Live Tutor Mode Powered by the Gemini Live API, this interactive coach "watches" users practice signing and provides immediate, low latency audio feedback. It corrects technique, suggests improvements, and creates a patient learning environment that adapts to individual progress. Reference Generator Mode Using Nano Banana, users can generate on demand AI visual references for any specific sign, creating an infinite, personalized learning library that grows with their needs.

How We Built It

The entire application was developed within Google AI Studio, harnessing Gemini 3 Pro's native multimodal reasoning. We structured the workflow as follows:

Video Input Processing: Streamed webcam frames directly to Gemini's vision endpoint
Sign Recognition: Engineered prompts to identify handshapes, palm orientation, and movement patterns specific to ZSL
Translation Pipeline: Mapped recognized signs to English phrases with contextual awareness
Audio Generation: Synthesized responses using the custom Zola voice persona via Gemini's audio capabilities
Live API Integration: Established persistent connections for real time tutoring feedback
Reference Generation: Implemented Nano Banana for dynamic sign visualization

Key optimization focused on speed and accuracy critical for natural sign language conversation flow. Challenges We Overcame

Limited ZSL Training Data: No existing datasets, so we engineered few shot learning prompts with detailed anatomical descriptions
Latency Requirements: Sign language requires near instantaneous feedback; we optimized API calls and implemented streaming responses
Multimodal Synchronization: Coordinating video analysis, text translation, and audio output simultaneously required careful prompt engineering and state management
Cultural Accuracy: Consulted with ZSL community members to ensure proper sign recognition and respectful representation

What We Learned

Gemini 3 Pro's multimodal capabilities exceeded our expectations its ability to reason across vision, audio, and text within a single model simplified our architecture dramatically. We discovered that prompt engineering with anatomical context (handshape classification, palm orientation) was more effective than traditional computer vision approaches. Most importantly, we learned that AI can be a powerful force for cultural preservation when built with community needs at the center.

Built With

Built With

  • audio-synthesis
  • computer-vision
  • gemini-3-pro
  • gemini-live-api
  • google-ai-studio
  • javascript
  • mediapipe
  • multimodal-ai
  • nano-banana
  • real
  • time
  • webrtc
Share this project:

Updates