Inspiration

Seventy million deaf people are systematically excluded from meetings, classrooms, and emergency rooms — not because of disability, but because of infrastructure failure. Ninety-five percent of hearing colleagues never learn sign language, and a human ASL interpreter costs $150/hr and needs 48-hour notice. Existing accessibility tools (Otter.ai, Zoom captions) handle text but ignore sign language entirely. We thought: what if agentic AI could close that gap in real time?

The "Build with AI" theme made it the perfect moment — Gemini's multimodal reasoning + the maturity of pyannote.audio + open-source projects like sign.mt made a real solution finally possible.

What it does

SignBridge turns any multi-speaker meeting into a fully accessible experience for deaf participants. Six specialized AI agents — coordinated by Gemini — handle the full pipeline:

  • 🎤 Live mic capture in the browser via MediaRecorder (no Google dependency)
  • 🗣️ Speech-to-text via OpenAI Whisper-large-v3 on HuggingFace Inference
  • 👥 Speaker diarization via pyannote.audio 4.0.4 (real ML, runs on the laptop)
  • ✍️ Transcript cleanup + translation into 10 languages via Gemini, with automatic Claude fallback
  • 🤟 Continuous sign-language avatar via sign.mt — one fluid signing animation, not separate clips
  • 📋 End-of-meeting summary + action items with owners, deadlines, and priorities, auto-extracted

Plus a live-polling stats footer showing per-call latency, token usage, and LLM-router fallback rate. Zero mocks — /api/health confirms mock_mode: false.

How we built it

Backend (Python 3.13 + FastAPI) with six agent classes coordinated by an Orchestrator. Each agent owns one job and emits typed events that stream to the frontend over Server-Sent Events. We built a custom LLM Router that prefers Gemini 2.5 Flash and auto-falls-back to Claude Sonnet 4.6 on failure, with full per-call telemetry (provider, latency, tokens, fallback count).

ML pipeline:

  • pyannote/speaker-diarization-3.1 runs locally for real speaker ID (~6s init, ~2s inference)
  • openai/whisper-large-v3 via HuggingFace InferenceClient for ASR
  • The WLASL dataset (Word-Level American Sign Language, 2000 glosses × 21,083 video instances) drives our 1959-sign vocabulary lookup
  • sign.mt — Moryossef et al.'s open-source spoken-to-signed translator — embedded for the continuous avatar

Frontend (Next.js 15 + React 19 + Framer Motion) with a glass-morphism design, animated gradient backdrops, real-time SSE consumer, and a MediaRecorder → 16kHz PCM WAV → backend pipeline that sidesteps Google's flaky Web Speech API entirely.

17 pytest tests cover the LLM router fallback, all REST endpoints, the SSE stream, and the full meeting lifecycle.

Challenges we ran into

  • pyannote.audio gated three separate models behind HuggingFace acceptance — speaker-diarization-3.1, segmentation-3.0, AND speaker-diarization-community-1. The 403 error didn't tell us which one. Took an hour of API spelunking to find the third.
  • HuggingFace Inference API URL format changed mid-hackathon. Direct REST POST to api-inference.huggingface.co/models/... returned 404 for every Whisper variant. Solution: switch to the official huggingface_hub.InferenceClient SDK, which handles routing automatically.
  • Web Speech API "network" errors on macOS Chrome — Google's Speech endpoint is unreachable behind some VPNs/networks. We replaced it entirely with browser MediaRecorder + AudioContext PCM capture + a custom WAV encoder, then sent audio to our own Whisper backend.
  • Every free ASL video CDN blocks hot-linking (Spread The Sign, HandSpeak, SignDict all return 403). Tried four sources before finding Microsoft signstock.blob.core.windows.net URLs in WLASL metadata that actually serve direct MP4. Then pivoted to embedding sign.mt entirely for one continuous signing flow.
  • pyannote 4.x renamed use_auth_tokentoken and changed return type from Annotation to DiarizeOutput — fun discovery during demo prep.
  • Stale Next.js .next cache caused Cannot find module './403.js' errors with multiple dev servers competing for ports. Solved by killing all and rm -rf .next.

Accomplishments that we're proud of

  • Zero mocks at runtime. Every component is real ML or real API. /api/health returns {"mock_mode": false, "mock_diarization": false, "mock_sign_lookup": false}.
  • 17/17 tests passing including a tricky SSE-streaming endpoint test with race-condition handling.
  • 1,959 real ASL signs from the WLASL dataset (98% of WLASL coverage).
  • A genuine multi-agent system — six specialized Python agents with shared context, autonomous routing, tool use, and event streaming. Not a single LLM in a trench coat.
  • Robust LLM fallback — when our Gemini key was a placeholder for testing, the router silently routed every call to Claude. The user never saw an error.
  • Beautiful, polished UI with Framer Motion animations, glassmorphism, live agent activity panel, and sign.mt 3D avatar embedded as a continuous signing flow.

What we learned

  • Agentic architecture beats single LLM calls. Six specialized agents emitting typed events let us tell the story visually (the agent network panel) and produce more accurate per-step outputs than asking one model to do everything.
  • Always build the fallback. Our LLM router saved the demo when Gemini was misconfigured. Production AI systems live or die on graceful degradation.
  • Browser APIs are the wild west. Web Speech, MediaRecorder, AudioContext, iframe COEP/COOP — every one had a quirk. Investing in our own audio pipeline (MediaRecorder + WAV encoder + Whisper backend) was the right call.
  • Real datasets > synthetic placeholders. Going from 40 hand-curated words to 1959 real WLASL signs took 2 hours and transformed the demo's credibility.
  • Honesty in scope wins trust. Calling out what's curated (the sign vocabulary lookup) vs. what's full ML (Whisper, pyannote, LLMs) makes the rest more believable.

What's next for SignBridge

  • Q3 2026 — Domain-specific modes (Medical, Legal, Education) with preset vocabulary and tone
  • Q4 2026 — WLASL camera input: deaf user signs at webcam → text out for hearing colleagues (true two-way conversation)
  • Q1 2027 — RAG over past meetings, Firebase persistence, Slack/Zoom plugin
  • Vision — Every meeting accessible. Every classroom captioned. Every emergency room signed. For the 70 million who deserve to be heard.

Built With

  • anthropic-api
  • claude
  • claude-sonnet-4.6
  • fastapi
  • framer-motion
  • gemini
  • gemini-2.5-flash
  • google-ai-studio
  • huggingface
  • mediarecorder
  • next.js
  • openai-whisper
  • pyannote.audio
  • pydantic
  • pytest
  • python
  • react
  • sign.mt
  • sse-starlette
  • tailwindcss
  • typescript
  • uvicorn
  • web-audio-api
  • whisper-large-v3
  • wlasl
Share this project:

Updates