Inspiration

Over 40 million people worldwide are non-verbal or minimally verbal, many of them autistic. Commercial AAC (Augmentative and Alternative Communication) devices cost $3,000–$15,000, and most apps require expensive subscriptions. I asked: What if AI could make a genuinely useful communication tool that costs almost nothing to run, works on any tablet, and actually sounds natural, not robotic?

What it does

Second Voice is a tap-to-speak communication board:

  • Tap picture tiles organized into categories (Needs, Feelings, People, Food, Places, Actions) to build a message.
  • Gemini AI composes the tapped keywords into a natural first-person sentence ("water" + "want" becomes "I want some water, please.").
  • Kokoro TTS speaks it aloud in a natural human voice — entirely on CPU, no cloud dependency for speech.
  • The phrase bank learns — tiles you use most float to the top automatically.
  • Accessibility-first — high-contrast mode, adjustable text size, and a scanning mode for single-switch users with motor impairments.
  • Installable PWA — works full-screen on a tablet like a dedicated AAC device.
  • Quick-reply tiles ("Yes", "No", "I need help", "Thank you") speak instantly with one tap — critical for urgent communication.

How I built it

  • Backend (Python + FastAPI):
    • SQLite phrase bank with 7 categories and 70 seeded tiles
    • Google Gemini 2.5 Flash (free tier) for keyword-to-sentence composition with a graceful plain-join fallback
    • kokoro-onnx (the same 82M-parameter Kokoro engine bundled by Voicebox) for CPU text-to-speech returning WAV audio
  • Frontend (React + TypeScript + Tailwind + Vite):
    • Zustand state management with persisted accessibility settings
    • Responsive tile grid, sentence-builder bar with chips, and a large Speak button
    • PWA via vite-plugin-pwa with service worker for offline caching
    • Scanning mode implementation using interval-based highlight cycling + Space/Enter selection
  • Testing:
    • Backend smoke tests hitting all 5 endpoints
    • 4 Playwright E2E tests covering board loading, quick-reply speech, full compose+speak flow, and sentence editing

Challenges I ran into

  • Thinking tokens eating output — Gemini 2.5 Flash's "thinking" mode consumed the output budget, returning truncated sentences. Fixed by disabling thinking via ThinkingConfig(thinking_budget=0).
  • No GPU available — full Voicebox requires a GPU. I extracted just the Kokoro ONNX engine (~350MB), which runs great on CPU with sub-second synthesis.
  • Browser autoplay restrictions — audio won't play without a user gesture. I ensured every speak action originates from a tap/click event.

Accomplishments that I am proud of

  • Zero cost to run — no paid APIs, no GPU, no subscriptions. Gemini free tier + on-device TTS.
  • Real accessibility — scanning mode for switch users isn't a checkbox feature; we implemented proper interval cycling with keyboard selection that actually works for motor-impaired users.
  • Sub-second speech — kokoro-onnx on CPU delivers natural audio in under a second. The compose+speak round-trip is ~3 seconds total.
  • Inspired by real open-source — I didn't just use APIs; I studied Voicebox's architecture and extracted its core engine in a lightweight form.

What I learned

  • How AAC systems actually work and what makes them usable (large targets, scanning, symbol+text, frequency-based ordering)
  • Gemini's free-tier constraints and how to work within them (deadline minimums, thinking budget control)
  • ONNX runtime for running neural TTS models on CPU without framework overhead
  • The gap between "AI that's cool" and "AI that helps someone communicate basic needs" — and how small that gap actually is with the right tools

What's next for Second Voice

  • Short term (next 2-4 weeks):
    • Deploy to DigitalOcean with a public URL and HTTPS
    • Add a text-input mode for literate users who want to type freely
    • Integrate Unlimited-OCR (free HuggingFace Space) for "read-the-world" — point at a sign or menu and hear it read aloud
  • Mid term (1-3 months):
    • Personal voice cloning via free HuggingFace Spaces (F5-TTS/XTTS) so users hear their own voice
    • Caregiver dashboard for customizing boards, adding tiles, and reviewing usage analytics
    • Gemini-powered next-tile prediction based on usage patterns and context
    • Multilingual support (kokoro supports 8 languages; Chatterbox supports 23)
  • Long term (3-6 months):
    • Speech-to-text input for partially verbal users
    • Cross-session memory that learns routines ("morning" context surfaces breakfast/bathroom tiles)
    • Offline-first mode with bundled local voice for connectivity-free use
    • Open-source release with a contributor guide so the AAC community can add symbol sets and languages

Built With

Share this project:

Updates