VisibleVoice

Inspiration

Over 34 million children worldwide live with disabling hearing loss, yet the world around them is built entirely around sound. Classrooms, conversations, public spaces: all designed for those who can hear. We were moved by the stories of children and families navigating a world that simply wasn't built for them, and we asked ourselves: what if the world could come to them instead?

What it does

VisibleVoice is a real-time captioning system built on Snap Spectacles. Speech from anyone nearby is captured through a phone's microphone, transcribed instantly, and displayed as an AR text overlay directly in the wearer's field of vision.

Unlike traditional captioning apps that require you to hold up a phone or look down at a screen, VisibleVoice keeps the wearer present in the conversation. The transcript appears right where they're already looking. Sentences fade in as they're spoken and disappear after a natural pause, keeping the display clean and readable without overwhelming the user.

How we built it

VisibleVoice is a distributed real-time pipeline connecting three independent systems:

Phone (speech capture): a lightweight web app hosted on Vercel uses the browser's Web Speech Recognition API to transcribe speech in real time. The app distinguishes between interim results (words still being spoken) and final results (completed sentences), using a custom 800ms pause timer to detect natural sentence breaks faster than the browser's default detection. Anyone can use it just by visiting a URL with no download required.
Firebase (cloud relay): Google's Firebase Realtime Database acts as the low-latency middleman between the phone and the glasses. Every speech update is written as a simple JSON object containing the interim text, final text, and a timestamp. Firebase processes these writes in under 100ms, making it significantly faster than traditional SQL databases for this use case.
Snap Spectacles (AR display): A TypeScript script running in Lens Studio polls Firebase every 60-100ms using Snap's InternetModule HTTP client. The script distinguishes between interim and final updates: interim text appears instantly, as close to real time as possible, while completed sentences trigger a smooth fade-out before clearing the screen. This keeps the display feeling natural and readable rather than flickery or overwhelming.

Challenges we faced

We hit a fundamental limitation early on: Snap's native VoiceML and ASR module only captures the wearer's own voice, which would make VisibleVoice useful only for transcribing yourself — the opposite of what we needed. We solved this by routing audio through an external phone microphone instead, which captures the full conversation happening around the wearer and dramatically expands the use case.

What we learned

We learned that accessibility-focused design requires a fundamentally different mindset. Every decision, font size, subtitle position, update speed, word wrap, has a direct impact on whether the experience actually works for someone in a real-world social setting. We also deepened our understanding of how powerful AR can be not just as entertainment, but as a genuine tool for inclusion.

What's next for VisibleVoice

We want to expand support to multiple languages, add speaker identification so users can tell who is speaking, and explore integration with educational settings where students who are hard of hearing could follow classroom lectures in real time, independently, and without drawing attention to themselves.