Inspiration
Conversation is one of the few parts of daily life that still breaks down instantly when accessibility fails. In a classroom, a hallway, a family dinner, a doctor’s office, or a fast-moving group discussion, missing just a few words can mean missing the point, the joke, the instruction, or the chance to respond at all. Most captioning tools stop at transcription, which is insufficient because live conversation is messy, social, and it's full of interruptions, overlapping speakers, uncertainty, and moments you need to recover in real time.
We built LumenAR because accessibility should not end at an infodump of captions, but rather, help a user keep up, recover missed moments, identify who is speaking, and stay socially included without having to constantly ask others to repeat themselves. We wanted to build something that feels integrated into their life: one that works in the browser, extends into AR and is designed around dignity, independence, and real human interaction.
What it does
LumenAR is a real-time accessibility assistant for live conversations.
At its core, it streams microphone audio through a secure local proxy to generate live captions with clear confidence indicators, so users can immediately tell when speech was heard reliably and when it was not. But the real value is that it goes beyond captioning.
LumenAR includes speaker profiles, so conversation stays tied to identity instead of becoming a blur of anonymous text. When a new diarized speaker appears, the system can automatically create a profile, and the user can edit, assign, or refine it across both the web app and the AR interface. That means the product does not just show words; it helps preserve social context.
It also includes a missed-moment repair loop, which is one of the most important parts of real accessibility. If the user misses something, they can instantly mark that moment, ask the speaker to repeat, slow down, or face them, and link the repeated speech back to the original missed section. That is a fundamentally different design philosophy from passive captioning. It actively helps the user recover the conversation.
LumenAR also supports conversation memory, including rolling summaries, extracted action items, and replay for recent captions. In AR, it provides a camera-first caption overlay with gesture-based controls, speaker lookup, and live translation settings. Altogether, the system turns accessibility from a single transcript into a full live-conversation support layer.
How we built it
The first piece of Lumen is a React + Vite web app, which provides the main live captioning interface, transcript history, speaker profile management, missed-moment repair tools, typed communication, and session memory controls. This gave us a fast, flexible front end where users can interact with the system in a familiar browser environment.
The second piece is the AR app, built with Electron. This creates a webcam-based overlay experience where captions appear directly in a camera-first interface. We added speaker-aware controls, a profile side panel, gesture input, and translation settings. We aimed to emphasize seamless adoption by users unfamiliar with complex technology, and we believed that this was the best UI decision.
The third piece is a local Node proxy server that handles Deepgram transcription, translation routing, and shared speaker-profile storage. This was a major architectural decision. Instead of exposing API keys in the browser or the Electron client, we isolated the Deepgram key inside the local proxy. That gave us a cleaner, safer, more serious system design while also letting the web and AR apps share the same live profile state.
From there, we layered in speaker diarization, dominant-speaker selection, profile syncing, missed-speech linking, translation fallbacks, sound alerts, demo-safe flows, and gesture-based interaction. Every major piece was built to solve an actual conversational accessibility problem rather than exist as a flashy feature.
Challenges we ran into
The hardest challenge was that live accessibility is much harder than live transcription. A transcript alone is not enough when speech is fast, multiple people are talking, confidence varies across utterances, and the user needs to know who said what in real time. We had to think beyond “speech-to-text” and build around conversational recovery.
One major challenge was speaker identity. Raw diarization labels are not useful on their own if a user wants to know whether the current speaker is a teacher, a friend, or a doctor. We had to design a profile system that could automatically create placeholders for new speakers while still allowing manual confirmation, assignment, editing, and synchronization across interfaces.
Another challenge was architecture and security. We wanted live captions powered by Deepgram, but we did not want the API key leaking into front-end code. That meant building and maintaining a local proxy layer that could handle WebSocket streaming, translation routes, and shared state while staying responsive enough for a real-time experience.
The AR side introduced its own difficulties. Traditional UI patterns do not work well in a camera-first overlay, and native controls are often awkward for gesture input. We had to build custom interaction patterns, including hold-click timing, visual progress feedback, and gesture-friendly dropdown behavior. On top of that, we needed the whole project to remain demo-resilient even if live audio or external services misbehaved, so we created fallback scenarios and scripted flows that still communicate the product clearly.
Accomplishments that we're proud of
We are proud that LumenAR is not just another AI slop caption tool. What Nirvaan and I have built is a fully thought-through accessibility system with a clear user, a real need, and a design that respects how live conversation actually works.
We are especially proud of the missed-moment repair loop, because it solves a real human problem that most caption tools ignore. Accessibility should include recovery, not just detection, and this feature makes that principle concrete. We are also proud of the shared speaker-profile system, which turns anonymous diarization into something socially meaningful across both the web and AR experiences.
From a technical standpoint, we are proud of the local-proxy architecture that protects the Deepgram API key and keeps sensitive configuration out of client code. That makes the project feel serious and deployable rather than superficial. We are also proud that the AR app is not a gimmick: it has actual caption overlays, profile interaction, translation controls, and gesture support that make hands-free usage possible.
Most importantly, we are proud that this project is clearly aimed at social good. It addresses communication access, autonomy, inclusion, and dignity for people who are deaf or hard of hearing. It is a project with real beneficiaries, not just a technical stunt.
What we learned
We learned a lot about real-time streaming architecture, WebSocket proxying, cross-client shared state, gesture-driven interface design, and the complexity of combining speech, translation, and AR into one coherent product.
What's next for LumenAR
An important step in getting Lumen out into the public is testing the technology against conditions that are less than ideal. Current testing shows that Lumen works well in a perfect environment without much background noise, but this is obviously not the case in 99% of use cases. We are hoping to include better translation quality, richer memory and summary controls, more precise sound-event detection, and stronger privacy options for sensitive environments like classrooms, clinics, and workplaces. On the AR side, we want to make the overlay lighter, faster, and more wearable, so the interface feels increasingly invisible while still being useful.
Built With
- css
- deepgram
- html
- javascript
- typescript

Log in or sign up for Devpost to join the conversation.