Inspiration

Current screen readers are 10 years out of date. A lot of them rely on clicking through HTML elements like buttons, links, and headings one by one. Web browsing becomes slow and confusing, making it difficult for visually impaired people to act independently.

We wanted to build a screen reader that feels more natural, like you are talking to a person as they help you explain whats on your computer

What it does

BirdBox.ai is a helps visually impaired users understand and navigate webpages using AI and voice.

When a user opens a page, BirdBox.ai gives a quick summary of what is there. Then the user can ask questions like “Where do I log in?” or “What buttons are on this page?” The app responds out loud and helps guide them through the page.

How we built it

The app is built in swift on mac os. It reads the page through accessibility APIs and sends that context to our FastAPI backend on Railway. Claude turns it into short summaries and answers, while Deepgram handles the voice loop. Nova-3 for speech-to-text and Aura for text-to-speech. Some sites like gmail barely expose anything through accessibility, so we fall back to Browserbase, which loads the page in headless Chrome and pulls the real content. Redis stores summaries and reading position so the app remembers where you left off. Auth runs through Supabase, and we use Sentry and Arize Phoenix to catch bugs along the way.

Technologies we used

Browserbase

Our Mac client reads pages through Chrome’s accessibility tree first. When that fails on SPAs like Gmail, our backend spins up headless Browserbase sessions to fetch the live DOM (headings, paragraphs, links) and feed it to Claude. Running that headlessly server-side let us enrich page context on demand without opening a second browser or blocking the voice pipeline.

DeepGram

We used Deepgram to build a voice-first screen reader where users never rely on a visual UI. When they first browse a webpage, Aura auto-reads a page summary from the DOM information. When they hold a hotkey to ask a followup question, Nova-3 transcribes their speech and Aura speaks the answer back as it streams, so they hear responses before generation finishes.

On technical execution, we integrated both APIs client-direct from Swift using ephemeral keys from our backend (/api/deepgram/token), with sentence-level TTS chunking for low latency.

Voice is essential. For blind users, voice and hearing is their primary way of interacting with computer, so deepgram was very necessary.

Sentry

We added Sentry early so we could catch crashes and slow requests while building. One example of how we used it is that when we shipped faster page summaries, speech started overlapping after frequent clicks. We traced the logs in sentry and showed that it was a timing issue in the voice pipeline, so we fixed confirmation order and audio cancellation before demo.

Arize

We added Arize Phoenix early so we'd have something to look at when things broke. When voice commands started going down the wrong path, we pulled up the traces and saw read requests getting treated like clicks while off-page tasks never reached the right handler. This telemetry data from Arize was really useful for this type of debugging.

Challenges we ran into

Modern websites often don’t give us much through the accessibility tree. Gmail was basically unreadable until we added Browserbase as a fallback.

Speed mattered a lot too. If you switch tabs and wait too long to hear anything, the whole app feels broken. We spent a lot of time cutting the delay from tab change to first spoken word.

Writing for voice turned out to be harder than we expected. Blind users need short, clear answers not long chatbot paragraphs. We rewrote our prompts so summaries stay around 20 words and answers stay to one sentence, and we had to stop the model from saying things like “above” or “you can see.”

Routing voice commands was another headache. “Click sign in” should work on the page, but “email my professor” shouldn’t try to click a random button. On top of that, we hit bugs with Deepgram TTS playback and making sure speech actually stops when you tap the hotkey.

Accomplishments that we're proud of

We are proud that we built a working prototype that makes web navigation feel easier and more conversational.

We are also proud that this project solves a real accessibility problem. Being able to use a computer affects school, work, privacy, and independence.

What we learned

We learned that accessibility is not just about making information available. It also has to be easy to understand and use.

We also learned that AI can be really useful when it helps simplify complex interfaces instead of making users deal with every small detail.

What's next for BirdBox.ai

Next, we want BirdBox.ai to do more than explain pages and navigating. Visually impaired individuals should be able to use technology fully agentically with their voice.

Our goal is to make BirdBox.ai a smarter, more natural screen reader for the AI era.

Built With

  • arize-phoenix
  • browserbase
  • chrome
  • claude-haiku-(anthropic)
  • deepgram-(nova-3-stt-+-aura-tts)
  • events
  • fastapi
  • macos-accessibility-apis
  • python
  • railway
  • redis
  • sentry
  • server-sent
  • supabase
  • swift
Share this project:

Updates