About Shorka
Inspiration
My grandma is blind. She isn't bad with computers — she knows exactly what she wants to do. She'll tell me, step by step: "open my email, read the new ones, then reply to the one from my doctor." When I'm there, I do it for her. When I'm not, the computer she paid for sits dark on the desk.
Existing screen readers like NVDA and JAWS read out what's currently focused — a button, a menu item, a paragraph. They don't help her form intent into action. She still has to know which keyboard shortcut, which submenu, which Tab-press gets her to the unread email. The interface is built for sighted users with audio bolted on.
We wanted to flip that. Tell the computer what you want, in plain English, and let it figure out how. That's Shorka.
What it does
Shorka is a voice-only desktop assistant for blind users on Windows. Say "Hey Shorka" to wake it. Then ask for anything — "open Chrome and search for tomato soup recipes", "read me what's on the screen", "switch to the British voice", "undo that" — and Shorka announces what it's doing, does it, and tells you the result. Start speaking mid-sentence and it stops talking immediately so you can interrupt. Say "stop" to put it back to sleep.
How we built it
The pipeline is end-to-end streaming so the perceived latency stays under two seconds:
$$ T_{\text{response}} = T_{\text{VAD-end}} + T_{\text{STT}} + T_{\text{LLM-TTFB}} + T_{\text{TTS-TTFB}} \lesssim 2.0\,\text{s} $$
Concretely:
- Mic in (16 kHz mono) → silero-vad with confidence threshold $\tau = 0.8$ and silence hangover $\Delta t = 500\,\text{ms}$ to detect utterance boundaries.
- STT: OpenAI
gpt-4o-mini-transcribe(Groq Whisper as fallback). - Brain: Claude Sonnet 4.6 streaming + tool-use loop. Tools cover the four MVP capability domains — apps, keyboard, web, screen — plus voice switching, undo, and confirmation gating.
- TTS: ElevenLabs
eleven_flash_v2_5over WebSocket at $24\,\text{kHz}$ PCM, $\sim 75\,\text{ms}$ TTFB. Text chunks from Claude stream straight into the TTS WS as they arrive. - Audio out (24 kHz mono) → single
sounddevicecallback that mixes TTS + cue tones with ducking ($0.5\times$ during cues).
A Flutter overlay sits in a screen corner showing live state — a glowing orb whose color and animation reflect listening / speaking / awaiting-confirmation / offline — backed by a tiny asyncio HTTP API on 127.0.0.1:8412.
We built it in seven slices, each ending with something runnable: hello-world voice loop → first tool call → voice switching → barge-in → screen reading → web + safety + undo → polish.
What we learned
- Streaming everywhere or nowhere. Latency is the sum of every blocking step. Stream the LLM, stream the TTS, never wait for end-of-turn before speaking.
- Acoustic Echo Cancellation isn't the only way. AEC is hard. Running VAD continuously while the assistant is speaking, and cancelling TTS the instant a real human speech-start fires, is simpler and more reliable. The barge-in budget we hit:
$$ T_{\text{interrupt}} \approx T_{\text{VAD}} + T_{\text{flush}} \approx 32\,\text{ms} + 20\,\text{ms} \approx 50\,\text{ms} $$
- The Anthropic Messages API is strict about tool-use ↔ tool-result pairing. Cancel a streaming response mid-tool-use without patching the history and every future request fails. We built a
_repair_messages()step that injects stubtool_resultblocks for orphanedtool_useblocks. - UIAutomation alone isn't enough. Browsers render most of their content to a canvas; UIA returns nothing useful. We added a
see_screentool that takes a screenshot and asks GPT-4o-mini to describe it. - HTTP transports are weirder than they look. Dart's
HttpClientdefaults to chunked transfer encoding for POSTs unless you setcontentLengthexplicitly. Took us an hour to figure out why voice switching was silently failing.
Challenges we ran into
- PortAudio device contention. Opening separate input and output streams from different components races on the device handle and crashes intermittently. Fix: a single
AudioBusowner that arbitrates record + playback in one duplex callback. pywinautotree walks blocking the loop. Walking the UI tree of an open Chrome window withpywinautocould take 60 seconds. Switched to theuiautomationlibrary directly and capped depth + time per call.- The ElevenLabs SDK shells out to
mpvfor playback, which we don't want. Wrote a raw WebSocket client that pulls PCM bytes ourselves and feeds the AudioBus. - Wake-word + barge-in interaction. While narrating, the assistant must still hear "stop" and react in under 100 ms. Solved by always running VAD on the mic, even during TTS — speech-start during playback triggers an immediate flush of the audio buffer (within one $\sim 20\,\text{ms}$ output callback period).
- Confirmation flow. Dangerous tools (close window, delete) need a spoken yes/no with timeout, "wait" extensions, and re-prompting — without blocking the rest of the loop. We used an
asyncio.Futureon the session that the main listener routes the next utterance into.
What's next
Outlook integration so my grandma can finally read her own email. NVDA bridge for the apps Shorka can't yet see. A friend-and-family mode where she can ask "call my granddaughter" and it just works.
She's the user. Everything else is implementation detail.
Log in or sign up for Devpost to join the conversation.