About Shorka

Inspiration

My grandma is blind. She isn't bad with computers — she knows exactly what she wants to do. She'll tell me, step by step: "open my email, read the new ones, then reply to the one from my doctor." When I'm there, I do it for her. When I'm not, the computer she paid for sits dark on the desk.

Existing screen readers like NVDA and JAWS read out what's currently focused — a button, a menu item, a paragraph. They don't help her form intent into action. She still has to know which keyboard shortcut, which submenu, which Tab-press gets her to the unread email. The interface is built for sighted users with audio bolted on.

We wanted to flip that. Tell the computer what you want, in plain English, and let it figure out how. That's Shorka.

What it does

Shorka is a voice-only desktop assistant for blind users on Windows. Say "Hey Shorka" to wake it. Then ask for anything — "open Chrome and search for tomato soup recipes", "read me what's on the screen", "switch to the British voice", "undo that" — and Shorka announces what it's doing, does it, and tells you the result. Start speaking mid-sentence and it stops talking immediately so you can interrupt. Say "stop" to put it back to sleep.

How we built it

The pipeline is end-to-end streaming so the perceived latency stays under two seconds:

$$ T_{\text{response}} = T_{\text{VAD-end}} + T_{\text{STT}} + T_{\text{LLM-TTFB}} + T_{\text{TTS-TTFB}} \lesssim 2.0\,\text{s} $$

Concretely:

Mic in (16 kHz mono) → silero-vad with confidence threshold $\tau = 0.8$ and silence hangover $\Delta t = 500\,\text{ms}$ to detect utterance boundaries.
STT: OpenAI gpt-4o-mini-transcribe (Groq Whisper as fallback).
Brain: Claude Sonnet 4.6 streaming + tool-use loop. Tools cover the four MVP capability domains — apps, keyboard, web, screen — plus voice switching, undo, and confirmation gating.
TTS: ElevenLabs eleven_flash_v2_5 over WebSocket at $24\,\text{kHz}$ PCM, $\sim 75\,\text{ms}$ TTFB. Text chunks from Claude stream straight into the TTS WS as they arrive.
Audio out (24 kHz mono) → single sounddevice callback that mixes TTS + cue tones with ducking ($0.5\times$ during cues).

A Flutter overlay sits in a screen corner showing live state — a glowing orb whose color and animation reflect listening / speaking / awaiting-confirmation / offline — backed by a tiny asyncio HTTP API on 127.0.0.1:8412.

We built it in seven slices, each ending with something runnable: hello-world voice loop → first tool call → voice switching → barge-in → screen reading → web + safety + undo → polish.

What we learned

Streaming everywhere or nowhere. Latency is the sum of every blocking step. Stream the LLM, stream the TTS, never wait for end-of-turn before speaking.
Acoustic Echo Cancellation isn't the only way. AEC is hard. Running VAD continuously while the assistant is speaking, and cancelling TTS the instant a real human speech-start fires, is simpler and more reliable. The barge-in budget we hit:

$$ T_{\text{interrupt}} \approx T_{\text{VAD}} + T_{\text{flush}} \approx 32\,\text{ms} + 20\,\text{ms} \approx 50\,\text{ms} $$

The Anthropic Messages API is strict about tool-use ↔ tool-result pairing. Cancel a streaming response mid-tool-use without patching the history and every future request fails. We built a _repair_messages() step that injects stub tool_result blocks for orphaned tool_use blocks.
UIAutomation alone isn't enough. Browsers render most of their content to a canvas; UIA returns nothing useful. We added a see_screen tool that takes a screenshot and asks GPT-4o-mini to describe it.
HTTP transports are weirder than they look. Dart's HttpClient defaults to chunked transfer encoding for POSTs unless you set contentLength explicitly. Took us an hour to figure out why voice switching was silently failing.

Challenges we ran into

PortAudio device contention. Opening separate input and output streams from different components races on the device handle and crashes intermittently. Fix: a single AudioBus owner that arbitrates record + playback in one duplex callback.
pywinauto tree walks blocking the loop. Walking the UI tree of an open Chrome window with pywinauto could take 60 seconds. Switched to the uiautomation library directly and capped depth + time per call.
The ElevenLabs SDK shells out to mpv for playback, which we don't want. Wrote a raw WebSocket client that pulls PCM bytes ourselves and feeds the AudioBus.
Wake-word + barge-in interaction. While narrating, the assistant must still hear "stop" and react in under 100 ms. Solved by always running VAD on the mic, even during TTS — speech-start during playback triggers an immediate flush of the audio buffer (within one $\sim 20\,\text{ms}$ output callback period).
Confirmation flow. Dangerous tools (close window, delete) need a spoken yes/no with timeout, "wait" extensions, and re-prompting — without blocking the rest of the loop. We used an asyncio.Future on the session that the main listener routes the next utterance into.

What's next

Outlook integration so my grandma can finally read her own email. NVDA bridge for the apps Shorka can't yet see. A friend-and-family mode where she can ask "call my granddaughter" and it just works.

She's the user. Everything else is implementation detail.