Inspiration
The purpose of Koda is to allow you to use your computer from anywhere via a phone call. Imagine being an on call software engineer trying to enjoy time with family and being able to access your codebase just with your phone. Koda can also be used to access and send files from your computer while you are out and about.
What it does
Koda turns a standard phone call into a direct command line for your macOS desktop. There’s no app to install and no complex web interface, it's just a phone number backed by Deepgram’s ultra-low latency speech engine. You tell it what to do, like 'Open VS Code' or 'Find the invoice in Downloads', and Deepgram automatically converts your voice to action faster than you can type. It translates your speech into actual mouse movements and keyboard clicks in real-time, creating a hands-free, eyes-free bridge to your digital workspace.
How we built it
The tech stack was very technical to make this work:
The Ears (Vapi.ai & Deepgram): We used Vapi.ai to handle the telephony and Deepgram’s Nova-3 model to transcribe speech instantly. This setup allows for "barge-in," so you can interrupt the AI just like a real person.
The Brain (Claude 4.5 Sonnet): We pipe the text to Claude, which acts as the decision-maker. It’s one of the few models smart enough to understand UI navigation without hallucinating buttons that don't exist.
The Hands (Agent-S): This is the Python engine running locally that actually "grabs" the OS accessibility layer to move the mouse and type.
The Plumbing (FastAPI & ngrok): To get the cloud to talk to a laptop sitting behind a home firewall, we built a local server and tunneled it through ngrok.
Challenges we ran into
Our own computers tried to stop us.
McAfee treated our tunneling software (ngrok) like it was malware, quarantining it about five times before we were able to finally exclude it from the firewall.
We learned that a 3-second delay on a phone call feels like forever. We relentlessly optimized our system prompts and server logic to shave off milliseconds.
Furthermore, debugging a voice agent is weird. We spent a lot of time yelling at our phone to work while staring blankly at a monitor, hoping something moves. Eventually, we were able to get the voice assistant Gateway Server to connect with our MCP Server.
Accomplishments that we're proud of
The first time we called the number and saw the mouse move on its own was awesome. It felt like the computer knew what to do already.
We managed to get Agent-S (which is pretty cutting-edge and experimental) to play nice with a stable telephony API.
We also successfully built a system that doesn't sound like a robot reading a script, but it actually feels like a helper is on the other end of the line thanks to Deepgram.
What we learned
Transcribing text is easy, but actually capturing intent and mapping it to a GUI action is incredibly difficult.
Every app is built differently, so teaching an AI to navigate a messy desktop is a crash course in spatial reasoning.
Opening a tunnel to your computer is scary, so learning how to lock down our endpoints was an important lesson in cybersecurity we will continue to look into.
What's next for Koda: Agentic Remote Desktop
Right now we are Mac-focused, but we want to fully implement the Windows drivers so Koda is OS-agnostic.
Implementing voice biometrics so Koda only obeys the owner's voice, preventing unauthorized access if someone else calls the number.
Teaching Koda to handle multi-step workflows "blindly", like "Find the cheapest flight to NYC and email it to me" without needing to confirm every single click.
Built With
- claude
- deepgram
- fastapi
- mcp
- ngrok
- python
- vapi
Log in or sign up for Devpost to join the conversation.