Inspiration
We kept running into the same frustration: every AI assistant we used was good at talking and bad at doing. You could ask ChatGPT to draft an email, but you still had to copy it, open Gmail, paste it, find the recipient, and hit send. The AI handled maybe 20% of the work and handed the rest back to you. We also spend a lot of time context-switching. Slack is open, Gmail is open, GitHub is open, iMessage is open, and Spotify is running in the background. None of them knows about the others. You are the glue holding your own digital life together, and that is exhausting. The question we started with was simple: what if you could just talk to your computer the way you would talk to a really competent person, and it would actually go handle things? Not suggest things. Not draft things for you to review in a separate tab. Actually handle them. That became BUDDY.
What it does
BUDDY is an agentic AI operator that runs on your Mac. You talk to it by voice or text, it reasons through your request using Claude or a local LLM, and it takes real actions across your real apps and services. You can say things like "text Jake I'm running late," and BUDDY will search your actual Contacts, find Jake, draft the iMessage, read it back to you for confirmation, and send it. You can say "play something chill on Spotify," and it interprets the mood and starts playback. You can say "remember I prefer morning meetings" and it stores that fact and uses it automatically in future conversations. BUDDY speaks back using ElevenLabs TTS so every response sounds natural rather than robotic. It listens using Whisper for real-time speech-to-text. It can run fully in the cloud using the Anthropic Claude API, or switch to a local Qwen 2.5 32B model via Ollama for full offline and private operation, with one flag change. Current integrations include Gmail, iMessage, Slack, Discord, GitHub, Spotify, Notion, Jira, Twitter, Google Calendar, Reminders, Notes, Chrome, web search, files, shell, and travel (Google Flights and Apple Maps).
How we built it
The core of BUDDY is a Python async agent loop. We use asyncio and threading to keep voice capture, chat input, LLM calls, and tool execution running concurrently without blocking each other. The LLM layer supports two backends. When USE_CLAUDE is True, BUDDY connects to the Anthropic API using Claude Sonnet and passes a full JSON tool schema on every call. Claude selects which tools to invoke, the ToolDispatcher routes those calls to the right Python functions, and the results are returned as tool_result messages for Claude to reason over. When USE_CLAUDE is False, the same schema goes to a local Ollama instance running Qwen 2.5 32B. The abstraction is clean enough that swapping backends requires no code changes beyond the flag. For voice input, we use OpenAI Whisper running locally for real-time transcription via PyAudio. For voice output, we use the ElevenLabs Python SDK, specifically the eleven_turbo_v2_5 model with custom voice settings tuned for conversational warmth. Audio is written to a temp file and played through afplay in a daemon thread, so BUDDY never blocks on speech. If ELEVENLABS_API_KEY is not set, it falls back to macOS automatically. Memory is a lightweight JSON store at ~/.buddy/memory.json. The MemoryStore class handles reads, writes, and auto-extraction of facts from natural language using regex patterns. Every call to the LLM includes the current memory contents injected into the system prompt. Each integration is its own Python module. OAuth credentials are loaded from environment variables. Native macOS apps like iMessage, Contacts, Calendar, Reminders, and Notes are automated via subprocess and AppleScript. We wrote a contacts fuzzy-matching layer to handle the fact that Claude should never guess a phone number and should always look it up first.
Challenges we ran into
Getting the voice pipeline truly bidirectional and non-blocking was harder than expected. The naive approach was to play audio and wait for it to finish before listening again, but that made every interaction feel stilted. We had to move audio playback into a daemon thread with a lock and implement interrupt logic so a new command can cut off the current response. Tool reliability was a constant battle. Claude is very good at picking the right tool, but handling the long tail of edge cases required significant prompt engineering. The most important rule in the system prompt turned out to be "ALWAYS call search_contacts before sending any message or email. Never guess a contact." Without that constraint, early versions would confidently make up phone numbers. The dual-LLM abstraction was conceptually simple but took real work to get right in practice. Ollama and the Anthropic API have different streaming formats, different error conditions, and slightly different behavior around tool-use. We had to normalize the response handling so the rest of the agent loop does not need to know which backend it is talking to. macOS permissions were a recurring headache. Contacts, Calendar, iMessage, Microphone, and Full Disk Access all require separate user approval, and macOS will silently refuse access rather than throwing a useful error. We had to build detection logic and add helpful error messages for each case. ElevenLabs voice selection across different account tiers was also tricky. Free accounts have a different set of available voices than paid accounts, so we built a candidate list of known free-tier voice IDs and fall through them in order until one resolves successfully for the current account.
Accomplishments that we're proud of
The dual-model architecture working seamlessly is our biggest technical win. Flipping one boolean and restarting gives you a completely local, offline, private AI operator that behaves identically to the cloud version. That is a real design achievement and not something most AI projects bother with. The ElevenLabs voice pipeline genuinely sounds good. A lot of voice AI demos sound functional, but nobody would actually want to listen to them all day. BUDDY sounds like something you would want to keep talking to, which changes the experience entirely. We got the contacts lookup and iMessage sending working reliably with real fuzzy matching. This sounds simple, but it is one of the hardest parts because it involves macOS permissions, AppleScript automation, and a contacts search layer that can handle nicknames, partial names, and ambiguous results gracefully. The memory system, while simple technically, works exactly how you want it to. You tell BUDDY something once, and it remembers it forever. That persistent context makes conversations feel fundamentally different from starting from scratch every time. We also wrote a clean install script that walks through all the environment setup, which means we can actually hand this to someone else and have them run it.
What we learned
Agentic AI is mostly a reliability problem, not a capability problem. The models are smart enough. The challenge is building the scaffolding around them so that tool selection is consistent, errors surface clearly, permissions are handled gracefully, and the user always knows what BUDDY is doing and why. The interesting engineering is all in the infrastructure. Prompt design has a bigger effect on agent behavior than we expected. Moving from a generic "you are a helpful assistant" system prompt to one with specific rules around contact lookup, confirmation before irreversible actions, and natural spoken language reduced failure modes dramatically. Small wording changes produced large behavioral shifts. Async Python requires a lot of discipline. It is easy to introduce subtle bugs where a blocking call in the wrong thread stalls the whole agent loop. We learned to be very deliberate about which operations run in which threads and where the locks live. Voice interaction design is its own discipline. What works in text often falls flat in speech. Responses that are concise and use natural contractions feel much better than verbose or formal replies. The system prompt rule to never use bullet points in spoken responses was more important than we initially gave it credit for.
What's next for Buddy
The most important near-term addition is a proper UI. Right now, BUDDY is a terminal application, which limits who can actually use it. We want a minimal menubar app that shows what BUDDY is hearing and doing without getting in the way. We also want to make memory smarter. The current system stores explicit facts that you tell it directly. The next version should also extract implicit context from conversations, notice patterns over time, and surface relevant memories proactively rather than waiting for them to be injected via the system prompt. Multi-step planning is another priority. Right now, BUDDY handles individual tasks well but does not have a good model of longer-running goals. We want to add the ability to say "every morning at 9, check my email and tell me what needs attention" and have BUDDY schedule and execute that reliably. Cross-device support is on the roadmap. The macOS-specific integrations are valuable but limiting. A version that works on Linux and Windows, or that syncs memory and preferences across machines, would make BUDDY much more useful. We also want to open-source the tool integration framework so others can contribute new services. Adding a new integration right now is clean and well-defined, and we think there is a real community opportunity around shared tool modules.
Log in or sign up for Devpost to join the conversation.