OpenClaw2026_Ryann_Pip
Pip — Local macOS Screen-Aware AI Pet Agent
Ryann
OpenClaw Agenthon 2026
Inspiration
AI agents are powerful, but for most people they still feel like developer tools: API keys, cloud setup, and answers trapped in a chat window. When you are already stressed inside a complex app—like Final Cut Pro with dozens of panels—you do not want another tab of text; you want something that sees what you see, shows you where to go, and talks you through the next step.
I built Pip to be that companion: a small pet agent that is easy to run locally, friendly to use, and grounded in the screen you are actually working on.
What it does
Pip is an AI pet agent that you talk to with Control + Option push-to-talk.
- Listens with Apple Speech and speaks back with macOS text-to-speech
- Captures your screen with ScreenCaptureKit when you trigger a task
- Reasons locally with Ollama, default:
gemma3, over what is on screen - Points at UI by flying Pip’s paw to on-screen targets parsed from
[POINT:x,y:label:screenN]tags - Helps with tasks via local tools: open apps/sites, browser searches, reminders/events/notes, guarded Desktop cleanup, and research to PDF export
- Shows its work in the menu bar panel with an agent step log
Default build: no paid API keys — Apple Speech, Ollama, and macOS speech synthesis are enough for demos and everyday use.
Example flows:
- “Inspect my screen and tell me what to do next”
- “Point at the export button”
- “Open YouTube and search rockets”
- “Organize my Desktop” with confirmation
- “Research reusable rockets and make me a PDF”
How we built it
- App shell: SwiftUI + AppKit, menu-bar-only,
LSUIElement, customNSPanelfor the control surface and a full-screen transparent overlay for Pip - Voice pipeline:
AVAudioEngine+ Apple Speech; global push-to-talk via a listen-onlyCGEventtap for reliable Ctrl + Option detection - Brain:
OllamaAgentClientstreaming/api/chatwith multi-monitor screenshots; optionalllama3.2-visionfor stronger visual reasoning - Agent loop:
CompanionManagerorchestrates dictation to capture to optional tools to confirmations for risky actions to Ollama to parse point tags to overlay animation to TTS - On-screen teaching:
OverlayWindowmaps coordinates across monitors and animates Pip along bezier arcs; response bubble + waveform next to the pet - Operator mode:
UIAutomationExecutor+BrowserTaskExecutorfor visible browser/desktop actions, with URL fallback when UI automation fails - Optional sidecar: local Playwright server in
agent-sidecar/for heavier browser automation
Architecture is MVVM on @MainActor, with a visible tool-step log and agent cards for follow-up tasks.
Challenges we ran into
- macOS permissions, TCC: Screen Recording, Accessibility, and microphone access had to be requested and tested carefully; we avoided terminal
xcodebuildso permission state stays reliable during demos - Multi-monitor pointing: Mapping model coordinates to the correct display and animating Pip smoothly across screens took extra geometry and state handling
- Browser automation vs. reality: Sites change layouts constantly; we combined Accessibility-first interaction with safe URL fallbacks when typing/clicking does not land
- Trust for risky actions: File moves and PDF exports needed clear preview/confirm steps so Pip feels helpful, not reckless
- Local-only constraints: Staying off paid APIs meant tuning prompts and models,
gemma3vs. vision, for good enough screen understanding without cloud vision APIs
Accomplishments that we’re proud of
- A fully local, key-free default stack judges and users can run with Ollama + Apple Speech
- A visible agent — not just chat — with paw pointing, spoken replies, and a step log
- End-to-end flows: voice in to screen context to tools to point to speak out
- Guarded Desktop cleanup and research PDF workflows with human confirmation
- A polished menu bar + notch HUD + pet overlay experience that feels like a product, not a script
What we learned
- On-screen guidance beats paragraphs when users are overwhelmed by feature-heavy apps
- Setup friction is the real barrier to agent adoption; local defaults matter as much as model quality
- Transparency builds trust: logging steps and confirming risky actions makes autonomous behavior feel acceptable
- macOS is a great agent surface if you embrace menu bar UX, non-activating overlays, and system speech APIs
- Small, bounded operator loops, try UI automation to observe to fallback, are more reliable than pretending one-shot prompts always work
What’s next for OpenClaw2026_Ryann_Pip
- Deeper in-app tutoring for creative tools, for example CapCut, Figma, and Xcode — richer step-by-step “follow the paw” lessons
- Stronger UI grounding with better vision models and element detection beyond coordinate tags
- Smarter task memory so Pip remembers context across a session and multi-step projects
- Plugin / skill system so communities can add tools without forking the app
- Broader operator coverage via the Playwright sidecar and safer action policies
- Accessibility focus: voice-first guidance for users who want hands-free help navigating dense UIs
Goal: Pip becomes the Mac copilot anyone can run in one minute — and trust when the UI gets hard.
Built With
- accessibility-api
- appkit
- apple-speech-framework
- avfoundation
- javascript
- macos
- node.js
- nsspeechsynthesizer
- ollama
- playwright
- screencapturekit
- swift
- swiftui
- ui-automation
- xcode
Log in or sign up for Devpost to join the conversation.