OpenClaw2026_Ryann_Pip

Logo for Pip
Pip's Paw

OpenClaw2026_Ryann_Pip

Pip — Local macOS Screen-Aware AI Pet Agent

Ryann
OpenClaw Agenthon 2026

Inspiration

AI agents are powerful, but for most people they still feel like developer tools: API keys, cloud setup, and answers trapped in a chat window. When you are already stressed inside a complex app—like Final Cut Pro with dozens of panels—you do not want another tab of text; you want something that sees what you see, shows you where to go, and talks you through the next step.

I built Pip to be that companion: a small pet agent that is easy to run locally, friendly to use, and grounded in the screen you are actually working on.

What it does

Pip is an AI pet agent that you talk to with Control + Option push-to-talk.

Listens with Apple Speech and speaks back with macOS text-to-speech
Captures your screen with ScreenCaptureKit when you trigger a task
Reasons locally with Ollama, default: gemma3, over what is on screen
Points at UI by flying Pip’s paw to on-screen targets parsed from [POINT:x,y:label:screenN] tags
Helps with tasks via local tools: open apps/sites, browser searches, reminders/events/notes, guarded Desktop cleanup, and research to PDF export
Shows its work in the menu bar panel with an agent step log

Default build: no paid API keys — Apple Speech, Ollama, and macOS speech synthesis are enough for demos and everyday use.

Example flows:

“Inspect my screen and tell me what to do next”
“Point at the export button”
“Open YouTube and search rockets”
“Organize my Desktop” with confirmation
“Research reusable rockets and make me a PDF”

How we built it

App shell: SwiftUI + AppKit, menu-bar-only, LSUIElement, custom NSPanel for the control surface and a full-screen transparent overlay for Pip
Voice pipeline: AVAudioEngine + Apple Speech; global push-to-talk via a listen-only CGEvent tap for reliable Ctrl + Option detection
Brain: OllamaAgentClient streaming /api/chat with multi-monitor screenshots; optional llama3.2-vision for stronger visual reasoning
Agent loop: CompanionManager orchestrates dictation to capture to optional tools to confirmations for risky actions to Ollama to parse point tags to overlay animation to TTS
On-screen teaching: OverlayWindow maps coordinates across monitors and animates Pip along bezier arcs; response bubble + waveform next to the pet
Operator mode: UIAutomationExecutor + BrowserTaskExecutor for visible browser/desktop actions, with URL fallback when UI automation fails
Optional sidecar: local Playwright server in agent-sidecar/ for heavier browser automation

Architecture is MVVM on @MainActor, with a visible tool-step log and agent cards for follow-up tasks.

Challenges we ran into

macOS permissions, TCC: Screen Recording, Accessibility, and microphone access had to be requested and tested carefully; we avoided terminal xcodebuild so permission state stays reliable during demos
Multi-monitor pointing: Mapping model coordinates to the correct display and animating Pip smoothly across screens took extra geometry and state handling
Browser automation vs. reality: Sites change layouts constantly; we combined Accessibility-first interaction with safe URL fallbacks when typing/clicking does not land
Trust for risky actions: File moves and PDF exports needed clear preview/confirm steps so Pip feels helpful, not reckless
Local-only constraints: Staying off paid APIs meant tuning prompts and models, gemma3 vs. vision, for good enough screen understanding without cloud vision APIs

Accomplishments that we’re proud of

A fully local, key-free default stack judges and users can run with Ollama + Apple Speech
A visible agent — not just chat — with paw pointing, spoken replies, and a step log
End-to-end flows: voice in to screen context to tools to point to speak out
Guarded Desktop cleanup and research PDF workflows with human confirmation
A polished menu bar + notch HUD + pet overlay experience that feels like a product, not a script

What we learned

On-screen guidance beats paragraphs when users are overwhelmed by feature-heavy apps
Setup friction is the real barrier to agent adoption; local defaults matter as much as model quality
Transparency builds trust: logging steps and confirming risky actions makes autonomous behavior feel acceptable
macOS is a great agent surface if you embrace menu bar UX, non-activating overlays, and system speech APIs
Small, bounded operator loops, try UI automation to observe to fallback, are more reliable than pretending one-shot prompts always work

What’s next for OpenClaw2026_Ryann_Pip

Deeper in-app tutoring for creative tools, for example CapCut, Figma, and Xcode — richer step-by-step “follow the paw” lessons
Stronger UI grounding with better vision models and element detection beyond coordinate tags
Smarter task memory so Pip remembers context across a session and multi-step projects
Plugin / skill system so communities can add tools without forking the app
Broader operator coverage via the Playwright sidecar and safer action policies
Accessibility focus: voice-first guidance for users who want hands-free help navigating dense UIs

Goal: Pip becomes the Mac copilot anyone can run in one minute — and trust when the UI gets hard.

Built With

accessibility-api
appkit
apple-speech-framework
avfoundation
javascript
macos
node.js
nsspeechsynthesizer
ollama
playwright
screencapturekit
swift
swiftui
ui-automation
xcode

Updates

Ryann Chandiari started this project — May 15, 2026 12:26 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.