Inspiration
We watched Andrej Karpathy's vision of "vibe coding" take off — developers speaking naturally to AI agents that write code for them. But as agent swarms became real (Claude Code teams, parallel agents), there was no good way to see what they're all doing at once. You're stuck alt-tabbing between terminal windows. We thought: what if you could stand in a room surrounded by your coding agents, watch them work in real-time, and just talk to direct them? That's the command center every developer will want when they're running 10 agents at once.
What It Does
VibeView is a spatial command center for AI coding agent swarms. You spawn Claude Code agents in terminals, and VibeView gives you a real-time dashboard showing every agent's terminal output with full color rendering, a kanban task board synced with Claude Code's task system, cost/token/model analytics, and — the centerpiece — a talking avatar you can speak to that types your commands directly into any agent's terminal and narrates what's happening back to you.
On Apple Vision Pro, each panel (agents, tasks, stats, avatar) opens as its own floating window you can arrange in 3D space with your hands. You can pop out individual agent terminals into their own spatial windows, glance at the task board floating to your left, and speak to the avatar hovering to your right — all simultaneously.
How We Built It
Three-process architecture: a Next.js 15 frontend with React 19 and Tailwind v4, a Python FastAPI bridge server, and a LiveKit voice agent with LemonSlice avatar rendering.
The bridge is the brain — it scans tmux sessions every 5 seconds to auto-detect Claude Code agents, captures their terminal output with ANSI escape preservation, parses status bars for cost/token metrics, watches ~/.claude/tasks/ for real-time task updates, and broadcasts everything over WebSocket. When an agent finishes (working→idle transition detected via tmux title markers), it auto-generates a spoken summary using GPT and ElevenLabs TTS.
The voice pipeline routes speech through LiveKit Cloud to our Python agent, which connects to the bridge via WebSocket and types transcriptions directly into the target agent's tmux pane using tmux send-keys. A narrator loop reads terminal output every 4 seconds and speaks a brief summary of what the agent is doing.
For spatial computing, we used WebSpatial — window.open() calls get hijacked into native visionOS floating windows with glass material backgrounds and depth elevation. Pointer events replace HTML5 drag-and-drop (which doesn't work on visionOS), and we handle Safari/WKWebView quirks for microphone access and MediaRecorder encoding.
Challenges We Ran Into
Voice-to-terminal was deceptively hard. LiveKit's agent dispatch system, LemonSlice avatar rendering, and tmux keystroke injection all had to work together. We hit TTS crashes when the narrator responded with just a period character, duplicate transcript handling when LiveKit emitted the same final transcript twice, and ANSI escape sequences that needed stripping before narration but preserving for terminal display.
Spatial layout breaks everything you know about CSS. visionOS sets body { height: auto } for native scroll gestures, which destroys every flex-1 min-h-0 pattern. Full-screen overlays with absolute inset-0 don't work because the parent height chain doesn't resolve. We had to switch to position: fixed with explicit offsets and set minimum heights everywhere.
Safari on Vision Pro lies. MediaRecorder.isTypeSupported('audio/webm') returns true but produces broken audio. Web Speech API exists in WKWebView but isn't permitted. We had to detect the runtime, skip mimeType entirely (let Safari pick its default), chunk recordings with start(1000), and convert Safari's AAC-encoded mp4 to mp3 via ffmpeg before sending to Whisper.
Accomplishments We're Proud Of
We built a fully working voice-controlled spatial interface for AI agent swarms in 48 hours. You can literally stand in a room on Vision Pro, surrounded by floating terminal windows, speak "build me a tic-tac-toe game," and watch Claude Code do it while the avatar narrates the progress. The entire pipeline — speech recognition, agent routing, terminal capture, narration, avatar lip-sync — works end-to-end.
The terminal rendering is pixel-perfect — full SGR, 256-color, and 24-bit RGB ANSI color support matching Ghostty's Catppuccin Mocha palette. The task board does real drag-and-drop on visionOS using pointer events. And the bridge auto-detects agents with zero configuration — just open a terminal, run Claude Code, and it appears in the grid.
What We Learned
Spatial computing changes how you think about developer tools. When you can see 8 agent terminals simultaneously in your peripheral vision and just speak to redirect them, it's fundamentally different from managing them in a flat 2D terminal. The cognitive load drops dramatically.
We also learned that the "last mile" of voice interfaces is harder than the AI part. Getting LiveKit, LemonSlice, ElevenLabs, and tmux to all cooperate reliably — handling reconnects, deduplication, ANSI stripping, TTS edge cases — took more time than any individual feature.
What's Next for VibeView
Smarter voice commands — parsing intent so you can say "spawn a team to refactor auth" and have it create agents automatically, or "kill agent 3" without touching the UI. Spatial audio — agent narrations positioned in 3D space matching their window locations. Persistent layouts — save and restore your spatial window arrangements. Multi-user — multiple developers sharing the same spatial command center, each seeing the same agent swarm from different perspectives. And deeper agent awareness — the avatar understanding git diffs, test results, and build status so it can proactively tell you "the auth refactor agent just broke 3 tests, want me to have it fix them?"

Log in or sign up for Devpost to join the conversation.