Inspiration

SensAI — Your AI Desktop Companion That Sees, Hears, Speaks, and Acts

Inspiration

We spend our entire day staring at screens, but AI assistants are still trapped in browser tabs. You have to switch context, type out what you're looking at, and wait for a text response. What if your AI could just be there — floating on your desktop, watching what you see, listening when you speak, and acting when you ask? That's SensAI. She's not a chatbot. She's a companion.

What it does

SensAI is an AI desktop companion built as an Electron app with Google's Gemini Live API. She lives on your screen as a floating orb and can:

  • See your screen in real-time — she watches what you're doing at ~1fps and understands the visual context
  • Have natural voice conversations — bidirectional audio with barge-in (you can interrupt her mid-sentence, just like talking to a real person)
  • Control your computer — click, type, drag, scroll, launch apps, run shell commands, read/write files, open URLs, all through voice commands with a safety confirmation system
  • Draw on your screen — arrows, circles, highlights, and text annotations directly on your display to point things out
  • Remember things across sessions — persistent memory synced to Google Cloud Firestore, so she knows your name, preferences, and context from previous conversations
  • Connect to any tool — via the MCP (Model Context Protocol) system, SensAI discovers and uses tools from any MCP-compatible server you configure. This makes her a platform, not just an app.
  • Switch personalities — 6 built-in personas (Wise Sage, Commander, Pro Coach, Hype Caster, Chill Buddy, raw Gemini) with 8 voice options, plus fully custom personas
  • Stay always-on — dormant mode keeps her running in the background, ready when you click the mic toggle on the floating orb

The key differentiator is the MCP tool system. Instead of hardcoding a fixed set of capabilities, SensAI reads a mcp_config.json file, spawns MCP servers via stdio, discovers their tools at session start, and merges them with Gemini's function declarations. Want her to manage your GitHub repos? Plug in the GitHub MCP server. Want filesystem access? Add the filesystem server. Security tools, dev tools, custom integrations — she can use anything you give her.

How we built it

Frontend/Desktop: Electron with a multi-window architecture. The main control window handles settings, persona selection, and chat transcript. A floating orb overlay (always-on-top, click-through) shows AI state via animated canvas. A fullscreen transparent overlay draws screen annotations. A hidden offscreen BrowserWindow handles screen capture without blocking the main thread.

AI/Voice: Google's Gemini Live API (@google/genai SDK) for real-time bidirectional audio streaming. Audio captured as PCM 16kHz mono, sent via WebSocket. Gemini responds with audio at 24kHz plus optional function calls. We implemented a queue-based audio scheduler with proper async AudioContext management to handle Chrome's autoplay policies and prevent audio dropout.

Computer Control: RobotJS (@jitsi/robotjs) running in a forked worker process to avoid blocking Electron's main thread. Mouse, keyboard, drag, scroll all use normalized 0-1 coordinates mapped to screen resolution. A voice confirmation system with 15-second timeout and grace windows keeps things safe.

MCP Integration: Custom MCP client that speaks JSON-RPC over stdio. Reads config, spawns servers as child processes, does the MCP handshake (initialize → tools/list), converts MCP tool schemas to Gemini function declarations, and routes tool calls back to the correct server. SensAI can even add new MCP servers to herself via voice command.

Cloud: Google Cloud Firestore for persistent memory sync, conversation logs, and session analytics. Memory syncs bidirectionally — pull from cloud on session start, push on every save. Conversation logs are archived per-session. Session analytics track persona usage, tools invoked, and duration.

Performance: Screen capture moved to an offscreen renderer process (MediaStream fast path with canvas.toBlob, falling back to desktopCapturer.getSources). Audio base64 encoding uses chunked loops instead of spread operators to avoid call stack pressure. These optimizations eliminated the UI lag that blocked the main Electron thread.

Google Cloud Services Used

  • Gemini Live API (via @google/genai SDK) — the core AI brain. Real-time voice + vision + function calling over WebSocket.
  • Google Cloud Firestore — cloud persistence for memory, conversation history, and session analytics. See electron/firestore-sync.js.
  • Google Search (native Gemini tool) — web search capability available in every session.
  • Code Execution (native Gemini tool) — Gemini can write and run code to answer questions.

Challenges we ran into

Electron audio pipeline: Chrome's AudioContext autoplay policy caused our biggest bug — SensAI would randomly stop talking mid-sentence. The root cause was that AudioContext.resume() wasn't being awaited before scheduling audio chunks, so chunks were silently dropped when the context was still suspended. We rebuilt the scheduler with an async queue that buffers chunks during suspension and drains them after resume completes.

Screen capture blocking the main thread: desktopCapturer.getSources() called from Electron's main process takes 100-400ms and blocks all IPC, window management, and audio routing. On a laptop with GPU switching, this caused severe stuttering. We moved capture to a hidden renderer BrowserWindow that uses MediaStream — the main thread now only receives lightweight IPC messages with the frame data.

MCP protocol edge cases: Spawning MCP servers via stdio requires careful handling of JSON-RPC framing, newline-delimited messages, and server initialization timeouts. Some servers send partial JSON lines that need buffering. We built a robust line-based parser with timeout fallbacks.

Voice confirmation UX: Getting the confirmation flow right for computer control actions was tricky. The system needs to save/restore the user's foreground window (via Win32 API), present the confirmation prompt on the orb overlay, listen for voice approval, handle timeouts, and support a grace window for rapid sequential actions — all without stealing focus from whatever the user is working on.

Accomplishments we're proud of

  • The MCP platform architecture — SensAI isn't a fixed-capability assistant. She's a platform that can use any tool you give her. This is the first desktop companion we know of that combines Gemini Live voice + screen vision + MCP tool extensibility.
  • The floating orb — during the demo, there's no "app window." Just the user's desktop with a floating AI companion.
  • Zero-cost operation — the Gemini API free tier covers everything. No paid services required.

What we learned

  • Electron's desktopCapturer is a known performance trap — always capture in a renderer process, never the main process.
  • Chrome's Web Audio API has subtle async pitfalls that cause silent audio dropout. Always await AudioContext.resume() before scheduling.
  • The MCP protocol is elegant but stdio transport has edge cases around buffering and server lifecycle management.
  • Building a voice-first UI requires fundamentally different thinking than visual UI — confirmation flows, barge-in handling, and state feedback through audio cues matter more than pixel-perfect layouts.

What's next for SensAI

  • Multi-monitor support — capture and annotate across multiple displays
  • Plugin marketplace — curated MCP server packages for common workflows (coding, research, productivity)
  • Conversation memory search — semantic search across all past conversations via Firestore
  • Mobile companion — lightweight mobile app that connects to the same Firestore backend for cross-device memory
  • Community personas — share and import custom personas with system prompts and voice configs

Built With

  • Gemini Live API
  • Google Cloud Firestore
  • Electron
  • JavaScript
  • Node.js
  • RobotJS
  • Model Context Protocol (MCP)
  • Web Audio API

Built With

Share this project:

Updates