Jarvis - Always-On Ambient AI Copilot

Startup
Wake Word Detection
Task Delegation
Architecture

Inspiration

Most AI assistants are pull-based: you open a chat, type a question, wait. I wanted an ambient agent that passively observes your screen and audio, building context over time, and only activates when you say its name. When it wakes up, it already knows what you're looking at - no copy-pasting, no describing, no tab-switching.

What It Does

Jarvis runs three layers:

Layer 1 (Local, $0): Continuously captures system audio via faster-whisper and screen via mss+Pillow, building a rolling context buffer. Zero API calls.
Layer 2 (Gemini 2.5 Flash Live API): Activated by wake word "Jarvis." Opens a real-time WebSocket session, injects buffered context, streams live audio + screen at 1fps, and provides voice responses.
Layer 3 (Gemini 3 Flash): When the user requests something complex, the Live API delegates via function calling to Gemini 3 Flash, which runs autonomously with code_execution, then reports results back via voice.

How I Built It

Python + asyncio for concurrent audio/screen/session management. The google-genai SDK handles both the Live API WebSocket and Gemini 3 Flash calls. faster-whisper runs locally for transcription.