Inspiration
Most AI assistants are pull-based: you open a chat, type a question, wait. I wanted an ambient agent that passively observes your screen and audio, building context over time, and only activates when you say its name. When it wakes up, it already knows what you're looking at - no copy-pasting, no describing, no tab-switching.
What It Does
Jarvis runs three layers:
- Layer 1 (Local, $0): Continuously captures system audio via faster-whisper and screen via mss+Pillow, building a rolling context buffer. Zero API calls.
- Layer 2 (Gemini 2.5 Flash Live API): Activated by wake word "Jarvis." Opens a real-time WebSocket session, injects buffered context, streams live audio + screen at 1fps, and provides voice responses.
- Layer 3 (Gemini 3 Flash): When the user requests something complex, the Live API delegates via function calling to Gemini 3 Flash, which runs autonomously with code_execution, then reports results back via voice.
How I Built It
Python + asyncio for concurrent audio/screen/session management. The google-genai SDK handles both the Live API WebSocket and Gemini 3 Flash calls. faster-whisper runs locally for transcription.
Challenges
- macOS system audio capture requires a virtual audio device (BlackHole)
- Managing Live API session lifecycle (2-min video limit, reconnection)
- Coordinating async background tasks with the live voice session


Log in or sign up for Devpost to join the conversation.