About The Project — ARGUS

The Inspiration

Every AI assistant I've ever used has the same fatal flaw: you have to explain everything from scratch, every single time.

You're debugging for 45 minutes. Three failed attempts. Stack Overflow tabs everywhere. You finally give up and open an AI assistant — and the first thing it asks is: "What's the problem?"

You just spent 45 minutes living the problem. Now you have to explain it in words?

That frustration is what built ARGUS.

What I Built

ARGUS is an ambient AI screen agent — the first of its kind that is proactive, not reactive.

It runs silently in the background, capturing your screen every 10 seconds, analyzing each frame with Gemini 2.0 Flash vision, and building a rolling 1-minute context window of exactly what you've been doing. When you say "ARGUS" — it doesn't ask what's wrong. It already knows.

"I've been watching. You hit that TypeError twice and visited Stack Overflow three times. The fix is on line 4 — you're missing an await. Let me apply it now."

The mouse moves on its own. The fix is applied. Tests pass.

How I Built It

The architecture is split into two parts:

Local Client (the eyes and hands)

mss captures screenshots every 10 seconds
A pixel diff filter compares each frame against the last — if less than 15% of pixels changed, the frame is skipped entirely. This reduced API calls by ~80% and made the free tier viable.
SpeechRecognition listens for the wake word "ARGUS"
PyAutoGUI executes real mouse movements and keyboard actions

Google Cloud Backend (the brain)

FastAPI server hosted on Google Cloud Shell with a persistent WebSocket endpoint
Gemini 2.0 Flash analyzes each screenshot and extracts structured context: app open, activity, errors visible, URLs, files
A rolling context manager stores observations in Firestore with timestamps, automatically dropping anything older than 1 minute
On wake word activation, the full context summary is sent to Gemini alongside the user's command — Gemini returns a narration + action + pixel coordinates
Cloud Storage logs every screenshot and action as an audit trail

The entire system communicates over a persistent WebSocket connection — no polling, no delays, real-time bidirectional flow.

The Challenges

1. API Quota on Free Tier Gemini's free tier has strict rate limits. Without the pixel diff filter, ARGUS would exhaust its daily quota in under an hour. The solution was a NumPy-based frame comparison that only sends genuinely changed frames — cutting API usage by 80%.

2. Coordinate Precision Gemini returns pixel coordinates for UI elements, but high-DPI Windows screens have a scaling factor. A coordinate Gemini thinks is at (450, 300) might actually be at (900, 600) on a 200% DPI screen. Solved with a screen resolution scaling layer in the executor.

3. WebSocket Stability Long-running WebSocket connections drop silently. Built an automatic reconnection loop that retries with exponential backoff — ARGUS never permanently loses its connection.

4. The Demo Problem Proving ambient awareness in a short video is hard — you can't film 5 minutes of context building and then a 4-minute demo. Solution: compress the context window to 1 minute for the demo, and film the entire context-building phase on camera so judges see it happen live, not just claimed.

What I Learned

Gemini 2.0 Flash's vision accuracy for coordinate detection is genuinely impressive — it correctly identifies UI elements in complex, cluttered screens with high confidence
Ambient agents are architecturally fundamentally different from reactive agents — the challenge isn't the AI logic, it's the data pipeline that feeds it
The most important engineering decision was the pixel diff filter — a single algorithmic choice that made the entire free-tier viability possible
WebSocket reconnection logic is unglamorous but critical for any real-world agent deployment

What's Next for ARGUS

Vertex AI embeddings for long-term memory beyond 1 minute — ARGUS should remember how you fixed that login bug last month
Native Gemini Live API streaming for true real-time interruption handling
Multi-monitor support
Mobile screen support via ADB
Cross-app workflow chaining — one voice command that spans Gmail → Sheets → Slack

In Greek mythology, Argus Panoptes had 100 eyes and never slept. He saw everything. That's the philosophy.

Most AI is reactive. ARGUS never stopped watching.