Inspiration
Most AI assistants are trapped inside a chat box. They can tell you how to book a flight or fill a form — but they can't just do it. We wanted to build something that works the way a capable human colleague would: you say the goal out loud, and it takes over the screen and handles it end-to-end, on any website or desktop app, without needing browser plugins, DOM access, or site-specific code.
What it does
Cortex41 is a voice-message universal desktop and browser automation agent. You speak (or type) a goal — "Book the cheapest flight to Dubai next Friday" — and cortex41:
Plans: Gemini 3 Flash decomposes your goal into an ordered list of atomic sub-tasks (find flights, select cheapest, fill passenger details, confirm) — once per goal.
Sees: For each sub-task step, it captures a full screenshot and sends it to Gemini with a coordinate grid overlay. Gemini visually reasons about what it sees and decides the next action.
Acts: The agent executes the action — mouse click, keyboard input, scroll, open URL, launch app — directly on the desktop.
Narrates: Voice feedback via Gemini Live API keeps you informed in real-time. You can interrupt mid-task with a new goal at any moment.
Learns: After every successful task, Gemini Pro reflects on what it learned and distills a reusable skill (e.g., "On Kayak, the cheapest flight tile always has a green badge — click that first"). Future similar tasks automatically inject that skill into the planner's context, so cortex41 gets measurably smarter with every run.
How we built it
The core idea was simple: give Gemini a screenshot and ask it what to do next. Everything else was figuring out how to make that loop fast, reliable, and actually useful.
We started with a two-level execution model. A planner powered by Gemini Pro takes the user's goal, looks at the current screen, and breaks it down into a small number of concrete sub-tasks. That happens once. Then a tighter loop runs for each sub-task — capture the screen, ask Gemini what action to take, execute it, repeat. This separation meant the planner could think carefully once while the executor moved fast.
The hardest problem early on was navigation reliability. pyautogui pixel-coordinate clicking on a full macOS desktop is brittle — Retina scaling, Mission Control spaces, and wrong-window captures all break it in different ways. We solved this by adding a Playwright browser layer that runs its own isolated Chromium viewport at a fixed 1280×800. For any web task, the agent uses that viewport instead of the desktop. Playwright's accessibility tree gives Gemini the names and roles of every interactive element on the page, so instead of guessing coordinates it can say "click the link named Attention Is All You Need" and have that work deterministically.
To keep costs down we built a two-tier semantic cache. Before calling Gemini on any step, we hash the screenshot with pHash and check if we've seen a visually identical state before. If that misses, we generate an embedding and query Firestore for a semantically similar past action. Most repetitive sub-steps — navigating to the same page, clicking the same button — get served from cache without a model call at all. A model router then decides whether each step that does need a model call should go to Flash or Pro based on confidence, page complexity, and action type. In practice around 70% of steps go to Flash.
The skill system came out of noticing that the agent kept relearning the same things. After every successful task, Gemini Pro reads through what happened and extracts any non-obvious knowledge — quirks of a specific site's UI, recovery strategies for common failure modes, multi-step patterns that aren't obvious from the goal alone. Those get stored in Firestore as embeddings and injected back into the planner the next time a similar goal comes in.
Voice was wired in through Gemini Live API, which handles the full bidirectional audio stream. The agent transcribes speech, feeds it into the same goal pipeline, and can be interrupted mid-task by simply speaking a new goal.
Challenges we ran into
macOS Space switching: pyautogui actions caused macOS to switch virtual desktops, breaking the screenshot feedback loop. Fixed by using mss for screen capture (bypasses the compositor) and carrying screenshots between actions rather than recapturing immediately after each one. Loop detection: Without careful tracking, the agent could click the same button repeatedly when a page doesn't change. Added multi-signal loop detection: consecutive cache hits, repeated narration strings, stale screenshot hashes, and a step counter cap — with graceful recovery strategies for each. WebSocket + asyncio: Firestore client initialization is synchronous and blocks the event loop for ~1s per class, causing the WebSocket to time out before the agent was ready. Solved by running construction in asyncio.to_thread. Gemini Live API interruption: Coordinating voice interruptions mid-task required careful async task management — the voice handler needs to cancel the running goal task and start a new one without leaving the desktop in a broken state.
Accomplishments that we're proud of
A fully working end-to-end loop: voice goal → hierarchical plan → vision reasoning → desktop action → skill extraction → improved next run
The two-tier semantic cache genuinely works — repeat steps within a session hit the pHash cache with sub-millisecond latency
The skill system progressively improves the agent across sessions with no manual intervention Production-quality deployment pipeline: Terraform IaC, Cloud Run, Cloud Build CI/CD — ready to scale
What we learned
Pure vision-based control (no DOM/accessibility tree) is more capable than expected — Gemini 3 Flash can reliably reason about pixel coordinates from screenshot grids Separating planning from execution (two-level hierarchy) is essential for reliability; re-planning every step is wasteful and less coherent Perceptual hashing is a surprisingly effective first-pass cache — it catches the vast majority of repeated visual states at zero API cost Gemini Live API's bidirectional streaming makes natural voice interaction feel genuinely different from push-to-talk
What's next for Cortex41
Multi-monitor support CAPTCHA detection and human-handoff flow Cloud-hosted browser option (for headless deployments) Fine-tuned skill extraction prompts per application category Shared public skill library (opt-in crowd-sourced knowledge base)
Built With
- cloud-build
- docker
- fastapi
- gemini-3-flash-preview
- gemini-live-api
- gemini-native-audio
- google-cloud-firestore
- google-cloud-run
- mss
- phash
- pyautogui
- python-playwright
- react
- terraform
- vite
- websockets
Log in or sign up for Devpost to join the conversation.