Lobster: The First Native Live Agent Browser

Architecture

Inspiration

As a solo developer, I constantly felt like I needed a colleague—someone to help me with my workflow and monitor things in the background while I focused on deep work. I was frustrated by the limitations of browser extensions and AI copilots that sit on top of existing browsers. They can read the page, maybe summarize it — but they can't truly act. They can't open new tabs, navigate autonomously, fill out forms on LinkedIn while I am reading Reddit, or monitor a website every 5 minutes and tell me what changed without interrupting my flow.

I asked myself: What if the AI wasn't bolted onto the browser — what if the AI was the browser?

That question led me to build Lobster — a desktop browser built from scratch in Electron where the AI agent is a first-class citizen with its own tabs, its own vision, and its own voice. You talk to Lobster like talking to a real colleague. It talks back. And it works.

What it does

Lobster is the world's first native live-agent browser. It combines:

Always-on voice conversation — powered by Gemini Live API's bidirectional streaming. No push-to-talk, no wake words needed. Just speak naturally. Lobster hears you, understands context, and responds with personality.
Autonomous browser control — Lobster opens its own background tabs, navigates websites, clicks buttons, fills forms, types messages, draws on canvases, and scrolls pages — all without touching your active tab.
Vision-based understanding — every action is guided by screenshots + a numbered DOM element map. The agent literally sees the page and clicks elements by reference ID — 100% accurate, no fragile CSS selectors.
Multi-tab parallel execution (Tab Swarm) — say "Compare prices on Amazon, eBay, and Walmart" and Lobster opens 3 tabs simultaneously, gathers data in parallel, and synthesizes results.
Scheduled monitoring (Cron) — say "Check Reddit for new posts every 5 minutes" and Lobster runs the task on autopilot, proactively notifying you when something changes.
Creative capabilities — Lobster can draw on Excalidraw, generate AI images via Gemini Imagen, and display results in a built-in Gallery.

How I built it

Two-Brain Architecture

The core innovation is splitting the agent into two specialized brains:

Brain 1 — The Conductor (Gemini Live API, gemini-2.5-flash-native-audio)

Maintains a real-time bidirectional voice conversation with the user
Handles personality, context, and task routing
Delegates browser tasks to the Executor via tool calls
Receives screenshots from the Executor to stay visually informed

Brain 2 — The Executor (Google GenAI SDK, gemini-2.5-flash with vision)

Receives screenshots + DOM element maps from browser tabs
Plans and executes multi-step browser automation
Uses ReAct reasoning (Observe → Think → Act → Verify)
Reports results back to the Conductor, who speaks them to the user

Tech Stack

Frontend: Electron 40 + React 19 + TypeScript + Framer Motion + Tailwind CSS 4
Backend: FastAPI + Google ADK (Agent Development Kit) + Google GenAI SDK
Cloud: Google Cloud Run (backend hosting) + Firestore (session memory) + Cloud Storage (screenshot archive) + Vertex AI (production model access)
Infrastructure: Terraform + Cloud Build + deploy.sh one-click deployment

Element Map System

Instead of fragile CSS selectors or XPath, I built a numbered element reference system. Before each action, the browser scans the DOM and assigns every interactive element a data-lobster-id. The agent sees: a clean, simplified mapping of IDs to Actionable Items (e.g., [42] -> "Add to Cart" button, [15] -> "Search Input"). Instead of hallucinating complex XPath queries or guessing dynamic CSS classes, the agent simply returns {"action": "click", "target_id": 42}. This makes Lobster incredibly fast, highly accurate, and completely immune to website redesigns or React/Tailwind class obfuscation.

Challenges I ran into

Building a native AI browser from scratch as a solo developer brought completely different challenges than a standard web app:

Handling the Live Voice Stream Sync: Keeping the Gemini Live API bidirectional audio stream open and highly responsive while the "Executor" brain was doing heavy DOM parsing and tab switching required complex asynchronous queuing. I had to ensure the "Conductor" didn't freeze or drop context while the Executor was working in the background.
DOM Noise Reduction: Modern websites are incredibly noisy (hidden elements, ad trackers, complex SVG trees). I spent a significant amount of time refining my DOM parser to strip out the garbage and only send a semantic, actionable "Element Map" to the Executor. This prevented blowing through token limits and kept reasoning latency low.
Swarm State Synchronization: Orchestrating multiple hidden browser tabs concurrently and aggregating their asynchronous results back into a single, coherent voice response from the Conductor was a massive state-management and orchestration challenge.

Accomplishments that I'm proud of

True Zero-Click Browsing: Successfully completing end-to-end tasks (like executing a Tab Swarm to check prices across three different e-commerce sites and synthesizing the result) using strictly voice commands and agent autonomy.
The "Interruption" Flow: Nailing the UX. You can interrupt Lobster mid-sentence, change its task while it's parsing a site, and have it seamlessly pivot to the new context via the Live API.
Production-Ready Desktop App: I didn't just build a script; I shipped a full Electron desktop app backed by a scalable Google Cloud Run and Vertex AI infrastructure, proving the concept is viable for production.

What I learned

I learned that the true power of the Gemini Live API isn't just in making chatbots "sound human"—it's in using voice as the ultimate orchestrator. By separating the "Voice/Reasoning" brain from the "Execution/Vision" brain, I discovered a highly scalable, robust pattern for building agents that don't just talk about the web, but actively work on it.

What's next for Lobster

I am moving from a "web browser" to a "life orchestrator."

OS-Level Control: Breaking Lobster out of the browser sandbox to manipulate local desktop applications, files, and terminal environments.
Persistent Memory: Expanding the Firestore memory so Lobster remembers your preferences across sessions (e.g., "Always decline optional cookies," "I prefer dark mode," "Book middle seats on flights").
Custom Executor Scripts: Allowing developers to write custom routines for the Executor brain, tailoring Lobster to specific, highly complex workflows.

Built With

cloud-storage
electron
fastapi
firestore
framer-motion
gemini-2.5-flash
gemini-live-api
google-adk
google-cloud-run
google-genai-sdk
python
react
tailwind-css
terraform
typescript
vertex-ai
websocket

Updates

Bartosz Idzik started this project — Mar 15, 2026 07:34 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.