Inspiration

As a solo developer, I constantly felt like I needed a colleague—someone to help me with my workflow and monitor things in the background while I focused on deep work. I was frustrated by the limitations of browser extensions and AI copilots that sit on top of existing browsers. They can read the page, maybe summarize it — but they can't truly act. They can't open new tabs, navigate autonomously, fill out forms on LinkedIn while I am reading Reddit, or monitor a website every 5 minutes and tell me what changed without interrupting my flow.

I asked myself: What if the AI wasn't bolted onto the browser — what if the AI was the browser?

That question led me to build Lobster — a desktop browser built from scratch in Electron where the AI agent is a first-class citizen with its own tabs, its own vision, and its own voice. You talk to Lobster like talking to a real colleague. It talks back. And it works.

What it does

Lobster is the world's first native live-agent browser. It combines:

  • Always-on voice conversation — powered by Gemini Live API's bidirectional streaming. No push-to-talk, no wake words needed. Just speak naturally. Lobster hears you, understands context, and responds with personality.
  • Autonomous browser control — Lobster opens its own background tabs, navigates websites, clicks buttons, fills forms, types messages, draws on canvases, and scrolls pages — all without touching your active tab.
  • Vision-based understanding — every action is guided by screenshots + a numbered DOM element map. The agent literally sees the page and clicks elements by reference ID — 100% accurate, no fragile CSS selectors.
  • Multi-tab parallel execution (Tab Swarm) — say "Compare prices on Amazon, eBay, and Walmart" and Lobster opens 3 tabs simultaneously, gathers data in parallel, and synthesizes results.
  • Scheduled monitoring (Cron) — say "Check Reddit for new posts every 5 minutes" and Lobster runs the task on autopilot, proactively notifying you when something changes.
  • Creative capabilities — Lobster can draw on Excalidraw, generate AI images via Gemini Imagen, and display results in a built-in Gallery.

How I built it

Two-Brain Architecture

The core innovation is splitting the agent into two specialized brains:

Brain 1 — The Conductor (Gemini Live API, gemini-2.5-flash-native-audio)

  • Maintains a real-time bidirectional voice conversation with the user
  • Handles personality, context, and task routing
  • Delegates browser tasks to the Executor via tool calls
  • Receives screenshots from the Executor to stay visually informed

Brain 2 — The Executor (Google GenAI SDK, gemini-2.5-flash with vision)

  • Receives screenshots + DOM element maps from browser tabs
  • Plans and executes multi-step browser automation
  • Uses ReAct reasoning (Observe → Think → Act → Verify)
  • Reports results back to the Conductor, who speaks them to the user

Tech Stack

  • Frontend: Electron 40 + React 19 + TypeScript + Framer Motion + Tailwind CSS 4
  • Backend: FastAPI + Google ADK (Agent Development Kit) + Google GenAI SDK
  • Cloud: Google Cloud Run (backend hosting) + Firestore (session memory) + Cloud Storage (screenshot archive) + Vertex AI (production model access)
  • Infrastructure: Terraform + Cloud Build + deploy.sh one-click deployment

Element Map System

Instead of fragile CSS selectors or XPath, I built a numbered element reference system. Before each action, the browser scans the DOM and assigns every interactive element a data-lobster-id. The agent sees: a clean, simplified mapping of IDs to Actionable Items (e.g., [42] -> "Add to Cart" button, [15] -> "Search Input"). Instead of hallucinating complex XPath queries or guessing dynamic CSS classes, the agent simply returns {"action": "click", "target_id": 42}. This makes Lobster incredibly fast, highly accurate, and completely immune to website redesigns or React/Tailwind class obfuscation.

Challenges I ran into

Building a native AI browser from scratch as a solo developer brought completely different challenges than a standard web app:

  1. Handling the Live Voice Stream Sync: Keeping the Gemini Live API bidirectional audio stream open and highly responsive while the "Executor" brain was doing heavy DOM parsing and tab switching required complex asynchronous queuing. I had to ensure the "Conductor" didn't freeze or drop context while the Executor was working in the background.
  2. DOM Noise Reduction: Modern websites are incredibly noisy (hidden elements, ad trackers, complex SVG trees). I spent a significant amount of time refining my DOM parser to strip out the garbage and only send a semantic, actionable "Element Map" to the Executor. This prevented blowing through token limits and kept reasoning latency low.
  3. Swarm State Synchronization: Orchestrating multiple hidden browser tabs concurrently and aggregating their asynchronous results back into a single, coherent voice response from the Conductor was a massive state-management and orchestration challenge.

Accomplishments that I'm proud of

  • True Zero-Click Browsing: Successfully completing end-to-end tasks (like executing a Tab Swarm to check prices across three different e-commerce sites and synthesizing the result) using strictly voice commands and agent autonomy.
  • The "Interruption" Flow: Nailing the UX. You can interrupt Lobster mid-sentence, change its task while it's parsing a site, and have it seamlessly pivot to the new context via the Live API.
  • Production-Ready Desktop App: I didn't just build a script; I shipped a full Electron desktop app backed by a scalable Google Cloud Run and Vertex AI infrastructure, proving the concept is viable for production.

What I learned

I learned that the true power of the Gemini Live API isn't just in making chatbots "sound human"—it's in using voice as the ultimate orchestrator. By separating the "Voice/Reasoning" brain from the "Execution/Vision" brain, I discovered a highly scalable, robust pattern for building agents that don't just talk about the web, but actively work on it.

What's next for Lobster

I am moving from a "web browser" to a "life orchestrator."

  1. OS-Level Control: Breaking Lobster out of the browser sandbox to manipulate local desktop applications, files, and terminal environments.
  2. Persistent Memory: Expanding the Firestore memory so Lobster remembers your preferences across sessions (e.g., "Always decline optional cookies," "I prefer dark mode," "Book middle seats on flights").
  3. Custom Executor Scripts: Allowing developers to write custom routines for the Executor brain, tailoring Lobster to specific, highly complex workflows.

Built With

  • cloud-storage
  • electron
  • fastapi
  • firestore
  • framer-motion
  • gemini-2.5-flash
  • gemini-live-api
  • google-adk
  • google-cloud-run
  • google-genai-sdk
  • python
  • react
  • tailwind-css
  • terraform
  • typescript
  • vertex-ai
  • websocket
Share this project:

Updates