CANE

Browsers gave us access to the web, but not help doing the work inside it.

Most digital work now happens in the browser: research, operations, applications, onboarding, purchasing, support, data entry, and countless repetitive workflows. But even with modern AI, the browser still expects the user to do everything manually: open tabs, click through interfaces, copy information around, fill forms, dismiss blockers, and keep track of progress step by step.

That friction is a productivity problem for everyone, and an even bigger accessibility problem for people with motor limitations, fatigue, repetitive strain, or anyone who finds heavy typing and clicking difficult or exhausting.

CANE was built to extend the browser from a passive window into an active side-pilot.


Inspiration

We kept seeing the same pattern: AI assistants are good at answering questions, but when real work moves into the browser, they stop short. They tell you what to do, but you still have to do all the clicking, typing, tab-switching, and navigation yourself.

For a lot of modern workflows, that is the real bottleneck.

We wanted to build something that felt more useful than a chatbot and more adaptive than a rigid macro tool: an agent that can understand what is happening in a live browser, reason about the current page, and actually help complete the task.

For the Gemini Live Agent Challenge, that made CANE a natural fit for the UI Navigator category. The goal was to turn Gemini into an agent that can interpret a browser visually and structurally, then act on the user’s behalf in real time.


What it does

CANE is a browser agent that works alongside the user in a live browser session.

It can:

Capability What CANE does
Understand intent Takes a plain-language objective and turns it into actionable browser steps
Observe the page Reads the DOM first and falls back to screenshots when the page structure is incomplete or misleading
Navigate interfaces Moves through websites, web apps, multi-step flows, and tab-based workflows
Take real actions Clicks, types, fills forms, opens tabs, downloads files, and verifies outcomes
Work in parallel Acts like a browser side-pilot, helping in one tab while the user stays focused in another
Improve accessibility Reduces repetitive clicking and typing, making browser workflows less manual and less exhausting

Core value

Instead of saying, "Click here, then type this, then open that page," CANE can do those steps directly in the browser.


How we built it

We split CANE into two major systems: a local desktop execution layer and a cloud reasoning layer.

System overview

Layer Responsibility Tech
Desktop app Hosts the live browser, UI, tabs, downloads, screenshots, and action execution Electron, TypeScript
Agent backend Runs the agent loop, orchestration, tool routing, and session management Node.js, TypeScript
Agent framework Defines the main agent plus specialist workers Google ADK
Model layer Provides reasoning and multimodal interpretation Gemini via Vertex AI
Cloud hosting Serves the backend and model-connected runtime Google Cloud Run
Shared protocol Keeps desktop and backend in sync over structured messages WebSocket protocol package

Architecture flow

  1. The user gives CANE a task in the desktop app.
  2. The Electron client sends the task and browser context to the backend over WebSocket.
  3. The backend runs a Google ADK-based main_agent.
  4. Gemini reasons about the next best step.
  5. The agent either uses tools directly or delegates to specialist workers like:
    • explore_agent
    • data_agent
    • unblock_agent
  6. Tool calls are sent back to the Electron app, where the live browser actually exists.
  7. The desktop app executes the action, captures the result, and returns updated state.
  8. The backend feeds that state back into Gemini and continues until the objective is complete.

Why Google mattered here

Google’s stack is central to how CANE works:

Google technology Role in CANE
Gemini Multimodal reasoning for understanding browser state and deciding actions
Vertex AI Model access and deployment-friendly runtime path
Google ADK Agent orchestration, tool calling, and worker-agent composition
Google Cloud Run Backend hosting and scalable deployment target

Challenges we ran into

Building a browser agent is very different from building a chatbot.

Challenge Why it was hard How we approached it
Unreliable page structure Real websites are messy, dynamic, and inconsistent We used a DOM-first strategy with screenshot-based recovery
Grounding actions The agent needs to act on what is true now, not what used to be on screen We kept the loop tightly tied to current browser state
Multi-step workflows A lot of browser tasks depend on intermediate results We used a main orchestrator plus worker agents for bounded tasks
Safe automation Some actions are easy, others are risky We added confirmation gates and verification-oriented execution
Desktop-cloud coordination The browser lives locally but reasoning happens in the cloud We designed a protocol layer to bridge both sides cleanly

Accomplishments that we're proud of

We are proud that CANE feels like an actual agent, not just a wrapper around a model.

  • Built a working desktop browser agent with a real live browser session
  • Connected an Electron execution layer to a Google Cloud-hosted agent backend
  • Used Gemini through Vertex AI as the reasoning engine
  • Structured the backend around ADK with a main agent and specialist worker agents
  • Created a browser control loop that can inspect, act, verify, and continue
  • Framed the product around real browser work and accessibility, not just novelty

Highlights at a glance

Area Outcome
Product A browser side-pilot that helps complete real web tasks
Technical architecture Desktop execution + cloud reasoning working together in real time
Hackathon fit Strong alignment with the UI Navigator track
Accessibility Reduced manual typing, clicking, and repetitive interaction load

What we learned

The biggest lesson was that browser agents need much stronger grounding than ordinary chat systems.

  • A browser agent has to constantly re-check state
  • DOM-only automation is not enough for modern interfaces
  • Multimodal reasoning becomes much more valuable when paired with real execution
  • Good orchestration matters as much as model quality
  • Accessibility is not a side benefit here; it is one of the clearest reasons this product should exist

We also learned that separating the system into a cloud "brain" and desktop "hands" creates a strong pattern for reliable agentic software.


What's next for CANE

We see CANE growing from a browser automation prototype into a true assistive web co-pilot.

Next step Why it matters
Voice-first interaction Make CANE feel more natural and reduce dependence on typing
Stronger multimodal input Expand beyond screenshots into richer live visual/browser context
Better accessibility workflows Support users with mobility, fatigue, and assistive browsing needs more directly
More robust task memory Let CANE resume and manage longer-running workflows
Safer automation controls Improve review, permission, and user-approval boundaries
Broader workflow coverage Handle more real-world browser tasks across research, operations, and admin work

Summary

CANE tackles a simple but important problem:

the browser is where work happens, but users are still expected to do all of it by hand.

By combining an Electron desktop browser, Google ADK orchestration, Gemini on Vertex AI, and a backend deployed on Google Cloud Run, we built a system that can understand live web interfaces and act on them in real time.

The result is a browser agent designed to make the web more usable, more assistive, and far less manual.

Built With

Share this project:

Updates