UISurf: Multimodal Agentic UI Automation Platform

About The Project

UISurf was built to explore a simple but important idea: AI should not stop at generating text, it should be able to complete real tasks across the interfaces people already use every day.

Main repository: https://github.com/haruiz/uisurf-agentic-platform

Inspiration

The inspiration for UISurf came from a gap we kept seeing in existing AI experiences. Many systems are good at answering questions, summarizing content, or generating structured output, but they often stop right before the most useful part: actually doing the work.

Real-world tasks usually span multiple interfaces. A user may need to browse websites, compare information across sources, switch to a desktop application, and save or organize the result. That kind of workflow is still tedious and manual, even with modern AI assistants.

We wanted to build a platform where agents could go beyond chat and actually operate software environments. The goal was to create a system where a user could give a prompt like:

Go to Walmart, Best Buy, and Amazon, find the price of the latest MacBook Pro, then open a text editor and save the comparison results in a text file on the desktop.

That requires much more than text generation. It requires multimodal understanding, browser automation, desktop control, orchestration, and safe execution. UISurf was built to make that possible.

What We Built

UISurf is a multimodal agentic UI automation platform where agents can see, reason, and collaborate across browser and desktop environments to complete end-to-end tasks.

The platform is composed of three main parts:

uisurf-agent: the runtime for UI automation agents
uisurf-admin: the session orchestration and control-plane service
uisurf-app: the main full-stack user application

At the core of the system are two specialized automation agents:

a Browser Agent that uses Playwright to navigate and interact with websites
a Desktop Agent that interacts with the operating system desktop and local applications

The platform uses Gemini Live multimodal capabilities to reason over UI screenshots and visual state, helping the system decide the next best step during automation. Instead of treating the screen as a blind command surface, UISurf lets agents interpret what is visible and adapt their actions accordingly.

How We Built It

We built UISurf as a cloud-connected, multi-service platform:

a Next.js frontend provides the user interface
a FastAPI backend powers the main application APIs
Firebase Authentication secures user access
Firestore stores chat session and session metadata
Google ADK is used for the multi-agent orchestration layer
Vertex AI supports agent session management and orchestration
uisurf-admin runs as a FastAPI service that provisions isolated automation sessions
uisurf-agent runs inside sandboxed Docker containers and exposes the Browser Agent and Desktop Agent through the Agent2Agent (A2A) protocol

The workflow looks like this:

A user starts a session from the UISurf application.
The system creates an orchestration session using ADK and Vertex AI.
The backend calls the admin service to provision a sandboxed automation environment.
The admin service starts a dedicated uisurf-agent container.
The container exposes browser and desktop automation agents.
Higher-level orchestration agents connect to those A2A-compatible agents and execute the task.

In simplified form, the architecture is:

$$ \text{User} \rightarrow \text{UISurf App} \rightarrow \text{UISurf Admin} \rightarrow \text{UISurf Agent} \rightarrow \text{Browser/Desktop Environment} $$

This separation made it possible to keep the system modular, observable, and safer to operate.

What We Learned

This project taught us a lot about what it really takes to move from “AI that answers” to “AI that acts.”

Some of the biggest lessons were:

UI automation is fundamentally multimodal. It is not enough to generate action plans; the system must understand what is actually visible on the screen.
Specialized agents are more effective than one monolithic agent. Separating browser automation from desktop automation created a cleaner model for collaboration and execution.
Session isolation matters. Running each automation task inside its own sandboxed container makes the platform safer, easier to debug, and easier to scale.
Infrastructure is part of the product. Building a useful agentic experience required more than model prompts. We needed orchestration, authentication, session recovery, remote access, and lifecycle management.
End-to-end user experience matters. A good agent system is not just about model intelligence; it is also about giving users a clean way to launch, observe, and manage real automation sessions.

Challenges We Faced

One of the hardest parts was designing a system that could coordinate multiple layers at once:

multimodal reasoning over screenshots
browser automation
desktop automation
session provisioning
authenticated user workflows
cloud-connected orchestration

Another major challenge was making the system work across boundaries. A single task might start in the browser, continue through information extraction, and finish with desktop actions like opening a text editor and saving a file. That required careful coordination between different agents and services.

We also had to think carefully about execution safety and observability. Because these agents can control real interfaces, we needed isolated sandboxed sessions and a clean control plane for creating, listing, restoring, and deleting automation environments.

Finally, one of the biggest engineering challenges was making the platform feel like one product even though it is composed of multiple services. We had to connect the frontend, backend, session manager, and automation containers into a single experience that feels coherent from the user’s perspective.

Why This Project Matters

UISurf is our attempt to push AI beyond text-based assistance into real interface-level execution.

Instead of asking users to adapt to AI, UISurf allows AI agents to adapt to the software environments users already depend on. By combining multimodal reasoning, browser automation, desktop control, and cloud-based session orchestration, we built a platform that can turn natural language into real actions across real interfaces.

That is the problem we wanted to solve, and UISurf is our answer.

Built With

agent2agent-(a2a)
docker
fastapi
firebase-authentication
firestore
gemini-live
google-adk
google-cloud-run
google-compute-engine
material-ui
next.js
node.js
playwright
python
react
react-query
typescript
vertex-ai
zustand

Updates

Henry Ruiz started this project — Mar 16, 2026 07:53 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.