Inspiration

Every time I tried to automate web tasks, I'd spend hours fighting with CSS selectors and XPath that would break the next day when the website changed. It was frustrating!

Then I discovered Gemini 2.0 Flash with its powerful multimodal vision capabilities. I thought: What if AI could just "see" the page like a human does?

Instead of parsing DOM structures, we could now have AI analyze screenshots and understand visual elements - just like how humans navigate websites. The vision was born: create an agent that could complete web tasks by "seeing" rather than "parsing."

What it does

UI Navigator is a Visual UI Understanding & Interaction Agent - it becomes your hands on screen. The agent:

  1. Sees: Takes screenshots of any webpage and analyzes them with Gemini 2.0 Flash
  2. Understands: Identifies buttons, text fields, links, and other UI elements visually
  3. Acts: Executes browser actions (click, type, scroll, navigate) autonomously
  4. Completes: Runs multi-step workflows until the task is finished

Example Tasks

  • "Search for the president of Kenya" → Opens Google → Types query → Shows results
  • "Find weather in Nairobi" → Navigates to weather site → Reports conditions
  • "Go to Wikipedia and find info about AI" → Opens Wikipedia → Navigates to AI page

Key Difference: No DOM access needed. The agent works purely through vision - making it work on ANY website regardless of framework.

How we built it

Architecture

┌─────────────────────────────────────────────────────────────┐
│ Frontend (React + Vite + Tailwind)                        │
│ - Task input, live viewport, action history                │
└────────────────────────────┬────────────────────────────────┘
                            │ HTTPS
                            ▼
┌─────────────────────────────────────────────────────────────┐
│ Backend (Express + Node.js on Cloud Run)                    │
│ - Orchestrator: Agent workflow loop                         │
│ - Gemini Integration: Vision analysis                       │
│ - Playwright: Browser automation                           │
└────────────────────────────┬────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
   Vertex AI          Firestore          Cloud Storage
   (Gemini)           (Sessions)         (Screenshots)

Tech Stack (Challenge Requirements)

  • Gemini 2.0 Flash: Multimodal vision to interpret screenshots (✓ Mandatory)
  • Google GenAI SDK: @google-cloud/vertexai integration (✓ Mandatory)
  • Google Cloud: Vertex AI, Cloud Run, Cloud Storage, Firestore (✓ Mandatory)

Key Code: Gemini Integration

const generativeModel = vertexAI.preview.getGenerativeModel({
    model: 'gemini-2.0-flash',
    systemInstruction: {
        role: 'system',
        parts: [{ text: systemInstruction }],
    },
});

The system instruction tells Gemini to output JSON with executable actions (click, type, scroll, etc.)

Challenges we ran into

Challenge 1: Vision-Based Element Finding

Problem: How do we find elements without DOM access? Solution: Use Gemini to "see" the page and identify elements visually using bounding boxes.

Challenge 2: Screenshot Pipeline

Problem: Screenshots needed efficient transmission to Gemini Solution: JPEG compression (quality 80), base64 encoding, proper data URI prefixes

Challenge 3: Multi-Tab Handling

Problem: Agent was opening new tabs instead of using existing one Solution: Force single-tab usage with new_tab: false parameter

Challenge 4: Verification Logic

Problem: Verification was too strict, blocking successful workflows Solution: Made verification lenient - trust the agent when it believes task is complete

Challenge 5: Search Task Performance

Problem: Agent wasn't thoroughly searching for answers Solution: Added explicit instructions to scroll through ALL results

Challenge 6: Cloud Run Deployment

Problem: Playwright in containers can be tricky Solution: Proper Dockerfile dependencies, source-based deployment, 2Gi memory allocation

Accomplishments that we're proud of

Production-Ready Deployment: Backend running on Cloud Run, frontend on Cloud Storage ✅ All Mandatory Requirements Met: Gemini 2.0 Flash, Google GenAI SDK, Multimodal Vision, Executable Actions ✅ Universal Compatibility: Works on ANY website - no DOM parsing needed ✅ Automated Deployment: One-command deployment via deploy.sh script ✅ Real Multimodal Agent: End-to-end vision-based automation loop ✅ Session Persistence: Firestore for history and recovery

What we learned

  1. Vision > DOM: For general web automation, vision-based approaches are more robust than DOM parsing since they work regardless of how the page is built.

  2. Prompt Engineering Matters: Small changes to system prompts significantly affect agent behavior. Adding explicit instructions for search tasks improved success rate dramatically.

  3. Cloud Run Works for Browser Automation: Despite initial concerns, Playwright with Chromium works in Cloud Run containers when proper dependencies are installed.

  4. Normalized Coordinates: Using normalized [0,1] coordinates instead of absolute pixels makes automation work across different viewport sizes.

  5. Error Recovery: Building retry logic and error categorization helps the agent recover from failures gracefully.

What's next for UI Navigator

  1. Enhanced Element Detection: Use bounding boxes with higher precision for better click accuracy

  2. Multi-Modal Input: Accept voice commands in addition to text

  3. Smarter Recovery: Better error handling with automatic retry and fallback strategies

  4. User Authentication: Add user login for personalized sessions and saved preferences

  5. Screenshots to Cloud Storage: For very long sessions, offload screenshots to GCS instead of memory

  6. Mobile Support: Extend to mobile app automation using similar vision-based approach


Built for the #GeminiLiveAgentChallenge - "Your AI Hands on Screen"

Built With

Share this project:

Updates