🔭 Voyance

AI-Powered Visual Web Research Agent

Describe your research task in plain English. Watch an intelligent agent navigate live websites using Gemini vision. Receive a spoken briefing and a structured comparison report — no DOM scraping, no site-specific APIs, no code required.

Live Demo GitHub Dev.to Article


💡 The Problem

Competitive research today is broken:

Pain Point Reality
CSS selectors Break every time a site is redesigned
Site-specific scrapers Need constant, costly maintenance
Manual copy-paste Hours of work across dozens of tabs
Data reliability No way to verify AI-extracted claims

We asked a different question: What if an AI agent could see the web the way a human analyst does?

Instead of fighting DOM structures, Voyance uses computer vision + multi-model planning to browse like a real analyst — and was built specifically for the Gemini Live Agent Challenge (UI Navigator track), which asks: Can an agent observe and interact with interfaces using multimodal vision?


✨ What Voyance Does

You type:   "Compare pricing for the top 5 CRM tools"
              ↓
 Agent plans → navigates → extracts → verifies → reports
              ↓
You receive: A sortable comparison table + Vera's spoken briefing

The full output includes:

  • 📊 Sortable comparison table — company, segment, pricing tiers, key features, and confidence badges
  • 🎙️ Spoken briefing — ElevenLabs persona Vera narrates findings with analyst-grade voice quality
  • 📥 Export — Download results as CSV or HTML
  • 📡 Live transcript — Full research pipeline narrated step-by-step in real time

Zero DOM dependencies. Works across site redesigns indefinitely — because it reads pages like a human does.


🏗️ How It Works

The Agent Loop

  ┌─────────────┐     ┌───────────────┐     ┌─────────────────┐
  │  1. PLAN    │────▶│  2. NAVIGATE  │────▶│  3. EXTRACT     │
  │  Perplexity │     │  Playwright   │     │  Firecrawl      │
  │  + Gemini   │     │  Chromium     │     │  → Gemini Vision│
  └─────────────┘     └───────────────┘     └────────┬────────┘
                                                      │
  ┌─────────────┐     ┌───────────────┐              │
  │  5. REPORT  │◀────│  4. VERIFY    │◀─────────────┘
  │  Gemini     │     │  Perplexity   │
  │  + ElevenLabs│    │  Fact-check   │
  └─────────────┘     └───────────────┘

Step-by-Step

① Planning — Research Intent Generation

  • Perplexity API looks up "top 5 CRM tools" → returns real, live URLs
  • Gemini 2.0 Flash generates a structured JSON research plan: intent, target sites, data points, exclusions
  • Keyword-based fallback lists ensure the agent always has targets

② Navigation — Screenshot Capture

  • Playwright + headless Chromium loads each URL silently
  • Captures raw base64 screenshots — zero DOM interaction, zero selectors
  • Pixel data is the only input to the vision layer

③ Extraction — Dual-Path Intelligence

  • Fast path: Firecrawl API → structured JSON (company, pricing tiers, features, segment). Low latency on standard pages.
  • Vision fallback: Base64 screenshot → Gemini 2.0 Flash multimodal prompt. Handles SPAs, paywalls, and rate-limited pages gracefully.
  • Robustness: Missing company names inferred from domain (e.g., assetpanda.com → "Assetpanda"). Enterprise "Contact Sales" pages return real tier names, not "Unknown."

④ Verification — Claim Cross-Check

  • For each competitor and key claim (e.g., "HubSpot Starter is $20/seat/mo"):
    • Perplexity API (sonar model, citations enabled, low temperature)
    • Returns a verified flag + confidence score
    • Powers UI badges: ✅ Verified · ⚠️ Unconfirmed · 🔴 Low Confidence

⑤ Report — Synthesis & Voice

  • Gemini 2.0 Flash aggregates all records and writes a concise narrative
  • ElevenLabs TTS (Vera / Rachel voice, Multilingual v2) renders step narration + final synthesis as MP3
  • Frontend plays audio via <audio> element; backend returns base64 MP3

↩️ Interrupts — Mid-Run User Redirection

  • Users can say "skip this site" or "focus on HubSpot" while the agent is running
  • Agent stores the instruction, re-plans with Gemini, and adjusts the URL list on the next iteration — no session loss

🛠️ Tech Stack

Layer Technology Purpose
AI Backbone Gemini 2.0 Flash Planning, vision analysis, report synthesis
SDK Google GenAI SDK All Gemini API calls, streamed via WebSocket
Browser Playwright + Chromium Headless screenshot-only navigation
Extraction Firecrawl API Fast structured extraction (primary path)
Vision Fallback Gemini Multimodal Screenshot-based understanding when Firecrawl fails
Verification Perplexity sonar Live web fact-checking + URL discovery
Voice ElevenLabs TTS Vera persona, Multilingual v2, step-level narration
Backend FastAPI + WebSockets Real-time agent ↔ UI communication
Frontend React 19 + Vite + Tailwind Live progress feed, sortable table, audio player
State Google Firestore Session persistence + research result storage
Infra Cloud Run + Terraform + Cloud Build Production-grade IaC deployment

Voyance Architecture Diagram


🎯 Key Engineering Decisions & Challenges

1. Screenshot Parsing at Scale

Challenge: Gemini vision can be slow and produce noise on complex pages — logos missing from viewport means "Unknown Company" in output.
Solution: URL-based name inference as a final fallback (extract and title-case the domain). Screenshots are cached per session to avoid redundant vision calls.

2. WebSocket Timeout on Cloud Run

Challenge: Load-balancer idle timeout (~10s) dropped long-running agent sessions mid-research.
Solution: Client-side ping every 5s. If connection drops, the frontend transparently polls the backend for the latest session state. Graceful degradation with zero UX interruption.

3. Hallucinations & Unverified Data

Challenge: Gemini would occasionally invent pricing or features during extraction fallback.
Solution: The Perplexity verification step cross-checks every key claim against live web sources with citations. Only verified claims appear with full confidence; others are flagged explicitly.

4. Playwright Memory & Timeouts

Challenge: Headless Chromium needed >512 MiB; some sites took >30s to load.
Solution: Cloud Run bumped to 1 GiB RAM. 30s request timeout added. Up to 5 concurrent navigations run in parallel.

5. Live Voice During Execution

Challenge: Users wanted to hear narration as the agent works, not only a final summary.
Solution: Step-level TTS — ElevenLabs narrates each step in real time ("Visiting HubSpot… reading pricing page…"), streamed to the frontend incrementally.


🏆 Accomplishments

# Achievement
🥇 True visual UI navigation — zero DOM dependency; works on any site after any redesign
🥈 Hybrid extraction pipeline — Firecrawl + Gemini vision fallback means we never fail silently
🥉 Production-grade verification — Perplexity fact-checking prevents hallucinated data from reaching reports
4 Intelligent mid-run interrupts — users redirect the agent without losing session state
5 Vera voice briefings — ElevenLabs persona makes research feel human and immersive
6 Full IaC deployment — Terraform + Cloud Build, production-ready from day one

📚 What We Learned

Multimodal vision beats DOM scraping — CSS selectors break. Screenshot-based perception is resilient by design. Build agents that perceive like humans.

Users want control mid-execution — Fire-and-forget isn't enough. Interrupt + re-plan capability made the agent feel trustworthy.

Verification separates useful tools from demos — LLMs extract well but hallucinate freely without grounding. Perplexity fact-checks turned Voyance from impressive to reliable.

Stream everything — Real-time WebSocket updates made the experience feel alive. Agentic UX is about feedback loops, not silent background processing.

Memory budgets matter — 512 MiB ≠ enough for 5 concurrent Playwright + Gemini calls. Always profile before deploying headless browsers to serverless infra.


🔮 What's Next

  • Gemini Live API — End-to-end voice: speak → Gemini Live transcribes → agent plans & executes. True voice-first research.
  • Screenshot Replay UI — Every table row links to the exact screenshot it was extracted from. Fully auditable, transparent intelligence.
  • Structured Fact-Checking — JSON claims + citations per competitor. Multi-claim verification with granular confidence scoring.
  • Custom Research Schemas — User-defined extraction fields, reusable across industries.
  • Scheduled Reports"Compare these 5 tools every Monday" → recurring research + email summaries.
  • Competitor Tracking Dashboard — Historical pricing trends, feature parity charts, market insights over time.

🚀 Try It Now

Live Demo

voyance-beta.vercel.app

Run Locally (5 minutes)

Prerequisites: Node.js 18+, Python 3.10+, API keys for Gemini, ElevenLabs, Firecrawl, and Perplexity.

# 1. Clone & install
git clone https://github.com/ibtisamafzal/voyance.git
cd voyance && npm install

# 2. Backend
cd backend
pip install -r requirements.txt
playwright install chromium
cp .env.example .env        # ← add your API keys here
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# 3. Frontend (new terminal, from repo root)
npm run dev

Open http://localhost:5173, enter "Compare pricing for the top 5 CRM tools", and click Research.

Cloud Deployment (IaC)

Full Terraform + Cloud Build pipeline in infra/:

gcloud auth login
gcloud config set project YOUR_GCP_PROJECT

# Infrastructure
terraform init
terraform apply -var="gcp_project=YOUR_PROJECT" -var="region=us-central1"

# Then connect your repo to Cloud Build — backend auto-deploys on every push

Files: infra/cloudbuild.yaml (CI/CD pipeline) · infra/main.tf (Cloud Run, Firestore, IAM, secrets)


📦 Built With

AI & Multimodal: Gemini 2.0 Flash · Google GenAI SDK · Gemini Multimodal Vision
Verification & Data: Perplexity API (sonar) · Firecrawl API
Browser: Playwright · Chromium (headless, Docker)
Voice: ElevenLabs API (Vera / Rachel, Multilingual v2)
Backend: FastAPI · WebSockets · Python 3.11 · Pydantic
Frontend: React 19 · Vite · Tailwind CSS · Radix UI · Framer Motion · Lucide React
State: Google Firestore · Google Cloud Storage
Infra: Docker · Google Cloud Run · Cloud Build · Terraform · Vercel · Google Secret Manager


Built for the Gemini Live Agent Challenge — UI Navigator track.

Built With

  • chromium
  • docker
  • elevenlabs
  • fastapi
  • firecrawl
  • firestore
  • flash
  • genai
  • multimodal
  • perplexity
  • playwright
  • pydantic
  • python
  • radix
  • run
  • sdk
  • tailwind
  • terraform
  • ui
  • vercel
  • vite
  • websockets
Share this project:

Updates