PrismUX

Architecture Diagram

Inspiration

Every developer ships UI bugs that real users find first, confusing navigation, tiny tap targets, poor contrast, unclear copy. Manual UX audits are expensive and slow. We asked: what if an AI agent could navigate your website like a real user and find friction before anyone else does? The Gemini Live Agent Challenge was the perfect catalyst to build an autonomous UX testing agent powered by multimodal AI.

What It Does

PrismUX is an AI-powered UX friction detector. Give it any URL and a goal (e.g. "Find opening hours" or "Complete checkout"), and it:

Perceives the page via Gemini 2.5 Flash vision, detecting all interactive elements with bounding boxes
Plans the optimal next action using confidence-gated reasoning
Acts via Playwright browser automation with DOM+Vision fusion for precise targeting
Evaluates before/after screenshots to assess progress and detect friction

It identifies 7 categories of UX friction (navigation, contrast, affordance, copy, error, performance, accessibility), generates severity-scored reports with actionable suggestions, and supports persona-based testing (elderly users, screen reader users, non-native speakers, etc.) to surface issues specific to different audiences.

Multimodal features: Real-time voice narration of agent thoughts via Web Speech API, synthesized audio cues for each action type, voice input for mid-navigation hints, annotated screenshot export with friction overlays, and a live thought stream showing the agent's reasoning in real time.

How We Built It

Backend: FastAPI + Playwright for browser automation, Gemini 2.5 Flash for multimodal perception/planning/evaluation, Google ADK for agent orchestration
Frontend: React 19 + TypeScript + Tailwind v4 with SSE-powered real-time streaming
Core Architecture: PPAE loop (Perceive-Plan-Act-Evaluate) with confidence gating, stuck detection with 7-level escalating recovery + Gemini-powered intelligent recovery, cross-page memory, DOM+Vision fusion, and grounding verification
Safety: URL allowlisting, PII detection, action-type blocking, page content scanning
Infrastructure: Docker Compose, GCS screenshot storage, structured JSON schemas for all Gemini calls with repair retry

Challenges We Ran Into

Cookie consent overlays blocking navigation on nearly every European site — solved with a multi-layer dismissal system that searches the main page, shadow DOM, and all iframes for consent buttons in 5 languages
Gemini coordinate accuracy — vision-detected bounding boxes sometimes miss by 20–50px. Solved with DOM+Vision fusion that cross-references Gemini targets with Playwright DOM elements and adjusts coordinates
Stuck loops — the agent would repeat the same failed action. Built a stuck detector with URL/action/screenshot fingerprinting and escalating recovery from scroll → Escape → click outside → go_back → Tab → Gemini AI recovery → abandon
Structured output reliability — Gemini occasionally returns malformed JSON. Added schema enforcement + repair retry that sends the broken response back with a fix prompt

Accomplishments That We're Proud Of

88 passing tests covering the agent core, stuck detection, friction analysis, persona engine, safety guards, and reporting
DOM+Vision fusion running in parallel with zero added latency — DOM extraction happens concurrently with the Gemini vision call
Persona-based testing that surfaces genuinely different friction for elderly users vs. screen reader users vs. non-native speakers
Full multimodal UX: voice narration, audio cues, voice input, annotated screenshots, real-time thought streaming — the agent feels alive while navigating
Gemini-powered intelligent recovery — when fixed recovery fails, the agent sends the screenshot to Gemini and asks "what should I try?" for context-aware unblocking

What We Learned

Multimodal AI is remarkably good at understanding web UIs but still needs DOM grounding for pixel-accurate interactions
Structured output schemas dramatically improve Gemini response reliability but repair retry is still essential
Cookie consent is the #1 blocker for autonomous web agents — every framework needs a dedicated dismissal system
Persona-based testing reveals friction that generic testing completely misses
Real-time audio feedback transforms a "watch and wait" experience into an engaging one

What's Next for PrismUX

Comparative A/B testing — run the same goal on two URL variants and diff the friction
CI/CD integration — GitHub Action that runs PrismUX on every deploy and fails the build if friction score exceeds threshold
Multi-page journey mapping — chain goals across pages for full user journey analysis
Gemini 2.5 Pro upgrade for deeper reasoning on complex interaction patterns
Accessibility compliance scoring — map friction to WCAG 2.2 success criteria automatically

Built With

pytest
python
react
typescript
vitest

Updates

Deepika Devaraj started this project — Mar 16, 2026 07:42 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.