Inspiration
Every developer ships UI bugs that real users find first, confusing navigation, tiny tap targets, poor contrast, unclear copy. Manual UX audits are expensive and slow. We asked: what if an AI agent could navigate your website like a real user and find friction before anyone else does? The Gemini Live Agent Challenge was the perfect catalyst to build an autonomous UX testing agent powered by multimodal AI.
What It Does
PrismUX is an AI-powered UX friction detector. Give it any URL and a goal (e.g. "Find opening hours" or "Complete checkout"), and it:
- Perceives the page via Gemini 2.5 Flash vision, detecting all interactive elements with bounding boxes
- Plans the optimal next action using confidence-gated reasoning
- Acts via Playwright browser automation with DOM+Vision fusion for precise targeting
- Evaluates before/after screenshots to assess progress and detect friction
It identifies 7 categories of UX friction (navigation, contrast, affordance, copy, error, performance, accessibility), generates severity-scored reports with actionable suggestions, and supports persona-based testing (elderly users, screen reader users, non-native speakers, etc.) to surface issues specific to different audiences.
Multimodal features: Real-time voice narration of agent thoughts via Web Speech API, synthesized audio cues for each action type, voice input for mid-navigation hints, annotated screenshot export with friction overlays, and a live thought stream showing the agent's reasoning in real time.
How We Built It
- Backend: FastAPI + Playwright for browser automation, Gemini 2.5 Flash for multimodal perception/planning/evaluation, Google ADK for agent orchestration
- Frontend: React 19 + TypeScript + Tailwind v4 with SSE-powered real-time streaming
- Core Architecture: PPAE loop (Perceive-Plan-Act-Evaluate) with confidence gating, stuck detection with 7-level escalating recovery + Gemini-powered intelligent recovery, cross-page memory, DOM+Vision fusion, and grounding verification
- Safety: URL allowlisting, PII detection, action-type blocking, page content scanning
- Infrastructure: Docker Compose, GCS screenshot storage, structured JSON schemas for all Gemini calls with repair retry
Challenges We Ran Into
- Cookie consent overlays blocking navigation on nearly every European site — solved with a multi-layer dismissal system that searches the main page, shadow DOM, and all iframes for consent buttons in 5 languages
- Gemini coordinate accuracy — vision-detected bounding boxes sometimes miss by 20–50px. Solved with DOM+Vision fusion that cross-references Gemini targets with Playwright DOM elements and adjusts coordinates
- Stuck loops — the agent would repeat the same failed action. Built a stuck detector with URL/action/screenshot fingerprinting and escalating recovery from scroll → Escape → click outside → go_back → Tab → Gemini AI recovery → abandon
- Structured output reliability — Gemini occasionally returns malformed JSON. Added schema enforcement + repair retry that sends the broken response back with a fix prompt
Accomplishments That We're Proud Of
- 88 passing tests covering the agent core, stuck detection, friction analysis, persona engine, safety guards, and reporting
- DOM+Vision fusion running in parallel with zero added latency — DOM extraction happens concurrently with the Gemini vision call
- Persona-based testing that surfaces genuinely different friction for elderly users vs. screen reader users vs. non-native speakers
- Full multimodal UX: voice narration, audio cues, voice input, annotated screenshots, real-time thought streaming — the agent feels alive while navigating
- Gemini-powered intelligent recovery — when fixed recovery fails, the agent sends the screenshot to Gemini and asks "what should I try?" for context-aware unblocking
What We Learned
- Multimodal AI is remarkably good at understanding web UIs but still needs DOM grounding for pixel-accurate interactions
- Structured output schemas dramatically improve Gemini response reliability but repair retry is still essential
- Cookie consent is the #1 blocker for autonomous web agents — every framework needs a dedicated dismissal system
- Persona-based testing reveals friction that generic testing completely misses
- Real-time audio feedback transforms a "watch and wait" experience into an engaging one
What's Next for PrismUX
- Comparative A/B testing — run the same goal on two URL variants and diff the friction
- CI/CD integration — GitHub Action that runs PrismUX on every deploy and fails the build if friction score exceeds threshold
- Multi-page journey mapping — chain goals across pages for full user journey analysis
- Gemini 2.5 Pro upgrade for deeper reasoning on complex interaction patterns
- Accessibility compliance scoring — map friction to WCAG 2.2 success criteria automatically
Built With
- pytest
- python
- react
- typescript
- vitest
Log in or sign up for Devpost to join the conversation.