Inspiration We noticed UX testing is broken. It costs thousands to recruit real users, takes weeks to schedule, and test panels always skew young and tech-savvy. Elderly users, blind users, ESL speakers, people on slow connections are almost never represented. We wanted to make UX testing instant, free, and actually diverse.
What it does You paste any website URL, pick from 8 AI personas (or build your own with one sentence), and they all browse your site simultaneously in real Chromium browsers. Each persona has a detailed cognitive model. A 72-year-old who calls tabs "little windows", a teen who rage-quits after 3 seconds, a blind developer using a screen reader. They scroll every page, click around, report confusion, and generate a full UX audit with a 5-dimension score, letter grade, click heatmap, sentiment timeline, accessibility violations, navigation funnel, persona conflict detection, and AI-generated recommendations. Takes about 2 minutes.
How we built it The backend is Python/FastAPI on Railway running Playwright to control real Chromium browsers. GPT-4o vision looks at screenshots each step and decides what to click, scroll, or type. All personas run in parallel via asyncio. Server-Sent Events stream the live browser view to the frontend. The frontend is Next.js on Vercel with Framer Motion animations, a CSS variable theme system for light/dark mode, and tabbed results with SVG charts. axe-core is injected for accessibility auditing. Scoring is based on real UX research methodologies (SUS, HEART framework, Nielsen heuristics).
Challenges we ran into Getting the AI to actually click things was hard. Sites like Apple use custom-styled elements that aren't real buttons, so we had to add cursor:pointer detection, click-by-text fallback, and a 3-strategy click chain (hover+mouse, JS dispatchEvent, JS .click). The agent kept getting stuck in loops describing the same page without acting, so we built anti-loop detection, action failure feedback, and repetition alerts. Playwright's browser version had to be pinned exactly to match the Docker image or Chromium wouldn't launch. Railway's port handling and Vercel's env var behavior each took multiple iterations to get right.
Accomplishments that we're proud of The persona conflict detector finds pages where one user type is completely fine but another is totally lost, which surfaces real design tradeoffs no other tool can show. The mid-test interaction where the AI asks the human operator for guidance at choice points. The 5-dimension scoring system grounded in actual UX research. And the fact that the whole thing works end to end: paste a URL, watch AI personas browse live, get a research-quality report in 2 minutes.
What we learned The gap between "AI can see a screenshot" and "AI can actually use a website" is massive. Vision models are great at describing what they see but terrible at taking action without very explicit guardrails, failure feedback, and anti-stuck mechanisms. We also learned how much UX research exists that nobody applies because testing is too expensive. The scoring methodologies we used (SUS, HEART, Nielsen) have been around for decades but most products never run them.
What's next for Phantom Session replay with AI-generated think-aloud voiceover so you can watch each persona's journey like a real user testing video. A/B comparison mode to test two URLs side by side with the same personas. Mobile viewport testing since half the personas would realistically be on phones. CI/CD integration so teams can run Phantom as a GitHub Action and block deploys if the UX score drops below a threshold. And fix-it code suggestions that generate actual CSS/HTML patches for each issue found.
Built With
- asyncio
- axe-core
- docker
- fastapi
- framer-motion
- javascript
- next.js
- openai-gpt-4o
- openai-gpt-4o-mini
- playwright
- python
- railway
- server-sent-events
- tailwind-css
- typescript
- vercel
Log in or sign up for Devpost to join the conversation.