Inspiration
Traditional A/B testing requires weeks of real user traffic before you get actionable results, meaning bad UX ships to real users while teams wait for statistical significance. What if you could get a directional signal in 15 minutes, before sending a single real user through your flow? That question led us to build an AI-powered UX simulation engine that runs synthetic personas through web variants and automatically surfaces friction.
What it does
The simulator takes two variants of a web experience (Variant A and Variant B) and runs AI-driven personas through both using a real browser. Each persona attempts realistic tasks (creating an account, finding pricing, learning about the company), and the system measures success rates, step counts, backtracking, and where sessions fail. The results feed into a live React dashboard showing a side-by-side comparison, per-task metrics, and clear results. A population model weights scores across four sub-agent behavioral types (focused, distracted-returns, blended-goals, fully-distracted) to reflect how real user populations behave rather than ideal users. Judges can adjust persona group weights in real time and instantly recompute scores.
How we built it
Playwright drives a real browser in an observe -> decide -> act -> log loop, with GPT-4o making navigation decisions from page state and extracted elements A deterministic success detection layer uses tight string matching per task to prevent false positives, no LLM subjectivity in the scoring A TypeScript monorepo with pnpm workspaces, Zod schemas, and a clean package separation between the runner, evaluator, and shared types An Express API server exposes /api/summary, /api/results, and /api/weights (GET + POST) so the dashboard can poll live results and write new weight configs back to disk React + Vite + Tailwind dashboard polls every 3 seconds, renders per-task metrics and persona tables, and includes a live weight editor with client-side score preview and validation
Challenges we ran into
Getting deterministic success detection right was harder than expected. Early versions had false positives: "welcome to shopease" (without the exclamation mark) matched the cluttered variant's h1, and generic strings like "free plan" matched an announcement bar instead of the actual pricing page. We tightened every pattern and switched from textContent to innerText to prevent hidden elements from triggering false positives. The sub-agent population model required careful design. We replaced an earlier random chaos system with four fixed behavioral archetypes per persona, each with its own scoring formula, to make results reproducible and explainable. Getting the weighted aggregation math right across 120 runs (3 tasks × 2 variants × 5 personas × 4 sub-agents) took several iterations.
Accomplishments that we're proud of
A fully working end-to-end simulation pipeline: one command (pnpm run demo) runs 120 sessions, scores them, and populates a live dashboard The population model, the idea that results should reflect a realistic mix of user behaviors, not just ideal focused users, feels like a genuine intellectual contribution to the UX testing space The live weight editor allows adjusting who's in the "audience" and watching the A/B scores recompute in real time; that interactivity makes the concept immediately tangible Robust false-positive prevention that makes the results actually trustworthy
What we learned
Deterministic evaluation is everything. The temptation in an LLM-powered project is to let the model judge its own output, but that compounds errors and makes results untrustworthy. Keeping the success-detection rule-based while using the LLM only for navigation decisions was the right call, producing consistent, explainable results. We also learned that simulating populations rather than individual users dramatically improves the believability of results. A single focused agent completing a task isn't interesting. Showing that 80% of focused users succeed but only 20% of distracted ones do, and mapping that to your actual user mix, is where the real insight lives.
What's next for HOMC
Generalize the runner to accept any URL with a JSON task config; right now it's scoped to the demo variants, but the observe-decide-act loop is fully site-agnostic LLM-generated friction summaries: plain-English explanations of why each persona type failed, not just that they did Session replay viewer, so you can watch the agent navigate step by step, the most compelling way to communicate failure modes to non-technical stakeholders CI/CD integration so teams can run a simulation check automatically before shipping a UX change
Log in or sign up for Devpost to join the conversation.