Inspiration
Traditional A/B testing requires significant traffic, time, and resources, which puts small teams at a clear disadvantage. We wanted to flip that. What if you could simulate hundreds of real customer behaviors before going live and already know what to fix?
That's TestSim. Drop in a URL, and we deploy an army of AI browser agents, each one embodying a different type of real shopper, to crawl your store and tell you exactly where people drop off and what to test.
What it does
You paste an ecommerce URL into TestSim. We spin up 500 browser agents in the background, each one playing a different customer persona: budget hunters, impulse buyers, methodical researchers, mobile-first shoppers, first-time visitors, loyal returners, and more. Each agent actually browses your site: it clicks through categories, opens product pages, adds items to the cart, and tries to check out.
As each agent finishes, its findings stream live to the UI. You watch the activity feed populate in real time. When all agents are done, Claude synthesizes everything into a ranked list of A/B test recommendations, specific hypotheses with variants, affected personas, and priority levels. You walk away knowing exactly what to test first.
How we built it
The core of the system is browser use, a library that wraps Playwright with a Claude agent loop. Each agent gets a task string, the persona's character description plus instructions to browse the site, report friction points, and output structured JSON. Browser Use handles all the actual browser automation: navigating pages, clicking elements, scrolling, filling forms.
Each agent runs in a headless Chrome session and outputs a final JSON blob:
{
"journey_steps": ["landed on homepage", "searched for shoes", "opened product page", ...],
"converted": false,
"friction_points": ["no size guide", "shipping cost hidden until checkout"],
"drop_off_reason": "couldn't find return policy before committing",
"ab_test_suggestion": "Add a visible return policy badge on product pages"
}
We extract this with a regex parser since LLM output isn't always perfectly clean.
Persona Distribution
We modeled 8 customer personas with weighted sampling to reflect real ecommerce traffic patterns:
| Persona | Weight | Rationale |
|---|---|---|
| Budget Hunter | 22% | Price is the #1 purchase driver |
| Impulse Buyer | 19% | ~40% of online purchases are impulse |
| Research Rachel | 17% | Common for mid/high-ticket items |
| Mobile Maya | 13% | Majority of traffic is mobile |
| First-Time Visitor | 12% | New customer acquisition is a major segment |
| Loyal Returner | 9% | Returning customers convert higher but are smaller slice |
| Gift Giver | 5% | Situational — spikes around holidays |
| Convenience Seeker | 3% | Rare but high-intent |
We use random.choices with these weights to sample 500 personas per run, so the simulation distribution mirrors real traffic.
Concurrency Model
Running 500 browser agents is expensive. We batch them 3 at a time using asyncio.gather, so we never have more than 3 Chrome instances open at once. Each batch completes before the next starts. 500 agents across 167 batches.
for i in range(0, len(personas), 3):
batch = personas[i:i + 3]
batch_results = await asyncio.gather(*[run_one(p) for p in batch])
results.extend(batch_results)
Streaming with SSE
The backend streams results to the frontend via Server-Sent Events as each agent finishes. We use an asyncio.Queue as a bridge between the agent callbacks and the SSE generator. This means the UI updates live, you see each persona's result pop in as it comes back, not all at once at the end.
Event types: status, scraped, persona_start, persona_done, complete, error.
Synthesis with Claude
After all agents finish, we send all their findings to Claude in a single structured prompt. Claude reads every persona's journey, friction points, and drop-off reasons and returns a prioritized list of A/B tests: each with a title, hypothesis, variant to test, affected personas, and priority level. This is the "so what" layer that turns raw agent output into actionable recommendations.
Challenges we ran into
Getting Browser Use to output structured JSON reliably. The agent is a Claude loop that controls a real browser, it doesn't always end cleanly with a JSON object. Sometimes it adds explanation text before or after, sometimes it wraps it in markdown code fences. We ended up using a regex search for the first {...} block in the output rather than trying to parse the whole string. Not elegant, but it works.
Accomplishments that we're proud of
We built a fully functional end to end platform within a limited timeframe, successfully simulating realistic user behavior using AI agents across multiple website variants. We were able to generate actionable insights without relying on live traffic, demonstrating that meaningful A B testing can be done even at zero scale. We are especially proud of translating a complex concept like agent based simulation into an intuitive product that non-technical users can easily interact with.
What we learned
Honestly the biggest thing we learned is that LLMs are not deterministic output machines, they're conversational. Even when you tell the agent "output ONLY a JSON object, no other text," it will sometimes add a preamble or wrap it in a code block or just decide to explain itself. You have to build your parsing layer with that in mind from day one, not as an afterthought.
What's next for TestSim
Next, we plan to improve the realism and accuracy of our simulations by refining agent behavior and incorporating more diverse personas and datasets. We also want to expand TestSim to support more complex user flows across full websites, not just individual pages. On the product side, we aim to build integrations with common tools like analytics platforms and website builders to make adoption seamless for teams of all sizes.
Built With
- browseruse
- claude
- javascript
- python
Log in or sign up for Devpost to join the conversation.