Inspiration A/B testing is the gold standard for data-driven decision-making, yet the workflow remains fragmented and technically demanding. I saw product managers struggling to design statistically rigorous tests, engineers burdened with implementing tracking code, and analysts bottlenecked by manual reporting. I wanted to democratize experimentation. My inspiration came from a simple question: What if an AI agent could act as my dedicated Data Scientist, Product Strategist, and Engineer all at once? This vision drove me to build a system that doesn't just "run" tests but understands business goals and the mathematics behind them. What it does AB Alchemy is an autonomous agent that automates the entire A/B testing lifecycle. It acts as a comprehensive experimentation team in a box:

  • Strategist: It analyzes business goals to generate data-driven hypotheses.
  • Statistician: It designs rigorous experiments, calculating sample sizes and performing power analysis (1-\beta = 0.8, \alpha = 0.05) to ensure validity.
  • Analyst: It interprets complex results, translating raw data into actionable business insights in plain English.
  • Simulator: It allows users to "test their tests" by generating realistic synthetic user data before going live. How I built it I engineered AB Alchemy as a modular agentic system powered by Google's Gemini 3 pro.
  • The Brain: I utilized Gemini 3 pto for its exceptional speed and reasoning capabilities, assigning it distinct "personas" to handle different stages of the testing lifecycle.
  • The Engine: I built a robust simulation engine using Python and Pandas (data_simulator.py) that generates realistic user behavior, including seasonality and time-of-day patterns.
  • The Interface: I used Streamlit to create a clean, interactive dashboard that guides the user from ideation to analysis.
  • Visualization: I integrated Plotly to render interactive charts for conversion rates, confidence intervals, and funnel analysis. Challenges I ran into
  • Structured Output from LLMs: Getting the LLM to consistently return valid JSON for application logic while maintaining creativity for hypothesis generation was difficult. I solved this by implementing robust JSON cleaning and validation layers.
  • Statistical Rigor vs. Hallucination: LLMs excel at text but can struggle with precise calculations. I mitigated this by using the LLM to design the test parameters, but delegating the actual math (p-values, confidence intervals) to standard Python libraries like scipy and statsmodels.
  • Simulation Realism: Creating a dummy data generator that felt "real" was a complex task. I had to implement intricate logic for day-of-week trends and distinct user segments to ensure the analysis dashboard looked authentic. Accomplishments that I'm proud of I am particularly proud of the Simulation Engine. It doesn't just spit out random numbers; it models user behavior with seasonality and segment-specific conversion rates, making the "test drive" experience feel incredibly authentic. I'm also proud of the Latency Optimization—by leveraging Gemini 3 pro, I achieved a near-instant response time for hypothesis generation, making the tool feel like a real-time collaborator rather than a slow background process. What I learned
  • Agents need "Guardrails": I learned that giving an AI agent a specific persona (e.g., "You are a PhD Statistician") significantly improves the quality and specificity of its output compared to generic prompts.
  • The "Cold Start" Problem: Synthetic data is incredibly powerful for testing agentic workflows. By simulating data, I could iterate on my analysis prompts much faster than if I had waited for real-world traffic.
  • The Power of Speed: The low latency of the Flash model was critical. Users expect instant feedback, and optimizing for speed transformed the user experience. What's next for AB alchemy
  • Live API Integrations: I plan to connect the agent directly to Google Analytics 4 (GA4) and Mixpanel for real-time analysis of live data.
  • Bayesian Optimization: I want to implement multi-armed bandit algorithms for dynamic traffic allocation to maximize conversions during the test itself.
  • Visual Editor: My ultimate goal is to build a visual editor that allows the agent to generate and inject the HTML/CSS for test variants directly into the user's application.

Built With

Share this project:

Updates