Inspiration

Modern day LLM models are great at testing code. AI is far better than man at remembering edge cases and crunching tasks. But one aspect of agentic based testing isn't as seamless - user experience.

That's because, without given intention, agents are forced to infer intention through code architecture - the precise thing they are testing against. When posed with a decision involving user experience, agents will take the full context of the app into account - something humans will rarely do.

What it does

“Flamboyance finds where your UI lies to users.”

I built Flamboyance, a distributed perspective based UX tester. On query from CLI or MCP, Flamboyance spins up multiple subagents each with their own tailored personalities, to crawl through the website. The agents are given a contextless window, some operating with heuristic state logic, some operating with vLLM decision making trees, all to simulate a "beta launch" for UX.

By limiting context per subagent based on personality factors, we can more accurately simulate humans, and thus surface human frustration points.

Through MCP we can chain Flamboyance with a prompt to fix its surfaced frustration points. A prompt such as "run flamboyance, then fix user experiences" will trigger a thorough agentic workflow, ultimately leading to a well-tested final product.

How we built it

Core Language: Python; Browser automation: Playwright;

LLM summarization: Ollama 3.8b LLM visual: LLaVa Cloud: groq

Challenges we ran into

Baseline models are already strong at UX fixes. I classified this as a successful tool if it even finds one bug that frontier models can't detect currently. Simple prompts like 'fix UX' can handle all the obvious issues, so I was forced to design non-trivial, perception-based failure cases.

Accomplishments that we're proud of

Building a working multi-agent UX testing system from scratch. Starting from simple master to worker orchestration, taking inspiration from the Apache Spark architecture, I was able to create a pipeline where UX issues can be detected, explained, and then automatically fixed. I was able to demonstrate cases where standard prompts fail to detect issues while Flamboyance surfaces them.

What we learned

The larger issue with websites isn't functionality anymore; those can be fixed easily with coding agents. But sometimes, powerful agents aren't always better. By constraining what the agents could actually see, they improved in my overall output. Thus, constraints created my realism.

What's next for flamboyance

I aim to scale the agent swarm size using better hardware for a broader coverage. I also hope to replace local inference with more powerful hosted models for richer reasoning.

Built With

Share this project:

Updates