Inspiration

I'm a principal PM and the kind of gamer who signs up for pre-release builds before the studio's sure they want me there. Handed a month to build anything, I went toward games — engine-building and logic are what I love, balance is the systems problem I already chew on at work, and I'd been wanting an excuse to point Monte Carlo at something real. What I found wasn't a toy. It was the one part of making a game that still hasn't gotten easier.

A designer isn't really balancing numbers — they're balancing three things at once, and all three run straight through the economy. Fun: does earning and spending feel rewarding, or like a grind? Engagement: do players keep coming back? Revenue: does any of it pay for the game? Every faucet that hands out currency, every store that drains it, every wallet it pools in, every rate that converts one currency to another — each one tips all three. Give too much and players hoard, nothing feels worth buying, and the money dries up. Give too little and free players quit in frustration while the big spenders lose the crowd that made spending worth it. And the only place to find that line is a spreadsheet — hours of it, for a balance that won't survive the next update. I checked this with working game designers before I built anything; I wasn't inventing the problem.

And it isn't only a small-studio problem. Even Blizzard built a real-money auction house into Diablo III and shut it down two years later, conceding it "undermines Diablo's core game play: kill monsters to get cool loot." If a studio with that much firepower can ship an economy that breaks, the person who just wants to make something fun deserves the same stress-test — in a browser, before they commit.

What it does

Open Loot and an economy is already running — currency moving through the whole system: sources that mint it (logins, quest rewards), sinks that drain it (shops, upgrades), and the converters and pools in between, all balanced green. From there, four things change.

You prototype at the speed of inspiration: The double-gem weekend that used to mean four hours of spreadsheet tabs is now a sentence — "add a 3-day double-gem weekend with a 500-gem cap" — and a slider you can push to five days, to a bigger cap, to whatever the lead asks next, in seconds.

You see the impact the instant you make it: Drag one slider and the change ripples outward node by node — flows thicken, a downstream system flips amber, then red — while three readouts move with it: Source/Sink (player spend against currency created), Revenue, and player growth. Behind every release sit a thousand scenarios across a thousand players and ninety days, so the colors are real outcomes, not a guess — and you're stewarding the economy's health for the long game instead of finding the damage a patch later.

You're never stuck at red: When a change breaks something, Loot hands you a math-checked fix — the exact lever, bisected against the engine to the value that lands genuinely green, confirmed before you touch it.

You never need to know the math: No formulas, no modeling tool, no Monte Carlo vocabulary — you describe what you want the way you'd say it to a teammate, and read a story instead of a spreadsheet. The same engine balances a SaaS free-tier plan or a loyalty program, because anywhere users earn and spend, one change cascades through everything else.

How we built it

Two demands fight each other: a designer needs feedback the instant they move a slider, and an answer deep enough to trust. Loot runs at two speeds to give them both. Every slider drag fires a deterministic preview in the browser — pure math, under 16 milliseconds, so the graph responds while your hand is still moving. Every time you let go, a 1,000-run Monte Carlo fires on a single AWS Lambda — about 1.7 seconds for the full distribution.

DRAG     →  browser preview      →  < 16 ms   instant, deterministic
RELEASE  →  Lambda Monte Carlo   →  ~1.7 s    1,000 runs × 1,000 players × 90 days
                                                  │
              both must agree on health color at days 30 / 60 / 90
                          — or the build breaks

The two implementations — TypeScript in the browser, NumPy on the server — are held to one truth by parity fixtures and golden files: they have to agree on the health color at days 30, 60, and 90, or the build fails. That contract is why you can trust the colors. Plain-English prompts check 42 cached intents first and only fall through to Bedrock — Claude Haiku 4.5 — when nothing matches, because the common questions should be instant and free, not a model call. 689 tests guard all of it — 476 in the frontend, 213 in the backend — from the underlying math to the colors on screen. I'm a principal PM; I don't write code. I built Loot in eight days with Claude Code as the build partner and one source-of-truth file — every architectural decision, every "do not," every locked contract — holding the whole thing coherent. Novus is wired into the production app, tracking every drag, cascade, prompt, and template switch.

Challenges we ran into

The math was genuinely hard, and I got it wrong before I got it right: Put 100 gems in a day and take 100 out, and you'd expect balance. Loot showed deep red — it looked wildly broken. It wasn't: the faucet ran every single day, but players only spend when they choose to, and most don't spend most days, so a sink "worth" 100 actually drained about four a day. The engine was right; my mental model of how players behave was wrong. Getting an economy to behave like a real one took a few iterations to wrap my head around — I had to rebuild how sinks work so spending tracked real player habits, not a flat daily number.

The slider made the health score move the wrong way: I'd drag the earn rate up — clearly making the economy more inflationary — and the danger score would improve. Worse should never look better. The cause was simpler than it seemed: I was measuring health over too short a window. Players binge-spend in cycles, so a short window caught an inconsistent number of those spikes and the score bounced around for no real reason. Stretching the window exposed what the short one had been hiding — inflation looked fine early on but was plainly out of control by day 90. Measure over too short a horizon and a slow leak looks like calm water.

And my own tests lied to me: The graph's connections disappeared — fifteen of them sitting in the page's data, none of them actually drawn — and my automated test suite swore everything rendered perfectly the whole time. So every green checkmark was false; I couldn't reproduce the bug where the tests ran. A quick probe of the live page found it in seconds: the graph library was quietly dropping the measurements each connection needs to know where to attach, so the lines were computed and then never placed. One line of code fixed it. The real lesson cost more — stop trusting the test that's convenient and go look at the thing the user actually sees.

Accomplishments that we're proud of

A free, public app with an AI behind it is how you wake up to a four-figure bill — so the first thing I built once the model worked was the switch that turns it off. One toggle cuts all model spending in under a minute, and beneath it sit four layers of fallback: with the entire backend dead, the economy still runs, the sliders still move, the templates still load, and the everyday prompts still work from a built-in library. For a solo project, that's the thing I'm proudest of — an AI tool anyone can open that can't bankrupt me and won't fully go dark.

Loot doesn't just tell you what broke — it proves how to fix it. When the economy turns red, the suggested fix isn't a guess: Loot runs the engine on the proposed change and finds the exact setting that lands the system genuinely back in green, confirmed before you ever see it, fast enough to feel instant. When one type of suggestion turned out to be quietly doing nothing, I rebuilt the mechanism behind it instead of shipping a fix that didn't work. Honest fixes over plentiful ones.

689 tests. Eight days. One principal PM who doesn't write code. 476 tests in the frontend and 213 in the backend hold everything from the underlying math to the colors on screen. And it's real right now: a stranger lands on the URL with no signup, an economy already running the moment the page paints, and Novus tracking every drag, every cascade, every prompt. Not a demo — a product.

What we learned

The engine was more honest than my intuition. Over and over, I'd form a confident picture of how the economy should behave, the math would disagree, and the math was right. I once wrote a whole redesign plan on three assumptions about how the engine worked; tests proved all three wrong before I shipped a line of it. The discipline that mattered most wasn't cleverness — it was measuring instead of guessing.

The simple fix usually hid behind the heroic one. When a template was unbalanced, my instinct was to add new pieces — fresh sources and sinks to push it back to center. They did nothing: the engine only counts a currency as healthy if it has both a faucet and a drain, so my additions were ignored. The real fix was three small number changes. Measure before you rebuild.

Claude Code is a faster typist than I'll ever be — but it is not, by itself, a product manager. Building stopped being the hard part. The hard part moved upstream: deciding what to build, and knowing when it was actually right. The single most valuable thing I made wasn't a feature — it was one document of decisions and hard rules that kept eight days of fast building from drifting into a mess.

When the design and the math genuinely disagree, name it — don't fake it. A few times, what I wanted the product to do and what the engine could honestly show were in conflict. The tempting move is to fudge the colors green. I chose to surface the tradeoff with real numbers and decide it on purpose. A tool people trust has to be willing to tell them something they don't want to hear.

What's next for Loot

Compare two economies side by side — your live build next to the change you're considering, same players, same ninety days, with only the differences lit up.

Push to 100,000 scenarios when you need the rare events, not just the shape — the once-a-season edge case that quietly breaks a whale's week.

Export the balanced economy as a config your game can read, so the thing you stress-tested is the thing you ship.

Connect it to your real players: feed in live behavior, tune the model to how your players actually spend, and get warned when next week's update will misbehave for your economy — not the textbook one.

Built With

Share this project:

Updates