GRPO My Vending Machine

Long-horizon agents fail economically before they fail intellectually. Capability isn't always the bottleneck — consistency and solvency are. Same model, same seed can swing from 120 units sold to zero; mean and bankruptcy capture what max reward hides. Survival and profit pull in different directions. An agent can stay alive while under-pricing — the reward correctly optimizes for not dying first; margin is a second objective. Thinking has a price, and models can learn the sweet spot. Compute drain creates natural pressure toward efficient action loops without separate penalty shaping. Plain RL collapses; population search helps. Baseline wins early steps; genome-conditioned agents pull ahead on mean and bankruptcy from step 3 on — evolution over risk postures plus RL execution beats RL alone on consistency. Evolution needs a story, not just a fitness number. Population grids — tiles dying, arrows from winners reseeding slots — make co-evolution legible to humans watching a training run. Prompt honesty matters. Telling agents the true objective (survive first, every action costs money) changed behavior alongside reward design.

Step means:

Step Genome mean Baseline mean Delta ━━━━━━ ━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━ ━━━━━━━━ 0 -12.2 130.0 -142.2 ────── ───────────── ─────────────── ──────── 1 -46.5 43.7 -90.2 ────── ───────────── ─────────────── ──────── 2 27.5 80.0 -52.5 ────── ───────────── ─────────────── ──────── 3 141.5 -93.1 +234.6 ────── ───────────── ─────────────── ──────── 4 88.7 67.7 +21.0 ────── ───────────── ─────────────── ──────── 5 134.8 18.7 +116.0 ────── ───────────── ─────────────── ──────── 6 107.4 74.3 +33.2 ────── ───────────── ─────────────── ──────── 7 42.5 -61.6 +104.2 ────── ───────────── ─────────────── ──────── 8 141.1 42.2 +98.8 ────── ───────────── ─────────────── ──────── 9 126.1 106.5 +19.6

What's next for GRPO My Vending Machine

Extend co-evolution runs: More steps and seeds; confirm genome advantage holds under harder econ.
Live training → dashboard: Stream real rollouts from prime-rl over WebSocket so spectators watch genomes evolve in real time.
Full evolution leaderboard: Balance timeline, lifespan-aligned cohorts, 4×4 population grid with lineage arrows.
Harder econ: Turn up demand_scale, compute_cost, and bankruptcy pressure — test whether co-evolution buys robustness pure GRPO doesn't.
Pricing curriculum: Staged training from "survive" toward "survive and price like the oracle."
Scale impact: If marginal agent profit exceeds marginal compute cost, the loop becomes a renewable source of training signal — agents keep operating, generating trajectories, and adapting under environment selection.
Publish the env: Package vending-bench-survival for the verifiers community as a standard survival + compute-cost benchmark.

Built With

python

Updates

Sanat Mouli started this project — Jun 20, 2026 07:55 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.