Inspiration
Prediction markets aggregate information astonishingly well, which makes them brutal to beat with naive sentiment. We wanted to test whether a calibrated LLM, sized with proper risk math, could find real edge — and whether we could do it on a $0 LLM budget.
What it does
A long-lived paper-trading bot that competes on Prophet Arena's 15-minute-tick prediction-market benchmark over a $10,000 starting bankroll. Each tick it:
- Claims the tick lease and pulls the candidate market universe.
- Forecasts every market that survives our filters with Groq's
llama-3.3-70b-versatile- primary call plus a contrarian audit on high-divergence picks. - Combines the two estimates via the geometric mean of odds, then applies post-hoc calibration to dampen overconfidence.
- Size any positive-edge trade with a 0.25× fractional Kelly bet, bounded by per-market and gross-exposure caps.
- Re-evaluates held positions every tick - exits when edge collapses, flips when our view reverses.
Scored on the combined rank of Sharpe + PnL.
How we built it
bot.py is a single self-contained Python file with four layers:
- Forecaster - wraps Groq's OpenAI-compatible chat completions API (
response_format=json_object,temperature=0.3,max_tokens=200). Strict JSON schema means tiny output tokens and predictable parsing. - Pricing helpers - converts a
MarketQuoteinto BUY/SELL fill prices honouring Prophet Arena's execution semantics (BUY YES @ best_ask,BUY NO @ 1 − best_bid, etc.). - Portfolio view - folds the live
PortfolioResponseinto a mutable working state and is updated after every decision so subsequent decisions in the same tick respect what we've already committed. - Decisioning - re-evaluates held positions first, then scans new candidates ranked by
|mid − 0.5|descending so the highest-asymmetry markets are analysed before our per-tick LLM-call budget bites.
Tick loop: claim_tick → load_candidates → get_portfolio → re-evaluate held → scan new → put_plan → submit_intents → finalize → complete_tick.
Six key design decisions
- Ensemble forecasting. Primary call + a contrarian second call on markets where the primary diverges from the market mid by > 0.10. Combined via geometric mean of YES odds - preserves the prior when the two agree, cancels overconfidence when they don't.
- Kelly sizing with a 0.25× scale. 'kelly = edge / (p × (1 − p))'; dollar amount = 'kelly × cash × 0.25'; clipped by the per-market $1,000 cap, the $10,000 gross-exposure cap, and remaining cash.
- Calibration before sizing. 'calibrated = 0.85·raw + 0.075', compressing '[0, 1]' to '[0.075, 0.925]'. Bites hardest exactly where naive sizing would otherwise be most dangerous.
- Selective trading. Pre-LLM skip when 'best_ask ∈ [0.40, 0.60]' AND when 'best_ask < 0.05' or > 0.95` (tail markets where linear calibration creates phantom edge). Post-LLM only trades when '|edge| > 0.10'.
- Active position management. Each tick, we re-forecast every held market. If the edge has a flipped sign, SELL the held side and BUY the new side as a separate intent in the same tick - Prophet Arena does not auto-flip.
- Cost-aware. Hard cap on LLM calls per tick (20); short prompts (<200 system tokens); JSON-only outputs; token usage logged per tick. Total LLM spend over the 14-day window: $0 (Groq free tier).
What's novel
- Two-stage ensemble combined via the geometric mean of odds - not the more common arithmetic mean, which is biased near extremes.
- Edge measured in the chosen side's price space. Most naive bots compute one edge in the YES space and reuse it for NO trades, which results in Kelly sizing being wrong. We re-frame to
p_eff − fill_priceper side. - Sequential within-tick decision accounting. The portfolio view is mutated as each decision is committed, so per-market and gross-exposure caps apply to the cumulative state inside a single tick, not the snapshot at tick-start.
- Tail-market filter discovered from live data. First three live ticks revealed that our linear calibration was creating a phantom ~0.075 edge on
Will Australia win the 2026 World Cup?-style tail markets. We added abest_ask < 0.05 or > 0.95filter without touchingCONFIG_JSON,so the running experiment kept itsconfig_hashand resumed cleanly. - Free-tier-only. No paid LLM, no Kalshi key — entirely Groq + Prophet Arena. The bot can run on a $4 droplet or a free Oracle Cloud ARM VM with zero marginal cost.
Challenges we ran into
- Slug placeholder fail. First live launch ran under the literal string
eval_<your-handle>because the placeholder text was never replaced. Caught it from the leaderboard, killed three duplicate processes that were fighting over the same lease, and relaunched undereval_sravya. - Calibration vs tails. Our linear shrinkage toward 0.5 made tail markets look tradeable when they weren't. Fixed mid-run with a hash-stable filter so the leaderboard entry stayed intact.
config_hashstability discipline. Discovered the hard way that changing anything inCONFIG_JSONmid-run forks the experiment. Built every subsequent tweak (EDGE_THRESHOLD env, tail filter) as a non-CONFIG_JSONknob.
What we learned
- Prediction markets price tails very well - the LLM rarely beats the market on
Will X tiny-prob event happen?. - Linear calibration is dangerous at the extremes; log-odds shrinkage would handle tails more gracefully (future work).
- Robust structured logging (one JSON line per event) is worth the 20 minutes it takes to set up - every diagnostic in this submission came from
tail -f bot.log | jq.
What's next
- Log-odds calibration to replace the linear formula.
- Multi-model ensemble (Groq + a Gemini free-tier model) to diversify forecasts.
- Position-level PnL attribution to learn which market families our edge actually exists in.
Run slug
"eval_gradientprophets"
Log in or sign up for Devpost to join the conversation.