Edge Hunter - ensemble-Kelly prediction-market trader

Inspiration

Prediction markets aggregate information astonishingly well, which makes them brutal to beat with naive sentiment. We wanted to test whether a calibrated LLM, sized with proper risk math, could find real edge — and whether we could do it on a $0 LLM budget.

What it does

A long-lived paper-trading bot that competes on Prophet Arena's 15-minute-tick prediction-market benchmark over a $10,000 starting bankroll. Each tick it:

Claims the tick lease and pulls the candidate market universe.
Forecasts every market that survives our filters with Groq's llama-3.3-70b-versatile - primary call plus a contrarian audit on high-divergence picks.
Combines the two estimates via the geometric mean of odds, then applies post-hoc calibration to dampen overconfidence.
Size any positive-edge trade with a 0.25× fractional Kelly bet, bounded by per-market and gross-exposure caps.
Re-evaluates held positions every tick - exits when edge collapses, flips when our view reverses.

Scored on the combined rank of Sharpe + PnL.

How we built it

bot.py is a single self-contained Python file with four layers:

Forecaster - wraps Groq's OpenAI-compatible chat completions API (response_format=json_object, temperature=0.3, max_tokens=200). Strict JSON schema means tiny output tokens and predictable parsing.
Pricing helpers - converts a MarketQuote into BUY/SELL fill prices honouring Prophet Arena's execution semantics (BUY YES @ best_ask, BUY NO @ 1 − best_bid, etc.).
Portfolio view - folds the live PortfolioResponse into a mutable working state and is updated after every decision so subsequent decisions in the same tick respect what we've already committed.
Decisioning - re-evaluates held positions first, then scans new candidates ranked by |mid − 0.5| descending so the highest-asymmetry markets are analysed before our per-tick LLM-call budget bites.

Tick loop: claim_tick → load_candidates → get_portfolio → re-evaluate held → scan new → put_plan → submit_intents → finalize → complete_tick.

Six key design decisions

Ensemble forecasting. Primary call + a contrarian second call on markets where the primary diverges from the market mid by > 0.10. Combined via geometric mean of YES odds - preserves the prior when the two agree, cancels overconfidence when they don't.
Kelly sizing with a 0.25× scale. 'kelly = edge / (p × (1 − p))'; dollar amount = 'kelly × cash × 0.25'; clipped by the per-market $1,000 cap, the $10,000 gross-exposure cap, and remaining cash.
Calibration before sizing. 'calibrated = 0.85·raw + 0.075', compressing '[0, 1]' to '[0.075, 0.925]'. Bites hardest exactly where naive sizing would otherwise be most dangerous.
Selective trading. Pre-LLM skip when 'best_ask ∈ [0.40, 0.60]' AND when 'best_ask < 0.05' or > 0.95` (tail markets where linear calibration creates phantom edge). Post-LLM only trades when '|edge| > 0.10'.
Active position management. Each tick, we re-forecast every held market. If the edge has a flipped sign, SELL the held side and BUY the new side as a separate intent in the same tick - Prophet Arena does not auto-flip.
Cost-aware. Hard cap on LLM calls per tick (20); short prompts (<200 system tokens); JSON-only outputs; token usage logged per tick. Total LLM spend over the 14-day window: $0 (Groq free tier).

What's novel

Two-stage ensemble combined via the geometric mean of odds - not the more common arithmetic mean, which is biased near extremes.
Edge measured in the chosen side's price space. Most naive bots compute one edge in the YES space and reuse it for NO trades, which results in Kelly sizing being wrong. We re-frame to p_eff − fill_price per side.
Sequential within-tick decision accounting. The portfolio view is mutated as each decision is committed, so per-market and gross-exposure caps apply to the cumulative state inside a single tick, not the snapshot at tick-start.
Tail-market filter discovered from live data. First three live ticks revealed that our linear calibration was creating a phantom ~0.075 edge on Will Australia win the 2026 World Cup?-style tail markets. We added a best_ask < 0.05 or > 0.95 filter without touching CONFIG_JSON, so the running experiment kept its config_hash and resumed cleanly.
Free-tier-only. No paid LLM, no Kalshi key — entirely Groq + Prophet Arena. The bot can run on a $4 droplet or a free Oracle Cloud ARM VM with zero marginal cost.

Challenges we ran into

Slug placeholder fail. First live launch ran under the literal string eval_<your-handle> because the placeholder text was never replaced. Caught it from the leaderboard, killed three duplicate processes that were fighting over the same lease, and relaunched under eval_sravya.
Calibration vs tails. Our linear shrinkage toward 0.5 made tail markets look tradeable when they weren't. Fixed mid-run with a hash-stable filter so the leaderboard entry stayed intact.
config_hash stability discipline. Discovered the hard way that changing anything in CONFIG_JSON mid-run forks the experiment. Built every subsequent tweak (EDGE_THRESHOLD env, tail filter) as a non-CONFIG_JSON knob.

What we learned

Prediction markets price tails very well - the LLM rarely beats the market on Will X tiny-prob event happen?.
Linear calibration is dangerous at the extremes; log-odds shrinkage would handle tails more gracefully (future work).
Robust structured logging (one JSON line per event) is worth the 20 minutes it takes to set up - every diagnostic in this submission came from tail -f bot.log | jq.

What's next

Log-odds calibration to replace the linear formula.
Multi-model ensemble (Groq + a Gemini free-tier model) to diversify forecasts.
Position-level PnL attribution to learn which market families our edge actually exists in.

Run slug

"eval_gradientprophets"

Built With

ai-prophet-core
groq
httpx
llama-3.3-70b
prediction-markets
python
python-dotenv

Updates

Sravya Rachakonda started this project — May 17, 2026 06:15 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.