Inspiration
I'm a labor economist working on AI's impact on labor markets. I wanted to see how far an LLM agent could go on real-world forecasting, and where it would fail.
What it does
Takes a binary or multi-outcome event from Kalshi (e.g., "Will the Fed cut rates in June?") and returns a probability for each outcome. Handles three event types: binary, exclusive multi-outcome, and cumulative thresholds.
How I built it
predict(event)Python function — the agent- Claude Sonnet 4.6 via OpenRouter with
:onlineweb search - Category-aware system prompt covering Sports, Elections, Politics, Entertainment, Economics
- Defensive JSON parsing, probability clipping to [0.02, 0.98], normalization to sum to 1
- FastAPI wrapper, deployed on Modal as an always-warm container
What I learned
- Retrieval matters more than model size. Appending
:onlinewas a one-line change that grounded predictions in current information instead of stale training data. - Plumbing-to-intelligence ratio was about 20:1. Most of the time went to environment setup, schema mismatches, and deployment — not to the actual prediction logic.
- Calibration is the open problem. LLMs default to confident; teaching them to output 0.5 on genuine unknowns is harder than it sounds.
Challenges
- The installed CLI rejected my predictions due to a schema mismatch with the official docs.
response_format={"type": "json_object"}broke Claude via OpenRouter and produced hallucinated templates.- Cumulative threshold events aren't mathematically "sum to 1" events but the spec requires it.
- ngrok email verification at the submission deadline.
:onlinemode costs ~$0.36/call — over the $50 budget at full 200 calls.
What's next
Per-category prompts and routing, ensembling across Claude / GPT / Gemini, calibration corrections from a held-out validation set, and a writeup for the ICML 2026 workshop on forecasting.
Log in or sign up for Devpost to join the conversation.