Inspiration

I'm a labor economist working on AI's impact on labor markets. I wanted to see how far an LLM agent could go on real-world forecasting, and where it would fail.

What it does

Takes a binary or multi-outcome event from Kalshi (e.g., "Will the Fed cut rates in June?") and returns a probability for each outcome. Handles three event types: binary, exclusive multi-outcome, and cumulative thresholds.

How I built it

  • predict(event) Python function — the agent
  • Claude Sonnet 4.6 via OpenRouter with :online web search
  • Category-aware system prompt covering Sports, Elections, Politics, Entertainment, Economics
  • Defensive JSON parsing, probability clipping to [0.02, 0.98], normalization to sum to 1
  • FastAPI wrapper, deployed on Modal as an always-warm container

What I learned

  • Retrieval matters more than model size. Appending :online was a one-line change that grounded predictions in current information instead of stale training data.
  • Plumbing-to-intelligence ratio was about 20:1. Most of the time went to environment setup, schema mismatches, and deployment — not to the actual prediction logic.
  • Calibration is the open problem. LLMs default to confident; teaching them to output 0.5 on genuine unknowns is harder than it sounds.

Challenges

  • The installed CLI rejected my predictions due to a schema mismatch with the official docs.
  • response_format={"type": "json_object"} broke Claude via OpenRouter and produced hallucinated templates.
  • Cumulative threshold events aren't mathematically "sum to 1" events but the spec requires it.
  • ngrok email verification at the submission deadline.
  • :online mode costs ~$0.36/call — over the $50 budget at full 200 calls.

What's next

Per-category prompts and routing, ensembling across Claude / GPT / Gemini, calibration corrections from a held-out validation set, and a writeup for the ICML 2026 workshop on forecasting.

Built With

  • anthropic
  • claude
  • fastapi
  • github
  • modal
  • ngrok
  • openai-sdk
  • openrouter
  • pydantic
  • python
Share this project:

Updates