Inspiration

We chose the forecasting track because it sits at an interesting intersection: it's not just about making LLMs correct, it's about making them honest about uncertainty. The Brier scoring rule rewards calibration over confidence — a 0.55 prediction that turns out right scores nearly as well as a 0.95 that's right, but a confident-wrong call is catastrophic. That changes the design problem completely. Most LLM applications optimize for sounding sure; ours optimizes for being appropriately unsure.

The other thing that drew us in: forecasting is one of the few LLM tasks where you can't just throw a bigger model at it. The bottleneck is information, not reasoning. An obscure W15 tennis match doesn't get more predictable with GPT-5 — it gets more predictable with a search result. That meant designing a hybrid system where every component has to earn its keep.

What it does

Our agent receives an event — a sports match, election, entertainment outcome — with a list of possible outcomes, and returns a calibrated probability distribution over those outcomes. The grader then scores us with Brier score across ~200 evaluation events.

The pipeline:

  1. Event parsing — extract title, description, rules, close time, and the exact outcome labels we must match.
  2. LLM forecasting call — Claude Sonnet 4.5 via OpenRouter, with web search enabled. The system prompt explicitly instructs the model to use search for pre-event context (form, odds, news) and to ignore any search results that reveal the actual outcome.
  3. Robust JSON parsing — extracts JSON from prose-wrapped or markdown-fenced responses, with one automatic retry on parse failure.
  4. Label matching — normalizes Unicode (NFKC), smart quotes, em dashes, and non-breaking spaces, then does fuzzy substring matching. This rescues cases where the model returns "Atlanta Braves" when the canonical label is "Atlanta".
  5. Calibration — confidence-aware shrinkage toward uniform. Predictions that are far from the uniform prior get pulled harder than ones near it. Multi-outcome events with no clear signal (top probability < 20% across ≥10 outcomes) are forced to exact uniform to avoid noisy near-uniform Brier penalties.
  6. Output validation and fallback — every output is checked for exact label matches, valid probabilities, and sum-to-one. Any failure at any pipeline step falls back to a valid uniform distribution rather than raising. predict() is guaranteed to never throw.

The whole thing is wrapped in a FastAPI server deployed on Railway. The grader POSTs an event; we return a valid distribution every single time.

How we built it

We built incrementally, validating each layer before adding the next:

Step 0 — The safety net. Before writing any LLM logic, we wrote predict() to return a guaranteed-valid uniform distribution with 13 contract tests covering every adversarial input we could think of (empty outcomes, malformed input, smart-quote labels, etc.). This bulletproof base meant that every subsequent feature could fail loudly without breaking the contract.

Step 1 — Single LLM call. Added Claude Sonnet 4.5 via OpenRouter with strict JSON output, fuzzy label matching, and a simple fixed-α shrinkage toward uniform.

Step 2 — Evaluation harness. Built a local pipeline to fetch the sample-resolved dataset (26 events with known outcomes), run predict() over it with resumable JSONL output, and compute Brier scores with per-category breakdowns. This was the moment we could finally measure ourselves. Baseline: 0.6449 vs uniform 0.6912.

Step 3 — Bug fixes and calibration upgrade. Discovered two silent failures: a smart-quote vs. straight-quote label mismatch on a Survivor event, and a SCOTUS event where the model returned prose instead of JSON. Added Unicode normalization and a one-shot retry. Upgraded calibration to confidence-aware shrinkage. Added web search via OpenRouter's :online suffix.

Step 4 — Anti-leakage prompt engineering. Web search initially gave us a Brier of 0.06 — but that turned out to be the model just looking up final scores on resolved events. At real submission time, events haven't happened yet and search can't leak. We rewrote the system prompt to explicitly instruct the model: "If search results show the outcome, IGNORE that information. Reason from pre-event context only." This dropped our leaked-eval score back to an honest 0.6290.

Step 5 — Multi-outcome rule and deployment. Added a calibration rule that forces exact uniform when the model has ≥10 outcomes and a top probability below 20% — these are signal-less guesses where uniform minimizes Brier variance. Wrote a FastAPI wrapper, deployed to Railway, verified all integration paths.

We ended with 60 unit tests + 3 server smoke tests, all passing, and a deployed endpoint at a stable Railway URL.

Challenges we ran into

The silent leakage trap. Our first run with web search gave a Brier of 0.0589 — looked like a breakthrough until we realized the resolved test set had events that had already happened, so search was just returning final scores. The model wasn't forecasting, it was retrieving. This is the kind of bug that doesn't show up until you ask "wait, is this actually predictive?" We spent significant time building an evaluation methodology that simulates submission-time conditions (where outcomes don't exist yet on the web).

Silent label mismatches. A Survivor event had a contestant named Benjamin "Coach" Wade — with curly quotes in the canonical label. The model returned straight quotes. Result: zero label match, fallback to uniform, Brier penalty. The fix was Unicode normalization, but the finding required reading our scoring report carefully. There are dozens of these kinds of edge cases lurking in label data.

Windows-specific subprocess crashes. The prophet CLI prints Unicode arrows in summary lines, which Windows' cp1252 codec rejects. The file was always written successfully, but the CLI's summary would crash and cause our wrapper to think the fetch failed. Fixed by injecting PYTHONIOENCODING=utf-8 into the subprocess environment and treating "non-zero exit + file exists" as success with a warning.

Server design under hostile conditions. Early server code raised HTTP 500 on internal errors. We caught this in code review: the grader counts non-200 responses as missed predictions, which directly reduces n_matched on the leaderboard. We rewrote the server so every error path returns a 200 with a valid uniform fallback. A "wrong" prediction is always better than "no prediction" in Brier scoring.

Calibration parameter sweeps. We built a tuning script that swept 1,050 combinations of calibration parameters against existing predictions. The result told us, paradoxically, that the best params were (0, 0) — i.e., don't calibrate at all. We figured out this was because we were applying calibration to already-calibrated predictions (double shrinkage). The right tuning approach would require storing raw LLM outputs before calibration, which we didn't have time to retrofit. We kept defaults that match domain intuition.

Accomplishments that we're proud of

  • Never crashes. predict() is guaranteed to never raise. The server is guaranteed to never return non-200. Whatever the grader throws at us — malformed JSON, weird Unicode, network failures, empty outcomes — we always return a valid distribution. Completion rate is 100%.
  • Honest calibration. We resisted the temptation to ship our 0.0589 "leaked" Brier as our headline number. The 0.6290 we actually submit with is what we'll really score, not a vanity metric.
  • Tested thoroughly. 60 unit tests + 3 server smoke tests, covering parsing, label matching, calibration math, output validation, server response codes, and end-to-end LLM calls. We caught the server 500 regression because we had tests.
  • Modular architecture. Every component is swappable. Calibration mode, model name, search on/off, retry behavior — all controllable via environment variables without code changes.

What we learned

Brier score changes the game. When the loss function punishes overconfidence quadratically, every design decision flips. You stop trying to make the model more confident and start trying to make it more honest. Calibration isn't a nice-to-have; it's the entire job.

Measure before you tune. We almost shipped our 0.06 leaked score before realizing it was meaningless. Building the eval harness in Step 2 — before adding features — meant every subsequent change was measurable. Without honest measurement, you're optimizing for the wrong number.

Defense in depth pays off. Every layer of the pipeline has its own try/except, its own fallback. We initially thought this was overkill. By the end, three different failure modes (smart-quote mismatch, JSON parse failure, malformed grader body) would have caused crashes without these layers.

The model has different failure modes by event type. Tennis matches fail by lack of information (model honestly says 50/50, scores 0.50 Brier). Multi-outcome reality TV fails the same way but worse (uniform on 14 outcomes scores 0.93). Geopolitics fails by confident-wrong (model says 80% Orbán, Magyar wins, Brier 0.81). Each failure type needs different treatment — uniform fallback for low-info, lighter calibration for confident-correct cases, heavier shrinkage for confident calls in volatile categories.

What's next for Cost-Aware Calibrated Forecaster

  • Multi-agent debate for high-stakes events. The original design called for optimist/pessimist/judge agents on uncertain events. We deferred this because our error analysis showed the bottleneck was information (search-shaped), not reasoning (debate-shaped). With more time, we'd add debate selectively for events where the model is confident but has thin search evidence — exactly the situations where debate adds the most signal.
  • Per-category calibration tuning. Sports events behave differently from political events behave differently from entertainment. With access to a larger ground-truth dataset, we'd fit a separate calibration curve per category.
  • Smarter search query generation. Currently the model decides what to search. We could pre-process events to generate explicit search queries tuned for forecasting signal (odds, expert opinions, recent form) rather than retrieval (final scores, winners).
  • Ensemble across models. Run the same event through Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-5, then weighted-average the predictions. Disagreement between models is itself a calibration signal.

Built With

Share this project:

Updates