Cross-Family LLM Forecasting with Dual Criteria Extraction

Inspiration

The naïve way to forecast with an LLM is to ask it once and trust the answer — that gives you wildly overconfident predictions and one Brier-0.9 disaster will sink any 2-week eval window. Real expert forecasters don't work that way. Tetlock's superforecasters decompose, look at base rates, and update slowly. Hyndman's textbook (Forecasting: Principles and Practice, Ch. 6) treats the Delphi method as near-canonical: experts make independent estimates without seeing each other, then iteratively revise after reading anonymized peer rationales, until either convergence or a deliberate stop. Prediction markets are the same idea at scale — Polymarket prices aggregate thousands of small, independent bets. I wanted to build the pipeline a panel of careful forecasters would actually use, not the one that maximizes "LLM-IQ" on a single call.

What it does

Given a Prophet Arena event payload (binary, ordered-numeric, or multi-outcome categorical), the endpoint returns a calibrated probability distribution over outcomes in the official response shape. Under the hood: a dual-analyst criteria extractor (Sonnet 4.6 primary + Opus 4.7 skeptic with native web_search) reads sources directly; a K=4 cross-family voter ensemble (Sonnet, Haiku, Gemini, GPT-5, DeepSeek) runs Tetlock-style three-stage reasoning in parallel; high-disagreement events trigger Delphi revision rounds; the result is blended in logit space with a resolution-window-adaptive market prior and parallel external signals from Polymarket, Manifold, Metaculus, CoinGecko, and Odds-API.

How we built it

Python 3.11 + FastAPI on Fly.io, with ThreadPoolExecutor for parallel voters and parallel external-signal fetching. The architecture is three pipelines that share one shape — binary, ordered-numeric, and multi-outcome categorical all run the same K-sample → aggregate → shrink → blend loop, just with different aggregation primitives (Bayes-style σ inflation for numeric, per-outcome shrinkage for multi-outcome). All component decisions are A/B-tested and documented in ARCHITECTURE_LOCKED.md. Everything has a graceful failure mode: the top-level predict() is globally wrapped and returns a calibrated uniform distribution on any internal error — the eval server never sees a 500.

Challenges we ran into

Directionality misreads on legal events (SCOTUS Louisiana v. Callais) drove the dual-analyst design — primary + skeptic surface a direct_answer to voters in Round 2 of Delphi, but never anchor Round 1, so dissenting voters can override a wrong extractor. Threshold-ladder catastrophe on the official Prophet Arena Subset-1200 release: events with nested cumulative outcomes ("Score ≥ 50, ≥ 55, …, ≥ 90") cost us 0.78 Brier because voters concentrated mass on the top threshold instead of treating each as an independent binary. Solved with a 3-layer safety stack: a prompt hint forcing value-first reasoning, PAV isotonic regression projecting onto the monotone cone, and a 2-auditor (Sonnet + GPT-5) agentic sanity check that proposes a point-estimate of the underlying scalar and blends a logistic-CDF correction. That single fix cut worst-case Brier from 0.91 to 0.17.

Accomplishments that we're proud of

The pipeline is conservatively engineered — I built and deleted multiple techniques (coherence filter, multi-agent debate, TTA, decomposition, meta-classifier) because A/B testing showed they hurt Brier or added complexity with no measurable win. What survived: Bayes-optimal disagreement shrinkage with an extreme-σ tier that pulls 85 % to the prior on events voters genuinely don't know, plus a resolution-window-adaptive market blend that gives markets up to 85 % weight on far-out events and as little as 20 % on past events. The output validator survives 8/8 hostile test inputs without raising. Everything runs end-to-end on a single shared-cpu-2x Fly machine, p50 ≈ 50 s, p99 ≈ 150 s — well under the org's 10-minute budget.

What we learned

The biggest Brier wins came from fact extraction and humility, not from cleverness. Adding Opus + web_search up front beat any voter-count increase. Treating prediction markets as a strong prior (not a thing to override) won more than any voter-prompt tuning. The Delphi structure — independent priors first, anonymized revision second — consistently beat a single big call. And every time I trusted A/B numbers over my own intuition I was right to: most "clever" additions hurt Brier; what worked was conservative, principled, and boring.

Public repo

As encouraged by the organizers, here the link to my public repo: https://github.com/emilpartow/prophet-hacks-partini

Built With

claude
coingecko
deepseek
fastapi
fly.io
manifold
metaculus
odds
openai
polymarket
python

Updates

Emil Partow started this project — May 17, 2026 04:47 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.