Prophet Hacks Agent

A six-stage forecasting agent for the Prophet Arena Forecasting Track. Decomposes every question into Fermi sub-questions, runs twelve research tools in parallel, ensembles Opus 4.7 + GPT-5.5 forecasts, and red-teams its own answer before committing a calibrated probability.

Inspiration

Tetlock's superforecasters don't think harder — they think more systematically. They decompose every question into specific empirical sub-questions, anchor on base rates, and red-team their own confidence before committing. We wanted to see how close a single LLM agent could get to that workflow if we structured the whole pipeline around it.

Brier scoring,

$$\text{Brier} = \frac{1}{N}\sum_i (p_i - y_i)^2,$$

punishes overconfidence quadratically — so the agent needs both information edge and calibration discipline. Every design decision came back to that tension.

Architecture

The agent is a six-stage async pipeline. Each stage hands a well-typed payload to the next; every stage logs into a per-request trace dump for debugging.

① Planner — Opus 4.7. Reads the event and a dynamically-rendered catalog of available tools. Produces a JSON plan with 3–7 Fermi sub-questions and a list of which tools to invoke with what args. The planner is the only stage that sees the full tool catalog; downstream stages only see the brief.

② Parallel Research. Twelve research tools fan out via asyncio.gather. Each tool is a self-registering module gated on its API key — the planner only sees tools whose env vars are set, and failures are timeout-isolated so one bad tool can't sink the pipeline. The catalog spans cross-market signals (Kalshi, Polymarket), news search (Anthropic web search, a custom semantic-search server), structured data (FRED, Financial Modeling Prep, CoinGecko, Congress.gov, CourtListener, Odds API), background (Wikipedia), and native code execution for math.

③ Synthesis — Sonnet 4.6. Collapses raw research into a structured evidence brief. Required to include a dedicated section that answers each sub-question with source-tagged 1–3 sentence answers, plus market anchors and calibration anchors.

④ Two parallel forecasters — Opus 4.7 + GPT-5.5. Both receive the same brief, must walk through the sub-questions and write a one-sentence inference per sub-question before committing probabilities. The two models are run via asyncio.gather for free latency parallelism.

⑤ Devil's Advocate — Sonnet 4.6 (no extended thinking). Red-teams both forecasts: which sub-questions did they underweight, where do they disagree, are they over- or under-confident given the brief.

⑥ Aggregator — Opus 4.7. Synthesizes brief + both forecasts + critique into the final probability distribution. Explicitly empowered to disagree with both upstream forecasters when the critique is sharp — not an averaging step.

Key design decisions

Model tiering. Opus 4.7 for the three high-leverage stages (planner, forecaster, aggregator). Sonnet 4.6 for synthesis and DA — those stages are mostly summarization / pattern-matching, not generation. Dropping the DA from Opus to Sonnet cut per-event cost by roughly 80 cents with no measurable calibration loss.

Extended thinking on, but at effort=low. Adaptive thinking on Claude 4 (via thinking.type=adaptive + output_config.effort=low) plus reasoning_effort=low on GPT-5.5 hits the sweet spot: roughly a quarter of the thinking tokens of effort=high, with most of the calibration benefit retained. Per-event cost: about \$1.30 vs about \$3.70 at high.

Adaptive probability floor. Fixed-0.01 floors crowd confident answers in multi-outcome events — a 23-team FA Cup with a 0.01 per-loser floor locks 22% of probability mass away from the correct answer. We use

$$\text{floor}(N) = \max!\left(0.002,\ \frac{0.05}{N}\right)$$

so total floor mass stays around 5% regardless of outcome count.

Bucket-interpretation guidance. For threshold-style outcomes like "Above 3.50%" / "Above 3.75%" on Fed rate questions, every prompt explicitly spells out that the outcomes are mutually-exclusive buckets, not cumulative thresholds. Without this, Opus and GPT-5.5 silently disagreed — we caught it via a side-by-side trace and the fix moved probability mass from 22% on the wrong bucket to 84% concentrated on the two real scenarios.

Self-registering tool catalog. Tools live in tools/, each a single file with a @register decorator. Adding a new tool is one file; no plumbing edits. The planner's prompt is regenerated per-request from the registry, so newly-added tools become visible to the planner the moment they're available.

Key innovations

Fermi decomposition threaded through every stage — sub-questions are generated by the planner, answered explicitly in the synthesis brief, used as reasoning anchors by both forecasters, audited by the DA, and cross-referenced by the aggregator.
Devil's Advocate that knows the interpretation rules. Same bucket-interpretation block as the forecasters, so it critiques actual flaws instead of arguing against the framing.
Native Anthropic server tools — code_execution_20250522 and web_search_20250305 bundle retrieval/computation with reasoning in a single API call.
Per-request trace dump + ?debug=1 — every prediction writes a full JSON trace of plan → research → brief → forecasts → critique → final to disk, and the deployed endpoint exposes it via a query param for remote inspection.

How we built it

Python 3.12 + FastAPI on Render Hobby. Anthropic + OpenAI SDKs. httpx + asyncio for the research fan-out. pydantic for typed payloads between stages. tenacity for retry/backoff on rate-limited public APIs. Total: about 2,500 LOC across 20 Python files.

What we learned

Decomposition is the work. Once each sub-question has an explicit answer in the brief and the forecaster must write a one-sentence inference per sub-question, calibration improves and rationales become legible.
Red-team stages need the same interpretation guidance as forecasters, or they argue against the framing.
Multi-outcome events need adaptive floors; a fixed 0.01 minimum silently kills Brier when N > 5.
Native Anthropic tools (code execution, web search) are underused — they bundle reasoning with retrieval/execution in one call and the trace stays clean.

Built With

claude
python
render

Updates

Will Wu started this project — May 17, 2026 09:22 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.