Prophet Hacks Agent
A six-stage forecasting agent for the Prophet Arena Forecasting Track. Decomposes every question into Fermi sub-questions, runs twelve research tools in parallel, ensembles Opus 4.7 + GPT-5.5 forecasts, and red-teams its own answer before committing a calibrated probability.
Inspiration
Tetlock's superforecasters don't think harder — they think more systematically. They decompose every question into specific empirical sub-questions, anchor on base rates, and red-team their own confidence before committing. We wanted to see how close a single LLM agent could get to that workflow if we structured the whole pipeline around it.
Brier scoring,
$$\text{Brier} = \frac{1}{N}\sum_i (p_i - y_i)^2,$$
punishes overconfidence quadratically — so the agent needs both information edge and calibration discipline. Every design decision came back to that tension.
Architecture
The agent is a six-stage async pipeline. Each stage hands a well-typed payload to the next; every stage logs into a per-request trace dump for debugging.
① Planner — Opus 4.7. Reads the event and a dynamically-rendered catalog of available tools. Produces a JSON plan with 3–7 Fermi sub-questions and a list of which tools to invoke with what args. The planner is the only stage that sees the full tool catalog; downstream stages only see the brief.
② Parallel Research. Twelve research tools fan out via asyncio.gather. Each tool is a self-registering module gated on its API key — the planner only sees tools whose env vars are set, and failures are timeout-isolated so one bad tool can't sink the pipeline. The catalog spans cross-market signals (Kalshi, Polymarket), news search (Anthropic web search, a custom semantic-search server), structured data (FRED, Financial Modeling Prep, CoinGecko, Congress.gov, CourtListener, Odds API), background (Wikipedia), and native code execution for math.
③ Synthesis — Sonnet 4.6. Collapses raw research into a structured evidence brief. Required to include a dedicated section that answers each sub-question with source-tagged 1–3 sentence answers, plus market anchors and calibration anchors.
④ Two parallel forecasters — Opus 4.7 + GPT-5.5. Both receive the same brief, must walk through the sub-questions and write a one-sentence inference per sub-question before committing probabilities. The two models are run via asyncio.gather for free latency parallelism.
⑤ Devil's Advocate — Sonnet 4.6 (no extended thinking). Red-teams both forecasts: which sub-questions did they underweight, where do they disagree, are they over- or under-confident given the brief.
⑥ Aggregator — Opus 4.7. Synthesizes brief + both forecasts + critique into the final probability distribution. Explicitly empowered to disagree with both upstream forecasters when the critique is sharp — not an averaging step.
Key design decisions
Model tiering. Opus 4.7 for the three high-leverage stages (planner, forecaster, aggregator). Sonnet 4.6 for synthesis and DA — those stages are mostly summarization / pattern-matching, not generation. Dropping the DA from Opus to Sonnet cut per-event cost by roughly 80 cents with no measurable calibration loss.
Extended thinking on, but at effort=low. Adaptive thinking on Claude 4 (via thinking.type=adaptive + output_config.effort=low) plus reasoning_effort=low on GPT-5.5 hits the sweet spot: roughly a quarter of the thinking tokens of effort=high, with most of the calibration benefit retained. Per-event cost: about \$1.30 vs about \$3.70 at high.
Adaptive probability floor. Fixed-0.01 floors crowd confident answers in multi-outcome events — a 23-team FA Cup with a 0.01 per-loser floor locks 22% of probability mass away from the correct answer. We use
$$\text{floor}(N) = \max!\left(0.002,\ \frac{0.05}{N}\right)$$
so total floor mass stays around 5% regardless of outcome count.
Bucket-interpretation guidance. For threshold-style outcomes like "Above 3.50%" / "Above 3.75%" on Fed rate questions, every prompt explicitly spells out that the outcomes are mutually-exclusive buckets, not cumulative thresholds. Without this, Opus and GPT-5.5 silently disagreed — we caught it via a side-by-side trace and the fix moved probability mass from 22% on the wrong bucket to 84% concentrated on the two real scenarios.
Self-registering tool catalog. Tools live in tools/, each a single file with a @register decorator. Adding a new tool is one file; no plumbing edits. The planner's prompt is regenerated per-request from the registry, so newly-added tools become visible to the planner the moment they're available.
Key innovations
- Fermi decomposition threaded through every stage — sub-questions are generated by the planner, answered explicitly in the synthesis brief, used as reasoning anchors by both forecasters, audited by the DA, and cross-referenced by the aggregator.
- Devil's Advocate that knows the interpretation rules. Same bucket-interpretation block as the forecasters, so it critiques actual flaws instead of arguing against the framing.
- Native Anthropic server tools —
code_execution_20250522andweb_search_20250305bundle retrieval/computation with reasoning in a single API call. - Per-request trace dump +
?debug=1— every prediction writes a full JSON trace of plan → research → brief → forecasts → critique → final to disk, and the deployed endpoint exposes it via a query param for remote inspection.
How we built it
Python 3.12 + FastAPI on Render Hobby. Anthropic + OpenAI SDKs. httpx + asyncio for the research fan-out. pydantic for typed payloads between stages. tenacity for retry/backoff on rate-limited public APIs. Total: about 2,500 LOC across 20 Python files.
What we learned
- Decomposition is the work. Once each sub-question has an explicit answer in the brief and the forecaster must write a one-sentence inference per sub-question, calibration improves and rationales become legible.
- Red-team stages need the same interpretation guidance as forecasters, or they argue against the framing.
- Multi-outcome events need adaptive floors; a fixed 0.01 minimum silently kills Brier when N > 5.
- Native Anthropic tools (code execution, web search) are underused — they bundle reasoning with retrieval/execution in one call and the trace stays clean.
Built With
- claude
- python
- render
Log in or sign up for Devpost to join the conversation.