Dr Strange

Inspiration

Forecasting is one of the clearest ways to test whether an AI system can reason about the real world. Instead of only answering static questions, a forecasting agent has to gather evidence, compare signals, handle uncertainty, and commit to calibrated probabilities. Prophet Hacks gave us a concrete way to build and evaluate that kind of system on live events.

What We Built

We built CodexProphet, a Codex-powered forecasting endpoint for the Prophet Hacks forecasting track. The system exposes a public /predict API endpoint, receives event payloads from the evaluator, launches a Codex forecasting loop, uses local tools for market, sports, finance, and Kalshi-style lookup, and returns a probability distribution over the event outcomes.

How It Works

When the Prophet evaluator sends an event, CodexProphet saves the full input, passes it into a structured Codex forecasting prompt, lets the agent use the repo’s tools and forecasting instructions, validates the final output schema, and returns probabilities plus a short rationale.

The system logs every request and prediction locally on the Mac mini, including the event payload, the final probabilities, the rationale, runtime, and any validation or execution errors.

What We Learned

A major lesson was that forecasting quality depends heavily on calibration, not just research depth. A model can sound persuasive while still producing poorly calibrated probabilities. We also learned that prediction-market data is useful but incomplete: many events do not map cleanly to an existing market, so the agent needs to search for related signals rather than assume a direct market exists.

On the infrastructure side, we learned that temporary tunnel URLs are not reliable for evaluation. We moved from an ephemeral trycloudflare.com tunnel to a persistent named Cloudflare tunnel on predict.hansonwen.dev, with CI/CD through a GitHub self-hosted runner on a Mac mini.

Challenges

The hardest parts were making the endpoint reliable and keeping the output compatible with the evaluator. We had to ensure the service returned the exact schema expected by Prophet Hacks, handled multi-outcome events correctly, and stayed online through a persistent public URL.

Another challenge was tool design. The agent needs enough information to make good forecasts, but too much raw data can overwhelm it. We focused on giving it structured lookup tools and clear instructions while preserving the agent’s responsibility to make the final judgment.

What’s Next

Next, we would improve the retrieval layer, add stronger news and market search, expand sports and finance coverage, and run systematic backtests to tune calibration. We would also add a lightweight dashboard for monitoring incoming tasks, prediction logs, latency, and failure modes during evaluation.

Built With

codex