OracleFlow

Inspiration

Prediction markets are one of the best forecasting tools humans have built, but AI agents consistently underperform them. We wanted to know why, and whether we could close the gap. The core problem: LLMs trained with RLHF systematically hedge toward 50%, and a single LLM call can't separate what it knows from what the evidence actually says. We set out to fix that.

What it does

OracleFlow is a structured-debate forecasting agent that produces calibrated probability estimates for real-money prediction market questions on Kalshi. For every question, it runs a 5-stage pipeline: anchors to live market prices, retrieves evidence in parallel from 7 sources (news, finance, weather, ESPN, Manifold Markets, historical resolved markets, and question decomposition), runs a structured 3-call LLM debate, dynamically extremizes based on confidence, and applies Platt scaling calibration fitted on 1,393 resolved markets.

How we built it

We forked the ai-prophet SDK and built the full pipeline in Python using OpenRouter (Claude 3.5 Haiku + GPT-4o-mini). The three-call debate — outside view, inside view, reconciler — is the core architectural innovation. Each call has a strict role: the outside view reasons from base rates only, the inside view updates on evidence, and the reconciler makes a final judgment call with explicit rules. The disagreement between outside and inside views feeds a unified confidence signal that controls both log-odds extremization and market price blending. We added parallel retrieval via ThreadPoolExecutor, Platt scaling calibration, and a historical RAG system using keyword Jaccard overlap over resolved Kalshi markets — no embeddings needed.

Challenges we ran into

Getting the multi-outcome markets right was the biggest correctness fix — Kalshi's "Who wins the NBA Finals?" question spawns 30 separate binary markets, and naively using the event-level ticker gave a meaningless 50% baseline. We built a focal-team extractor that parses the description, fetches the full Kalshi series, and overrides the baseline with the specific team's live price. We also hit several silent bugs: a variable name collision that was dropping our historical RAG evidence from prompts, a category filter that was suppressing all empirical base rates for categorized events, and stale fred_block/fred_data references that were silently falling back to 50% on every prediction.

Accomplishments that we're proud of

+79.8% improvement over coin flip on contested real-money markets with live price signals 17.7% Brier score improvement from Platt calibration fitted on resolved market data — no agent re-runs needed A coherent two-stage confidence system where view disagreement drives both extremization and market blending, reaching maximum conservatism at the same threshold Semantic opposition detection that prevents blending Manifold Markets signals asking the opposite question (e.g. "Will the Fed cut rates?" vs "Will the Fed hike rates?") What we learned The market price is the strongest signal you have — beating it requires clear, recent, high-quality evidence, not just LLM confidence. Structured debate forces the model to separate what it knows from what the data says, which is the key to avoiding RLHF hedging. And silent bugs that fall back to a default 50% are the hardest to catch — the agent keeps running, just wrong.

What's next for AI Forecasting Track

Fitting calibration on actual agent predictions (not market price as proxy), adding real-time election data feeds, and expanding the historical RAG corpus. The structured debate architecture is model-agnostic — swapping in a stronger reconciler model should yield immediate gains without changing anything else.

https://github.com/mitchellcrevier/uncommon-hacks/tree/main see run instructions in repo