The Prophet Oracle

"In the beginning, there was uncertainty. And the Norns said: 'Let there be calibrated probability estimates.' And it was good."

Inspiration

In Norse mythology, three Norns sit beneath Yggdrasil — Urd (what was), Verdandi (what is), and Skuld (what shall be) — weaving the destiny of all beings into the fabric of reality.

We thought: what if we gave them API keys?

The Prophet Hacks Forecasting Track challenged us to build an autonomous agent that receives prediction market events and returns calibrated probabilities. Not just guesses — calibrated predictions where saying 70% means it happens 70% of the time. This is the hardest problem in AI forecasting, and we attacked it with the rigor of academic research and the hubris of people who named their project after mythological fate-weavers.

What It Does

The Prophet Oracle receives a prediction market event (e.g., "Will the Fed cut rates before August 2026?"), autonomously researches it using real-time web search, reasons about it through a council of three AI models, reconciles its prediction with market prices, and returns a calibrated probability — all within 10 minutes.

It handles:

Binary events (Yes/No)
Multi-outcome events (Who wins among N candidates)
Non-mutually-exclusive events (Which K of N will qualify)
Any category — economics, geopolitics, sports, science, entertainment, technology

How We Built It — The Three Norns Architecture

graph TD
    A[📨 Event Request] --> B[🔀 Router]
    B --> C[Category Detection]
    B --> D[Complexity Assessment]
    C --> E[🔍 Urd - Research Phase]
    D --> E
    E --> F[Tavily Search]
    E --> G[Serper.dev Fallback]
    E --> H[DuckDuckGo Tertiary]
    F --> I[📊 Evidence Synthesis]
    G --> I
    H --> I
    I --> J[⚖️ Verdandi - Ensemble Reasoning]
    J --> K[Claude Sonnet 4]
    J --> L[Gemini 2.5 Flash]
    J --> M[GPT-4o]
    K --> N[Logit-Space Averaging]
    L --> N
    M --> N
    N --> O{Models Disagree >15%?}
    O -->|Yes| P[🎯 Qwen 72B Tiebreaker]
    O -->|No| Q[🔮 Skuld - Calibration]
    P --> Q
    Q --> R[Supervisor Reconciliation]
    R --> S[Market Anchoring]
    S --> T[Platt Scaling + Shrinkage]
    T --> U[✅ Final Prediction]
🕰️ Urd — The Past (Research Phase)
Our agent doesn't guess. It investigates.

Using Tavily (primary), Serper.dev (secondary), and DuckDuckGo (tertiary), it gathers real-time evidence — news articles, market data, historical precedents. Category-specific search strategies target the most relevant sources:

Category    Preferred Sources
Economics   FRED, BLS, Reuters
Geopolitics AP News, Foreign Affairs
Sports  ESPN, Sports Reference
Science Nature, Science, arXiv
Technology  TechCrunch, Ars Technica
⚖️ Verdandi — The Present (Ensemble Reasoning)
Three LLMs independently analyze the evidence using FutureSearch-style structured prompting:

Resolution Analysis — "What EXACTLY needs to happen for this to resolve YES?"
Base Rate Assessment — Historical frequency of similar events
YES Thesis — Steelman the affirmative case
NO Thesis — Steelman the negative case
Key Factors — What 2-3 variables matter most?
Synthesis — Final probability assignment
Their predictions are aggregated using logit-space averaging — the state-of-the-art method from Bayesian Logit Forecasting research:

p
^

 =σ( 
N
1

 ∑ 
i=1
N

 log 
1−p 
i


p 
i



 )

This properly handles extreme probabilities and is provably superior to simple averaging in probability space.

🔮 Skuld — The Future (Calibration & Reconciliation)
A Supervisor Agent reconciles our prediction with market prices — the collective wisdom of traders with real money at stake. Then adaptive calibration applies:

Time-to-Resolution Weighting: Near-term events (2-3 days) anchor 80% to market. Long-term events (8-14 days) anchor only 30%.
Category Multipliers: Sports get 2x market anchoring (we're not arrogant enough to fight the sharps). Geopolitics gets 0.7x (markets are slow on political events).
Overconfidence Shrinkage: Extreme predictions (>90% or <10%) are pulled toward 50% to prevent catastrophic Brier score penalties.
Confidence Threshold: If our prediction deviates less than 5% from market, we just return market prices. No edge = no deviation.
Key Technical Innovations
1. Counter-Evidence Debiasing
When the initial prediction is strong (>70%), we actively search for reasons it might be wrong:

query = f"{event.title} why {top_outcome} might NOT happen unlikely"
This combats confirmation bias — the #1 killer of forecasting accuracy.

2. Iterative Research (BLF-Inspired)
For moderate-confidence predictions (40-70%), we do a second research pass with refined queries. The Bayesian Logit Forecasting paper shows iterative belief updating significantly improves accuracy.

3. Resolution Analysis
The prompt explicitly forces the model to analyze resolution mechanics:

"What EXACTLY needs to happen for each outcome to resolve? Are there edge cases, technicalities, or ambiguities?"

This catches scenarios like: "Will the US ban TikTok?" — where the answer depends on whether an executive order counts, whether a court injunction delays it, etc.

4. Adaptive Market Anchoring
Not all markets are equally efficient:

Time Horizon    Anchor Weight   Rationale
1-3 days    80% Market very efficient near resolution
4-7 days    50% Balanced
8-14 days   30% Market less efficient, trust research
5. Graceful Degradation
The system never crashes. If models fail:

3 succeed → logit-space average
2 succeed → average of 2
1 succeeds → use single result
0 succeed → uniform distribution (safe fallback)
Challenges We Faced
"The Norns Argued"
Model disagreement handling was our biggest challenge. GPT-5 kept returning empty responses. Gemini 3.1 Pro refused to output clean JSON. We had to swap to more reliable models (Gemini 2.5 Flash, GPT-4o) mid-development. Lesson: reliability beats sophistication.

"Yggdrasil's Branches Are Tangled"
Non-mutually-exclusive events (top-K) are fundamentally hard. LLMs think in "who wins" mode, not "who qualifies" mode. We implemented detection heuristics and skip normalization for these events, but the models still struggle to output independent probabilities.

"The Well of Urd Ran Dry"
Tavily credits burned faster than expected during testing (~800 credits in development). We implemented API key rotation and search fallback chains to ensure resilience.

"Ragnarök of the Rate Limits"
Featherless AI's Qwen 72B has a concurrency limit of 4 units, and the model costs 4 units per request. We can only call it sequentially, never in parallel. The tiebreaker gracefully fails and the system continues with 3 models.

What We Learned
Market anchoring is the single biggest edge. When in doubt, trust the crowd. Traders with real money at stake are usually right.

Calibration > Accuracy. Being right 70% of the time when you say 70% beats being right 80% when you say 90%. Brier score punishes overconfidence quadratically.

The hardest part isn't the prediction — it's understanding the resolution criteria. Edge cases in how events resolve account for more forecasting errors than bad reasoning.

Ensemble diversity matters more than individual model quality. Three different models that disagree productively outperform one excellent model that's confidently wrong.

Tech Stack
Component   Technology
Framework   FastAPI (Python 3.14)
LLM Ensemble    Claude Sonnet 4 + Gemini 2.5 Flash + GPT-4o via OpenRouter
Tiebreaker  Qwen 72B via Featherless AI
Search  Tavily (primary) + Serper.dev (secondary) + DuckDuckGo (tertiary)
Market Data Kalshi Public API
Calibration Platt Scaling (√3 coefficient) + Adaptive Shrinkage
Hosting AWS EC2 (t2.small, Ubuntu 26.04)
Monitoring  Custom real-time dashboard with OpenRouter balance tracking
Aggregation Logit-space averaging (BLF method)
What's Next
The evaluation runs May 17-31. Since we're self-hosting, we can iterate live — tuning parameters based on actual prediction outcomes as they resolve. The Norns never stop weaving.

Planned improvements during evaluation:

Calibration curve analysis after first 50 predictions
Dynamic shrinkage adjustment based on observed accuracy
Category-specific model weighting based on track record
"We do not predict the future. We calculate the probability that the future has already decided."

— The Prophet Oracle 🔮


---

## Built With
`python` `fastapi` `openrouter` `claude-sonnet-4` `gemini` `gpt-4o` `tavily` `aws-ec2` `docker` `qwen` `featherless-ai`

Built With

fastapi
openrouter
python

Updates

Shyam Sharma started this project — May 17, 2026 03:04 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.