The Lost Context of Prediction Markets

https://github.com/storoeb/UncommonHacks2026

Inspiration

Prediction markets look simple: read the question, output a probability. But the scoring problem is more subtle. Copying the market price is usually well-calibrated, but it gives you no edge. Deviating from the market can create upside, but it can also destroy calibration.

That tension became the inspiration for The Lost Context of Prediction Markets: a forecasting agent that does not treat each market as a one-off prompt. Instead, it tries to recover the missing context behind a prediction: similar resolved markets, historical base rates, model disagreement, and when it is worth trusting the market versus challenging it.

What it does

The Lost Context of Prediction Markets is an OpenAI-compatible forecasting agent for prediction-market questions.

Given a market prompt, it:

Parses the question, outcomes, market price, and metadata.
Runs a parallel ensemble of LLM forecasters through Wafer.ai.
Searches Snowflake for similar resolved Kalshi markets using Cortex embeddings and vector similarity.
Aggregates historical neighbors into base-rate features.
Uses a calibrator to adjust raw model probabilities.
Applies an alpha policy that decides how far to shade toward or away from the market price.
Returns probabilities through /v1/chat/completions and logs the full trace for debugging.

We also built a Streamlit-in-Snowflake dashboard with live forecasts, neighbor search, Brier vs AVER tradeoffs, and by-category performance.

How we built it

We built the backend in Python with FastAPI, exposing an OpenAI-style /v1/chat/completions endpoint so external evaluation harnesses can call it directly.

Snowflake stores our resolved Kalshi market history, including question text, outcomes, prices, resolutions, and embeddings. We use Snowflake Cortex embeddings plus cosine similarity to retrieve the most relevant historical markets for a new forecast.

The forecasting pipeline combines:

A three-model Wafer.ai ensemble
Snowflake-backed historical retrieval
A sklearn/Snowflake-compatible meta-calibrator
A learned alpha policy for balancing calibrated beliefs against market prices
Streamlit-in-Snowflake for the demo UI
Tests and reproducible scripts for bootstrapping, importing history, embedding questions, and backfilling features

Challenges we ran into

The hardest challenge was that “good forecasting” has two competing meanings. Brier rewards calibration against reality, while return rewards being right relative to the market. A system that simply copies Kalshi can look reasonable on Brier while earning zero edge, so we had to design around both metrics from the beginning.

We also ran into practical data and infrastructure issues: normalizing Kalshi market history, handling binary versus multi-outcome markets, backfilling market prices, keeping LLM calls within latency and rate limits, and making Snowflake ML work within hackathon constraints. To keep the demo reliable, we added fallbacks so the agent can still run even if Snowflake AutoML or retrieval is unavailable.

Accomplishments that we're proud of

We are proud that this is more than a wrapper around an LLM. The system has an actual forecasting architecture: ensemble beliefs, institutional memory, historical base rates, calibration, alpha shading, and a judge-compatible API.

We are also proud of making the pipeline inspectable. Every prediction can expose the model votes, neighbor count, similarity scores, base rate, calibrated probability, alpha value, and final output. That made the project feel less like a black box and more like a forecasting system we could debug and improve.

Our holdout experiments showed the value of the approach: the learned policy improved both calibration and relative return compared with raw ensemble predictions and a market-only baseline.

What we learned

We learned that prediction markets are not just about knowing facts. They are about knowing when the market is already right, when an LLM has useful independent signal, and when history says similar questions behave differently than intuition suggests.

We also learned that retrieval for forecasting is different from normal RAG. We were not retrieving articles to quote; we were retrieving resolved market rows with prices and outcomes, then turning them into empirical base rates.

Finally, we learned that honest fallbacks matter. A hackathon demo needs ambitious architecture, but it also needs to keep running when credentials, trial-tier limits, or APIs behave differently than expected.

What's next for The Lost Context of Prediction Markets

Next, we want to expand the historical dataset, improve multi-outcome handling, and move more of the calibrator and alpha-policy training fully into Snowflake ML.

We also want to run larger evaluations across categories, expose better explanations for why the agent deviated from the market, and use the logged AGENT_PREDICTIONS table to continuously retrain as markets resolve. The long-term goal is a forecasting agent that learns when to trust consensus, when to trust history, and when it has a real edge.

https://github.com/storoeb/UncommonHacks2026/