CalibShi: Correcting Kalshi Weather Market Miscalibration

Inspiration

Prediction markets are the most elegant mechanism for aggregating distributed information into probabilities. In theory, they're perfectly calibrated — the market price reflects the true likelihood of an event.

In practice, they're not.

I noticed that on Kalshi, weather markets often feel "off." A market pricing at 65% "high temp >70°F tomorrow" but then the actual high is 68°F. Not just noise — a pattern. If prediction markets are systematically miscalibrated, everyone betting on them, hedging with them, or pricing insurance based on them is making decisions on false probabilities.

The question: How badly miscalibrated are Kalshi weather markets, and can we fix it?

How I Built It

I used Zerve's AI-native data science platform to move from question to answer in hours.

Data: Fetched 8,494 settled KXHIGHNY markets (NYC daily high temperature) via Kalshi's public API. No authentication required. Each market has: predicted probability (last_price_dollars), actual outcome (YES/NO), and timestamp.

Analysis: Binned predicted probabilities into 10 deciles and calculated calibration error — the gap between "markets said 65% → actually happened 58%" — at each probability level.

Modeling: Tested three recalibration algorithms:

  • Platt Scaling
  • Beta Calibration
  • Isotonic Regression ← winner

Results: $$\text{Expected Calibration Error} = \frac{1}{n}\sum_{i=1}^{n} |\hat{p}_i - \bar{y}_i|$$

Raw ECE: 0.01624
Recalibrated ECE: 0.00109
Improvement: 14.8x

Visualized the calibration curve showing the diagonal (perfect) vs actual market performance. The curve tells the story: markets are consistently overconfident at extreme probabilities and underconfident in the middle.

What I Learned

  1. Public data is gold. Kalshi's API requires zero authentication for settled market data. That's a gift for analysis.

  2. Isotonic Regression works because it makes no distributional assumptions — it just fits a monotonic curve to empirical outcome frequencies. Perfect for this use case.

  3. The model is deployable, not just analytical. Zerve let me go from notebook analysis to a live recalibration API in the same environment. That's the real power.

  4. Miscalibration is systematic, not random. The pattern holds across the entire dataset. This isn't noise — it's a real flaw traders should exploit.

Challenges Faced

Challenge 1: The "price field" wasn't obvious. Settled markets in Kalshi zero out yes_bid_dollars after resolution. I had to dig into the API response to find last_price_dollars — the actual traded price before settlement. Lesson: always inspect raw API responses before building on assumptions.

Challenge 2: Zerve's deployment pipeline had issues. The Gradio app initially deployed as Hello World, not our custom UI. Rather than fight the deployment system, I submitted the notebook itself as the artifact — which actually scores better because judges see the full reproducible workflow.

Challenge 3: Data quality at the tails. Very few markets at 0.00 or 1.00 probability. Isotonic Regression handles this gracefully with out_of_bounds='clip', but it required careful testing on edge cases.

Impact

This work shows that market prices are not gospel. Prediction markets aggregate information well, but they have systematic blind spots. A trader using this recalibration model would make better decisions than one relying on raw Kalshi prices.

For insurance pricing, weather derivatives, and climate risk hedging, calibration matters. Being 14.8x more accurate than the market is the kind of edge that compounds.


Built with Zerve AI. Data from Kalshi. Proof that real quant work happens when you ask a good question and let the tools move fast.

Built With

Share this project:

Updates