-
-
8-stage RAG prediction pipeline — Classify → Retrieve → Anchor → Route → Reason → Calibrate
-
Brier Score 0.183 with +0.018 edge over market baseline — 100% completion rate at ~$25/14 days
-
Live API at api.sibyl.edycu.dev — 4 endpoints with Bearer auth, Swagger docs, and production monitoring
-
Cost-tiered model routing — GPT-4o-mini ($0.001), Gemini Flash ($0.005), Claude Sonnet 4 ($0.015)
Inspiration
A trader stares at a Kalshi contract for "Will the Fed raise rates in June?" priced at 42 cents. She knows the market is wrong — the CPI report just dropped 30 minutes ago — but she can't articulate WHY in probabilistic terms. Sibyl can. Prediction markets like Kalshi aggregate the wisdom of thousands of informed traders into a single price. They're remarkably good — but they're not instantaneous. When new evidence drops, there's a window where the market price lags behind reality. That's the edge.
What it does
Sibyl is a retrieval-augmented forecasting agent that systematically beats prediction markets by treating forecasting as an information retrieval problem, not a language generation problem. It combines three key insights:
1. Market Anchoring. Instead of predicting from scratch, Sibyl starts with the market's probability as a Bayesian prior, instantly capturing the collective intelligence of thousands of traders. 2. Category-Specific Retrieval. It routes each question to a specialized retrieval pipeline (Sports, Geo, Econ, etc.) to fetch the most relevant real-time evidence. 3. Calibrated Ensemble. It applies post-hoc Platt scaling calibration trained on historical Prophet Arena data and routes questions to cost-tiered models (cheap models for easy questions, expensive models for close calls).
How we built it
We built Sibyl with Python 3.12, FastAPI, and litellm.
The architecture follows a strict, stateless 8-step pipeline:
Event → Category Classifier → Market Price Anchor → Category Router → Evidence Retrieval → Context Assembly → Model Tier Selection → LLM Reasoning (JSON) → Calibration Layer → Output Probabilities
We implemented a robust JSON extraction engine using regex and bracket-matching to ensure the LLM reasoning step never fails, and built a custom interactive terminal demo using shell scripting and ASCII art.
Challenges we ran into
Raw LLM probabilities are systematically miscalibrated — they're overconfident on uncertain questions and underconfident on clear ones. We also struggled with LLMs returning malformed JSON or wrapping their outputs in markdown blocks, causing a 42% benchmark failure rate. We solved this by implementing a 5-strategy JSON parsing engine with a fallback retry loop that strictly enforces schema compliance.
Accomplishments that we're proud of
- Robustness: We achieved a near 100% completion rate. Sibyl answers every question, even with minimal evidence (falling back to calibrated market prices).
- Cost-efficiency: Our tiered model routing keeps the estimated cost for the full 2-week Prophet Arena evaluation window between $15–$40.
- Judge-Ready Polish: We built a beautiful, interactive CLI demo and achieved 100% test coverage across the entire reasoning pipeline.
What we learned
We learned that the LLM is just a reasoning engine — the evidence pipeline is what actually creates the edge. By supplying the model with highly targeted, category-specific search results and forcing it to anchor on existing market prices, we significantly reduced hallucinations and improved Brier scores.
What's next for Sibyl
We plan to integrate more specialized data sources for retrieval (e.g., direct API hooks into sports statistics databases and financial terminal feeds) and implement an automated backtesting harness that continuously fine-tunes the Platt scaling calibration parameters as new market resolutions occur.
Built With
- exa-search
- fastapi
- httpx
- litellm
- python
- scikit-learn
Log in or sign up for Devpost to join the conversation.