Inspiration

Prediction markets are one of the most intellectually honest pricing mechanisms ever invented — they force participants to put money behind their beliefs and aggregate information more accurately than expert panels, polls, or media consensus. Platforms like Kalshi have brought them into the mainstream, yet the vast majority of participants still trade on gut feel, social momentum, and chronically stale information.

We were inspired by Philip Tetlock's decades of research on superforecasting — the finding that a small set of structured thinkers, armed with the right decomposition framework, can outperform intelligence analysts, domain experts, and markets alike. The Good Judgment Project showed that calibrated probability estimation is a learnable skill, not an innate talent. What would happen if you encoded that skill into an LLM, gave it real-time web access, and let it trade continuously for two weeks?

We were also inspired by the mathematics of the Kelly Criterion — a 1956 formula by John L. Kelly Jr. that proves, rigorously, that betting proportional to your edge maximizes long-run wealth. Almost every trading agent we looked at ignored this entirely, using fixed bet sizes regardless of conviction. That felt like leaving money on the table.

ORACLE was born from a simple thesis: structured LLM reasoning + live evidence + mathematically optimal sizing should systematically outperform any single one of those ideas alone. The AI Forecasting Hackathon gave us the perfect arena to test it.

What it does

ORACLE is a fully autonomous AI trading agent that takes positions on live Kalshi prediction markets every 15 minutes over the 2-week evaluation window (96 ticks total), targeting positive PnL without any human intervention after deployment.

Every tick, ORACLE runs a four-stage pipeline:

Stage 1 — REVIEW (Market Selection): From the full slate of 50–200 live Kalshi markets, ORACLE filters to the 10–20 most tractable opportunities. Hard filters remove illiquid markets (volume < $50/day), wide-spread markets (bid–ask gap > 15%), near-expiry contracts (< 1 hour), and very long-dated markets (> 30 days). Remaining markets are scored by a weighted combination of liquidity, spread efficiency, and time-to-resolution (sweet spot: ~7 days), then capped at 3 positions per topic family to ensure diversification.

Stage 2 — SEARCH (Evidence Gathering): For each selected market, ORACLE generates two targeted search queries and executes them via Tavily Advanced Search, pulling fresh news articles, expert commentary, and data as of that exact moment. Results are deduplicated and synthesized into a concise evidence paragraph by Claude Haiku — fast and cheap for this summarization task.

Stage 3 — FORECAST (Probability Estimation): This is ORACLE's core innovation. Claude Sonnet 4.6 is given the market question, current bid/ask prices, time to resolution, and the synthesized evidence, then asked to follow a five-step superforecaster reasoning chain: (1) reference class and base rate, (2) inside view from evidence, (3) outside view from comparable events, (4) market anchor analysis — why might the crowd be wrong?, (5) calibration check — am I being overconfident? Extended thinking (5,000 dedicated reasoning tokens) is enabled for this call, giving the model a "System 2" deliberation phase before committing to a final probability. A secondary Claude Haiku call provides an ensemble estimate (weighted 75/25), reducing systematic bias from any single model.

Stage 4 — ACTION (Trade Sizing): Any market where ORACLE's probability diverges more than 5% from the market mid-price is a candidate for trading. Position size is computed using the fractional Kelly Criterion — 25% of the full Kelly fraction — ensuring bets are proportional to conviction while maintaining a 4× safety margin for imperfect probability estimates. Global risk guards halt trading if the portfolio drawdown exceeds 30% or available cash drops below the 20% reserve floor.

ORACLE submits trade intents to the Prophet Arena harness via the ai-prophet-core SDK, which handles deterministic execution against live market prices. All state lives server-side — making ORACLE crash-safe and resumable with a simple restart.

How we built it

We built ORACLE in Python 3.12 as a fully modular, stateless agent integrated with the ai-prophet-core SDK (v0.1.5).

Architecture: The codebase is organized into four clean pipeline stages (agent/pipeline/review.py, search.py, forecast.py, action.py) orchestrated by a central tick loop (agent/loop.py) and launched from a CLI entry point (agent/main.py). Configuration is managed through Pydantic v2 models backed by a config.yaml file, with all secrets injected as environment variables — nothing sensitive touches the repository.

SDK Integration: We started by installing ai-prophet-core and running Python's inspect module against every class we planned to use — ServerAPIClient, TradeIntentRequest, PortfolioResponse, MarketData, MarketQuote, ClaimTickResponse. This was critical: the actual SDK field types differed significantly from what documentation implied. cash, equity, best_bid, best_ask, and shares are all decimal strings (str), not floats. We wrote a _f() coercion helper and formatted share counts as f"{shares:.4f}" strings before submission. Discovery-first prevented hours of runtime errors.

LLM Integration: We use Anthropic's Python SDK with anthropic.Anthropic client. The forecast call passes thinking={"type": "enabled", "budget_tokens": 5000} and forces temperature=1 as required by the extended thinking API. We parse the JSON output from the text blocks (skipping thinking blocks), with a regex fallback for malformed responses. Retry logic via tenacity handles transient API failures with exponential backoff.

Search Integration: Tavily's Python client (tavily-python 0.7.24) is wrapped behind a SearchProvider protocol, with a NoOpSearch fallback when no API key is configured. This makes the agent fully runnable without a Tavily key — at the cost of evidence quality.

Parallelism: Up to 5 concurrent ThreadPoolExecutor threads run the SEARCH + FORECAST stages in parallel, one per selected market. Each thread is fully isolated — a failure in one market's forecast does not block the others. This cuts per-tick wall-clock time from ~10 minutes to ~2 minutes for 20 markets.

Run script: run.sh creates a virtual environment, installs dependencies via pip install -e ., validates required environment variables, and launches the agent — everything a judge needs to reproduce the run in three commands.

Challenges we ran into

  1. SDK field types were not what they appeared. The biggest early blocker was discovering that PortfolioResponse.cash, MarketQuote.best_bid, MarketQuote.best_ask, and TradeIntentRequest.shares are all typed as str in the SDK — not float. Math operations on these fields silently failed or produced type errors. We resolved this by systematically inspecting every model with cls.model_fields before writing any integration code, then adding a _f() coercion helper throughout the pipeline.

  2. claim_tick() and submit_trade_intents() have non-obvious required parameters. The harness requires lease_owner_id on every tick claim and tick_id + candidate_set_id on every intent submission. These IDs flow from the ClaimTickResponse and CandidatesResponse objects and must be threaded through the pipeline carefully — missing either causes silent submission failures. We traced the full call graph before writing the loop.

  3. Balancing LLM cost against quality. Extended thinking on Claude Sonnet 4.6 is high quality but not cheap. Running it on 20 markets per tick × 96 ticks would cost significantly more than our target. We addressed this with a two-model strategy: Haiku for cheap evidence synthesis, Sonnet with thinking for the high-value forecast call. We also added --no-search mode and reduced max_markets_per_tick in config.yaml as cost levers.

  4. Pyrefly false positives on installed packages. The ai-prophet-core package installed cleanly to Python 3.12 site-packages, but Pyrefly's static type checker kept reporting Cannot find module. We resolved this by creating a pyrefly.toml with an explicit python_interpreter path, pointing it to the exact interpreter that has the package installed.

  5. Designing for crash resilience without local state. We wanted ORACLE to be restartable mid-experiment without losing data. The solution was leaning fully into the stateless design: create_or_get_experiment() is idempotent (same slug returns the same experiment), and upsert_participant() is safe to call repeatedly. Any restart picks up exactly where the previous run left off, with portfolio state intact on the server.

  6. Extended thinking requires temperature=1. The Anthropic API rejects extended thinking calls with any temperature other than 1.0. Our initial implementation set temperature=0.3 for all calls, causing the thinking calls to fail silently. We added a conditional branch in call_anthropic() that forces temperature=1 when thinking_budget > 0.

    Accomplishments that we're proud of

    End-to-end autonomous operation. ORACLE completes the full pipeline — market selection, web research, LLM forecasting, trade sizing, and submission — without a single line of human decision-making after ./run.sh. Watching the first full tick execute successfully, with real trade intents landing in the Prophet Arena harness, was genuinely satisfying.

The superforecaster prompt. Getting Claude to reliably follow a five-step structured reasoning chain and output valid JSON every time — including through extended thinking mode — required careful iteration. The prompt explicitly names each step, provides the market anchor as a concrete number to reason against, and asks the model to check its own overconfidence before committing. The calibration improvement over a naive "what's the probability?" prompt is substantial.

Kelly Criterion implementation. Translating the Kelly formula to binary prediction markets correctly — distinguishing between YES bets (cost = ask price) and NO bets (cost = 1 − bid price) — and integrating it with the portfolio's available cash, reserve floor, and per-trade notional cap in a coherent risk framework was a satisfying piece of engineering.

Parallel forecasting with error isolation. Five concurrent forecasting threads, each independently handling search + synthesis + LLM forecast for a different market, with clean failure isolation (a crashed thread returns None and the others continue). This makes the per-tick runtime practical for the 15-minute tick window.

Discovery-first SDK integration. Rather than guessing at the API surface, we ran python -c "import ai_prophet_core; ..." inspection commands before writing a single integration line. This discipline saved hours of debugging and is something we'd replicate on any future SDK project.

Zero secrets in the repo. .env.example documents every required key; .gitignore ensures .env is never committed; run.sh validates required keys before starting. Judges can clone, configure, and run without touching the source code.

What we learned

Inspect before you integrate. The SDK's documented types and its actual runtime types were different in multiple places. The lesson: always print(cls.model_fields) and inspect.signature(method) on every SDK class you plan to use before writing integration code. It costs 10 minutes upfront and saves hours of debugging.

Structured prompts dramatically outperform open-ended ones for calibration. Asking an LLM "what is the probability this resolves YES?" yields overconfident, anchored answers. Forcing it through a five-step decomposition — base rate, inside view, outside view, market anchor, calibration check — produces meaningfully different (and better) probabilities, especially on hard questions where the answer isn't obvious from surface features.

Extended thinking requires temperature=1 — always check API constraints. We assumed temperature was a universal parameter. It isn't. The Anthropic API explicitly requires temperature=1 when extended thinking is enabled. Reading the API docs for feature-specific constraints before implementation is now a hard rule.

25% fractional Kelly is the right operating point. Full Kelly (fraction=1.0) is theoretically optimal but requires perfectly calibrated probabilities. In practice, LLM forecasts are good but not perfect. 25% fractional Kelly provides a 4× buffer — meaning we'd need to be off by more than 4× our edge estimate before the sizing becomes harmful. This is the right risk/return tradeoff for an imperfect forecaster.

Stateless design is a feature, not a concession. We initially planned to maintain local state (a SQLite database of positions, forecasts, tick history). Switching to fully server-side state was the right call: crash recovery is trivial, there's no risk of divergence between local and server state, and the agent can be paused and resumed without data loss.

Parallelism needs rate-limit awareness. Five concurrent Anthropic API calls per tick is fine at current rate limits, but scaling to 10+ would require explicit rate-limit handling. We capped at 5 workers deliberately, and the config makes this easy to tune

What's next for ORACLE--OptimalReasoningAgentforCalibratedLeveragedExecution

Self-improving calibration loop. Currently ORACLE's Kelly fraction is fixed at 25%. The next version logs every predicted probability alongside the eventual market resolution, builds a calibration curve (reliability diagram), and dynamically adjusts the Kelly fraction based on recent calibration error. If ORACLE has been overconfident, it bets smaller; if it's been underconfident, it bets larger.

Cross-platform arbitrage. Kalshi and Polymarket frequently price the same underlying event differently. A second client could monitor both platforms simultaneously and flag arbitrage opportunities — risk-free profit when the sum of YES prices across platforms is less than 1.0.

Adaptive evidence strategy. Not all markets benefit equally from web search. ORACLE currently searches every candidate market identically. The next version classifies markets by type (breaking news vs. slow-moving political vs. economic data) and applies a different evidence strategy to each: breaking news markets get aggressive search, slow-moving markets skip search and rely on base rate reasoning.

Reinforcement learning on search quality. Log which search queries actually changed the forecast significantly (large delta between pre-search and post-search probability) vs. which returned noise. Train a lightweight classifier to predict search query quality, and only execute expensive searches for markets where the expected information gain exceeds the API cost.

Sentiment signals from social media. Integrate real-time sentiment from X/Twitter and Reddit via their APIs as an additional evidence signal. Social sentiment often leads market price movements by minutes to hours, particularly on political and sports markets.

Multi-agent adversarial debate. Run two instances of the forecast stage with adversarial prompts — one optimized to argue YES, one to argue NO — then have a third instance adjudicate. This structured debate framework has been shown to improve LLM calibration on hard questions beyond what single-pass reasoning achieves.

Open-source contribution to ai-prophet — we plan to submit a pull request with the Kelly Criterion sizing module and the superforecaster prompt template as reusable components for the broader community, competing for the open-source contribution award.

Built With

  • 5-step
  • anthropic
  • api
  • claude
  • criterion
  • kelly
  • position
  • prompting
  • python
  • search
  • sizing
  • sonnet
  • superforecaster
  • tavily
Share this project:

Updates