Inspiration
LLMao was inspired by the idea that strong forecasting should feel less like one model guessing and more like a small research team debating under uncertainty. Prediction markets reward calibration, evidence, and humility, so I wanted an agent that retrieves context, critiques assumptions, simulates scenarios, and produces probabilities that are actually usable.
What it does
LLMao is a multi-agent forecasting endpoint for Prophet Hacks. Given a forecasting question and a list of possible outcomes, it returns a probability distribution over every outcome:
[ \sum_i p_i = 1,\quad 0 \le p_i \le 1 ]
The system uses retrieval, base-rate reasoning, domain analysis, skepticism, scenario simulation, calibrated debate, and final aggregation to produce evidence-aware forecasts.
How we built it
I built LLMao in Python around a custom my_agent.py pipeline. The agent breaks forecasting into stages: search, knowledge graph construction, prior estimation, domain-specific reasoning, skeptical critique, scenario generation, simulation, calibration, and final probability aggregation.
I deployed the forecasting endpoint on AWS EC2 behind Nginx with HTTPS. The public /predict endpoint accepts event JSON and returns forecast JSON. For local resolved testing, I used OpenRouter; for the submitted evaluation endpoint, I switched the same logic to the OpenAI API for reliability.
Challenges we ran into
The biggest challenge was balancing forecast quality with latency. The full pipeline runs multiple reasoning stages, searches, simulations, and debate rounds, which can be slow for a web endpoint. I added configuration controls for simulation count, debate rounds, and agent count so the system could stay responsive while preserving the core reasoning structure. Deployment also had its own challenges: configuring EC2, DNS, Nginx, SSL, environment variables, and keeping the server alive for evaluation.
Accomplishments that we're proud of
I'm proud that LLMao is not just a prompt wrapper. It has a structured forecasting process with self-critique, scenario simulation, and calibration. I also built and deployed a working HTTPS endpoint that returns valid probability distributions and handles different event shapes, including events without a close time. Most importantly, the system produces complete probability distributions over all outcomes, not just a single answer.
What we learned
I learned that forecasting agents need both reasoning depth and operational reliability. A clever model chain is not useful if it times out or returns malformed JSON. I also learned that explicit skepticism helps: forcing the system to challenge its own evidence and assumptions leads to more careful forecasts. On the engineering side, I learned a lot about deploying AI endpoints reliably on EC2 with HTTPS, environment-based model switching, and production-style endpoint testing.
What's next for LLMao
Next, I want to add better observability, request-level logs, smarter caching, and adaptive reasoning depth. Simple questions should return quickly, while complex or high uncertainty events should trigger deeper simulations and debate. I also want to improve calibration over time by comparing forecasts against resolved outcomes and using those errors to tune the agent’s confidence and aggregation strategy.
Built With
- amazon-ec2
- amazon-web-services
- brave-search-api
- openai-api
- python
Log in or sign up for Devpost to join the conversation.