Inspiration
This idea first came from the simple question, "Who do you think is going to win Spartahack 11 this year?" My friends from other universities bragged in the channel that their school is going to win. Of course, Go Green—but. Let's really think here: how does one realistically predict the future?
I took a look at this Spartahack. I thought about what factors affects the judging process—the seniority, the years of experience, the engagement of the judge, their university alumni—so many factors, hundreds. But what if we knew all of it? What if we knew who each person was, grabbed an exact replica of them, and asked them: "Who do you think is going to win?" Each simulated person will give their votes, their most likely choice of who would win. And then, not maybe, just then...we would know the future.
This analogy I imagined can be recreated in so many scenarios: Who will win the 2028 elections; Who is going to win the next FIFA world cup; Who is going to perform at the next Superbowl? The questions are endless. There are big-brand businesses who will wants to know what the next trend will be, investors asking for the next unicorn. So, let's give it to them.
What it does
To simulate a person, we gave them as many demographics as possible based off the US Census. We feed that specific demographic to an AI agent. For example, one AI agent will be a White 21-year old male, another will be an African 67-year old woman. We generate roughly a 100 of these AI agents, and have the cast a binary vote.
We then send these AI Agents to a chatroom to debate against one another, and reach a consensus. For example, if used in 2024, we can have these AI Agents vote and argue who will win the 2024 election: Trump or Biden. They all argue and reach a single conclusion that everyone can agree on—Trump will be the next president of the United States.
Although strong, we needed our model to be stronger. So we masked a self-reinforcing learning algorithmic model within a prediction market, where consumers can bet against our AI's predictions, whether they believe our model has made the correct prediction. Just like a popular predication market—Kalshi—if you lose against the AI, you win money, and we win as our model learns from this and becomes stronger further on.
How we built it
We first decided to have an API route made in Python for simplicity, since we were going to be dealing with agents communicating with each other. The way we approached it was by spawning threads for every task given from Gemini using their Google Agent Development Kit. From this, we differentiated the types of agents to represent a large distribution of people; for example, we had Bayesian updaters, contrarian skeptics, black swan hunters, etc. This resembles almost all possible types of people, from those who hedge to those who are very extreme in their risks. We decided not to go with certain phenotypic characteristics or genetic characteristics, but rather focus on how people think and group people based off of that.
In order to make sure this had the highest chance of accuracy, this simulation would change the amount of value a certain agent would get as they “speak” to each other. This would make the simulation, over time, unbiased, as the presence of one agent wouldn’t always outweigh the rest. On top of that, we used humans betting against the AI as a self-reinforced learning model to increase accuracy: whenever humans disagreed with the model and the outcome proved them right, that signal was treated as feedback to adjust weighting, improve calibration, and reduce the likelihood of repeating the same failure mode. In other words, the system didn’t just learn from internal agent debate—it also learned from real external pressure, where humans effectively acted as adversarial evaluators whose incentives were tied to being correct.
Challenges we ran into
We ran into a lot of issues, one of them being the number of tokens we were using for each simulation. Each simulation had a lot of steps, took a lot of time, and since Gemini was the one handling how the embedding space worked, there wasn’t much we could do on our end. However, the way we got around that was by limiting the context we gave it, and then having the agents start to generate context from the other agents’ behavior. This sped things up, since it didn’t need to process as much information at each step.
We also ran into issues generating responses for each market we had, and it was very impractical to do that at scale. Instead, we chose a topic that was very broad and had the agents argue over the consensus of that topic to get a distilled, but still vague, answer. Using that answer as a shared baseline, the LLMs were able to answer a lot of the market-specific questions more efficiently, since they were effectively starting from a near-complete frame of reference rather than rebuilding the full reasoning from scratch each time.
Accomplishments that we're proud of
We started with a clear hypothesis: as we increase the number of demographic factors, the model’s accuracy should increase linearly (assuming no self-learning algorithm is involved). Over the past hour, we tested this by systematically adding demographic dimensions and re-running evaluations after each increment to track how performance changed. What we observed was encouraging, but it didn’t match the linear expectation—accuracy continued to improve as we added more factors, but the gains followed a logarithmic pattern instead.
In other words, the early additions of demographic information produced the largest jumps in accuracy, and each additional factor still helped, but with diminishing returns over time. This is a meaningful result because it suggests that demographic diversity does improve predictive quality, but there’s a saturation point where adding more demographic granularity becomes less impactful relative to the complexity it introduces. The key accomplishment here is that we validated the direction of the hypothesis (more demographic factors increased accuracy), while also discovering the true shape of the relationship (logarithmic rather than linear), giving us a more realistic basis for how to scale the model going forward.
What we learned
We learned a lot about how agents can be very powerful when you make them argue against each other. The adversarial setup forces each agent to expose weak assumptions, defend claims, and refine its reasoning in a way that a single, isolated model often won’t. However, we also learned that the opposite can happen: once agents start optimizing against each other, they can converge toward a “stable” position that isn’t necessarily the most truthful or insightful one (we realized this through the idea of a Nash equilibrium). In that situation, the system can stop exploring better answers and instead settle into whatever balance prevents any single agent from “losing,” even if that balance is mediocre.
Something else we learned through emergence is that these agents are sometimes not very powerful when compared to a well-scoped, single-model approach—especially if the task is already straightforward or if the agents don’t have enough meaningful differences in incentives, information, or capabilities. If the agents are too similar, they tend to produce the same reasoning with extra overhead, and the “debate” becomes more like repetition than refinement. Even worse, when constraints like limited context or token budgets are tight, the multi-agent setup can amplify shallow reasoning, because agents react to compressed signals rather than fully grounded explanations. Overall, we learned that agent debate is not automatically better—it’s strongest when we design real tension (different priors, objectives, or access to information) and when we actively guard against equilibrium behavior that trades truth for stability.
What's next for PredictTheFuture
We are planning on using the potential winnings to go to pitch competitions across the globe, and even apply to Y-Combinator in order to grow this company into a very-own unicorn from Michigan State University. This includes growing the company, and having it become even more accurate. We will use the Spartahack winning reputation to dream and pursue partnerships with popular startups in Silicon Valley (such as Kalshi) to scale and grow users. We are also planning on applying this project through Michigan State's Burgess Institute if we win because of the valued potential.
Built With
- agent-development-kit
- css
- fastapi
- gemini
- html
- javascript
- python
Log in or sign up for Devpost to join the conversation.