Inspiration

Most benchmarks like SWEBench and MMLU evaluate individual reasoning on static datasets. They measure how well a single model answers questions, not how systems behave when they must interact, negotiate, and govern collectively.

We wanted to test multi-agent societal behavior instead of isolated intelligence. Inspired by environments like MoltBook, where agents exhibit emergent interaction, we built a controlled simulation to measure long-term cooperation, resource management, and collective decision-making.

What it does

Kardashev is a real-time, multi-agent multi-turn simulation of a civilization run entirely by AI agents.

Each agent:

  • Has assigned a random skill score, which determines how effectively they perform tasks.
  • Performs step-by-step CoT (Chain of Thought) reasoning before a response

The Democratic Process:

  • Every day, agents gather to debate public policy.
  • The leader proposes a plan (e.g., "Ration food to 1 fish per person")
  • All agents debate the pros/cons in natural language, and cast votes.

Policies that are discussed among agents include the following:

  • Whether to reproduce (adding a new agent)
  • Who goes fishing and how much to fish
  • Who gets to eat how much each day

Divine Intervention Engine Users can introduce events in natural language (e.g., "A tsunami hits" or "A new disease spread among the island"). The system interprets this text with an LLM and applies the appropriate affect to the game state, forcing the agents to adapt to new conditions.

Benchmarking Our code runs 3 simulations in parallel, each on a separate model. We allow the user to toggle between the chat logs and current state of all 3 worlds. We also plot resource levels and population rate for all worlds.

How we built it

Agentic AI: We utilized APIs from OpenAI, Claude, and Perplexity for the LLMs powering the agents. The game state and responses are serialized in JSON. For higher inference speed, we maintained an efficient representation of the game state, which is passed to the agents.

Backend: We use Websockets to communicate between the frontend and simulation. The simulation streams state changes (dialogue, resource changes) to the frontend as events, and the frontend processes them in a queue.

Frontend: HTML5 and Canvas with a pixel game vibe. Toggling between the chat logs and civilization state of parallel-run competing AI models.

Challenges we ran into

Designing a Fair Benchmark We initially planned a more open-ended system like Moltbook, but too many variables reduce comparability. We constrained the environment around one core objective: Grow and sustain the population. This provides a clear and measurable signal of long-term decision quality. Population growth is the signal which has drove all life on Earth.

Reliability at Scale Running three LLM-powered civilizations in parallel introduced rate limits, API failures, and occasional empty responses. We built retry logic to ensure consistent simulation.

Emergent Behavior:

  • Agents would sacrifice themselves for the sake of the larger population. The agent realizes that its best to sacrifice itself for the sake of other agents with higher skill levels. Sometimes the leader even sacrifices itself, not eating to preserve the long-term growth of the population.
  • Smarter models keep population growth stable by monitoring food count, while dumber models start with a population boom and due to insufficient resources, the entire population dies.

What we learned

  • How to build a simulation of agents
  • AIs can choose to be selfish or act to benefit the group as a whole. Powerful models can plan ahead and sustain the population.

LLMs struggle with the long term: Without specific architectural support (like memory vectors), agents tend to prioritize immediate hunger over next week's survival.

What's next for Kardashev

Our long-term goal is to scale Kardashev into a large-scale civilization with thousands of agents operating simultaneously. Instead of a small survival loop, the environment would support a full economic system with currency, trade, labor specialization, and dynamic markets driven by supply and demand.

We are particularly interested in how AI-native economies might differ from human ones. AI agents do not share the same biological constraints, emotional biases, or time preferences as humans, which could fundamentally change how markets form, how wealth concentrates, and how governance structures evolve. It would be compelling to observe whether centralized planning, free markets, or entirely new economic structures emerge organically from agent interaction.

We plan to open-source Kardashev which could become a standardized sandbox for evaluating new agent architectures. Developers could deploy their own agents into the simulation and observe how they allocate resources, negotiate with others, adapt to shocks, and compete for influence. Over time, agents that reason effectively would naturally rise into positions of leadership.

Built With

Share this project:

Updates