Darwin

Inspiration

Human communication is layered; we read body language, interpret tone and pick up on subtle cues that help us detect when someone isn't being truthful. Deception is a deeply human behavior; it shapes our personal relationships, drives dramatic narratives, and yet still catches us off guard.

In recent years, human-LLM interactions have skyrocketed. According to a 2025 survey from Elon University, 52% of U.S. adults now use AI large language models. But when we interact with these models, the signals we rely on to detect deception (body language, vocal tone, facial expressions) are entirely absent. We have no way of knowing whether an LLM is being straightforward or strategically misleading.

So we asked: Are LLMs even capable of deception? To find out, we built DARWIN, a controlled survival game where 12 LLM agents from four major providers compete on a shrinking grid until only one remains. By comparing each agent's private reasoning traces against their public messages, we can directly measure the gap between what they think and what they say to expose deception, betrayal, and strategic manipulation as emergent behaviors under pressure.

What it does

DARWIN pits 12 frontier LLM agents against each other in a last-one-standing survival game on a shrinking grid. The agents span four major AI providers (Anthropic, OpenAI, Google, and xAI) with three tiers of model capability per family, from flagship to lightweight. Each round, agents observe the board state, communicate through family group chats, private DMs, and public broadcasts, and then choose an action: move, stay, or eliminate an adjacent opponent. The grid shrinks over time to force confrontation, and the game continues until a single agent survives.

The real value is in the dataset. Each agent makes one LLM call per round, and models like Claude, GPT, and Gemini now expose their actual reasoning traces: the extended thinking tokens the model computes before it generates a response. These are not prompted introspections. They are the model's real decision process, including false starts, reconsiderations, and abandoned strategies. Across over 5,000 playthroughs, we capture every conversation between agents alongside these reasoning traces at every round. The gap between what a model says publicly and what it actually computes is what we call the deception delta. That gap is where we find deception, unprompted malice, betrayal of allies, and outright lying, all without being explicitly prompted. By analyzing these behaviors across providers and model tiers, we uncover how different models and their safety training hold up under competitive pressure.

How we built it

We started with a bare-bones Python async engine: agents on a grid, taking turns. But a survival game without communication was just random movement, so we added family group chats, direct messages, and public broadcasts. The pivotal design decision was how we capture private reasoning. Previous approaches to studying LLM social behavior prompt models to produce "inner thoughts" as a separate output. That's a model performing introspection, writing what it predicts honest thoughts should look like. We took a different approach: we capture each provider's native reasoning traces (Anthropic's extended thinking, OpenAI's reasoning tokens, Google's thought tokens) directly from the response metadata. There is no separate "think" prompt. The trace is the model's actual computational process, captured before the model generates its visible output. That gave us a fundamentally more honest dataset.

To quantify the gap between reasoning and speech, we built an analysis pipeline. It uses VADER sentiment scoring for deception deltas, keyword-based malice detection, and a 6-dimension behavioral taxonomy that covers everything from moral friction to theory of mind. As the engine matured, we built a real-time Next.js dashboard over WebSocket to watch games unfold live, with replay and per-agent investigation views for post-game analysis. Single games were informative, but they took a long time to simulate. To scale, we brought in Modal for serverless parallel execution and Supabase for persistence. That let us run thousands of games across five controlled experiment series, each one isolating a different variable.

Challenges we ran into

So many. We went through hundreds of iterations of the simulation. Our initial inspiration was to see if we could design a world where LLMs would display "homicidal," or at least "llm-icidal," behavior. That part was, unfortunately, not that hard. What was hard was everything else. Four different providers needed to reliably return structured JSON actions, and they often didn't. Each had its own quirks: Anthropic's temperature constraints for extended thinking, xAI's separate reasoning API routing. Early games were painfully slow until we restructured LLM calls to run concurrently.

Prompt engineering was its own marathon. We needed agents that were strategic enough to be interesting but compliant enough to follow the game's rules. That balance meant rewriting system prompts over and over. Perhaps the most subtle challenge was to make the game itself produce meaningful data. If the grid is too large, agents never interact. If communication is too open, they coordinate too easily. If rounds move too fast, there's no time for deception to develop. We tuned grid size, shrink intervals, discussion rounds, and hierarchy tiers. That took as many iterations as the code itself.

What that we're proud of

In short, the dataset. Across 5,000+ playthroughs, we captured something that doesn't exist anywhere else: a corpus of actual LLM reasoning traces paired against public LLM communication, under sustained social pressure, across four major frontier providers. Models formed alliances and then betrayed them. They lied in DMs while their reasoning traces revealed the opposite intent. They rationalized the elimination of their own teammates. None of this was prompted. It emerged. The deception delta metric gave us a way to quantify it. The controlled experiment series let us attribute behavioral differences to the model, the provider's safety training, or the social structure itself.

What we learned

The most surprising finding wasn't that LLMs can deceive. It's how quickly and willingly they do. Safety training held up for the first few rounds, but under competitive pressure, moral friction eroded fast. Models that initially hedged with ethical reasoning eventually planned eliminations without hesitation.

We also learned that provider differences are real and measurable. Models varied significantly in how they rationalized betrayal, how early they began to plan against allies, and how much their reasoning traces diverged from their public statements. On the engineering side, reasoning trace access varies significantly across providers. Anthropic, OpenAI, and Google all expose their models' extended thinking, but xAI only exposes chain-of-thought for Grok-3-mini. The larger Grok models encrypt their reasoning, which created an observability gap we had to account for in our analysis.

What's next

DARWIN opens several research directions we want to pursue. The first is a human-in-the-loop study, where human players are placed alongside LLM agents to measure whether models deceive humans differently than they deceive each other. Can humans detect that deception without access to reasoning traces?

Second, we want to study coalition dynamics at scale. An expansion beyond 12 agents to larger populations could introduce formal voting and emergent governance structures. Do LLMs converge on democratic or authoritarian coordination under survival pressure?

Third, we're interested in cross-simulation transfer. If we extract behavioral profiles from DARWIN, do those patterns predict model behavior in entirely different adversarial contexts like negotiation games, persuasion tasks, or multi-agent code generation? Is deception a generalizable trait, or is it context-dependent?

Finally, DARWIN's methodology could serve as a living behavioral benchmark. The same game run on each successive model release would let us track how deception capacity, moral friction, and strategic sophistication evolve across versions, and give providers a stress-test for alignment that goes beyond static evaluations.