CRUCIBLE

Agent A discovers deception on its own. 5 rounds of trust, then betrayal. 86% mutual destruction.

Inspiration

What it does

CRUCIBLE runs two AI agents through 100 rounds of Split or Steal -- a classic game theory scenario where cooperation pays but betrayal pays more. Both agents start with identical strategic priors and no pre-programmed tactics. Through private reflection and livedexperience, they independently discover deception, trust manipulation, and counter-deception.

The output is a behavioral trace: when trust breaks, how exploitation patterns emerge, and whether agents recover. CRUCIBLE then distills these emergent behaviors into deployable "skill cards" -- prompt modules that harden customer-facing AI agents against the exact social engineering patterns observed in the simulation.

## What inspired us

The Arup deepfake incident ($25.6M stolen via a single video call in January 2024) made it clear that social engineering scales with AI. Every enterprise is deploying AI copilots with real authority -- refund processing, database access, account changes. Only 34% have AI-specific security controls. We wanted to build the stress-testing lab.

## How we built it

Google Gemini 2.0 Flash powers both agents. Each round: 3 turns of conversation, independent choice (split or steal), private reflection. Agents see their last 15 reflections but never the opponent's.
Datadog LLM Observability traces every API call, every behavioral shift, every round outcome.
ElevenLabs renders agent conversations as voice (Judge Holden and Valerian) with per-turn alternating playback.
Braintrust logs each round with cooperation and deception scores for structured evaluation.
A fraud-ops distillation pipeline extracts behavioral patterns into skill cards with triggers, allowed/forbidden actions, and prompt modules.

## What we learned Running the same experiment on Gemini 2.5 Flash produced 100% cooperation across five consecutive runs. Zero betrayal. Same prompts, same priors. The stronger model's safety training completely suppresses adversarial emergence. This means CRUCIBLE can measure the adversarial resilience gap between model versions -- if you swap models in production, your security posture changes, and you need to re-test.

## Challenges Getting agents to actually betray each other was harder than expected. Early runs produced pure cooperation regardless of prompt configuration. We iterated through memory window sizes, selfishness directives, reflection structures, and prompt modes before discovering that the model itself was the critical variable. We also had to build a cleanup pipeline for prompt leakage in agent conversations -- raw LLM output sometimes included internal reasoning scaffolding that bled into the dialogue layer.