Agent Colosseum

Sim arena landing. Nemotron Super-120B hosts the simulation; each agent slot routes to a different Crusoe model.
Debate scenario mid-run. Four Crusoe models speak as Perfect-Corp-generated characters under a Nemotron host.
Talk to the Catalog. Every Crusoe model has its own face via Perfect Corp's text-to-image API. Seven models.
Click any avatar — the reply is a real Crusoe call to that exact model. Gemma 4 31B shown here, live.
Beauty AI tab. Four agents on different Crusoe models collaborate around a live Perfect Corp Skin Analysis call.
Live skin analysis: 10 metric scores, overall 83/100, estimated skin age 27, plus ML overlays. All in seconds.
Perfect Corp ML overlays: dark-circle, pore, acne, and redness masks rendered directly on the face.
Lark Gatekeeper Red Team — 24 adversarial attacks across 6 threat categories, run via Lark CLI/MCP.
100% A+. 24/24 correctly classified: 20 blocked, 4 allowed, 0 false positives. Every category 100%.
TrueFoundry AI Gateway — auth verified, gateway responsive, ready to route Crusoe calls.
Custom Endpoints path on TF Developer Plan (Self-Hosted Model is paid-tier-gated).
Crusoe configured as a TF custom provider: Bearer auth header, account name crusoe.
Endpoint registered: https://api.inference.crusoecloud.com/v1 proxied through TF Gateway.
Integration code ready (truefoundry_gateway.py); blocker is TF backend schema rejecting custom-endpoint type.

Inspiration

When I got the Crusoe API key, the first thing I noticed was the model list. Seven different LLMs — Nvidia Nemotron Super-120B and Nano-30B, DeepSeek V4 Pro, Meta Llama 3.3 70B, Alibaba Qwen3 235B, Google Gemma 4 31B, OpenAI GPT-OSS 120B — all behind one endpoint, one key. Most hackathon submissions pick one model and build a chat window. I wanted to know what happens when you make all of them work together.

That's Agent Colosseum: a meta-agent platform where Nemotron Super-120B doesn't just answer questions — it runs the simulation environment. It designs experiments, picks which agents speak, gatekeeps unsafe actions, and analyzes what happened.

What it does

Multi-model simulation arena. Pick a scenario — debate, crisis, model-town comparison, chaos resilience — and Nemotron Super hosts an environment where each agent slot is a different Crusoe model. Agents collaborate, disagree, redistribute work when one fails.

Talk to the Catalog. Every model has a face generated by Perfect Corp's text-to-image API. Click any model, type a question, and the reply is a real Crusoe inference call to that exact endpoint.

Perfect Corp Beauty AI. A face image goes in. Ten metric scores come back — texture, pores, wrinkles, acne, dark circles, oiliness, moisture, radiance, age spots, redness — plus an overall skin score, estimated skin age, and ML overlays from Perfect Corp's pipeline. Eight seconds, in the browser.

Lark Gatekeeper Red Team. A dual-agent gatekeeper (defender + adversary on Nemotron Super) decides what enters the system. Twenty-four adversarial attacks test it continuously — exfiltration, prompt injection, harmful content, escalation, social engineering, policy bypass. Verified live: 24/24 = 100% A+. Five Lark CI workflows deployed on Lark Cloud with real wflw_ IDs.

Resilient agents (TrueFoundry). When infrastructure breaks, the system breaks gracefully. CrusoeClient retries once on transient errors. The gatekeeper retries then fails closed — surfacing infra errors instead of silently allowing traffic. The chaos scenario injects kills, brownouts, and 503 errors; the system handles them with zero anomalies.

How I built it

Host/orchestrator: Nemotron Super-120B on Crusoe Cloud Managed Inference
Agent slots: each routes to a different Crusoe model (6 verified live)
Visual layer: Perfect Corp YouCam Enterprise — text-to-image for avatars, AI Skin Analysis for Beauty
Gatekeeper: dual-agent consensus on Nemotron Super, retry-then-fail-closed
Frontend: Streamlit with 4 tabs (Simulation / Lark / Beauty / Talk to the Catalog)
CLI: Rich terminal — colosseum list / run / design / lark red-team --deploy
TrueFoundry: drop-in OpenAI-compatible client wired at src/colosseum/truefoundry_gateway.py

What I learned

Models have distinct personalities in multi-agent settings. Nemotron coordinates methodically. DeepSeek drills into analysis. Llama communicates directly. Qwen synthesizes across perspectives.
Visual identity changes how you reason about model differences. Once each model has a face, "which model is in which slot" stops being abstract.
Gatekeeper fail-open is silent and dangerous. My first 24-attack run scored 91.7% A — but the two "false negatives" were actually infrastructure errors silently converted to ALLOW by a broad except: return True. Replacing it with retry-then-fail-closed exposed the issue; verified score became 100% A+.
Nemotron reasoning tokens eat max_tokens. Calls under 1024 max_tokens returned empty content as chain-of-thought burned the budget. Crusoe's own sponsor curl reproduces this — HTTP 200 in 121s with empty content because the example uses max_tokens=128.

Challenges

Perfect Corp client was broken end-to-end. Original integration sent base64 in the auth body. Real flow is presigned S3 PUT URL, path-param task IDs, task_status not status. Rewrote and verified: 134KB clay-style hero image in 4 calls / 5.8s, 7 model avatars in 37s.
Crusoe transient errors. Four across 200+ verification calls today; the retry wrapper transparently recovered all of them.
TF Custom Endpoint blocked on Developer Plan. Walked the form through all 3 steps; backend rejects provider-account/custom-endpoint with aws_account_id required. Integration code in-repo; flipping a paid tier activates it.
Time pressure. Solo entry, four-sponsor coverage, ~250 live API verifications across Crusoe, Perfect Corp, and Lark on the day before submission.

What's next

Persistent simulation history and replay
Custom scenario builder UI
Real-time multi-human + multi-agent collaboration
Full TrueFoundry AI Gateway routing once Custom Endpoint unblocks

Built With

crusoe-cloud
deepseek
ffmpeg
gemma
gpt-oss
lark-cli
llama
nividia-nemotron
openai-sdk
perfect-corp
playwright
pydantic
pyton
qwen
steamlit
truefoundry

Updates

Alexander Sorrell started this project — May 28, 2026 10:05 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.