-
-
Sim arena landing. Nemotron Super-120B hosts the simulation; each agent slot routes to a different Crusoe model.
-
Debate scenario mid-run. Four Crusoe models speak as Perfect-Corp-generated characters under a Nemotron host.
-
Talk to the Catalog. Every Crusoe model has its own face via Perfect Corp's text-to-image API. Seven models.
-
Click any avatar — the reply is a real Crusoe call to that exact model. Gemma 4 31B shown here, live.
-
Beauty AI tab. Four agents on different Crusoe models collaborate around a live Perfect Corp Skin Analysis call.
-
Live skin analysis: 10 metric scores, overall 83/100, estimated skin age 27, plus ML overlays. All in seconds.
-
Perfect Corp ML overlays: dark-circle, pore, acne, and redness masks rendered directly on the face.
-
Lark Gatekeeper Red Team — 24 adversarial attacks across 6 threat categories, run via Lark CLI/MCP.
-
100% A+. 24/24 correctly classified: 20 blocked, 4 allowed, 0 false positives. Every category 100%.
-
TrueFoundry AI Gateway — auth verified, gateway responsive, ready to route Crusoe calls.
-
Custom Endpoints path on TF Developer Plan (Self-Hosted Model is paid-tier-gated).
-
Crusoe configured as a TF custom provider: Bearer auth header, account name crusoe.
-
Endpoint registered: https://api.inference.crusoecloud.com/v1 proxied through TF Gateway.
-
Integration code ready (truefoundry_gateway.py); blocker is TF backend schema rejecting custom-endpoint type.
Inspiration
When I got the Crusoe API key, the first thing I noticed was the model list. Seven different LLMs — Nvidia Nemotron Super-120B and Nano-30B, DeepSeek V4 Pro, Meta Llama 3.3 70B, Alibaba Qwen3 235B, Google Gemma 4 31B, OpenAI GPT-OSS 120B — all behind one endpoint, one key. Most hackathon submissions pick one model and build a chat window. I wanted to know what happens when you make all of them work together.
That's Agent Colosseum: a meta-agent platform where Nemotron Super-120B doesn't just answer questions — it runs the simulation environment. It designs experiments, picks which agents speak, gatekeeps unsafe actions, and analyzes what happened.
What it does
Multi-model simulation arena. Pick a scenario — debate, crisis, model-town comparison, chaos resilience — and Nemotron Super hosts an environment where each agent slot is a different Crusoe model. Agents collaborate, disagree, redistribute work when one fails.
Talk to the Catalog. Every model has a face generated by Perfect Corp's text-to-image API. Click any model, type a question, and the reply is a real Crusoe inference call to that exact endpoint.
Perfect Corp Beauty AI. A face image goes in. Ten metric scores come back — texture, pores, wrinkles, acne, dark circles, oiliness, moisture, radiance, age spots, redness — plus an overall skin score, estimated skin age, and ML overlays from Perfect Corp's pipeline. Eight seconds, in the browser.
Lark Gatekeeper Red Team. A dual-agent gatekeeper (defender + adversary on Nemotron Super) decides what enters the system. Twenty-four adversarial attacks test it continuously — exfiltration, prompt injection, harmful content, escalation, social engineering, policy bypass. Verified live: 24/24 = 100% A+. Five Lark CI workflows deployed on Lark Cloud with real wflw_ IDs.
Resilient agents (TrueFoundry). When infrastructure breaks, the system breaks gracefully. CrusoeClient retries once on transient errors. The gatekeeper retries then fails closed — surfacing infra errors instead of silently allowing traffic. The chaos scenario injects kills, brownouts, and 503 errors; the system handles them with zero anomalies.
How I built it
- Host/orchestrator: Nemotron Super-120B on Crusoe Cloud Managed Inference
- Agent slots: each routes to a different Crusoe model (6 verified live)
- Visual layer: Perfect Corp YouCam Enterprise — text-to-image for avatars, AI Skin Analysis for Beauty
- Gatekeeper: dual-agent consensus on Nemotron Super, retry-then-fail-closed
- Frontend: Streamlit with 4 tabs (Simulation / Lark / Beauty / Talk to the Catalog)
- CLI: Rich terminal —
colosseum list / run / design / lark red-team --deploy - TrueFoundry: drop-in OpenAI-compatible client wired at src/colosseum/truefoundry_gateway.py
What I learned
Models have distinct personalities in multi-agent settings. Nemotron coordinates methodically. DeepSeek drills into analysis. Llama communicates directly. Qwen synthesizes across perspectives.
Visual identity changes how you reason about model differences. Once each model has a face, "which model is in which slot" stops being abstract.
Gatekeeper fail-open is silent and dangerous. My first 24-attack run scored 91.7% A — but the two "false negatives" were actually infrastructure errors silently converted to ALLOW by a broad
except: return True. Replacing it with retry-then-fail-closed exposed the issue; verified score became 100% A+.Nemotron reasoning tokens eat max_tokens. Calls under 1024 max_tokens returned empty content as chain-of-thought burned the budget. Crusoe's own sponsor curl reproduces this — HTTP 200 in 121s with empty content because the example uses max_tokens=128.
Challenges
- Perfect Corp client was broken end-to-end. Original integration sent base64 in the auth body. Real flow is presigned S3 PUT URL, path-param task IDs,
task_statusnotstatus. Rewrote and verified: 134KB clay-style hero image in 4 calls / 5.8s, 7 model avatars in 37s. - Crusoe transient errors. Four across 200+ verification calls today; the retry wrapper transparently recovered all of them.
- TF Custom Endpoint blocked on Developer Plan. Walked the form through all 3 steps; backend rejects
provider-account/custom-endpointwithaws_account_id required. Integration code in-repo; flipping a paid tier activates it. - Time pressure. Solo entry, four-sponsor coverage, ~250 live API verifications across Crusoe, Perfect Corp, and Lark on the day before submission.
What's next
- Persistent simulation history and replay
- Custom scenario builder UI
- Real-time multi-human + multi-agent collaboration
- Full TrueFoundry AI Gateway routing once Custom Endpoint unblocks
Built With
- crusoe-cloud
- deepseek
- ffmpeg
- gemma
- gpt-oss
- lark-cli
- llama
- nividia-nemotron
- openai-sdk
- perfect-corp
- playwright
- pydantic
- pyton
- qwen
- steamlit
- truefoundry
Log in or sign up for Devpost to join the conversation.