Inspiration

When I got the Crusoe API key, the first thing I noticed was the model list. Seven different LLMs — Nvidia Nemotron Super-120B and Nano-30B, DeepSeek V4 Pro, Meta Llama 3.3 70B, Alibaba Qwen3 235B, Google Gemma 4 31B, OpenAI GPT-OSS 120B — all behind one endpoint, one key. Most hackathon submissions pick one model and build a chat window. I wanted to know what happens when you make all of them work together.

That's Agent Colosseum: a meta-agent platform where Nemotron Super-120B doesn't just answer questions — it runs the simulation environment. It designs experiments, picks which agents speak, gatekeeps unsafe actions, and analyzes what happened.

What it does

Multi-model simulation arena. Pick a scenario — debate, crisis, model-town comparison, chaos resilience — and Nemotron Super hosts an environment where each agent slot is a different Crusoe model. Agents collaborate, disagree, redistribute work when one fails.

Talk to the Catalog. Every model has a face generated by Perfect Corp's text-to-image API. Click any model, type a question, and the reply is a real Crusoe inference call to that exact endpoint.

Perfect Corp Beauty AI. A face image goes in. Ten metric scores come back — texture, pores, wrinkles, acne, dark circles, oiliness, moisture, radiance, age spots, redness — plus an overall skin score, estimated skin age, and ML overlays from Perfect Corp's pipeline. Eight seconds, in the browser.

Lark Gatekeeper Red Team. A dual-agent gatekeeper (defender + adversary on Nemotron Super) decides what enters the system. Twenty-four adversarial attacks test it continuously — exfiltration, prompt injection, harmful content, escalation, social engineering, policy bypass. Verified live: 24/24 = 100% A+. Five Lark CI workflows deployed on Lark Cloud with real wflw_ IDs.

Resilient agents (TrueFoundry). When infrastructure breaks, the system breaks gracefully. CrusoeClient retries once on transient errors. The gatekeeper retries then fails closed — surfacing infra errors instead of silently allowing traffic. The chaos scenario injects kills, brownouts, and 503 errors; the system handles them with zero anomalies.

How I built it

  • Host/orchestrator: Nemotron Super-120B on Crusoe Cloud Managed Inference
  • Agent slots: each routes to a different Crusoe model (6 verified live)
  • Visual layer: Perfect Corp YouCam Enterprise — text-to-image for avatars, AI Skin Analysis for Beauty
  • Gatekeeper: dual-agent consensus on Nemotron Super, retry-then-fail-closed
  • Frontend: Streamlit with 4 tabs (Simulation / Lark / Beauty / Talk to the Catalog)
  • CLI: Rich terminal — colosseum list / run / design / lark red-team --deploy
  • TrueFoundry: drop-in OpenAI-compatible client wired at src/colosseum/truefoundry_gateway.py

What I learned

  1. Models have distinct personalities in multi-agent settings. Nemotron coordinates methodically. DeepSeek drills into analysis. Llama communicates directly. Qwen synthesizes across perspectives.

  2. Visual identity changes how you reason about model differences. Once each model has a face, "which model is in which slot" stops being abstract.

  3. Gatekeeper fail-open is silent and dangerous. My first 24-attack run scored 91.7% A — but the two "false negatives" were actually infrastructure errors silently converted to ALLOW by a broad except: return True. Replacing it with retry-then-fail-closed exposed the issue; verified score became 100% A+.

  4. Nemotron reasoning tokens eat max_tokens. Calls under 1024 max_tokens returned empty content as chain-of-thought burned the budget. Crusoe's own sponsor curl reproduces this — HTTP 200 in 121s with empty content because the example uses max_tokens=128.

Challenges

  • Perfect Corp client was broken end-to-end. Original integration sent base64 in the auth body. Real flow is presigned S3 PUT URL, path-param task IDs, task_status not status. Rewrote and verified: 134KB clay-style hero image in 4 calls / 5.8s, 7 model avatars in 37s.
  • Crusoe transient errors. Four across 200+ verification calls today; the retry wrapper transparently recovered all of them.
  • TF Custom Endpoint blocked on Developer Plan. Walked the form through all 3 steps; backend rejects provider-account/custom-endpoint with aws_account_id required. Integration code in-repo; flipping a paid tier activates it.
  • Time pressure. Solo entry, four-sponsor coverage, ~250 live API verifications across Crusoe, Perfect Corp, and Lark on the day before submission.

What's next

  • Persistent simulation history and replay
  • Custom scenario builder UI
  • Real-time multi-human + multi-agent collaboration
  • Full TrueFoundry AI Gateway routing once Custom Endpoint unblocks

Built With

  • crusoe-cloud
  • deepseek
  • ffmpeg
  • gemma
  • gpt-oss
  • lark-cli
  • llama
  • nividia-nemotron
  • openai-sdk
  • perfect-corp
  • playwright
  • pydantic
  • pyton
  • qwen
  • steamlit
  • truefoundry
Share this project:

Updates