AgentArena

💡 Inspiration

Watching NVIDIA train robots in Omniverse simulations before deploying them to real warehouses made me ask: Why are we deploying LLMs to production with only text benchmarks?

Current LLM testing is broken:

  • Black-box evaluation with no visibility into reasoning processes
  • Happy-path scenarios that never test conflict or pressure
  • Single-agent tasks that miss multi-agent coordination failures

We chose Minecraft as our testing arena because it offers a rich, observable 3D environment where LLMs must navigate adversarial conditions, coordinate with difficult teammates, and make real-time decisions under pressure. If NVIDIA can stress-test robots before deployment, we can stress-test LLMs the same way.

🎯 What it does

AgentArena is an adversarial multi-agent testing framework that evaluates LLMs in realistic Minecraft scenarios. The system:

  • Spawns testing agents with six distinct behavioral profiles (Leader, Non-Cooperator, Confuser, Resource-Hoarder, Task-Abandoner, Follower) that create realistic challenges for the target LLM
  • Orchestrates test scenarios including cooperation challenges (build a house with uncooperative teammates) and resource-management tasks (craft tools under scarcity)
  • Provides real-time observability via WebSocket dashboard showing live metric updates
  • Generates comprehensive evaluations using five metrics: cooperation score, task completion rate, response latency, resource-sharing behavior, and communication quality
  • Delivers behavioral insights with timestamped analysis showing when and how the target LLM adapted strategies

🛠️ How we built it

Backend Architecture:

  • Elysia + Bun for high-performance TypeScript runtime
  • 6 core modules: Testing, Agents, Minecraft, Discord, LLM, Evaluation
  • 9-step test pipeline: Scenario selection → environment init → agent spawning → LLM connection → coordination phase → execution → real-time observation → completion detection → evaluation

Integration:

  • OpenRouter API for 400+ LLM models (GPT-4, Claude, Llama, Gemini, DeepSeek)
  • Mineflayer.js for Minecraft bot control with 20+ custom actions
  • Discord.js + ElevenLabs TTS for voice coordination
  • Prisma + Supabase PostgreSQL for data persistence

Frontend:

  • React 19 + shadcn/ui for the dashboard
  • WebSocket multiplexing for real-time updates
  • Test creation wizard, live monitoring dashboard, and results viewer

🚧 Challenges we ran into

Race conditions in metrics: Multiple async data sources (Discord chat, Minecraft events, database state) caused inconsistent metric updates. Fixed with debounced event batching for atomic calculations.

LLM latency variance: LLM responses ranged from 2-15 seconds while bots acted every 5 seconds, causing stale world state issues. Solved with event sourcing architecture that lets LLMs query historical world state at specific timestamps.

WebSocket connection management: Multiple concurrent users created connection storms. Implemented WebSocket multiplexing with message tagging to reduce connections and CPU usage.

Discord rate limits: Multiple agents speaking simultaneously via TTS hit Discord's rate limits. Created centralized TTS queue with cooldown management.

Adversarial behavior realism: Initially hard-coded behaviors felt robotic. Switched to probabilistic models where agents ignore messages randomly, delay responses variably, and act at realistic intervals.

🏆 Accomplishments we're proud of

Complete adversarial testing framework: Full pipeline from test creation to live monitoring to statistical reports, all in a hackathon timeframe.

Six behavioral profiles with realistic pressure: Probability-based behaviors that create unpredictable, realistic challenges. Non-Cooperator breaks blocks others place and complains in chat. Confuser provides contradictory information. Leader delegates tasks and motivates.

Emergent behavior capture: We observed GPT-5 give up on a Non-Cooperator agent and complete the task independently—not programmed behavior, but adaptive decision-making under pressure.

Real-time observability: Dashboard showing live metrics.

Research-grade evaluation: Five metrics with statistical analysis, behavioral insights showing exactly when and how LLMs adapted their strategies during tests.

📚 What we learned

Multi-agent coordination is complex: Bridging LLM decision intervals (5-10 seconds) with Minecraft's real-time environment (20 ticks/second) required careful event-driven architecture.

Adversarial agents need probability, not rules: Realistic difficult teammates don't always refuse—they ignore 50% of messages randomly, delay responses variably, and create unpredictable challenges LLMs must adapt to.

Event sourcing solves temporal consistency: Storing timestamped events lets LLMs query world state as it was when they last observed, eliminating stale data issues when responses are delayed.

The best tests reveal unexpected behavior: We wanted to test cooperation but discovered LLMs struggle more with ambiguity resolution, resource prioritization, and graceful degradation when teammates fail.

🚀 What's next for AgentArena

v2.0:

  • Deterministic scenario seeds: Reproducible tests for research
  • Visual scenario editor: Create tests without coding
  • Comparative analysis dashboard: Track model improvements over time
  • Video recording: Export test sessions as MP4
  • Leaderboard: Compare LLM performance on standardized tests
  • Public API: Programmatic test execution for researchers

Long-term vision:

  • Multi-world testing with parallel scenarios
  • Custom behavioral profile builder
  • Cloud deployment with hosted Minecraft servers
  • Community scenario marketplace

AgentArena proves that LLMs need adversarial testing before production—not to train them, but to understand their limits under realistic pressure.

Built With

  • bun-runtime
  • claude
  • elevenlabs-tts
  • elysiajs
  • git
  • github
  • minecraft
  • mineflayer
  • node.js
  • openapi
  • openrouter-api
  • prisma-orm
  • supabase
  • typescript
  • websocket
  • zod-validation
Share this project:

Updates