AgentArena
💡 Inspiration
Watching NVIDIA train robots in Omniverse simulations before deploying them to real warehouses made me ask: Why are we deploying LLMs to production with only text benchmarks?
Current LLM testing is broken:
- Black-box evaluation with no visibility into reasoning processes
- Happy-path scenarios that never test conflict or pressure
- Single-agent tasks that miss multi-agent coordination failures
We chose Minecraft as our testing arena because it offers a rich, observable 3D environment where LLMs must navigate adversarial conditions, coordinate with difficult teammates, and make real-time decisions under pressure. If NVIDIA can stress-test robots before deployment, we can stress-test LLMs the same way.
🎯 What it does
AgentArena is an adversarial multi-agent testing framework that evaluates LLMs in realistic Minecraft scenarios. The system:
- Spawns testing agents with six distinct behavioral profiles (Leader, Non-Cooperator, Confuser, Resource-Hoarder, Task-Abandoner, Follower) that create realistic challenges for the target LLM
- Orchestrates test scenarios including cooperation challenges (build a house with uncooperative teammates) and resource-management tasks (craft tools under scarcity)
- Provides real-time observability via WebSocket dashboard showing live metric updates
- Generates comprehensive evaluations using five metrics: cooperation score, task completion rate, response latency, resource-sharing behavior, and communication quality
- Delivers behavioral insights with timestamped analysis showing when and how the target LLM adapted strategies
🛠️ How we built it
Backend Architecture:
- Elysia + Bun for high-performance TypeScript runtime
- 6 core modules: Testing, Agents, Minecraft, Discord, LLM, Evaluation
- 9-step test pipeline: Scenario selection → environment init → agent spawning → LLM connection → coordination phase → execution → real-time observation → completion detection → evaluation
Integration:
- OpenRouter API for 400+ LLM models (GPT-4, Claude, Llama, Gemini, DeepSeek)
- Mineflayer.js for Minecraft bot control with 20+ custom actions
- Discord.js + ElevenLabs TTS for voice coordination
- Prisma + Supabase PostgreSQL for data persistence
Frontend:
- React 19 + shadcn/ui for the dashboard
- WebSocket multiplexing for real-time updates
- Test creation wizard, live monitoring dashboard, and results viewer
🚧 Challenges we ran into
Race conditions in metrics: Multiple async data sources (Discord chat, Minecraft events, database state) caused inconsistent metric updates. Fixed with debounced event batching for atomic calculations.
LLM latency variance: LLM responses ranged from 2-15 seconds while bots acted every 5 seconds, causing stale world state issues. Solved with event sourcing architecture that lets LLMs query historical world state at specific timestamps.
WebSocket connection management: Multiple concurrent users created connection storms. Implemented WebSocket multiplexing with message tagging to reduce connections and CPU usage.
Discord rate limits: Multiple agents speaking simultaneously via TTS hit Discord's rate limits. Created centralized TTS queue with cooldown management.
Adversarial behavior realism: Initially hard-coded behaviors felt robotic. Switched to probabilistic models where agents ignore messages randomly, delay responses variably, and act at realistic intervals.
🏆 Accomplishments we're proud of
Complete adversarial testing framework: Full pipeline from test creation to live monitoring to statistical reports, all in a hackathon timeframe.
Six behavioral profiles with realistic pressure: Probability-based behaviors that create unpredictable, realistic challenges. Non-Cooperator breaks blocks others place and complains in chat. Confuser provides contradictory information. Leader delegates tasks and motivates.
Emergent behavior capture: We observed GPT-5 give up on a Non-Cooperator agent and complete the task independently—not programmed behavior, but adaptive decision-making under pressure.
Real-time observability: Dashboard showing live metrics.
Research-grade evaluation: Five metrics with statistical analysis, behavioral insights showing exactly when and how LLMs adapted their strategies during tests.
📚 What we learned
Multi-agent coordination is complex: Bridging LLM decision intervals (5-10 seconds) with Minecraft's real-time environment (20 ticks/second) required careful event-driven architecture.
Adversarial agents need probability, not rules: Realistic difficult teammates don't always refuse—they ignore 50% of messages randomly, delay responses variably, and create unpredictable challenges LLMs must adapt to.
Event sourcing solves temporal consistency: Storing timestamped events lets LLMs query world state as it was when they last observed, eliminating stale data issues when responses are delayed.
The best tests reveal unexpected behavior: We wanted to test cooperation but discovered LLMs struggle more with ambiguity resolution, resource prioritization, and graceful degradation when teammates fail.
🚀 What's next for AgentArena
v2.0:
- Deterministic scenario seeds: Reproducible tests for research
- Visual scenario editor: Create tests without coding
- Comparative analysis dashboard: Track model improvements over time
- Video recording: Export test sessions as MP4
- Leaderboard: Compare LLM performance on standardized tests
- Public API: Programmatic test execution for researchers
Long-term vision:
- Multi-world testing with parallel scenarios
- Custom behavioral profile builder
- Cloud deployment with hosted Minecraft servers
- Community scenario marketplace
AgentArena proves that LLMs need adversarial testing before production—not to train them, but to understand their limits under realistic pressure.
Built With
- bun-runtime
- claude
- elevenlabs-tts
- elysiajs
- git
- github
- minecraft
- mineflayer
- node.js
- openapi
- openrouter-api
- prisma-orm
- supabase
- typescript
- websocket
- zod-validation
Log in or sign up for Devpost to join the conversation.