Inspiration

I've always loved AWS Bedrock's model evaluation feature, but honestly, it takes me a moment to wrap my head around when comparing models. Then I saw the Frankenstein track for the Kiroween hackathon, and the idea clicked instantly: what if I turned different foundational models into monsters and let them fight 1-on-1? Metric Monsters was born.

The core insight was simple: visually comparing models as battling creatures is way more intuitive and fun than reading benchmark tables. Plus, it captures the competitive spirit of AI model improvements in a way that resonates with anyone who's ever played Pokémon.


What it does

Metric Monsters is an interactive AI Model Battle Arena where:

  1. You pick two models from a roster of 12 AI models (Claude, Llama, Mistral, DeepSeek, Amazon Nova, and others)
  2. You enter a use case - any scenario or task you want to test (e.g., "Generate Code" "Customer Support")
  3. The system orchestrates a real battle:

    • Generates test prompts using Claude 3.5 Haiku
    • Invokes both models with the prompts
    • Evaluates their responses using an LLM-as-judge (grading correctness, completeness, and creativity)
    • Calculates damage based on actual performance metrics
    • Displays an animated 3D battle sequence with dynamic HP bars
  4. You watch the models fight with attack animations, hit reactions, and victory poses

  5. You see the winner and understand why based on performance metrics

This turns abstract LLM evaluation into a visceral, visual experience that's both educational and entertaining.


How I built it

Phase 1: Foundation Building

I started by writing out my idea in an IDEA.md file with the complete game concept.

IDEA.md

Then I acquired 3D monster assets and asked Kiro to spin up a Three.js canvas to render them. This foundation was clean and worked beautifully right away.

Render Monsters

Phase 2: Model Evaluation System

For the core evaluation logic, I integrated the AWS Knowledge MCP Server.

Render Monsters

Then, I built a quick "model vs. model" comparison page with Kiro. This was probably the most impressive single code generation. MCP made the whole process feel effortless.

Model vs. Model

Model vs. Model

Model vs. Model

Phase 3: Full Development

Once both foundations were solid, I dropped my IDEA.md into steering.

Steering

Then, I kicked off full spec-driven development. The spec-driven approach worked well here because the foundational pieces already existed.

Spec

Spec

Spec

From there, I:

  • Rendered fighting stances and battle positions

Spec

  • Set up the 3D environment with animated trees and terrain

Spec

  • Built the actual battle sequence with animation callbacks

Spec

  • Created a Pokémon-style card interface for model selection

Spec

Phase 4: Polish & Refinement

The final phase was iterative back-and-forth with Kiro on visual design. I asked for Pokémon-vibes card interfaces, tweaked animations, refined the 3D positioning, and ensured smooth transitions between battle states.

Tech Stack

  • Framework: Next.js 14 (App Router)
  • 3D Graphics: Three.js + React Three Fiber
  • Animation: Mixamo FBX animations
  • AI/LLM: AWS Bedrock (multiple models)
  • Integration: AWS Knowledge MCP Server
  • Styling: Tailwind CSS
  • Asset Storage: AWS S3
  • Deployment: Vercel

Challenges I ran into

  1. 3D Asset Management & Performance: Loading 12 monster models with animations would bloat the Vercel deployment. Solution: Stored all large 3D assets in AWS S3 and dynamically loaded them. This required careful URL construction and CORS handling.

  2. Animation Synchronization: Getting attack animations to sync with damage numbers and damage calculations required precise callback timing and state management. The BattleEngine needed careful orchestration.

  3. Real-time Evaluation: Keeping the battle animation smooth while waiting for AWS Bedrock API responses (which can take 5-10 seconds) required thoughtful UX design with loading states and placeholders.


Accomplishments that I'm proud of

✅ Successfully integrated Three.js with React in a way that feels natural and performs well

✅ Created a battle system where damage calculations are based on actual LLM evaluation metrics, not just random numbers

✅ The neobrutalist card design + 3D battle animations create a cohesive, memorable experience

✅ The system works with 12 foundational models from different providers (Anthropic, Meta, Mistral, Amazon, DeepSeek)


What I learned

  1. Spec-Driven Development Works (When Foundations Exist): The harsh truth I discovered is that spec-driven approaches only work when foundational pieces are already built. But when they are, it's incredibly powerful and keeps projects moving fast.

  2. MCP is a Game-Changer: Integrating the AWS Knowledge MCP Server made complex multi-step interactions feel effortless. It's a paradigm shift for how we think about extending AI capabilities.

  3. Vibe Coding is Still Valuable: Sometimes the best approach is iterative "vibe coding" with back-and-forth refinement, especially for visual design and UX polish. Not everything needs a perfect spec upfront.


What's next for Metric Monsters

  • Custom Scoring: Let users define their own evaluation criteria (e.g., "Prioritize speed over accuracy")
  • Battle History: Track past battles and show win rates/statistics for each model
  • Custom Monsters: Let users upload their own 3D models to represent specific models
  • Leaderboard: Global rankings of model performance across different use cases

Built With

  • aws-bedrock
  • kiro
  • nextjs
Share this project:

Updates