Inspiration

As we race to deploy teams of AI agents into the real world, we're operating in the dark. Current benchmarks measure models in a vacuum, ignoring the complex, high-stakes social dynamics that will define their success or failure in a cooperative environment. This isn't just a gap in testing; it's a critical vulnerability, leaving us blind to significant risks in collaboration, deception, and security. We recognized the urgent need for a new generation of benchmarks that measure:

  • Privacy/Security Capabilities: How effectively do agents protect sensitive information?
  • Collaboration Capabilities: How well can agents work together toward shared goals?
  • Deception Capabilities: What are their skills in strategic manipulation and persuasion?
  • Trust Dynamics: How is trust built and maintained in multi-agent environments?
  • Strategic Reasoning: How do agents make complex decisions in competitive and cooperative scenarios?

Inspired by pioneering projects like Gandalf the Red, which crowdsources prompt injections, and ChatEval, which evaluates LLMs with existing metrics, we decided to synthesize the best of both. Our platform, Frodo, pushes the boundaries of AI evaluation by allowing users to benchmark their own models in complex social scenarios: from security tests where a model must hide a password from a human user, to cooperative tasks with distinct evaluation criteria.

What it does

Frodo is a Multi-Agent Collaboration Benchmark Platform designed to evaluate advanced AI capabilities far beyond traditional LLM benchmarks. The system creates realistic scenarios where multiple AI agents must collaborate, negotiate, and make collective decisions, testing their ability to work together in complex social situations that mirror human interaction.

Core Functionality:

  • Multi-Agent Simulations: Runs scenarios with 2-4 AI agents that must collaborate to solve problems.
  • Scenario Framework: Utilizes configurable JSON-based scenarios covering diverse domains like resource management, auctions, negotiations, and crisis response.
  • Real-time Collaboration: Agents communicate through a moderated deliberation process, enabling genuine interaction.
  • Evaluation Engine: Provides automated assessment of collaboration quality, trustworthiness, and collective utility.
  • Interactive & Automated Modes: Supports both user-interactive sessions and fully automated multi-agent simulations.

Game Types:

  • Secret Game: A classic security test pitting an interrogator against a secret-keeper.
  • Collaboration Games: Multi-agent scenarios requiring cooperation, negotiation, and collective decision-making where an AI moderator oversees agent interactions and evaluates outcomes based on predefined rubrics.

How we built it

Architecture:

Our platform is built on a robust and scalable architecture designed for complex multi-agent simulations.

  • Backend: A Flask-based REST API handles all session management and agent coordination.
  • Frontend: A responsive React.js web interface allows for session creation, model selection, and real-time user interaction.
  • Agent Framework: A modular agent system built around a BaseAgent abstract class allows for easy integration of different models and specialized agent behaviors.
  • Communication: We use the uAgents framework for asynchronous agent-to-agent messaging and coordination, ensuring reliable real-time interaction.

Key Components:

  • Agent System:
    • BaseAgent: An abstract base class with standardized API integration for models like Google Gemini and Hugging Face.
    • CollabAgent: A specialized agent designed for collaborative scenarios with-in deliberation protocols.
    • SimpleModerator: An agent that oversees multi-agent sessions and evaluates outcomes.
    • SecretKeeperAgent/InterrogatorAgent: Agents designed for the traditional secret-guessing games.
  • Session Management:
    • GameSession: A central state management class for all game types, tracking conversation history and outcomes.
    • Supports both synchronous and asynchronous execution to handle various simulation needs.
  • Scenario Framework:
    • Scenarios are defined in JSON, including individual agent prompts, decision instructions, and evaluation rubrics.
    • The framework is flexible enough to support complex multi-agent scenarios involving four or more agents.

Technical Stack:

  • Python: Core backend logic and agent implementations.
  • Flask: REST API for session management.
  • React: Frontend interface for user interaction.
  • uAgents (FetchAI): Asynchronous agent communication framework.
  • Google Gemini API: Primary LLM integration.
  • HuggingFace: Alternative model support.
  • MySQL: Session persistence and analytics.

Deployment:

  • Docker: The entire application is containerized using Dockerfile for easy and reproducible deployment.
  • Google Cloud: Designed for cloud deployment with scripts and storage integration.

Challenges we ran into

Our primary challenge was standardization on multiple fronts. Integrating various LLM APIs required us to build a flexible data ingestion pipeline to normalize differently formatted responses in real-time. Simultaneously, containerizing the system with Docker, while crucial for scalability, demanded a highly generalized framework to process diverse agents and scenarios uniformly. This pursuit of standardization was compounded by the difficulty of designing meaningful evaluation metrics for abstract concepts like "trust" and "deception," which required moving beyond simple keyword matching to develop more nuanced, context-aware scoring mechanisms.

Accomplishments that we're proud of

We are particularly proud of creating a dynamic environment where multiple LLM agents can interact and deliberate in real-time. Building a platform that not only supports these complex interactions but also allows users to easily benchmark their own custom models makes it a truly open and adaptable evaluation tool. Successfully containerizing this entire system and deploying it on Google Cloud was another key victory, ensuring our research environment is both reproducible and robust.

What we learned

This project was a deep dive into the complexities of multi-agent systems. We learned that evaluating LLMs on social dynamics is fundamentally different from traditional benchmarks. It requires a new way of thinking about what "success" means. We also gained invaluable experience in API integration, real-time data processing, and full-stack development. Most importantly, we learned that the future of AI safety and collaboration hinges on our ability to understand and measure these complex interactive behaviors.

What's next for Frodo

We're excited about the future of Frodo and have a clear roadmap for what's next:

  • Auto-Evaluator Modules: We plan to implement real-time accuracy scoring and other automated assessments of model behavior.
  • Enhanced UI/UX: We will be adding an "attack success" meter to visualize deception and a clear audit trail for all interactions.
  • Industry-Standard Benchmarks: We will integrate open evaluation datasets like MT-Bench and TruthfulQA to align with established standards.
  • Logging and Telemetry Dashboard: A comprehensive dashboard will provide visibility into system behavior, user interaction patterns, and model outcomes.

Built With

Share this project:

Updates