Chorus: The Immune System for AI Agents
Predict agent conflicts before they cascade.
🎯 Problem Statement
As AI agents increasingly operate in decentralized environments—autonomous trading bots, smart city infrastructure, robotic swarms—they create unpredictable feedback loops and cascading failures.
Consider this scenario: Agent A detects low inventory and orders supplies. Agent B, seeing the same signal, does the same. Agent C observes the sudden demand spike and raises prices. The system spirals into a deadlock—or worse, a market crash.
Current solutions fail because:
- Traditional monitoring is reactive, not predictive
- Centralized orchestrators become single points of failure
- No existing system applies Game Theory to multi-agent conflict detection
The cost of inaction: Cascading failures in autonomous systems can cause millions in damages, safety incidents, and complete system collapse.
💡 Solution
Chorus is a real-time AI safety layer that acts as an "immune system" for multi-agent networks. It:
- Observes agent interactions via high-throughput event streaming
- Predicts conflicts using Game Theory analysis powered by Google Gemini
- Intervenes automatically by quarantining risky agents before failures cascade
- Alerts operators with voice notifications for critical incidents
Unlike traditional monitoring, Chorus is proactive—it predicts and prevents failures rather than just observing them.
🛠️ Services Used
Google Gemini 3 Pro ⭐ (Core Intelligence)
- Role: Primary conflict prediction engine
- Implementation: Direct API integration via
google-generativeaiSDK - How it works: Batched agent intentions are sent to Gemini for Game Theory analysis. The model calculates Nash Equilibria and detects non-cooperative behaviors (resource hoarding, deadlocks) in <50ms.
- Key Feature: Generates quantitative risk scores (0-100) that drive automated quarantine decisions
Confluent Kafka ⭐ (Event Streaming Backbone)
- Role: High-throughput message bus for agent communication
- Implementation:
agent-messages-raw: Agents publish intentionsagent-decisions-processed: Backend publishes intervention decisionssystem-alerts: Critical notifications
- Throughput: 1,000+ messages/second
- Why Confluent: Decouples high-velocity agent streams from analysis. Enables Event Sourcing for post-mortem failure analysis.
Datadog ⭐ (Observability & Trust Verification)
- Role: Real-time monitoring and alerting
- Implementation:
- Custom metrics:
agent.trust_score,system.conflict_risk,intervention.count - APM tracing for Conflict Prediction Engine latency
- Live dashboards for swarm health visualization
- Custom metrics:
- Why Datadog: Provides a "trust verification layer"—proving to operators that the system is functioning correctly and enabling root-cause analysis.
ElevenLabs ⭐ (Voice-First Incident Response)
- Role: Voice alerts for critical failures
- Implementation: Converts structured alert JSON into natural language narrations using
eleven_multilingual_v2model - Voice ID: 21m00Tcm4TlvDq8ikWAM (Rachel)
- Why ElevenLabs: Critical failures in autonomous systems require immediate attention. Voice alerts reduce operator reaction time by explaining exactly why an agent was quarantined.
🏗️ Architecture
Agent Network → Kafka Streaming → Gemini Analysis → Trust Scoring → Intervention → Voice Alerts
↓ ↓ ↓ ↓ ↓ ↓
Simulation Event Sourcing Risk Scoring Redis Store Quarantine ElevenLabs
Data Flow:
- Agents publish actions to Confluent Kafka (
agent-messages-raw) - Backend batches intentions and sends to Gemini for Game Theory analysis
- Trust scores updated in Redis with sub-millisecond latency
- Metrics pushed to Datadog on every prediction cycle
- Critical alerts trigger ElevenLabs voice synthesis
- Real-time state pushed to React dashboard via WebSockets
💭 Inspiration
We were inspired by the human immune system—a decentralized network that detects and neutralizes threats without a central controller. As AI agent systems grow in complexity (autonomous vehicles, DeFi bots, industrial automation), we realized they need the same kind of self-regulating safety mechanism.
The question that drove us: "What happens when AI agents start working together—and against each other?"
📚 What We Learned
- Game Theory is powerful for AI safety: Nash Equilibrium calculations can predict agent conflicts before they manifest. Gemini's reasoning capabilities made this tractable in real-time.
- Event Sourcing is essential: Confluent Kafka's immutable log allows us to "replay" failures for post-mortem analysis—crucial for understanding emergent behaviors.
- Voice alerts reduce cognitive load: In high-stress situations, operators respond faster to spoken explanations than dashboards full of metrics.
- Trust must be dynamic: Static access control fails in multi-agent systems. Continuous trust scoring based on behavior is the only scalable approach.
🔨 How We Built It
Backend (Python/FastAPI):
- Conflict Prediction Engine with Gemini 3 Pro integration
- Trust Management System with Redis persistence
- Intervention Engine with automated quarantine logic
- WebSocket server for real-time dashboard updates
Frontend (React/TypeScript):
- Real-time trust visualization with color-coded agent cards
- Conflict alerts as toast notifications and dashboard panels
- System health monitoring (Redis, Gemini, Kafka status)
- Cyberpunk "Glassmorphism" aesthetic with neon accents
Infrastructure:
- Dockerized deployment with single-command launch
- Kubernetes-ready with Helm charts
- Comprehensive test suite (260+ tests, 92.7% pass rate)
- Property-based testing with Hypothesis for correctness invariants
🚧 Challenges We Faced
- Sub-50ms Prediction Latency: Getting Gemini to return Game Theory analysis fast enough for real-time intervention required careful prompt engineering and request batching.
- Trust Score Consistency: In a distributed system, maintaining consistent trust scores across components was challenging. We solved this with Redis as a single source of truth.
- False Positive Quarantines: Early versions quarantined too aggressively. We tuned confidence thresholds and added manual override capabilities.
- Voice Alert Timing: Generating voice alerts added latency. We made ElevenLabs calls asynchronous so they don't block critical intervention actions.
📊 Validation & Results
- ✅ 90.9% system validation success rate
- ✅ 260+ automated tests
- ✅ <50ms conflict prediction latency
- ✅ 1,000+ agents tested concurrently
- ✅ 10,000+ events/second throughput
🔗 Links & Resources
- Live Demo:
./run_frontend_demo.sh - Full Documentation:
/docs/ - Tech Stack: Python, FastAPI, React, TypeScript, Redis, Confluent Kafka, Google Gemini, Datadog, ElevenLabs
Chorus Team — December 2025
Built With
- confluent
- datadog
- gemini-3-pro
- google-cloud
- kafka

Log in or sign up for Devpost to join the conversation.