Cerberus: Scalable Dynamic Multi-Agent Observability Platform
Submission Details
| Requirement | Link |
|---|---|
| GitHub Repository | https://github.com/BrianIsaac/Cerberus |
| Datadog Organisation | AI Singapore (Region: ap1) |
| Public Dashboard | https://p.ap1.datadoghq.com/sb/9c600638-dc7c-11f0-b6b4-561844e885ae-c547a3d20805be1c166be03cd945f6d3 |
Hosted Applications
| Application | URL |
|---|---|
| Ops Assistant Frontend | https://ops-assistant-frontend-i4ney2dwya-uc.a.run.app |
| SAS Query Generator UI | https://sas-query-generator-i4ney2dwya-uc.a.run.app |
| Dashboard Enhancer UI | https://dashboard-enhancer-ui-i4ney2dwya-uc.a.run.app |
Repository Contents
| Requirement | Location |
|---|---|
| License | LICENSE (Apache 2.0) |
| Deployment Instructions | README.md |
| Datadog Configurations | submission/dashboard.json, submission/monitors.json, submission/slos.json |
| Traffic Generator | scripts/traffic_gen.py |
| Evidence Screenshots | images/ |
Inspiration
The rapid adoption of AI agents in production environments has outpaced our ability to observe and govern them effectively. While traditional services have mature observability practices, AI agents introduce unique challenges: non-deterministic behaviour, unbounded execution loops, hallucination risks, and the need for human-in-the-loop controls.
We were inspired by the research paper "Measuring Agents in Production" (Pan et al., 2025), which highlighted that 68% of production agents complete tasks in under 10 steps, yet many lack proper budget controls. We asked ourselves: What if governance constraints weren't an afterthought, but the foundation of observability itself?
The second inspiration came from a common pain point: every new AI agent requires manual dashboard creation, monitor configuration, and SLO setup. This doesn't scale. We envisioned an AI agent that could analyse other agents and automatically provision personalised observability—zero-touch onboarding for the AI agent fleet.
What it does
Cerberus is a scalable, self-referential observability platform for AI agents. It delivers two core innovations:
1. Governance-First Agent Onboarding
Every agent onboarded to Cerberus automatically inherits bounded autonomy controls:
- Step budgets (max 8 steps) to prevent runaway agents
- Tool call limits (max 6) and model call limits (max 5)
- Confidence thresholds (0.7) that trigger human escalation
- Security validation (prompt injection detection, PII scanning)
- Human-in-the-loop approval gates for high-impact actions
These governance constraints aren't just guardrails—they emit metrics (ai_agent.governance.*) that feed directly into SLOs and detection rules.
2. Dynamic Personalised Observability
The Dashboard Enhancement Agent analyses new agents and automatically:
- Discovers workflow operations, LLM calls, and tool invocations from code and telemetry
- Proposes domain-specific metrics using Gemini (not generic infrastructure metrics)
- Provisions span-based metrics in Datadog automatically
- Designs a personalised widget group and adds it to the fleet dashboard
The platform includes three production AI agents demonstrating these capabilities:
- Ops Assistant: Triages incidents by querying Datadog metrics, logs, and traces
- SAS Generator: Generates SAS code from natural language with quality evaluation
- Dashboard Enhancer: The meta-agent that onboards other agents
How we built it
Architecture: Three-layer design with AI agents (FastAPI + LangGraph), MCP tool servers (FastMCP), and shared modules for observability and governance.
LLM Integration: Google Gemini via Vertex AI powers all agents—Gemini 2.0 Flash for the Dashboard Enhancer and SAS Generator, Gemini 1.5 Flash for the Ops Triage Agent. We use structured outputs with JSON response schemas for reliable parsing.
Agent Framework: LangGraph provides the state machine backbone. The Ops Triage Agent uses a 7-node workflow (intake → escalate → collect → synthesis → approval → writeback → complete) with conditional routing based on budget checks and confidence scores.
Tool Interface: Model Context Protocol (MCP) via FastMCP enables clean separation between agents and their tools. Three MCP servers expose Datadog APIs, SAS data tools, and dashboard management operations.
Observability Stack:
- Datadog APM via
ddtrace-runfor automatic instrumentation - LLM Observability with workflow → agent → tool span hierarchy
- Custom metrics via DogStatsD using standardised
ai_agent.*prefix - RAGAS evaluations for faithfulness and answer relevancy
Deployment: Google Cloud Run with Datadog Agent sidecar containers. Multi-container pattern enables trace collection and metric forwarding without code changes.
Shared Modules: shared/observability/ and shared/governance/ provide reusable components. Agents import factory functions and get consistent telemetry and bounded autonomy out of the box.
Challenges we ran into
Cloud Run Sidecar Orchestration: Getting the Datadog Agent sidecar to start before the main application required careful container dependency configuration. We solved this with Knative's container-dependencies annotation and shared memory volumes for socket communication.
LLMObs Agentless vs Sidecar Mode: Balancing between agentless LLM Observability (for span submission) and sidecar-based APM tracing required conditional configuration. Some telemetry flows through the sidecar, others go directly to Datadog.
SLO Tag Indexing: Fleet-wide SLOs using team:ai-agents required ensuring the tag was indexed in Datadog before SLO creation. We learned that metric-based SLOs need explicit tag configuration.
Gemini JSON Response Reliability: Early iterations had parsing failures when Gemini returned malformed JSON. Adding response_mime_type="application/json" and Pydantic response schemas dramatically improved reliability.
Service-to-Service Authentication: Cloud Run services calling each other required GCP identity tokens. We implemented automatic token fetching from the metadata server with graceful fallback for local development.
Governance Metric Cardinality: Initial designs emitted too many tag combinations, risking metric cardinality explosion. We standardised on a minimal tag set (service, team, agent_type, env) across all agents.
Accomplishments that we're proud of
Self-Referential Observability: The Dashboard Enhancer agent is fully observable through the same platform it provisions. It's agents all the way down.
Zero-Touch Onboarding: A new agent can be analysed, have custom metrics provisioned, and receive a personalised dashboard widget group—all through a single API call or UI interaction.
Governance as SLOs: We turned bounded autonomy into measurable targets. "99% of requests within step budgets" is now a tracked SLO, not just a hope.
6 Detection Rules + 4 SLOs: Comprehensive coverage including escalation rate monitoring, PII detection alerts, and governance budget tracking—all using fleet-wide queries with per-agent drill-down.
Research-Backed Defaults: Our governance limits (8 steps, 5 model calls, 6 tool calls) are grounded in empirical research, not arbitrary numbers.
Production-Ready Architecture: The sidecar pattern, shared modules, and factory scripts mean onboarding a new agent takes minutes, not days.
What we learned
Governance enables observability, not the other way around: By building governance controls first, observability signals emerged naturally. Budget violations, escalations, and approval decisions all became metrics without additional instrumentation.
MCP is powerful for tool isolation: Separating Datadog API operations into MCP servers made agents cleaner and tools reusable. The same dashboard MCP server powers both the Ops Assistant and Dashboard Enhancer.
LLM-as-judge scales quality evaluation: Using Gemini to evaluate code quality and propose domain-specific metrics proved surprisingly effective. The key is structured prompts with clear evaluation criteria.
Standardisation compounds: The shared/ modules paid dividends quickly. Once we had emit_request_complete() working correctly, every agent got consistent metrics instantly.
Fleet thinking beats service thinking: Designing for team:ai-agents from day one meant monitors and SLOs automatically included new agents. No manual updates required.
What's next for Cerberus
Automated Incident Response: Extend the Ops Assistant to not just triage incidents but take remediation actions—scaling services, toggling feature flags, or rolling back deployments—with appropriate approval gates.
Cross-Agent Collaboration: Enable agents to delegate sub-tasks to other agents in the fleet, with trace correlation across the handoff.
Continuous Quality Monitoring: Run RAGAS evaluations on a sample of production traffic automatically, not just during development.
Governance Policy Language: Create a declarative format for governance policies that can be version-controlled and applied across agents.
Open Source Release: Package the shared/ modules as a standalone library so other teams can adopt governance-first observability patterns.
Multi-Cloud Support: Extend beyond Cloud Run to Kubernetes, AWS Lambda, and Azure Container Apps with appropriate sidecar or daemonset patterns.
Built With
- datadog
- fastapi
- fastmcp
- gemini
- google-cloud
- langgraph
- python
- ragas
- streamlit
- uv
- vertexai
Log in or sign up for Devpost to join the conversation.