Cerberus: Scalable Dynamic Multi-Agent Observability Platform

Submission Details

Requirement Link
GitHub Repository https://github.com/BrianIsaac/Cerberus
Datadog Organisation AI Singapore (Region: ap1)
Public Dashboard https://p.ap1.datadoghq.com/sb/9c600638-dc7c-11f0-b6b4-561844e885ae-c547a3d20805be1c166be03cd945f6d3

Hosted Applications

Application URL
Ops Assistant Frontend https://ops-assistant-frontend-i4ney2dwya-uc.a.run.app
SAS Query Generator UI https://sas-query-generator-i4ney2dwya-uc.a.run.app
Dashboard Enhancer UI https://dashboard-enhancer-ui-i4ney2dwya-uc.a.run.app

Repository Contents

Requirement Location
License LICENSE (Apache 2.0)
Deployment Instructions README.md
Datadog Configurations submission/dashboard.json, submission/monitors.json, submission/slos.json
Traffic Generator scripts/traffic_gen.py
Evidence Screenshots images/

Inspiration

The rapid adoption of AI agents in production environments has outpaced our ability to observe and govern them effectively. While traditional services have mature observability practices, AI agents introduce unique challenges: non-deterministic behaviour, unbounded execution loops, hallucination risks, and the need for human-in-the-loop controls.

We were inspired by the research paper "Measuring Agents in Production" (Pan et al., 2025), which highlighted that 68% of production agents complete tasks in under 10 steps, yet many lack proper budget controls. We asked ourselves: What if governance constraints weren't an afterthought, but the foundation of observability itself?

The second inspiration came from a common pain point: every new AI agent requires manual dashboard creation, monitor configuration, and SLO setup. This doesn't scale. We envisioned an AI agent that could analyse other agents and automatically provision personalised observability—zero-touch onboarding for the AI agent fleet.

What it does

Cerberus is a scalable, self-referential observability platform for AI agents. It delivers two core innovations:

1. Governance-First Agent Onboarding

Every agent onboarded to Cerberus automatically inherits bounded autonomy controls:

  • Step budgets (max 8 steps) to prevent runaway agents
  • Tool call limits (max 6) and model call limits (max 5)
  • Confidence thresholds (0.7) that trigger human escalation
  • Security validation (prompt injection detection, PII scanning)
  • Human-in-the-loop approval gates for high-impact actions

These governance constraints aren't just guardrails—they emit metrics (ai_agent.governance.*) that feed directly into SLOs and detection rules.

2. Dynamic Personalised Observability

The Dashboard Enhancement Agent analyses new agents and automatically:

  • Discovers workflow operations, LLM calls, and tool invocations from code and telemetry
  • Proposes domain-specific metrics using Gemini (not generic infrastructure metrics)
  • Provisions span-based metrics in Datadog automatically
  • Designs a personalised widget group and adds it to the fleet dashboard

The platform includes three production AI agents demonstrating these capabilities:

  • Ops Assistant: Triages incidents by querying Datadog metrics, logs, and traces
  • SAS Generator: Generates SAS code from natural language with quality evaluation
  • Dashboard Enhancer: The meta-agent that onboards other agents

How we built it

Architecture: Three-layer design with AI agents (FastAPI + LangGraph), MCP tool servers (FastMCP), and shared modules for observability and governance.

LLM Integration: Google Gemini via Vertex AI powers all agents—Gemini 2.0 Flash for the Dashboard Enhancer and SAS Generator, Gemini 1.5 Flash for the Ops Triage Agent. We use structured outputs with JSON response schemas for reliable parsing.

Agent Framework: LangGraph provides the state machine backbone. The Ops Triage Agent uses a 7-node workflow (intake → escalate → collect → synthesis → approval → writeback → complete) with conditional routing based on budget checks and confidence scores.

Tool Interface: Model Context Protocol (MCP) via FastMCP enables clean separation between agents and their tools. Three MCP servers expose Datadog APIs, SAS data tools, and dashboard management operations.

Observability Stack:

  • Datadog APM via ddtrace-run for automatic instrumentation
  • LLM Observability with workflow → agent → tool span hierarchy
  • Custom metrics via DogStatsD using standardised ai_agent.* prefix
  • RAGAS evaluations for faithfulness and answer relevancy

Deployment: Google Cloud Run with Datadog Agent sidecar containers. Multi-container pattern enables trace collection and metric forwarding without code changes.

Shared Modules: shared/observability/ and shared/governance/ provide reusable components. Agents import factory functions and get consistent telemetry and bounded autonomy out of the box.

Challenges we ran into

Cloud Run Sidecar Orchestration: Getting the Datadog Agent sidecar to start before the main application required careful container dependency configuration. We solved this with Knative's container-dependencies annotation and shared memory volumes for socket communication.

LLMObs Agentless vs Sidecar Mode: Balancing between agentless LLM Observability (for span submission) and sidecar-based APM tracing required conditional configuration. Some telemetry flows through the sidecar, others go directly to Datadog.

SLO Tag Indexing: Fleet-wide SLOs using team:ai-agents required ensuring the tag was indexed in Datadog before SLO creation. We learned that metric-based SLOs need explicit tag configuration.

Gemini JSON Response Reliability: Early iterations had parsing failures when Gemini returned malformed JSON. Adding response_mime_type="application/json" and Pydantic response schemas dramatically improved reliability.

Service-to-Service Authentication: Cloud Run services calling each other required GCP identity tokens. We implemented automatic token fetching from the metadata server with graceful fallback for local development.

Governance Metric Cardinality: Initial designs emitted too many tag combinations, risking metric cardinality explosion. We standardised on a minimal tag set (service, team, agent_type, env) across all agents.

Accomplishments that we're proud of

Self-Referential Observability: The Dashboard Enhancer agent is fully observable through the same platform it provisions. It's agents all the way down.

Zero-Touch Onboarding: A new agent can be analysed, have custom metrics provisioned, and receive a personalised dashboard widget group—all through a single API call or UI interaction.

Governance as SLOs: We turned bounded autonomy into measurable targets. "99% of requests within step budgets" is now a tracked SLO, not just a hope.

6 Detection Rules + 4 SLOs: Comprehensive coverage including escalation rate monitoring, PII detection alerts, and governance budget tracking—all using fleet-wide queries with per-agent drill-down.

Research-Backed Defaults: Our governance limits (8 steps, 5 model calls, 6 tool calls) are grounded in empirical research, not arbitrary numbers.

Production-Ready Architecture: The sidecar pattern, shared modules, and factory scripts mean onboarding a new agent takes minutes, not days.

What we learned

Governance enables observability, not the other way around: By building governance controls first, observability signals emerged naturally. Budget violations, escalations, and approval decisions all became metrics without additional instrumentation.

MCP is powerful for tool isolation: Separating Datadog API operations into MCP servers made agents cleaner and tools reusable. The same dashboard MCP server powers both the Ops Assistant and Dashboard Enhancer.

LLM-as-judge scales quality evaluation: Using Gemini to evaluate code quality and propose domain-specific metrics proved surprisingly effective. The key is structured prompts with clear evaluation criteria.

Standardisation compounds: The shared/ modules paid dividends quickly. Once we had emit_request_complete() working correctly, every agent got consistent metrics instantly.

Fleet thinking beats service thinking: Designing for team:ai-agents from day one meant monitors and SLOs automatically included new agents. No manual updates required.

What's next for Cerberus

Automated Incident Response: Extend the Ops Assistant to not just triage incidents but take remediation actions—scaling services, toggling feature flags, or rolling back deployments—with appropriate approval gates.

Cross-Agent Collaboration: Enable agents to delegate sub-tasks to other agents in the fleet, with trace correlation across the handoff.

Continuous Quality Monitoring: Run RAGAS evaluations on a sample of production traffic automatically, not just during development.

Governance Policy Language: Create a declarative format for governance policies that can be version-controlled and applied across agents.

Open Source Release: Package the shared/ modules as a standalone library so other teams can adopt governance-first observability patterns.

Multi-Cloud Support: Extend beyond Cloud Run to Kubernetes, AWS Lambda, and Azure Container Apps with appropriate sidecar or daemonset patterns.

Built With

Share this project:

Updates