Inspiration

As LLMs move from experiments into real production systems, I kept running into a problem that traditional monitoring doesn’t really solve. LLMs don’t always fail loudly. Sometimes they hallucinate, sometimes latency slowly creeps up, and sometimes costs spike without anything obviously “breaking.” These issues are hard to reason about if all you have are generic infrastructure metrics.

Datadog already provides powerful observability tools, but while building LLM-driven systems I felt there was still a gap between raw telemetry and actual operational understanding. This project came from that gap. I wanted to treat LLM failures as real production incidents, not as vague AI problems that are hard to explain, debug, or act on.

What it does

LLM Incident Commander is a Datadog-native observability and incident management system built specifically for LLM workloads.

It instruments an LLM-powered application and captures signals that actually matter in production, such as request volume, errors, latency behavior, cost-related indicators, and semantic risk signals. These signals are visualized through Datadog dashboards, evaluated using detection rules, and escalated into incidents when reliability thresholds are crossed.

The goal isn’t just to monitor numbers, but to turn LLM behavior into concrete, debuggable operational signals that teams can reason about.

How it was built

I built the project around three clear layers.

At the core is an instrumented LLM application built with FastAPI and Gemini on Vertex AI. The service emits metrics, traces, and logs to Datadog and is intentionally designed to fail fast, use environment-based configuration, and avoid any embedded secrets.

On top of that sits the Datadog observability layer. I created custom dashboards to visualize LLM health and behavior, monitors to detect latency regressions, error rate increases, cost anomalies, and semantic risk, and incident workflows to simulate how these issues would be handled in a real production environment.

Finally, the observability setup itself is captured as configuration. Dashboards, monitors, and SLOs are exported as JSON so the monitoring strategy can be reviewed, reproduced, and reasoned about independently of a specific Datadog account. This separation closely mirrors how observability is handled in real enterprise systems.

Challenges I ran into

One of the biggest challenges was realizing that traditional infrastructure metrics are not enough for LLM systems. Problems like hallucinations or cost drift require domain-aware signals, not just CPU usage or request counts.

Another challenge was deciding what should live in application code versus what belongs in Datadog configuration. Designing metrics and thresholds that were meaningful without being noisy took multiple iterations.

I also spent time making sure the system felt production-realistic. That meant enforcing fail-closed configuration, avoiding any hardcoded credentials, and ensuring everything ran cleanly in containers instead of relying on local shortcuts.

Accomplishments that I’m proud of

I’m particularly proud of treating LLM failures as real production incidents instead of abstract AI issues. Designing Datadog monitors around semantic and cost-based signals pushed the project beyond basic observability.

I’m also proud of making the entire setup reproducible through exported Datadog configuration, and of building the system with security and reliability in mind rather than optimizing only for a demo.

What I learned

This project reinforced that observability for AI systems requires intentional design. LLMs need clear reliability targets, domain-specific telemetry, and explicit incident workflows if they are going to be trusted in production.

It also showed me that Datadog can act as more than a monitoring backend. Used correctly, it becomes a control plane for AI reliability, helping bridge the gap between model behavior and operational accountability.

What’s next for LLM Incident Commander — Datadog-Native Observability for LLMs

Going forward, this system could be extended with tighter feedback loops between incidents and prompt or model configuration, automated remediation suggestions, and more detailed cost attribution. The core idea stays the same: making LLM systems observable, reliable, and safe to operate in real production environments.

Built With

  • asyncio
  • datadog
  • datadog-apm
  • datadog-metrics-&-tracing-apis
  • docker
  • dogstatsd
  • fastapi
  • gcp
  • gemini
  • google-cloud
  • langchain
  • python
  • uvicorn
  • vertex-ai
  • vertex-ai-generative-models-api
  • vertexaiembeddings
  • vertexaivectorsearch
Share this project:

Updates