Inspiration

LLM applications are fundamentally different from traditional software. They're non-deterministic, expensive to operate, and can fail silently in ways that infrastructure monitoring cannot detect. A service can return HTTP 200 while the AI output is completely unusable. We were inspired to solve this observability gap: making AI behavior visible, measurable, and actionable.

What it does

Compliance Copilot is an AI-powered transaction compliance system that demonstrates production-grade observability for LLM applications. It uses Vertex AI Gemini 2.5 Flash to analyze financial transactions for compliance risks, while Datadog monitors every aspect of the AI's behavior.

The system implements three critical detection rules:

  • Semantic Drift: Validates that AI outputs match expected JSON schema, catching model regressions before downstream systems break
  • Budget Burn: Monitors call volume to prevent cost overruns from retry loops or traffic spikes
  • UX Friction: Tracks error rates to protect user experience

Every LLM call emits structured telemetry including latency, token usage, validation results, prompt version, and route mode—all correlated via trace IDs for full observability.

How we built it

Tech Stack:

  • Backend: Node.js/Express with Vertex AI Gemini 2.5 Flash integration
  • Frontend: Next.js with Datadog RUM
  • Observability: Datadog APM (dd-trace), Logs, Monitors, Incidents
  • Infrastructure: Google Cloud (Vertex AI, Cloud Run ready)

Architecture: We structured the project as a monorepo with shared packages for telemetry, prompt versioning, and type definitions. The API instruments every request with dd-trace for automatic APM, emits structured logs to Datadog's HTTP endpoint, and validates all LLM responses against strict schemas using TypeScript type guards.

The three Datadog monitors use log-based queries with carefully tuned thresholds. When monitors trigger, they create actionable incidents with context and runbooks. A traffic generator simulates diverse scenarios across countries, risk profiles, and prompt versions to demonstrate the system under realistic load.

Challenges we ran face

Vertex AI Model Access: Initially struggled with model availability—several Gemini variants returned 404 errors. Solved by discovering gemini-2.5-flash was accessible and delivered excellent performance.

Datadog Log Ingestion: Logs weren't appearing initially. The issue was the log payload format—we needed to send telemetry fields as top-level attributes, not nested in the message string, for Datadog to properly index and create facets.

Monitor Query Syntax: Hit multiple validation errors when creating log-based monitors. Learned that Datadog's query language requires @ prefix for facets, specific aggregation functions (e.g., avg instead of p95 for some metrics), and that monitor types cannot be changed after creation.

Schema Validation Tuning: Balancing strict validation with real LLM behavior was tricky. Gemini sometimes adds helpful context beyond the strict schema. We made validation strict enough to catch real drift while accepting minor variations.

What we learned

LLM Observability is Different: Traditional metrics like CPU and memory aren't enough. You need to monitor semantic correctness, token economics, and user experience simultaneously.

Structured Telemetry is Critical: Correlating traces, logs, and user sessions via IDs enables fast root cause analysis. The investment in structured logging pays off immediately when debugging production issues.

Prompt Versioning Matters: Tracking which prompt version produced which outcome enables A/B testing and instant rollback—essential for maintaining reliability as you iterate on prompts.

Datadog + Vertex AI: The combination is powerful. Vertex AI handles the AI infrastructure complexity, Datadog handles observability complexity, and together they enable shipping production-grade LLM applications with confidence.

What's next

  • Deploy to Cloud Run for a public demo URL
  • Implement self-healing: automatically switch route modes when monitors trigger
  • Add more sophisticated detection rules for specific compliance patterns
  • Expand prompt version testing with automated quality scoring
  • Build a Firestore-backed routing configuration for dynamic behavior adjustments

Built With

Share this project:

Updates