Inspiration

LLM applications are becoming critical to enterprise operations, but traditional monitoring solutions only watch problems happen — they don't prevent them. I wanted to build something that actively protects LLM infrastructure, not just observes it.

What it does

LLM Governance Monitor is an active, self-healing observability solution for LLM applications. It implements a closed-loop governance architecture that:

  • Detects threats in real-time (prompt injection, jailbreaking attempts)
  • Decides on appropriate response (allow or block)
  • Acts autonomously to neutralize threats before they reach the LLM
  • Adapts its security posture based on observed patterns

The system operates in two modes:

  • STANDARD MODE: Monitors and alerts on suspicious activity
  • STRICT MODE: Automatically blocks unsafe requests, saving API costs

Key Features

  • Real-time metrics: latency, tokens, cost, safety score, health score
  • Active Governance Engine with automatic threat blocking
  • Self-healing architecture with automatic mode escalation
  • 5 Datadog monitors including ML-based anomaly detection
  • Event logging with actionable incidents for engineers
  • Cost optimization: blocked requests don't call the API

How I built it

  • Backend: Python with FastAPI for high-performance async API
  • AI Model: Google Vertex AI with Gemini 2.0 Flash
  • Hosting: Google Cloud Run for serverless deployment
  • Monitoring: Datadog APM, Metrics API, and Events API
  • Frontend: HTML/JavaScript with Tailwind CSS

Challenges I faced

  • Implementing the governance state machine that transitions between STANDARD and STRICT modes
  • Calculating a unified Health Score that combines performance, cost, safety, and reliability
  • Sending custom metrics and events to Datadog in real-time
  • Balancing security (blocking threats) with usability (not blocking legitimate requests)

What I learned

  • How to implement closed-loop governance for AI systems
  • Datadog's Metrics API and Event Management for custom observability
  • The importance of actionable incidents over simple alerts
  • How to build self-healing architectures that adapt autonomously

What's next

  • Add semantic safety analysis using Vertex AI's safety settings
  • Implement ECONOMY mode for cost optimization with token truncation
  • Add user-level tracking for abuse detection
  • Integrate with Datadog Incident Management for automated runbooks

Built With

Share this project:

Updates