-
-
LLM Governance Monitor in STANDARD MODE - real-time metrics: latency, tokens, cost, health score
-
STRICT MODE activated - unsafe request blocked, threat neutralized without API cost
-
Datadog Dashboard - monitoring tokens, cost, latency, safety score, and health score
-
Datadog Event Management - automatic incident creation with context for engineers
-
Five custom Datadog monitors: High Latency Alert, Anomaly Detection with ML, Low Safety Score, High Token Usage, and Critical Health Score
-
Real-time incident notifications -Threat Blocked and STRICT MODE Activated events with severity levels, enabling immediate engineer response
Inspiration
LLM applications are becoming critical to enterprise operations, but traditional monitoring solutions only watch problems happen — they don't prevent them. I wanted to build something that actively protects LLM infrastructure, not just observes it.
What it does
LLM Governance Monitor is an active, self-healing observability solution for LLM applications. It implements a closed-loop governance architecture that:
- Detects threats in real-time (prompt injection, jailbreaking attempts)
- Decides on appropriate response (allow or block)
- Acts autonomously to neutralize threats before they reach the LLM
- Adapts its security posture based on observed patterns
The system operates in two modes:
- STANDARD MODE: Monitors and alerts on suspicious activity
- STRICT MODE: Automatically blocks unsafe requests, saving API costs
Key Features
- Real-time metrics: latency, tokens, cost, safety score, health score
- Active Governance Engine with automatic threat blocking
- Self-healing architecture with automatic mode escalation
- 5 Datadog monitors including ML-based anomaly detection
- Event logging with actionable incidents for engineers
- Cost optimization: blocked requests don't call the API
How I built it
- Backend: Python with FastAPI for high-performance async API
- AI Model: Google Vertex AI with Gemini 2.0 Flash
- Hosting: Google Cloud Run for serverless deployment
- Monitoring: Datadog APM, Metrics API, and Events API
- Frontend: HTML/JavaScript with Tailwind CSS
Challenges I faced
- Implementing the governance state machine that transitions between STANDARD and STRICT modes
- Calculating a unified Health Score that combines performance, cost, safety, and reliability
- Sending custom metrics and events to Datadog in real-time
- Balancing security (blocking threats) with usability (not blocking legitimate requests)
What I learned
- How to implement closed-loop governance for AI systems
- Datadog's Metrics API and Event Management for custom observability
- The importance of actionable incidents over simple alerts
- How to build self-healing architectures that adapt autonomously
What's next
- Add semantic safety analysis using Vertex AI's safety settings
- Implement ECONOMY mode for cost optimization with token truncation
- Add user-level tracking for abuse detection
- Integrate with Datadog Incident Management for automated runbooks
Log in or sign up for Devpost to join the conversation.