Inspiration
In early 2025, Deloitte's AI chatbot made headlines for hallucinating—generating false information that damaged trust and required costly remediation. This isn't an isolated incident. LLMs are everywhere powering chatbots, code assistants, and customer service but when they break, traditional monitoring tools are blind.
Datadog can tell you your API is slow, but it can't tell you your LLM is hallucinating, leaking your system prompt, or burning through your budget on a single malicious request.
I've seen teams discover their LLM was producing harmful content only after users complained. By then, the damage was done. We needed observability built specifically for the AI era.
What it does
LLM Observability Copilot is a real-time monitoring platform purpose-built for LLM applications:
- Multi-Dimensional Quality Scoring - Tracks hallucination risk, performance, response quality, and abuse detection with a unified health score (0-100)
- SAFE Mode Security - Detects and blocks 15+ attack patterns including prompt injection, jailbreak attempts, and system prompt theft
- AI-Powered Incident Triage - Uses Gemini 2.5 Pro to analyze anomalies and generate natural language root cause explanations
- Predictive Alerts - "Error rate trending up, will breach SLO in 15 minutes"
- Cost Intelligence - Per-request cost tracking with budget alerts and forecasting
- Datadog Integration - Ships LLM-specific metrics to your existing observability stack
How I built it
- Datadog Organization Name: Neuralrocks
- Frontend: React with a custom dark-mode dashboard optimized for operations teams
- Backend: Python/FastAPI handling LLM request instrumentation
- AI Triage: Gemini 2.5 Pro for intelligent incident analysis
- Metrics: Custom integration with Datadog for LLM-specific telemetry
- Novel Metrics: Invented the "latency-to-token ratio" metric:
latency_ms / (prompt_tokens + completion_tokens)for meaningful LLM performance measurement
Challenges I ran into
Hallucination Detection - No ground truth to compare against. Built a probabilistic scorer using linguistic patterns and confidence indicators.
Real-time Security Scanning - SAFE mode needs to analyze requests before they hit the LLM without adding latency. Optimized pattern matching to stay under 10ms.
Making AI Explain AI - Getting the triage LLM to produce actionable, structured incident reports (not just "something went wrong") required extensive prompt engineering.
Accomplishments that I am proud of
- Built a complete observability platform from scratch
- Created novel metrics specifically designed for LLM workloads
- Achieved sub-10ms overhead for security scanning
- Filed a provisional patent for the core technology
What I learned
- Traditional APM concepts don't translate directly to LLMs—you need new metrics
- Security for LLMs is fundamentally different (semantic attacks vs. technical exploits)
- Using AI to monitor AI creates powerful feedback loops
What's next for LLM Observability Copilot
- Multi-model support (OpenAI, Anthropic, Cohere, local models)
- Automated remediation actions
- Compliance reporting for regulated industries
- Open-source SDK for easy integration
Built With
- css
- datadog
- fastapi
- gemini-2.5-pro
- google-cloud-run
- google-firestore
- javascript
- python
- react
- typescript
- vertex-ai
Log in or sign up for Devpost to join the conversation.