Inspiration
Every day, LLMs confidently tell users things that aren't true. They hallucinate product specs, fabricate statistics, and invent facts—all while appearing completely certain.
Traditional observability tools track latency and errors, but hallucinations aren't "errors"—they're confident, coherent, and completely wrong. We asked: what if we could monitor trust, not just speed?
That's when we discovered calibration—a statistical concept from machine learning research. A well-calibrated model's confidence actually matches its accuracy. When it says "80% confident," it should be right 80% of the time.
We built the LLM Reliability Observatory to bring calibration-first observability to production AI.
What It Does
Our platform tracks Expected Calibration Error (ECE)—the gap between what an LLM thinks it knows and what it actually knows.
Key features:
- 📊 ECE Monitoring: Real-time calibration tracking with Datadog dashboards
- 🚨 High-Confidence Failure Detection: Catches the most dangerous errors—confident but wrong
- 🔬 HaluGate Detection: Token-level hallucination identification inspired by vLLM research
- 🔐 Security Signals: Blocks 40+ prompt injection patterns
- 📈 Reliability Diagrams: Visual proof of calibration quality
How We Built It
Google Cloud Stack:
- Vertex AI (Gemini 1.5 Flash): LLM inference with multi-signal confidence estimation
- Cloud Run: Serverless deployment
- Cloud Storage: Audit logs for responses and feedback
- Cloud Logging: Structured application logging
Datadog Integration:
- Custom metrics via HTTP API (agentless for Cloud Run)
- 6 dashboards tracking ECE, security, HaluGate, and Q&A interactions
- 8 monitors with automated alerting
- Log management for security events
Core Innovation:
- Self-consistency sampling (5 LLM calls) for robust confidence
- Bootstrap confidence intervals for ECE
- Statistical calibration from ML research applied to production LLMs
Challenges We Faced
1. Agentless Metrics in Cloud Run Cloud Run doesn't have a Datadog Agent. We solved this by implementing direct HTTP API metric submission—a hybrid approach that falls back to API mode when StatsD isn't available.
2. Vertex AI Model Access Getting Gemini models working required navigating GCP's permission model. We built a fallback system: Vertex AI → Google AI API → Mock responses.
3. Real-Time Calibration Computing ECE requires ground truth. We built a feedback loop where users mark responses correct/incorrect, enabling live calibration updates.
4. Hallucination Detection Without Fine-Tuning Inspired by HaluGate, we implemented rule-based detection for demo purposes, with architecture ready for ModernBERT-based token classification.
What We Learned
- Calibration > Accuracy: A model can be accurate but poorly calibrated. ECE reveals hidden unreliability.
- High-Confidence Failures Are Critical: When confidence is high but accuracy is low, users get hurt. This is THE metric to monitor.
- Serverless + Observability: Cloud Run requires creative solutions for metrics—API mode was our breakthrough.
- Security Is Observability: Prompt injection detection is part of the reliability story.
What's Next
- Production ModernBERT: Replace rule-based HaluGate with fine-tuned token classifier
- Confidence Calibration: Temperature scaling to improve calibration post-hoc
- Multi-Model Comparison: Track ECE across different LLMs
- Datadog AI Agents: Automated root cause analysis for calibration drift
When your model says 80% confident, we verify it's actually right 80% of the time.
Built With
- chart.js
- cloud-logging
- cloud-run
- cloud-storage
- css
- datadog
- datadog-apm
- datadog-custom-metrics
- datadog-dashboards
- datadog-monitors
- ddtrace
- dogstatsd
- fastapi
- gemini
- google-cloud
- html
- javascript
- pydantic
- python
- secret-manager
- uvicorn
- vertex-ai
Log in or sign up for Devpost to join the conversation.