LLM Reliability Observatory

Inspiration

Every day, LLMs confidently tell users things that aren't true. They hallucinate product specs, fabricate statistics, and invent facts—all while appearing completely certain.

Traditional observability tools track latency and errors, but hallucinations aren't "errors"—they're confident, coherent, and completely wrong. We asked: what if we could monitor trust, not just speed?

That's when we discovered calibration—a statistical concept from machine learning research. A well-calibrated model's confidence actually matches its accuracy. When it says "80% confident," it should be right 80% of the time.

We built the LLM Reliability Observatory to bring calibration-first observability to production AI.

What It Does

Our platform tracks Expected Calibration Error (ECE)—the gap between what an LLM thinks it knows and what it actually knows.

Key features:

📊 ECE Monitoring: Real-time calibration tracking with Datadog dashboards
🚨 High-Confidence Failure Detection: Catches the most dangerous errors—confident but wrong
🔬 HaluGate Detection: Token-level hallucination identification inspired by vLLM research
🔐 Security Signals: Blocks 40+ prompt injection patterns
📈 Reliability Diagrams: Visual proof of calibration quality

How We Built It

Google Cloud Stack:

Vertex AI (Gemini 1.5 Flash): LLM inference with multi-signal confidence estimation
Cloud Run: Serverless deployment
Cloud Storage: Audit logs for responses and feedback
Cloud Logging: Structured application logging

Datadog Integration:

Custom metrics via HTTP API (agentless for Cloud Run)
6 dashboards tracking ECE, security, HaluGate, and Q&A interactions
8 monitors with automated alerting
Log management for security events

Core Innovation:

Self-consistency sampling (5 LLM calls) for robust confidence
Bootstrap confidence intervals for ECE
Statistical calibration from ML research applied to production LLMs

Challenges We Faced

1. Agentless Metrics in Cloud Run Cloud Run doesn't have a Datadog Agent. We solved this by implementing direct HTTP API metric submission—a hybrid approach that falls back to API mode when StatsD isn't available.

2. Vertex AI Model Access Getting Gemini models working required navigating GCP's permission model. We built a fallback system: Vertex AI → Google AI API → Mock responses.

3. Real-Time Calibration Computing ECE requires ground truth. We built a feedback loop where users mark responses correct/incorrect, enabling live calibration updates.

4. Hallucination Detection Without Fine-Tuning Inspired by HaluGate, we implemented rule-based detection for demo purposes, with architecture ready for ModernBERT-based token classification.

What We Learned

Calibration > Accuracy: A model can be accurate but poorly calibrated. ECE reveals hidden unreliability.
High-Confidence Failures Are Critical: When confidence is high but accuracy is low, users get hurt. This is THE metric to monitor.
Serverless + Observability: Cloud Run requires creative solutions for metrics—API mode was our breakthrough.
Security Is Observability: Prompt injection detection is part of the reliability story.

What's Next

Production ModernBERT: Replace rule-based HaluGate with fine-tuned token classifier
Confidence Calibration: Temperature scaling to improve calibration post-hoc
Multi-Model Comparison: Track ECE across different LLMs
Datadog AI Agents: Automated root cause analysis for calibration drift

When your model says 80% confident, we verify it's actually right 80% of the time.

Built With

chart.js
cloud-logging
cloud-run
cloud-storage
css
datadog
datadog-apm
datadog-custom-metrics
datadog-dashboards
datadog-monitors
ddtrace
dogstatsd
fastapi
gemini
google-cloud
html
javascript
pydantic
python
secret-manager
uvicorn
vertex-ai

Updates

andrew espira started this project — Dec 18, 2025 02:55 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.