LLM Sentinel: Observability & Safety for Production AI

Inspiration

As LLMs move from experimentation into production, teams struggle to monitor how these systems behave in real-world conditions. Traditional observability tools fail to capture LLM-specific signals such as token usage, inference latency, and content safety. We were inspired to bridge this gap by building a platform that makes LLMs observable, reliable, and production-ready.

What it does

LLM Sentinel is an advanced observability and safety platform for production LLM applications. It monitors real-time metrics such as response latency, token consumption, and content safety signals for LLMs running on Google Cloud Vertex AI. These signals are streamed into Datadog, where dashboards, detection rules, and alerts help engineers quickly identify performance issues, cost anomalies, and safety risks.

How we built it

We built a full-stack LLM application using a React-based frontend and a FastAPI backend hosted on Google Cloud. The backend invokes Gemini models via Vertex AI for inference. During each request, we capture LLM-specific telemetry and stream it to Datadog using custom metrics. Datadog dashboards visualize system health, while detection rules trigger alerts or incidents when anomalies are detected.

Challenges we ran into

One of the main challenges was designing meaningful LLM-specific metrics beyond traditional system monitoring. Mapping abstract concepts like token usage and content safety into actionable observability signals required careful design. Another challenge was keeping the architecture simple and demo-friendly while still demonstrating production-grade observability.

Accomplishments that we're proud of

Built an end-to-end working LLM observability system within a limited time

Successfully integrated Google Cloud Vertex AI with Datadog monitoring

Designed real-time dashboards and alerting logic for LLM-specific signals

Delivered a clear, production-relevant demo within a 3-minute constraint

What we learned

We learned that observability is just as critical for AI systems as it is for traditional software. LLMs introduce new failure modes—cost spikes, latency degradation, and safety risks—that require specialized monitoring. We also gained hands-on experience integrating cloud AI services with enterprise observability platforms.

What's next for LLM Sentinel: Observability & Safety for Production AI

Next, we plan to integrate deeper safety analysis, automated remediation workflows, and support for multiple LLM providers. We also aim to extend the platform with cost forecasting, prompt-level tracing, and advanced anomaly detection to further enhance production readiness.

Built With

cloud
css3
datadog
datadog-api
datadog-apm
docker
fastapi
gemini
git
github
google
google-cloud
google-cloud-logging
google-cloud-run
google-vertex-ai
html5
javascript
monitoring
opentelemetry
python
react.js
rest-apis
uvicorn
vertex-ai-sdk

Updates

JAGDEV HAKE started this project — Dec 24, 2025 02:57 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.