Inspiration
As LLMs move from experimentation into production, teams struggle to monitor how these systems behave in real-world conditions. Traditional observability tools fail to capture LLM-specific signals such as token usage, inference latency, and content safety. We were inspired to bridge this gap by building a platform that makes LLMs observable, reliable, and production-ready.
What it does
LLM Sentinel is an advanced observability and safety platform for production LLM applications. It monitors real-time metrics such as response latency, token consumption, and content safety signals for LLMs running on Google Cloud Vertex AI. These signals are streamed into Datadog, where dashboards, detection rules, and alerts help engineers quickly identify performance issues, cost anomalies, and safety risks.
How we built it
We built a full-stack LLM application using a React-based frontend and a FastAPI backend hosted on Google Cloud. The backend invokes Gemini models via Vertex AI for inference. During each request, we capture LLM-specific telemetry and stream it to Datadog using custom metrics. Datadog dashboards visualize system health, while detection rules trigger alerts or incidents when anomalies are detected.
Challenges we ran into
One of the main challenges was designing meaningful LLM-specific metrics beyond traditional system monitoring. Mapping abstract concepts like token usage and content safety into actionable observability signals required careful design. Another challenge was keeping the architecture simple and demo-friendly while still demonstrating production-grade observability.
Accomplishments that we're proud of
Built an end-to-end working LLM observability system within a limited time
Successfully integrated Google Cloud Vertex AI with Datadog monitoring
Designed real-time dashboards and alerting logic for LLM-specific signals
Delivered a clear, production-relevant demo within a 3-minute constraint
What we learned
We learned that observability is just as critical for AI systems as it is for traditional software. LLMs introduce new failure modes—cost spikes, latency degradation, and safety risks—that require specialized monitoring. We also gained hands-on experience integrating cloud AI services with enterprise observability platforms.
What's next for LLM Sentinel: Observability & Safety for Production AI
Next, we plan to integrate deeper safety analysis, automated remediation workflows, and support for multiple LLM providers. We also aim to extend the platform with cost forecasting, prompt-level tracing, and advanced anomaly detection to further enhance production readiness.
Log in or sign up for Devpost to join the conversation.