Inspiration

LLMs are moving from prototypes into real customer-facing products, and the failure modes change. A service can be “up” while users still suffer: long LLM latency, sudden token/cost spikes from prompt regressions, or security risks like prompt injection and PII leakage. I built LLM Reliability & Security Sentinel to treat these LLM-specific risks as first-class production signals—measurable, monitorable, and actionable—so teams can ship LLM features with confidence.

What it does

LLM Reliability & Security Sentinel is a small FastAPI service that calls Gemini on Vertex AI and streams LLM and runtime telemetry to Datadog, then automatically provisions monitoring and response workflows. It provides deterministic request modes to reproduce real-world issues on demand: normal — standard LLM call slow — adds 6–10s async delay to reliably trigger latency monitoring error — forces an LLM error trace + HTTP 500 to reliably trigger reliability monitoring token_spike — inflates prompt/output to spike total tokens cost_spike — loops multiple Gemini calls to spike estimated cost prompt_injection — injects obvious attack patterns, emits a SECURITY signal log + metric pii_leak — emits clearly labeled FAKE PII patterns, triggers PII detection When a detection rule triggers, Datadog notifies a webhook endpoint in the app, and the app creates Datadog Cases with context, triage steps, and runbook guidance (with deduplication to prevent alert spam).

How I built it

Backend: Python 3.11 + FastAPI, with request ID middleware and structured JSON logging. LLM: Vertex AI Gemini via the google-genai SDK (import google.genai as genai) using Vertex mode (project + region). Observability pipeline (Datadog): LLM Observability (ddtrace LLMObs): wraps model calls and annotates spans with provider/model, prompt/response sizes, token estimates, and cost estimates. Agentless logs: sent to Datadog Logs HTTP intake (structured, queryable fields). Agentless metrics: sent via Datadog Metrics API using required metric names (HTTP counts/latency/errors, mode counts, security signals, cost fallback). Security detection: simple, transparent heuristics (regex-based prompt injection detection; regex-based PII detection with “FAKE” label to keep the demo safe). Automation: scripts to bootstrap GCP, deploy to Cloud Run, provision Datadog resources via API, export JSON configs for submission, and generate traffic to trigger monitors quickly.

Challenges we ran into

Model availability and region differences: Gemini model access can vary by region/project. We added fallback logic across common model names and locations to keep deployments resilient. Making LLM observability “demoable” reliably: not every provider path produces the same metrics in every environment, so we added span annotations and a cost fallback metric to keep dashboards meaningful. Log indexing differences: some orgs don’t facet env the same way for custom HTTP intake logs. We tuned the dashboard’s security log stream query to match the actual indexed fields so it populates reliably. End-to-end action items: incidents APIs can vary by org plan. We made response creation robust by using Datadog Cases (and replay tooling to make the demo reliable for recording).

Accomplishments that we're proud of

End-to-end, API-only setup: monitors, SLOs, dashboard, webhook integration, and case project are created and exported without manual UI configuration. Deterministic “LLM chaos testing”: one endpoint with modes that reproduce latency, errors, token spikes, cost spikes, injection attempts, and PII leakage—ideal for validation and demos. Actionable response automation: webhook-triggered, deduplicated Cases with investigation context (dashboard/monitor links, queries, runbook steps). Production-minded engineering: request IDs for correlation, timeouts, retries/backoff, structured logs, and clean error handling.

What we learned

LLM systems need LLM-native telemetry. CPU and generic HTTP metrics aren’t enough—tokens, cost, and model latency must be first-class signals. “Agentless” observability can still be production-grade if you keep the data structured and automate provisioning and exports. Even simple security heuristics create immediate value when they’re consistently logged, metered, and tied to an action workflow. SLOs are most credible when they’re built from the same signals that drive alerting (monitor-based SLOs).

What's next for LLM Reliability & Security Sentinel

Replace heuristic detectors with more robust policy enforcement (structured output validation, configurable redaction/blocking, allow/deny lists). Add multi-tenant or per-feature tagging (team/app/route) so larger organizations can segment reliability and cost by product surface. Add automated remediation playbooks (rate limiting, output caps, safe-mode prompting, or model fallback) triggered by monitors. Expand evaluation signals to include quality metrics (refusal rates, hallucination indicators, tool-call failures) while keeping the same end-to-end observability and response pattern.

Built With

  • ai
  • api
  • artifact
  • build
  • cloud
  • datadog
  • datadog-api-client-python
  • ddtrace
  • fastapi
  • gemini
  • google
  • google-genai
  • gunicorn
  • httpx
  • llm
  • logs
  • metrics
  • observability
  • pydantic
  • python
  • registry
  • run
  • uvicorn
  • vertex
Share this project:

Updates