Inspiration
LLM features are everywhere now—support bots, copilots, RAG search—but almost nobody has clear SLOs for things like hallucinations, prompt injection, or unsafe content. Teams usually find out something went wrong because a user complains or screenshots it, not because observability tools raised a signal. At the same time, Google Cloud makes it easy to run Gemini apps on Cloud Run, and Datadog is investing heavily in AI observability. I wanted a small, concrete example that shows what AI SLOs + guardrails could look like in practice with Google Cloud and Datadog, not just another “chat with an LLM” toy.
What it does
AI SLO Guardrail Center is an observability and safety layer for Gemini-powered LLM apps running on Google Cloud Run.
For every request, the service:
- Calls a Gemini model to generate the main answer.
- Calls a second Gemini “judge” model that returns strict JSON with:
hallucinationSuspectedpromptInjectionSuspectedunsafeContentSuspectedqualityScore(0–1)
Then it emits one structured JSON log per request containing:
requestId,userId- Truncated
promptandresponseText latencyMs,tokensIn,tokensOut,model- The full
evaluationobject from the judge
These logs go into Cloud Logging and are designed to be streamed via the standard Cloud Logging → Pub/Sub → Dataflow → Datadog pipeline, so teams can build dashboards and alerts on top. On the frontend, a small “control tower” UI shows:
- A request panel (User ID + Prompt)
- Live Performance card (latency, tokens, model)
- Quality card (score with color coding)
- Safety card (badges for hallucination, prompt injection, unsafe content)
- A collapsible Raw JSON panel for debugging
How we built it
- Backend: Node.js + TypeScript + Express, deployed on Google Cloud Run.
- Models: Uses the official Google Gen AI SDK (
@google/generative-ai) to call Gemini for:- The main answer generation
- The “judge” evaluation call
- Judge module: A dedicated helper that:
- Crafts a constrained prompt for the judge
- Parses the judge’s JSON into a typed
EvaluationResult - Falls back to safe defaults if parsing or the call fails
- Structured logging: The
/api/llmroute:- Measures latency and token usage
- Builds a
logEventobject with all fields (eventType, requestId, userId, prompt, responseText, latencyMs, tokens, evaluation, status, errorMessage) - Prints one JSON line via
console.log(JSON.stringify(logEvent))so it’s easy to forward into Datadog
- Frontend: A simple HTML/CSS/JS single-page dashboard:
- Dark “control panel” aesthetic
- Left: request form
- Right: Performance / Quality / Safety cards + Raw JSON
Challenges we ran into
- Model / API evolution: Keeping up with the latest Gemini models and SDK patterns, and avoiding deprecated endpoints.
- Strict JSON from an LLM: Getting the judge to always return parseable JSON (no markdown, no extra text) required careful prompt design and defensive parsing.
- Windows + Cloud Run deployment: Many examples assume Linux-style shells; we had to adapt deployment to clean, single-line
gcloud run deploycommands that work reliably on Windows. - New observability UIs: Both Cloud Logging and Datadog have newer onboarding/query flows than many tutorials show, so verifying logs and wiring the pipeline required some exploration and debugging.
Accomplishments that we're proud of
- Built a real Gemini API service on Cloud Run instead of a purely local prototype.
- Implemented a reusable LLM-as-a-judge pattern that surfaces hallucination, prompt injection, and unsafe content per request.
- Designed a structured log schema (
llm_requestevents) that is immediately ready to plug into the Google Cloud → Datadog log streaming pattern. - Created a small but effective “guardrail control tower” UI that makes performance, quality, and safety signals understandable at a glance.
- Kept the architecture thin and composable so it can sit in front of any Gemini/Vertex-based app without rewriting the core product.
What we learned
- AI observability is about semantics, not just metrics. Latency and error rate are not enough; you also need to know “Was this answer safe?”, “Was this probably a hallucination?”, “Did someone try prompt injection?”.
- The Google Gen AI SDK is a solid bridge between the Gemini Developer API and Vertex AI Gemini: you can prototype quickly and still have a path to a more enterprise Vertex deployment later.
- Cloud Run + Cloud Logging give a clean base for structured logs, and designing logs as “one JSON event per LLM call” makes integration with tools like Datadog much simpler.
- In a hackathon setting, a tight vertical slice (one endpoint, one judge, one log format, one dashboard) is more realistic and valuable than trying to clone a full enterprise observability platform.
What's next for AI SLO Guardrail Center
- Full Vertex AI integration: Switch from API-key mode to Vertex AI Gemini with service accounts and pull in model-level metrics (latency, errors, resource usage) from Vertex AI monitoring.
- Richer Datadog dashboards: Build full dashboards and log-based monitors for p95 latency, hallucination rate, prompt injection incidents, unsafe content, and cost per request.
- Business KPIs on top of SLOs: Correlate guardrail breaches with ticket resolution time, user satisfaction, conversion, or churn, so incidents are measured in terms of real impact.
- Auto-remediation workflows: Use Datadog monitors and workflows (or other orchestrators) to automatically roll back models/prompts or route risky traffic to humans when thresholds are exceeded.
- Multi-app control tower: Extend the pattern so multiple LLM services (chatbots, copilots, RAG APIs) can all send
llm_requestevents into one shared control center, each with its own SLOs but a unified incident view.
Built With
- antigravity
- artifact-registry
- cloud-build
- cloud-storage
- css
- datadog
- dataflow
- express.js
- git
- github
- google-cloud-logging
- google-cloud-run
- google-gen-ai-sdk-(@google/generative-ai)
- html
- javascript
- node.js
- npm
- pub/sub
- secret-manager
- ts-node-dev
- typescript
- vanilla-javascript
Log in or sign up for Devpost to join the conversation.