Here is a clean, polished Project Story written in Markdown, with optional LaTeX formatting support. You can copy–paste this directly into Devpost.


Project Story — LLM Guardian

🚀 Inspiration

As AI applications scale, ensuring reliability, safety, and observability becomes increasingly difficult. Traditional monitoring tools aren’t designed for the unique behaviors and failure modes of LLMs—like prompt injection, drift in model responses, unpredictable latency spikes, and hidden token usage.

We wanted to create a system that gives AI engineers the same level of visibility that SRE teams have for cloud apps. This inspired us to build LLM Guardian, an end-to-end AI observability platform combining the intelligence of Google Vertex AI/Gemini with the powerful monitoring ecosystem of Datadog.


🧠 What We Learned

Building this project taught us several important lessons:

  • Modern LLM telemetry is more than logs—it includes latency, token count, context size, and safety scores.
  • Datadog APM and Metrics can be extended to support AI workloads using structured events.
  • Vertex AI/Gemini exposes fine-grained model metadata useful for monitoring.
  • Combining real-time anomaly detection with LLM telemetry improves reliability and reduces debugging time.
  • Observability for AI requires both performance monitoring and prompt-level security monitoring.
  • A well-designed pipeline can surface issues before they reach users.

We also learned how to design alerting thresholds using statistical formulas, such as: [ \text{Latency Threshold} = \mu_{\text{latency}} + 3\sigma ] to detect abnormal model behavior.


🏗️ How We Built the Project

We designed LLM Guardian as a modular, cloud-native system:

1. LLM Backend

  • Built using Google Vertex AI / Gemini models
  • Exposes an API for prompts, responses, and metadata
  • Emits structured telemetry (latency, tokens, response time, safety flags)

2. Telemetry Pipeline

  • Google Cloud Run + Cloud Functions collect runtime logs
  • Logs flow into Datadog via Log Intake API
  • Metrics (token count, success rate, latency) pushed to Datadog Metrics API
  • Traces captured using Datadog APM SDK

3. Datadog Observability Layer

We built:

  • Dashboards for latency, token usage, model reliability, and throughput
  • Security dashboard for prompt anomalies
  • Detection rules for:

    • Latency spikes
    • Error rate deviations
    • Suspicious query patterns
    • Sudden token inflation

4. Incident Automation

When detection rules fire:

  • Datadog triggers an Incident
  • Context includes logs, last 5 prompts, error messages, and statistical summaries
  • AI engineers can take action instantly

🧩 Challenges We Faced

Building observability specifically for LLMs came with unique challenges:

1. Structuring Telemetry for AI Models

LLM responses don’t always produce consistent metadata. Designing a schema that captured:

  • token usage
  • response length
  • latency
  • safety signals

required custom formatting.

2. Real-Time Detection Sensitivity

If thresholds were too strict → too many false alerts If too loose → incidents were missed

We solved this by using rolling window statistics: [ \text{Dynamic Threshold} = \mu_{t-5:t} + 2\sigma_{t-5:t} ]

3. Visualizing AI-specific metrics

Datadog dashboards needed new metric types, such as:

  • “Prompt Toxicity Confidence”
  • “Token Drift Factor”
  • “Semantic Response Variability”

4. Streaming Logs Efficiently

LLM logs can be large; we optimized payloads by batching and compressing JSON events.

5. Mitigating Prompt Attacks

We added detection rules for patterns like:

  • SQL injection attempts
  • Security bypass prompts
  • Role play override prompts

Built With

Share this project:

Updates