About the Project

Inspiration

As LLM-based applications move into production, I noticed a recurring gap: observability tools show metrics, but they don’t explain failures. Latency spikes, rising token costs, hallucinations, and silent errors often require manual correlation across dashboards, logs, and tribal knowledge. I wanted to build an AI SRE copilot that reasons like a human SRE—looking at numbers, logs, and visual dashboards together—and explains why an incident happened and what to do next.

What I Built

AI SRE Copilot is a multimodal SRE assistant for LLM systems. It streams LLM telemetry to Datadog, detects anomalies, captures live dashboard screenshots, and uses Google Cloud Vertex AI (Gemini) to perform multimodal root-cause analysis across:

  • Text (incident metadata)
  • Metrics (latency, error rate, tokens, cost)
  • Logs (structured events)
  • Images (Datadog dashboard snapshots)

The output is a clear, structured response: root cause, impact, immediate mitigation, and prevention steps.

How I Built It

  • Backend: FastAPI deployed on Cloud Run
  • AI: Gemini 1.5 Pro via Vertex AI for multimodal reasoning
  • Observability: Datadog for metrics, logs, dashboards, monitors, and Snapshot API
  • Multimodal Ingestion: Dashboard PNGs are captured via Datadog’s Snapshot API and passed to Gemini alongside metrics and logs
  • Deployment: Dockerized service with serverless scaling

High-level flow:

  1. LLM app emits custom metrics and logs
  2. Datadog monitors detect anomalies
  3. A dashboard snapshot (PNG) is captured
  4. Gemini analyzes text + metrics + logs + image
  5. Actionable RCA is returned

What I Learned

  • Multimodal reasoning is powerful for SRE workflows—dashboards are first-class data, not just visuals.
  • LLM reliability requires LLM-native signals (tokens, cost, hallucination risk), not just generic infra metrics.
  • Clear, explainable AI outputs matter more than black-box automation in incident response.

Challenges

  • Designing a reliable demo flow that consistently triggers incidents
  • Ensuring strict compliance (using only Google Cloud AI)
  • Keeping multimodal prompts concise while preserving context
  • Balancing realism with hackathon scope and time limits

Outcome

AI SRE Copilot demonstrates how multimodal AI can turn observability data into decisions, reducing mean time to resolution and enabling safer, more cost-aware LLM operations in production.

Compliance Note: This project uses only Google Cloud Vertex AI (Gemini) and Datadog. No third-party AI services are used.

Built With

  • dashboards
  • datadog-(metrics
  • fastapi
  • google-cloud-run
  • google-cloud-vertex-ai-(gemini-1.5-pro)
  • logs
  • python
  • snapshot-api)
Share this project:

Updates