AI SRE Copilot — Multimodal LLM Observability

Landing Page
Concept/Architecture
Datadog Dashboard
API Trigger
AI Output
Datadog metrics explorer

About the Project

Inspiration

As LLM-based applications move into production, I noticed a recurring gap: observability tools show metrics, but they don’t explain failures. Latency spikes, rising token costs, hallucinations, and silent errors often require manual correlation across dashboards, logs, and tribal knowledge. I wanted to build an AI SRE copilot that reasons like a human SRE—looking at numbers, logs, and visual dashboards together—and explains why an incident happened and what to do next.

What I Built

AI SRE Copilot is a multimodal SRE assistant for LLM systems. It streams LLM telemetry to Datadog, detects anomalies, captures live dashboard screenshots, and uses Google Cloud Vertex AI (Gemini) to perform multimodal root-cause analysis across:

Text (incident metadata)
Metrics (latency, error rate, tokens, cost)
Logs (structured events)
Images (Datadog dashboard snapshots)

The output is a clear, structured response: root cause, impact, immediate mitigation, and prevention steps.

How I Built It

Backend: FastAPI deployed on Cloud Run
AI: Gemini 1.5 Pro via Vertex AI for multimodal reasoning
Observability: Datadog for metrics, logs, dashboards, monitors, and Snapshot API
Multimodal Ingestion: Dashboard PNGs are captured via Datadog’s Snapshot API and passed to Gemini alongside metrics and logs
Deployment: Dockerized service with serverless scaling

High-level flow:

LLM app emits custom metrics and logs
Datadog monitors detect anomalies
A dashboard snapshot (PNG) is captured
Gemini analyzes text + metrics + logs + image
Actionable RCA is returned

What I Learned

Multimodal reasoning is powerful for SRE workflows—dashboards are first-class data, not just visuals.
LLM reliability requires LLM-native signals (tokens, cost, hallucination risk), not just generic infra metrics.
Clear, explainable AI outputs matter more than black-box automation in incident response.

Challenges

Designing a reliable demo flow that consistently triggers incidents
Ensuring strict compliance (using only Google Cloud AI)
Keeping multimodal prompts concise while preserving context
Balancing realism with hackathon scope and time limits

Outcome

AI SRE Copilot demonstrates how multimodal AI can turn observability data into decisions, reducing mean time to resolution and enabling safer, more cost-aware LLM operations in production.

Compliance Note: This project uses only Google Cloud Vertex AI (Gemini) and Datadog. No third-party AI services are used.

Built With

dashboards
datadog-(metrics
fastapi
google-cloud-run
google-cloud-vertex-ai-(gemini-1.5-pro)
logs
python
snapshot-api)

Updates

Darshan Linge Gowda started this project — Dec 30, 2025 05:16 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.