ReliabilityAgent : AI That Audits Itself

ReliabilityAgent: Incident → Self-Check → Reinvestigation → Human Approval → Telemetry
AI investigates. Humans authorize. Every finding is reviewed before action is taken.
From incident to downloadable postmortem in seconds with root cause and remediation.
Launch a live incident autopsy powered by Gemini, Dynatrace Grail, and OpenTelemetry.
The agent audits its own investigation and records the verdict as Dynatrace telemetry.
(Dashboard) Not just AI observability. Observability of the AI itself.

Inspiration

Most AI incident-response agents are black boxes. They investigate an incident, return an answer, and ask humans to trust their reasoning. We wanted to build something different. ReliabilityAgent is an AI SRE that makes its own reasoning observable. Every trace query, log query, LLM call, self-check, reinvestigation, and approval is recorded as OpenTelemetry telemetry in Dynatrace Grail.

Our inspiration came from a simple question: What if Dynatrace could monitor the AI agent that is using Dynatrace?

What it does

ReliabilityAgent autonomously investigates production incidents while continuously auditing its own investigation quality.

Workflow : Self-Check → Reinvestigation → Human Approval → Telemetry.

Every trace query, log query, LLM call, self-check, and reinvestigation is recorded as an OpenTelemetry span in Dynatrace Grail.

The agent: Queries Dynatrace traces and logs Collects deployment and operational evidence Uses Gemini 2.5 Flash/3.1 Flash lite to generate hypotheses Performs a deterministic self-check on its own investigation Triggers recursive reinvestigation when evidence is insufficient Requires human approval before generating a post-mortem Records every action as OpenTelemetry spans in Dynatrace Grail

Key results: 248+ real spans recorded 14 investigations completed 2 autonomous reinvestigations triggered Downloadable incident post-mortems Full observability of AI reasoning

Incident
    │
    ▼
Investigation
(query_dynatrace_traces · query_gcp_logs · execute_runbook)
    │
    ▼
Self-Check
(reads own tool-call registry → verdict: PASSED / FAILED)
    │
    ├── PASSED ──────────────────────────┐
    │                                    │
    └── FAILED → Reinvestigation         │
                (corrective prompt +     │
                 new tool calls)         │
                      │                  │
                      └────────────────► ▼
                              Human Approval
                          (/autopsy/{id}/feedback)
                                   │
                                   ▼
                             Postmortem
                      (Gemini-generated, structured)
                                   │
                                   ▼
                      Dynatrace Grail Telemetry
              (every span queryable via DQL in real time)

Features

#	Feature	Why It Matters
1	Recursive self-observability loop	After investigation, the agent queries its own Dynatrace trace to verify it called ≥ 3 distinct tools. Insufficient coverage triggers a full reinvestigation — automatically.
2	Dynatrace as a cognitive feedback signal	The agent doesn't just emit telemetry to Dynatrace — it reads back from it to govern its own next action. Observability becomes part of the control plane.
3	Zero mock data, zero fake verdicts	Every tool returns real API responses or honest error structs. `self_check.verdict` is driven by actual recorded tool calls in an in-process registry — not a counter, flag, or modulo.
4	Human-in-the-loop autopsy API	`/autopsy/start` → Gemini gathers evidence → structured hypothesis with confidence score → operator approves or redirects → Gemini writes the postmortem. Every step is an auditable state machine.
5	Full OTel span hierarchy, importable dashboard	7 span types, 20+ typed attributes, cost accounting (`llm.cost.usd`), phase durations, verdict distribution — all queryable via Grail DQL. Dashboard ships as an importable JSON.
6	Production-grade observability stack	`TracerProvider` + `MeterProvider` + `LoggerProvider` all wired to Dynatrace via OTLP HTTP. Structured logs carry `trace_id` correlation. Token cost computed per-model and recorded as a span attribute and OTel metric.
7	Docker + Cloud Run ready	`Dockerfile` and `deploy/gcloud/deploy.sh` included. One command deploys to GCP Cloud Run with correct memory, CPU, and timeout settings.

How we built it

ReliabilityAgent was built using:

Google ADK, Gemini 2.5 Flash/3.1 Flash lite, Dynatrace Grail DQL, OpenTelemetry, FastAPI, MCP Server, Google Cloud Run, Google Secret Manager The most unique component is the self-check engine. Instead of asking another LLM to evaluate the investigation, ReliabilityAgent performs a deterministic audit: Counts investigation tool calls Verifies evidence thresholds Records verdicts as span attributes Triggers reinvestigation when necessary Example span attributes: self_check.verdict = PASSED self_check.tool_count = 11 self_check.backend = in_process_registry self_check.dql_attempted = true These become permanently queryable in Dynatrace Grail.

Challenges we ran into

Real self-checks instead of simulated ones : Early prototypes used synthetic pass/fail logic. We replaced it with real telemetry-driven verification using Dynatrace data and deterministic validation.

Dynatrace observability integration : Getting OpenTelemetry traces, logs, and metrics flowing correctly into Dynatrace Grail required careful configuration and tuning.

Human oversight design : We wanted approval to be meaningful rather than a checkbox. The final Approve/Redirect workflow ensures humans remain accountable for production decisions.

Accomplishments that we're proud of

Our biggest accomplishment is making AI reasoning observable. A judge can open Dynatrace, click a self-check span, and see: Investigation verdict Tool count Verification backend Reinvestigation status Self-check telemetry in real time. ReliabilityAgent doesn't just generate telemetry.

ReliabilityAgent generates telemetry about its own reasoning. We also successfully demonstrated autonomous reinvestigation, where the agent detected weak analysis, restarted its investigation, and produced a stronger result before a human reviewed it.

What we learned

The biggest lesson was that observability should not be added after the AI agent is built. Observability must be part of the architecture from day one.

We also learned that: Human oversight creates trust Graceful degradation is better than hidden failures Deterministic self-checks are more reliable than LLMs grading LLMs Telemetry can become an agent's memory and audit trail

What's next for ReliabilityAgent : AI That Audits Itself

Future versions will include: Native Grail-powered self-verification Confidence scoring as OpenTelemetry metrics Historical incident memory using Dynatrace DQL Expanded MCP integrations Multi-service root cause correlation Autonomous reliability optimization recommendations

Our long-term vision is simple: ReliabilityAgent should not only investigate incidents. It should continuously observe, audit, and improve its own reasoning. ReliabilityAgent - Incident → Self-Check → Reinvestigation → Human Approval → Telemetry.

Built With

dynatrace-dashboards
dynatrace-distributed-tracing
dynatrace-dql
dynatrace-grail
fastapi
gemini-2.5-flash
gemini-api
google-adk
google-cloud-run
google-secret-manager
mcp-server
opentelemetry
otlp-telemetry-pipeline
python
rest-apis

Updates

Robert samuel started this project — Jun 02, 2026 06:11 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.