Inspiration

Every on-call engineer knows the dance. You get a page that says "checkout-api latency is spiking." You open your observability tool, hunt for the right query, scroll through open problems, correlate with recent deployments, and twenty minutes later you find the root cause. That whole loop is conversation-shaped. So is Gemini's strength. I wanted to see how far a Gemini agent could collapse that twenty minutes.

What it does

gemini-ops-agent treats every production symptom as an SRE ticket. You hand it a one-line description ("checkout-api latency just spiked, what changed?") and the agent works the case using the Dynatrace MCP server:

  • Calls list_problems to see what Davis AI has already detected.
  • Calls execute_dql to pull supporting metrics or logs.
  • Uses find_entity_by_name to resolve service IDs when the user references things by name.
  • Uses generate_dql_from_natural_language when it needs to translate the user's question into a runnable DQL query.

The output is the format an on-call engineer can act on: a one-line root cause, two to four evidence bullets with specific timestamps and numbers, and one concrete next step.

How we built it

  • Google Cloud Agent Builder for the agent framework. The whole agent fits in six lines of ADK: one LlmAgent, one McpToolset, a Gemini model, and a system prompt that defines the SRE workflow.
  • Gemini 2.5 Flash on Vertex AI for reasoning. Flash is fast enough for an interactive incident-investigation loop and cheap enough that judges can fire as many investigations as they want without burning credits.
  • Dynatrace MCP server for tools. The agent talks to the official dynatrace-oss/dynatrace-mcp tool surface (list_problems, execute_dql, find_entity_by_name, generate_dql_from_natural_language). A local stub MCP server ships with the repo so the demo runs without a Dynatrace tenant; flip one flag and the same agent code targets a real tenant via the official npm package.
  • Streamlit for the dashboard.
  • Cloud Run for hosting.
  • GeminiLens (a small companion library I'm open sourcing) wraps every Gemini call so the agent observes itself: cost in USD, latency, and an audit log of which tools were called land in ~/.gemini-ops-agent/traces.jsonl for every investigation.

Challenges we ran into

The Agent Development Kit is new and the Python MCP SDK API surface still varies between versions. The McpToolset connection-params shape changed between point releases. Getting the stub server's stdio handshake right was the longest single bit of debugging. Dynatrace's tool input schemas aren't published as a single JSON file; I rebuilt them by reading the open-source MCP server's TypeScript.

Accomplishments that we're proud of

  • A real end-to-end Vertex AI agent call returns a structured SRE root-cause analysis with cited problem IDs and DQL evidence. Not a demo with a hand-written response. The agent reasons over real tool output.
  • Twelve passing tests cover the stub server's responses and the agent's wiring.
  • The stub-vs-real split means reviewers can run the project on their machine in under five minutes without provisioning a Dynatrace tenant.
  • Self-observation: GeminiLens records the agent's own cost and latency so reviewers can see what each investigation costs in Vertex AI tokens.

What we learned

The Model Context Protocol is a real abstraction. Building an agent that talks to a stubbed MCP server and a real one with the same code is genuinely possible. That portability is the deeper bet here: tomorrow's Dynatrace MCP, PagerDuty MCP, Datadog MCP all drop in without rewriting the agent.

What's next for gemini-ops-agent

  • Plug in additional partner MCP servers (PagerDuty, Datadog, Sentry) and let the agent fan out across them.
  • A "remediation" mode where the agent can propose a rollback or a feature-flag flip via Slack MCP after diagnosing root cause.
  • A scheduled "morning report" that runs every weekday at 9am, fans out across all open problems, and emails the on-call a triage summary.

Built With

  • agent-development-kit
  • dynatrace
  • dynatrace-mcp
  • gemini
  • gemini-2.5
  • geminilens
  • google-cloud-agent-builder
  • google-cloud-run
  • mcp
  • python
  • streamlit
  • vertex-ai
Share this project:

Updates