Inspiration- Every engineer has lived this: it's 2am, PagerDuty fires, and you spend the next

two hours manually grep-ing through logs, git-blaming commits, and trying to figure out which deploy broke production — while users are hitting errors in real time.

The problem isn't that engineers can't solve incidents. It's that the first 30–60 minutes are pure mechanical work: find the spike, search the logs, trace the commit, file the issue. An agent can do all of that. We wanted to prove it.

What it does: Incident Copilot is an autonomous DevOps agent that detects production anomalies in Elasticsearch, traces root causes to specific GitLab commits, and files structured incident issues — in under 30 seconds, with no human intervention.

How we built it: Built on Google ADK + Gemini 2.5 Flash (Vertex AI), integrating Elasticsearch for log intelligence, GitLab for commit tracing and issue management, and Arize Phoenix for full LLM observability of every agent decision. Demonstrated on 5 realistic incident scenarios: NPE spikes, Redis failures, DB timeouts, multi-service cascades, and full site outages.

Challenges we ran into:

  1. Arize Phoenix authentication — the SDK expects a Bearer token via phoenix.otel.register(), not a plain api_key header. Took significant debugging to get traces flowing correctly to app.phoenix.arize.com.

  2. GitLab issue creation 500 errors — GitLab returns 500 when labels referenced in the API call don't pre-exist on the project. Fixed with a silent fallback retry that creates the issue without labels, so triage never fails silently.

  3. detect_error_rate_spike false positives — dividing recent errors by baseline when both are 0 produced infinity, flagging healthy services as spiking. Fixed with an explicit zero-count guard.

  4. Gemini API rate limits — the free-tier Gemini API (5 RPM) is completely insufficient for a multi-step agent making 8–10 tool calls per triage. Switched to Vertex AI with $999 GenAI App Builder credits — no rate limits, full Gemini 2.5 Flash capability.

  5. Seed data accumulation — running seed scripts multiple times inflated the 60-minute baseline, making spikes impossible to detect. Fixed by deleting and rebuilding the index on each seed run.

Accomplishments that we're proud of:

  • A fully autonomous 5-step triage pipeline: detect anomaly → search logs → trace commit → find merge request → file incident issue — with zero human input at any step.

  • 35 tests all passing across config, Arize, Elasticsearch tools, GitLab tools, API, and full end-to-end agent scenarios.

  • 5 realistic demo scenarios covering the full spectrum: single service NPE, Redis session failure, DB connection pool exhaustion, multi-service cascade, and full site outage — all triaged correctly by the agent.

  • Complete LLM observability via Arize Phoenix — every agent decision, tool call, and reasoning step is traced and visible, making the agent's behavior auditable and debuggable.

  • The agent handles multi-service failures without confusion — when payment, auth, and cart fail simultaneously, it correctly files three separate, distinct incident issues rather than collapsing them into one.

What we learned:

  • Google ADK is the right tool for this — not Vertex AI Agent Builder. ADK gives you full control over tool definitions, agent reasoning, and orchestration without locking you into a GUI-based flow.

  • Multi-step agents need real observability from day one. Without Arize Phoenix tracing, debugging why the agent made a wrong tool call would have been nearly impossible.

  • Prompt design matters more than model choice. The same Gemini 2.5 Flash model produced wildly different triage quality depending on how tool descriptions and system prompts were structured.

  • Self-healing error handling is essential for agentic systems. The GitLab label bug would have silently broken incident creation in production — building fallback retries into every tool made the agent robust against API quirks.

What's next for Incident Copilot -

  • Real-time streaming from Elasticsearch — shift from on-demand triage to continuous monitoring with automatic agent invocation when anomaly thresholds are crossed.

  • Rollback automation — once root cause is identified and the bad commit is traced, trigger a GitLab CI/CD pipeline revert automatically with one confirmation step.

  • Post-mortem generation — after incident resolution, automatically draft a structured post-mortem document from the triage timeline, root cause, and resolution steps.

  • Multi-repo support — today the agent searches one GitLab repo. Extending to multi-repo microservice architectures where the offending commit may live in a different service's repo.

  • Slack/PagerDuty integration — surface triage results directly into the incident channel where the on-call engineer is already working, rather than requiring them to open a separate UI.

--> All agent runs are traced via Arize Phoenix MCP server — query past incidents and trace spans directly from the agent using phoenix_* tools.

Built With

Share this project:

Updates