Inspiration
Every engineering team has been there, production down at 2am, and before anyone can start fixing anything, someone spends 20-30 minutes manually tracing what's connected to the broken service, scrolling through recent commits, and hunting down who owns each affected part. None of GitLab's five foundational agents handle live root-cause correlation during an active incident. That gap is what we built for.
What it does
Incident Blast-Radius Triage is a GitLab Duo Agent Platform custom flow that triggers on a single mention in an incident issue. It queries GitLab Orbit's SDLC knowledge graph for the blast radius of the failing component, correlates recent merge requests using the MR authorship graph, ranks root-cause hypotheses by graph proximity and recency, and auto-creates a structured triage issue with a native Mermaid dependency map, ranked suspects, and owners to page, all in under 60 seconds.
How we built it
Built on the GitLab Duo Agent Platform using a hybrid two-component architecture: a DeterministicStepComponent handles the timestamp with zero loop risk, then an AgentComponent calls Orbit's query_graph MCP tool for blast radius traversal (File → CALLS/IMPORTS edges, depth 3) and owner lookup (MergeRequestDiffFile → HAS_LATEST_DIFF → MergeRequest → AUTHORED → User). A Python scoring engine computes score = 1/hop_distance + 60/max(minutes_since_change, 1) and generates the Mermaid dependency diagram. Published to the AI Catalog as a public, MIT-licensed flow.
Challenges we ran into
The biggest challenge was a beta-platform agent-loop bug, the AgentComponent repeated tool calls indefinitely despite receiving clean, successful results every time. Maximally explicit prompt instructions didn't stop it. We fixed it architecturally by moving deterministic work into a DeterministicStepComponent, which has zero loop risk. A second challenge was Orbit Remote's DSL having 7 breaking changes from the documented v1 format, all discovered and corrected by testing against the live API directly.
Accomplishments that we're proud of
End-to-end working flow confirmed live multiple times: mention trigger → Orbit blast radius traversal → ranked triage issue with Mermaid diagram, differentiated scores (2.17 vs 1.04), and real owner attribution. Proven graceful degradation: when triage_cli.py failed mid-run during testing, the agent wrote the issue body itself and still completed all steps successfully, exactly as designed.
What we learned
When to NOT trust an LLM with a step, moving deterministic work out of the agent's control was the correct architectural fix for the loop bug, not more prompt engineering. Orbit Remote's DSL requires hands-on testing against the live API; the published docs lagged the actual API version by 7 breaking changes. Graceful degradation isn't just a nice-to-have, it's what makes an agentic system actually deployable.
What's next for Incident Blast-Radius Triage
- Real-time recency scoring via runner git checkout, eliminating the static .owners.json dependency
- Label-based auto-trigger when an "incident" label is applied to any issue
- Cross-repo blast radius traversal for microservice architectures spanning multiple GitLab projects
- Slack/PagerDuty integration for the owners-to-page notification list
Built With
- agentcomponent
- ci/cd
- deterministicstepcomponent
- gitlab
- mermaid
- python
Log in or sign up for Devpost to join the conversation.