Incident Replay

Inspiration

After every major production incident, software teams religiously hold post-mortems. They fill out a template, list root causes, construct a timeline, and most importantly, write down a checklist of action items. Then, almost inevitably, nothing happens.

The evidence across the industry is stark:

Google's SRE team openly states they are "by no means perfect at formulating and executing postmortem action items" (Google SRE Book, Ch. 15)
Recurring incidents caused by incomplete fixes and unresolved action items are a well-documented industry problem (Atlassian: Incident Management Post-Mortems)

The gap isn't a lack of knowledge; engineers know what went wrong. The gap is the "last mile" turning passive text in a closed issue into assigned, tracked work on an agile board. I wanted to build a native GitLab agent that ensures post-mortem lessons don't rot in a checklist graveyard.

What it does

Incident Replay is a GitLab Duo Agent flow that automatically turns post-mortem documentation into tracked, assigned prevention work natively within GitLab.

When you mention the agent in an incident or post-mortem issue, it:

Analyzes the Context: Reads the issue and discussion to extract root causes and proposed action items.
Gathers Log Evidence: Infers the affected service and timeframe, executing an external request to fetch real error logs from a Google Cloud log server.
Identifies Patterns: Searches the repository for past issues with similar root causes, surfacing unaddressed recurring problems.
Creates Tracked Work: Generates individual, trackable GitLab issues for every single action item, securely appending the log evidence and linking back to the original source.
Summarizes Findings: Posts a clean prevention summary comment back on the original issue.

By automating this workflow, it guarantees that post-mortem action items instantly become visible tasks on your team's issue board.

How we built it

Incident Replay was built leveraging the newly released GitLab Duo Agent Platform to create a custom, multi-agent flow. When an engineer comments @ai-incident-replay-gitlab-ai-hackathon please analyze this on a post-mortem, the custom Flow kicks into action:

Agent 1: Incident Analyzer: This agent uses native GitLab tools (get_issue, list_issue_notes) to read the post-mortem. It extracts the root cause and proposed action items. Crucially, it uses the gitlab_issue_search tool to find past incidents with similar traits, and the run_command tool to execute curl against a Google Cloud Run log server to fetch actual error logs.
Agent 2: Action Orchestrator: Armed with this structured analysis, the orchestrator takes over. It uses create_issue to spin up a new, trackable GitLab issue for every single action item. It appends the GCP log evidence, links back to the original post-mortem, and uses update_issue to add labels (like incident-prevention). Finally, it uses create_issue_note to post a prevention summary back on the original incident.

To support the agent's need for real observability data, I deployed a sample Python FastAPI service to Google Cloud Run, simulating a /logs endpoint that returns timeline-specific telemetry.

Challenges we ran into

Pivoting Away from MCP: Our original architectural vision was to leverage Model Context Protocol (MCP) servers to ingest our observability data. This would have provided a beautifully standardized and secure connection to our logs. However, I quickly discovered that deploying and connecting custom, external MCP servers inside the hackathon's specific Duo Agent Platform sandbox wasn't fully supported yet. This limitation forced us into a major mid-project pivot—abandoning the clean MCP design in favor of explicitly crafting raw HTTP requests using curl through the run_command tool.
Querying External URLs: Getting the agent to successfully communicate with our Google Cloud Run log server was a major initial hurdle. I repeatedly encountered a baffling curl exit code 56 error when the agent tried to fetch logs. It took deep digging to realize that GitLab Duo enforces strict outbound network policies. I had to create and configure a .gitlab/duo/agent-config.yml file to explicitly whitelist our external domain before the agent was allowed to make the connection.
Environment Variable Limitations: I originally designed the agent to depend entirely on a GCP_LOGS_ENDPOINT CI/CD variable to know where the log server was. However, I realized that the hackathon sandbox repository environment didn't allow us to easily set CI/CD variables natively. To bypass this restriction, I had to re-engineer the prompt to intelligently parse the agent's trigger comment, allowing users to pass the endpoint URL dynamically inline while still keeping the CI/CD variable as a fallback.

Accomplishments that we're proud of

Seamless External Integration: I'm incredibly proud of successfully bridging the native GitLab Duo ecosystem with an external, deployed Google Cloud Run application. Orchestrating a native agent to securely construct and execute complex curl requests purely through prompt instruction and the run_command tool felt like a massive breakthrough.
Robust Agent Architecture: Instead of building a fragile agent that wildly guesses when confused, I engineered a resilient, conditional two-agent workflow.
Solving a Pervasive Industry Problem: Post-mortem rot is a well-known cultural problem in software engineering. I took a notoriously tedious chore, manually transferring root causes and checklists into issue trackers, and used AI to completely automate the "agile bureaucracy." Ensuring every post-mortem automatically generates tracked, verifiable fixes feels like a significant win for engineering productivity.

What we learned

The Convenience of Enforced Structure: GitLab Duo is incredibly powerful because it actively encourages engineering best practices through its abstractions. By enforcing the use of flows (where I separated the Incident Analyzer from the Action Orchestrator), the framework naturally guides developers away from creating brittle, monolithic "god agents." This structured flow architecture made the agents vastly more predictable and easier to debug.
The Power of the Native Ecosystem: The sheer amount of context the Duo ecosystem has out of the box is staggering. Because the agents run natively within GitLab, I didn't have to waste time writing complex API wrappers, handling pagination, or managing OAuth tokens to access repository data. The agent immediately had robust, native access to project tools like get_issue, list_issue_notes, and gitlab_issue_search. This native connectivity turned what would normally be a massive integration headache into a seamless, highly capable automation pipeline.

What's next for Incident Replay

I see this hackathon MVP as the foundation for a much larger automated reliability pipeline. Our future roadmap includes:

Cross-Project Pattern Detection: Expanding the agent's incident search capability to the GitLab Group level to reveal systemic, org-wide vulnerabilities that transcend a single repository.
Auto-Generating Code Fixes: Moving beyond just creating issue tickets to automatically generating direct Merge Requests with the code changes necessary to fix common issues (like injecting missing timeout configurations or retry logic).
Shift-Left Prevention: Integrating the agent into the pre-merge review process, actively scanning incoming code against root causes extracted from past historical incidents to prevent bugs before they reach production.
Automated Compliance Reporting: Structuring the tracking and resolution of incident action items to automatically generate airtight evidence reports for SOC 2 and ISO 27001 compliance audits.
Model Context Protocol (MCP) Integration: Transitioning away from customized HTTP curl scripts by building native support for MCP servers. This open standard will grant the agent secure, unified access to a much wider array of observability tools and telemetry data.

By tackling the well-documented problem of recurring bugs, Incident Replay ensures that every post-mortem is the absolute last time that specific incident happens.

Built With

gitlab
python

Updates

Herbert Kagumba started this project — Mar 25, 2026 01:44 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.