Incident Oracle

Inspiration

Production incidents are still too often discovered after deployment, even when changes pass review and tests. We wanted to build something that shifts that discovery earlier, into the merge request itself, where teams can still act cheaply and safely. The core idea behind Incident Oracle is simple: use historical failure patterns plus semantic code understanding to predict which changes are likely to cause incidents before they ship.

What it does

Incident Oracle analyzes merge requests and predicts production incident risk before merge. It identifies the semantic intent of a change, estimates blast radius, compares the change against historical incident patterns, assigns a structured risk score from 0 to 100, and recommends a merge decision such as normal merge, staged rollout, or block merge. It can also post structured findings back into GitLab so the prediction becomes part of the development workflow instead of a separate report nobody reads.

How we built it

We built Incident Oracle as a GitLab-native hackathon project with two layers. The first layer is a custom GitLab Duo flow and companion agent for native MR analysis inside GitLab. The second layer is a Google Cloud runtime on Cloud Run that receives merge request webhooks, queries BigQuery for historical incident context, uses Vertex AI for structured reasoning, applies a deterministic fallback risk model when needed, and writes results back to GitLab through the API. We also seeded a small incident-history corpus and encoded a repeatable scoring rubric so the system produces consistent decisions instead of vague summaries.

Challenges we ran into

The hardest part was turning an ambitious architecture into something that actually works end to end in a hackathon window. We had to solve cloud deployment issues, secret access permissions, project billing and API enablement, model availability problems in Vertex AI, and GitLab integration edge cases. We also had to make the system resilient enough that a model failure would not collapse the whole demo path, which led us to build a deterministic fallback predictor and fail-soft side-effect handling.

Accomplishments that we're proud of

We are proud that Incident Oracle is not just a concept deck. It includes a real GitLab agent and flow, a live Cloud Run service, BigQuery-backed historical context, structured risk scoring, a reproducible demo path, and an end-to-end webhook-driven analysis flow. We are also proud that the project tells a clear story: small changes can cause expensive outages, and Incident Oracle helps teams catch them before merge.

What we learned

We learned that the most valuable AI developer tooling is not just “smart chat,” but systems that take action in context and fit naturally into existing workflows. We also learned that reliability matters as much as intelligence in hackathon demos. A strong prompt is not enough on its own; you need deterministic fallbacks, good deployment hygiene, and a clear explanation layer if you want judges and developers to trust the output.

What's next for Incident Oracle

Next, we want to deepen the historical learning loop with larger incident datasets, improve reviewer routing and merge gating policies, add better observability and replay tooling, and expand the blast-radius model across multi-service systems. Beyond the hackathon, the goal is to turn Incident Oracle into a production-ready pre-merge incident prediction system that continuously learns from real deployments and helps engineering teams reduce avoidable outages.

Built With

bigquery
cloud-build
cloud-run
custom-gitlab-flows-and-agents
docker
gitlab-api
gitlab-duo-agent-platform
gitlab-webhooks
json
node.js
secret-manager
sql
typescript
vertex-ai

Updates

Chungu Chipimo Chama started this project — Mar 25, 2026 01:45 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.