Inspiration

When your best SRE quits, MTTR goes up 77%. IncidentBrain captures their expertise in a knowledge graph so every engineer diagnoses like a 10-year veteran.

  • Claude agent receives alert → queries Neo4j for similar past incidents → "I've seen this before, INC-042 had identical symptoms" → proposes fix → human approves → applies fix → writes resolution back to graph
  • Demo wow: First incident takes 45 min. Same pattern again → 30 seconds. Live.

What it does

IncidentBrain is an AI SRE agent that turns your team's incident history into institutional memory. When an alert fires, it investigates the Kubernetes cluster, queries a Neo4j knowledge graph of past incidents, finds matching patterns, and proposes verified fixes — all in under 30 seconds. After resolution, it writes the new incident back to the graph, so every fix makes the system smarter. It's the difference between a junior engineer Slacking the senior at 3 AM and a junior engineer resolving it themselves in seconds.

How we built it

  • Claude Agent SDK as the reasoning engine — receives alerts, plans investigation steps, queries tools, proposes fixes
  • Neo4j Aura as the temporal knowledge graph — stores incidents, services, symptoms, root causes, and fixes as connected nodes with relationships (AFFECTED, CAUSED_BY, RESOLVED_BY)
  • Kubernetes (GKE) as the live environment — a real cluster with a deliberately broken app (DB_POOL_SIZE=1) that the agent diagnoses and
    fixes via kubectl
  • 3d-force-graph for the real-time 3D visualization — nodes light up as the agent investigates, the "MATCH FOUND" moment is visible as
    connections form
  • Browser STT + Cartesia TTS for voice interaction — talk to the agent, hear it respond
  • GCP VM as shared dev environment — three teammates, three SSH accounts, one codebase ## Challenges we ran into
  • Neo4j's browser JavaScript driver doesn't support neo4j+ssc:// connections — we had to proxy through a server-side route instead of
    connecting directly from the frontend
  • Getting the 3D graph to render node labels required careful library loading order — SpriteText had to load before the graph initialized
  • Coordinating three people building three different layers (agent, graph, UI) that all needed to connect — we solved this by defining a
    clear API contract early: the agent returns {match, root_cause, fix, confidence}, and both the graph and UI consume that
  • Making the demo reliable under time pressure — we pre-seeded 5 realistic incidents and built a deterministic "SIMULATE ALERT" flow so
    the demo never fails on stage

    Accomplishments that we're proud of

  • The full loop works end-to-end: alert fires → agent investigates → graph returns match → fix applied → new incident written back to graph → system is smarter for next time

  • 45 minutes to 28 seconds on the same incident pattern — that's a 96x improvement in MTTR, and it's real, not mocked

  • The 3D knowledge graph visualization makes the "MATCH FOUND" moment visually dramatic — judges can see the connections form in real time

  • Built and deployed on real GKE infrastructure, not localhost — the agent actually runs kubectl commands against a live cluster

What we learned

  • Temporal knowledge graphs (Graphiti pattern) are far more powerful than flat RAG for incident management — you need to know not just
    "what happened" but "when did we learn this" and "is this still true"
  • Claude's tool-use capability is perfect for SRE — it naturally chains investigation steps the way a human engineer would: check pods →
    read logs → query graph → verify hypothesis → propose fix
  • The hardest part of building an AI agent isn't the AI — it's the data model. Getting the Neo4j schema right (Incident → AFFECTED → Service, Incident → CAUSED_BY → RootCause) determined whether matches were accurate
  • Voice interaction changes the demo completely — being able to say "what's wrong with api-server?" and hear the answer makes it feel like a real colleague, not a dashboard ## What's next for IncidentBrain
  • Integrate with real alerting — PagerDuty, OpsGenie, Grafana webhooks as trigger sources instead of simulated alerts
  • Slack/Teams bot — the agent lives where on-call engineers already work, no separate dashboard needed
  • Automated runbook extraction — ingest existing postmortems and runbooks into the knowledge graph automatically using Claude's document understanding
  • GRPO-trained reasoning — use reinforcement learning to make the agent's diagnostic reasoning better over time, rewarding faster and more accurate root cause identification
  • Multi-cluster support — monitor multiple Kubernetes clusters and cross-reference incidents across environments (staging issues that
    predict production failures)

Built With

Share this project:

Updates