Heal-K8s — Submission Story

Inspiration

At 3 AM, most Kubernetes incidents are not new failures but familiar failures that are recurring under pressure: OOMKilled pods, crash loops, image pull failures, and critical manual interventions.

This is why we posed a question to address the problem:

Why are we paged on incidents that a system should be able to predict, classify, and safely recover from?

Heal-K8s was designed to ease the pain by providing a simple cycle of prediction, deterministic debugging, approval-driven fixing, and memory.


What it does

Heal-K8s is a Human-in-the-loop Kubernetes Incident Response System that:

  • Predicts OOM failures before they happen
  • Detects known failure signatures immediately without using LLMs
  • Utilizes an LLM Fallback only when encountering unknown failures
  • Provides a suggested remediation command in the dashboard
  • Executes only after receiving Human Approval
  • Learns from outcomes using Incident Memory

Core loop:

Predict → Diagnose → Approve → Execute → Learn


How we built it

We built Heal-K8s as a modular stack:

  • Backend (FastAPI): API orchestration, state management, and incident routing
  • Predictive Engine: time series memory growth analysis for time to OOM calculation
  • Signature Engine: regex/rule-based matching for common Kubernetes error types
  • LLM Fallback: structured JSON diagnosis path for unknown error types
  • Memory Layer (SQLite): storing results and confidence levels for repeated incidents
  • Execution Layer: Kubernetes Python client integration and safe execution controls
  • Frontend (Vanilla JS + Chart.js): real-time dashboard display with countdown, confidence levels, and approval flow
  • Telemetry Input: Prometheus polling and deterministic fake triggers for reproducible demos

A core memory confidence idea:

confidence = success_count / (success_count + failure_count)


Challenges we ran into

  • Live infra vs demo reliability: real Kubernetes + Prometheus can be powerful but brittle in a time-constrained demo scenario
  • False positives in prediction: what if we have no safeguards for sustained growth in the detection of a leak
  • Frontend consistency: the countdown timing, stale messages, transitions of execution state all need to be carefully fine-tuned
  • Mode switching: balancing deterministic test flows with live telemetry proof without muddling the narrative
  • Safety controls: maintaining explicit approval gating and command constraints while keeping it fast

Accomplishments that we're proud of

  • Built a complete incident loop, not just an assistant wrapper
  • Showed a prediction before failure with a visible countdown
  • Used LLM as a fallback, not primary decision logic
  • Included a form of incident memory to resolve repeated failures more quickly
  • Provided a polished judge-ready flow with deterministic demo paths
  • Maintained a clear human-in-the-loop safety model throughout execution

What we learned

  • In operational tooling, reliability and clarity are more important than the number of features
  • Deterministic systems (prediction + signatures + memory) create trust faster than AI-only behavior
  • Approval-gated automation is vital for safe remediation, especially in a production-like environment
  • Observability UX is about reducing cognitive load under pressure, not adding more panels
  • Demo-ready and production-ready are distinct, and both require intentional design

What's next for Heal-K8s

  • Expanded signature coverage (CPU Throttling, Disk Pressure, Network Path Failures)
  • Multi-Cluster support, stronger RBAC-Aware Execution Policies
  • Integration of Alert & Approvals Workflows with Slack & PagerDuty
  • Improved Memory Intelligence with more context & confidence calibration
  • Additional deployment paths beyond local Minikube for more pilots of staging & production clusters

Closing note

Heal-K8s demonstrates that incident response can be proactive, explainable, and safer — without removing humans from the loop.

Built With

Share this project:

Updates