Dashboard

Heal-K8s — Submission Story

Inspiration

At 3 AM, most Kubernetes incidents are not new failures but familiar failures that are recurring under pressure: OOMKilled pods, crash loops, image pull failures, and critical manual interventions.

This is why we posed a question to address the problem:

Why are we paged on incidents that a system should be able to predict, classify, and safely recover from?

Heal-K8s was designed to ease the pain by providing a simple cycle of prediction, deterministic debugging, approval-driven fixing, and memory.

What it does

Heal-K8s is a Human-in-the-loop Kubernetes Incident Response System that:

Predicts OOM failures before they happen
Detects known failure signatures immediately without using LLMs
Utilizes an LLM Fallback only when encountering unknown failures
Provides a suggested remediation command in the dashboard
Executes only after receiving Human Approval
Learns from outcomes using Incident Memory

Core loop:

Predict → Diagnose → Approve → Execute → Learn

How we built it

We built Heal-K8s as a modular stack:

Backend (FastAPI): API orchestration, state management, and incident routing
Predictive Engine: time series memory growth analysis for time to OOM calculation
Signature Engine: regex/rule-based matching for common Kubernetes error types
LLM Fallback: structured JSON diagnosis path for unknown error types
Memory Layer (SQLite): storing results and confidence levels for repeated incidents
Execution Layer: Kubernetes Python client integration and safe execution controls
Frontend (Vanilla JS + Chart.js): real-time dashboard display with countdown, confidence levels, and approval flow
Telemetry Input: Prometheus polling and deterministic fake triggers for reproducible demos

A core memory confidence idea:

confidence = success_count / (success_count + failure_count)

Challenges we ran into

Live infra vs demo reliability: real Kubernetes + Prometheus can be powerful but brittle in a time-constrained demo scenario
False positives in prediction: what if we have no safeguards for sustained growth in the detection of a leak
Frontend consistency: the countdown timing, stale messages, transitions of execution state all need to be carefully fine-tuned
Mode switching: balancing deterministic test flows with live telemetry proof without muddling the narrative
Safety controls: maintaining explicit approval gating and command constraints while keeping it fast

Accomplishments that we're proud of

Built a complete incident loop, not just an assistant wrapper
Showed a prediction before failure with a visible countdown
Used LLM as a fallback, not primary decision logic
Included a form of incident memory to resolve repeated failures more quickly
Provided a polished judge-ready flow with deterministic demo paths
Maintained a clear human-in-the-loop safety model throughout execution

What we learned

In operational tooling, reliability and clarity are more important than the number of features
Deterministic systems (prediction + signatures + memory) create trust faster than AI-only behavior
Approval-gated automation is vital for safe remediation, especially in a production-like environment
Observability UX is about reducing cognitive load under pressure, not adding more panels
Demo-ready and production-ready are distinct, and both require intentional design

What's next for Heal-K8s

Expanded signature coverage (CPU Throttling, Disk Pressure, Network Path Failures)
Multi-Cluster support, stronger RBAC-Aware Execution Policies
Integration of Alert & Approvals Workflows with Slack & PagerDuty
Improved Memory Intelligence with more context & confidence calibration
Additional deployment paths beyond local Minikube for more pilots of staging & production clusters