Outage-Pilot

✨ Features

  • Autonomous Kubernetes Monitoring: Lightweight Go agents run directly in your cluster to detect common issues like CrashLoopBackOff, ImagePullBackOff, and OOMKilled.
  • AI-Powered Root Cause Analysis: A central "Brain" orchestrator powered by Google Gemini uses a ReAct loop to perform deep investigations into incident signals.
  • Dynamic Tool Use: The AI Brain dynamically uses a rich set of read-only Kubernetes tools (via MCP) to gather context, just like a human SRE would.
  • One-Click Remediation: The AI proposes a remediation plan (e.g., "Set image to nginx:latest" or "Rollback deployment"), which you can approve directly from the dashboard.
  • Real-Time Dashboard: A React-based frontend provides a live stream of incidents, agent status, and pending approvals.
  • Safe & Secure: Privileged operations are only executed after explicit user approval. All actions are audited.

🏗️ Architecture

OutagePilot uses a distributed, multi-agent architecture designed for safety and scalability.

  1. Go Watchdog Agents: These lightweight binaries run in your cluster, monitor for specific failure modes, and send a Signal to the backend when an issue is detected.
  2. Python Backend: The central "Brain" of the system. It receives signals, creates incidents, and orchestrates the AI investigation.
  3. Gemini AI Brain: A sophisticated prompt and ReAct (Reason+Act) loop that uses Gemini to investigate an incident. It behaves like a human SRE, using tools to gather information step-by-step.
  4. MCP Server: A Node.js server that acts as a secure gateway to the Kubernetes API. It exposes a set of "tools" (e.g., k8s_describe_pod) that the AI can call. This prevents the AI from having direct, unrestricted API access.
  5. React Frontend: The mission control center. It provides a real-time view of the system and is the single point for approving any mutating actions proposed by the AI.

🛠️ Tech Stack

  • Backend: Python, FastAPI, SQLAlchemy, Alembic
  • Frontend: TypeScript, React, Vite, Zustand, Tailwind CSS
  • Agents: Go
  • AI: Google Gemini
  • Tooling Protocol: MCP (Model Context Protocol)
  • Database: PostgreSQL
  • Real-time UI: Server-Sent Events (SSE)

Built With

Share this project:

Updates