Outage-Pilot
✨ Features
- Autonomous Kubernetes Monitoring: Lightweight Go agents run directly in your cluster to detect common issues like
CrashLoopBackOff,ImagePullBackOff, andOOMKilled. - AI-Powered Root Cause Analysis: A central "Brain" orchestrator powered by Google Gemini uses a ReAct loop to perform deep investigations into incident signals.
- Dynamic Tool Use: The AI Brain dynamically uses a rich set of read-only Kubernetes tools (via MCP) to gather context, just like a human SRE would.
- One-Click Remediation: The AI proposes a remediation plan (e.g., "Set image to
nginx:latest" or "Rollback deployment"), which you can approve directly from the dashboard. - Real-Time Dashboard: A React-based frontend provides a live stream of incidents, agent status, and pending approvals.
- Safe & Secure: Privileged operations are only executed after explicit user approval. All actions are audited.
🏗️ Architecture
OutagePilot uses a distributed, multi-agent architecture designed for safety and scalability.
- Go Watchdog Agents: These lightweight binaries run in your cluster, monitor for specific failure modes, and send a
Signalto the backend when an issue is detected. - Python Backend: The central "Brain" of the system. It receives signals, creates incidents, and orchestrates the AI investigation.
- Gemini AI Brain: A sophisticated prompt and ReAct (Reason+Act) loop that uses Gemini to investigate an incident. It behaves like a human SRE, using tools to gather information step-by-step.
- MCP Server: A Node.js server that acts as a secure gateway to the Kubernetes API. It exposes a set of "tools" (e.g.,
k8s_describe_pod) that the AI can call. This prevents the AI from having direct, unrestricted API access. - React Frontend: The mission control center. It provides a real-time view of the system and is the single point for approving any mutating actions proposed by the AI.
🛠️ Tech Stack
- Backend: Python, FastAPI, SQLAlchemy, Alembic
- Frontend: TypeScript, React, Vite, Zustand, Tailwind CSS
- Agents: Go
- AI: Google Gemini
- Tooling Protocol: MCP (Model Context Protocol)
- Database: PostgreSQL
- Real-time UI: Server-Sent Events (SSE)
Built With
- a2a
- alembic
- fastapi
- gemini
- go
- mcp
- postgresql
- python
- react
- sqlalchemy
- sse
- typescript
- vite
- zustand


Log in or sign up for Devpost to join the conversation.