About Kubeguardian Inspiration
Kubernetes clusters are powerful but complex, and managing reliability at scale can be challenging for SRE teams. During the GKE Hackathon, I wanted to explore building a system that acts as an autonomous SRE engineer, capable of monitoring clusters, analyzing events, and remediating issues without constant human intervention. The idea was inspired by real-world operational challenges where downtime or misconfigurations can have immediate impact on applications.
Project Overview
Kubeguardian is an autonomous reliability assistant for Kubernetes. It functions as a multi-agent system, integrating background automation and interactive user-driven commands:
Publisher & Subscriber Agents: These background pods watch Kubernetes events (Pods, Deployments, Services, etc.), publish them to RabbitMQ, and feed them to a Remediator Agent which executes corrective actions.
Chat Agent & Frontend: Users interact via a web UI that allows chatting with a Chat Agent, sending cluster commands, and requesting remediation actions.
MCP Server & kubectl-ai: Exposes custom tools and kubectl functions for programmatic cluster control and automation.
Data Persistence: PostgreSQL stores user sessions and metadata for agent interactions.
This hybrid system ensures autonomous monitoring while allowing human-in-the-loop control, providing a full SRE experience in a Kubernetes-native environment.
Features & Functionality
Event Streaming: Real-time Kubernetes event collection from cluster resources.
Automated Remediation: Background agents detect and fix cluster issues automatically.
Interactive Chat Agent: Frontend interface for user commands and action requests.
Custom MCP Tools: Extendable protocol for exposing new automation capabilities.
User Management: Persistent storage of user data and session history in PostgreSQL.
Technologies Used
GKE (Google Kubernetes Engine): Orchestrates pods, services, and ingress.
ADK (Agent Development Kit): Powers autonomous agent logic.
MCP (Model Context Protocol): Exposes custom tools and reasoning for agents.
kubectl-ai: MCP server providing programmatic Kubernetes operations.
RabbitMQ: Event-driven communication between publisher, subscriber, and remediator agents.
PostgreSQL: Stores user accounts, sessions, and metadata.
React: Frontend UI for chat-based interaction.
Python: For all agent and backend services.
Other Data Sources
Kubernetes API for cluster events.
RabbitMQ message queues for real-time event streaming.
PostgreSQL database for persistent session data.
Challenges & Learnings
Cluster Networking: Configuring ingress, services, and pod communication in GKE required careful debugging.
Event-driven Architecture: Designing a system where publisher → RabbitMQ → subscriber → remediator works reliably, with minimal delay.
Agent Integration: Combining ADK agents with MCP, kubectl-ai, and custom tools for real-time reasoning.
Hybrid Automation: Balancing automated remediation with interactive human-driven control via the Chat Agent.
Deployment: Managing multiple pods, database connections, and frontend in a single GKE cluster.
Through this project, I gained hands-on experience with multi-agent systems, Kubernetes event streams, GKE networking, RabbitMQ integration, and autonomous SRE operations, as well as designing a reliable architecture that can handle failures gracefully.
Summary
Kubeguardian demonstrates how autonomous agents and human-in-the-loop interactions can work together to improve Kubernetes reliability. It combines event-driven architecture, agent-based automation, and a chat-based interface, making it both educational and practical for real-world SRE scenarios.

Log in or sign up for Devpost to join the conversation.