About Kubeguardian Inspiration

Kubernetes clusters are powerful but complex, and managing reliability at scale can be challenging for SRE teams. During the GKE Hackathon, I wanted to explore building a system that acts as an autonomous SRE engineer, capable of monitoring clusters, analyzing events, and remediating issues without constant human intervention. The idea was inspired by real-world operational challenges where downtime or misconfigurations can have immediate impact on applications.

Project Overview

Kubeguardian is an autonomous reliability assistant for Kubernetes. It functions as a multi-agent system, integrating background automation and interactive user-driven commands:

Publisher & Subscriber Agents: These background pods watch Kubernetes events (Pods, Deployments, Services, etc.), publish them to RabbitMQ, and feed them to a Remediator Agent which executes corrective actions.

Chat Agent & Frontend: Users interact via a web UI that allows chatting with a Chat Agent, sending cluster commands, and requesting remediation actions.

MCP Server & kubectl-ai: Exposes custom tools and kubectl functions for programmatic cluster control and automation.

Data Persistence: PostgreSQL stores user sessions and metadata for agent interactions.

This hybrid system ensures autonomous monitoring while allowing human-in-the-loop control, providing a full SRE experience in a Kubernetes-native environment.

Features & Functionality

Event Streaming: Real-time Kubernetes event collection from cluster resources.

Automated Remediation: Background agents detect and fix cluster issues automatically.

Interactive Chat Agent: Frontend interface for user commands and action requests.

Custom MCP Tools: Extendable protocol for exposing new automation capabilities.

User Management: Persistent storage of user data and session history in PostgreSQL.

Technologies Used

GKE (Google Kubernetes Engine): Orchestrates pods, services, and ingress.

ADK (Agent Development Kit): Powers autonomous agent logic.

MCP (Model Context Protocol): Exposes custom tools and reasoning for agents.

kubectl-ai: MCP server providing programmatic Kubernetes operations.

RabbitMQ: Event-driven communication between publisher, subscriber, and remediator agents.

PostgreSQL: Stores user accounts, sessions, and metadata.

React: Frontend UI for chat-based interaction.

Python: For all agent and backend services.

Other Data Sources

Kubernetes API for cluster events.

RabbitMQ message queues for real-time event streaming.

PostgreSQL database for persistent session data.

Challenges & Learnings

Cluster Networking: Configuring ingress, services, and pod communication in GKE required careful debugging.

Event-driven Architecture: Designing a system where publisher → RabbitMQ → subscriber → remediator works reliably, with minimal delay.

Agent Integration: Combining ADK agents with MCP, kubectl-ai, and custom tools for real-time reasoning.

Hybrid Automation: Balancing automated remediation with interactive human-driven control via the Chat Agent.

Deployment: Managing multiple pods, database connections, and frontend in a single GKE cluster.

Through this project, I gained hands-on experience with multi-agent systems, Kubernetes event streams, GKE networking, RabbitMQ integration, and autonomous SRE operations, as well as designing a reliable architecture that can handle failures gracefully.

Summary

Kubeguardian demonstrates how autonomous agents and human-in-the-loop interactions can work together to improve Kubernetes reliability. It combines event-driven architecture, agent-based automation, and a chat-based interface, making it both educational and practical for real-world SRE scenarios.

Built With

Share this project:

Updates