Inspiration
Running apps on Kubernetes is powerful, but when things go wrong, troubleshooting can be overwhelming. Pods crash, services fail, metrics spike, and you end up chasing logs across different dashboards. I wanted to explore: What if Kubernetes could help us troubleshoot itself? That idea — giving micro services an AI upgrade became the inspiration for my project.
What it does?
The K8s MCP Troubleshooting Agent is an AI-powered assistant that runs on Google Kubernetes Engine (GKE). It:
- Monitors pods, services, deployments, and cluster resources.
- Gathers logs, events, and metrics when requested.
- Analyzes issues and provides AI-powered troubleshooting suggestions.
- Performs remediation actions like restarting pods, scaling deployments, or cleaning up failed resources.
- Integrates seamlessly with GKE, Prometheus, and Cloud Monitoring.
Example use case: If a pod is stuck in Pending, the agent can describe it, detect resource constraints, suggest scaling adjustments, and even execute the fix — all through a simple interaction.
How I built it?
- Core platform: Google Kubernetes Engine (GKE)
- Intelligence layer: Google ADK agents + Vertex AI for analysis
- Tooling: Model Context Protocol (MCP) server to expose cluster tools like list_pods, get_pod_logs, scale_deployment, etc.
- Languages & frameworks: Python 3.11, Kubernetes Python client, httpx, requests
- Deployment: Docker + Artifact Registry + Cloud Build + kubectl manifests
- Data sources: Kubernetes API, GCP project metadata, metrics-server, Prometheus, Cloud Monitoring
Challenges I ran into?
- RBAC permissions: making sure the agent had just enough access to manage pods and deployments securely.
- Metrics hunting: finding the right GKE metrics (CPU, memory, pod lifecycle events) and wiring them into the MCP tools.
- Debugging: plenty of “2 a.m. errors” with kubeconfig, service accounts, and failed image pulls.
Accomplishments that I'd proud of:
- Built a working AI assistant that can interact with GKE clusters in real time.
- Integrated MCP tools with ADK + Vertex AI to provide intelligent troubleshooting suggestions.
- Designed a solution that extends Kubernetes without touching core application code making it flexible and non-intrusive.
- Contributed a fresh idea to the GKE Turns 10 Hackathon community.
What I learned?
- GKE’s APIs are powerful for building intelligent extensions.
- MCP offers a flexible way to plug AI into Kubernetes workflows.
- Combining monitoring with AI insights significantly reduces troubleshooting time.
- Sometimes, the biggest learnings come from solving errors that seem small — like permissions or missing metrics — but block everything else.
What's next for A K8S MCP Troubleshooting Agent
- Add predictive analytics for scaling decisions.
- Train ML models for anomaly detection in cluster behavior.
- Expand remediation actions for more self-healing capabilities.
- Integrate with more Google Cloud services to broaden coverage.
Built With
- adk
- docker
- gcp
- gke
- httpx
- kubernetes
- kubernetes-python-client
- mcp
- python
- vertex



Log in or sign up for Devpost to join the conversation.