Inspiration

Running applications on GKE involves constant monitoring, scaling, upgrades, and security checks. SREs often juggle multiple dashboards and alerts, which leads to slow responses and human error. We wanted to create an AI-powered copilot that makes Kubernetes operations smarter, safer, and easier to manage.

What it does

SRE-Copilot for GKE adds intelligent, Gemini-powered agents on top of the Online Boutique microservice app. These agents continuously monitor the cluster, detect risks, predict scaling issues, and propose safe remediations. The dashboard translates complex Kubernetes signals into natural-language insights and allows SREs to “approve & apply” changes confidently.

How we built it

Deployed Online Boutique on Google Kubernetes Engine (GKE) as the baseline app.

Built specialized agents:

API Change Guardian for safe manifest changes.

Performance & Capacity Optimizer for scaling and quota.

Security Sentinel for policy drift and vulnerabilities.

Cluster Maintenance Concierge for upgrade planning.

Used Gemini AI for natural language reasoning.

Integrated with Model Context Protocol (MCP) to connect services and Agent-to-Agent (A2A) for orchestration.

Added kubectl-ai for AI-assisted but human-approved actions.

Challenges we ran into

Integrating multiple agents without modifying the Online Boutique core code.

Handling quota limits and resource constraints in GKE Autopilot clusters.

Making the dashboard intuitive yet powerful for both AI-driven recommendations and manual approvals.

Ensuring production-grade reliability while running experimental AI workflows.

Accomplishments that we're proud of

Built a fully agentic extension to a real-world microservice demo (Online Boutique) without breaking its architecture.

Successfully connected Gemini AI with cluster APIs via MCP.

Designed an intuitive Approve & Apply flow to keep humans in control.

Created a modular design so new agents can be added easily.

What we learned

How agentic AI can augment cloud-native platforms without replacing human decision-making.

Best practices for connecting external AI services (Gemini) with Kubernetes APIs safely.

The importance of orchestration protocols (MCP & A2A) for multi-agent collaboration.

Practical limits of GKE Autopilot quotas and how AI can help anticipate them.

What's next for SRE-Copilot for GKE

Support for multi-cluster federation and hybrid cloud setups.

Tighter integration with Gemini CLI for developer workflows.

Automated runbooks/playbooks powered by the Agent Development Kit (ADK).

Expanding beyond Online Boutique to support other GKE-based microservice applications.

Built With

  • agent-to-agent-(a2a)
  • agentic-ai
  • ai-driven-operations
  • cloud-native
  • cluster-maintenance
  • devops
  • gemini-ai
  • gemini-cli
  • google-kubernetes-engine-(gke)
  • kubectl-ai
  • kubernetes-automation
  • microservices
  • model-context-protocol-(mcp)
  • multi-agent-systems
  • online-boutique
  • reliability
  • resource-optimization
  • scaling
  • security
  • site-reliability-engineering
  • sre
  • zero-downtime-upgrades
Share this project:

Updates