Inspiration

As an SRE, managing Kubernetes clusters and optimizing cloud costs are two of the most complex challenges facing DevOps teams today. While working with multiple GKE clusters and analyzing GCP billing data, I realized that these tasks required constant context-switching between different tools, dashboards, and APIs. I wanted to explore how AI agents could not only simplify these workflows individually but also collaborate to solve complex problems that span multiple domains. When I discovered Google's Agent Development Kit (ADK) and the A2A (Agent-to-Agent) protocol, I saw an opportunity to build specialized AI agents that could work together seamlessly - just like a team of expert copilots. The vision was simple: What if you could ask natural language questions about your infrastructure and get intelligent, actionable answers - whether it's about cluster health, resource usage, or cost optimization?

What it does

Cloud AI Copilots is a multi-agent platform consisting of three Cloud Run services:

  1. K8s Copilot - Kubernetes Management Agent
  2. Provides natural language interface to Kubernetes clusters
  3. Monitors cluster health, pod status, and resource utilization
  4. Troubleshoots deployment issues and suggests optimizations
  5. Manages deployments, services, and configurations through conversation
  6. Exposes both A2A protocol and web UI for human interaction

  7. Cost Optimization Copilot - Cloud Cost Analysis Agent

  8. Analyzes GCP billing data and resource usage patterns

  9. Identifies idle resources and cost-saving opportunities

  10. Provides budget forecasts and spending recommendations

  11. Detects anomalies in cloud spending

  12. Offers rightsizing suggestions for compute resources

  13. Copilot Dashboard - Unified Interface

  14. Single-page application that connects both agents

  15. Provides agent discovery and status monitoring

  16. Routes user queries to appropriate agents

  17. Built with TanStack Start (React) for modern UX

Agent Collaboration (A2A Protocol): The magic happens when agents work together. For example, when you ask K8s Copilot: "Are there any pods consuming excessive resources that might be increasing our costs?" - it automatically collaborates with Cost Copilot via the A2A protocol. K8s Copilot identifies resource-heavy pods while Cost Copilot analyzes the financial impact, providing a comprehensive answer.

How we built it

Architecture:

  • 3 Cloud Run Services deployed in europe-west1
  • CI/CD Pipeline using Cloud Build and Artifact Registry
  • Secret Management via Google Secret Manager
  • Container Registry for Docker images with multi-stage builds

Tech Stack: Backend (Python):

  • Google Agent Development Kit (ADK) - Agent framework
  • Gemini 2.5 Flash - Large language model for intelligent responses
  • Kubernetes Python Client - Direct cluster API access
  • GCP Billing API & Cloud Asset Inventory - Cost data analysis
  • Uvicorn - ASGI server for high-performance async handling
  • A2A Protocol - Agent-to-agent communication standard

Frontend (TypeScript):

  • TanStack Start - Full-stack React framework with SSR
  • React 19 - UI library
  • TypeScript - Type safety and better DX
  • Tailwind CSS - Utility-first styling
  • shadcn/ui - Beautiful, accessible components

Development Process:

  1. Agent Development - Built specialized agents with ADK, defining custom tools for Kubernetes API and GCP Billing API interactions
  2. A2A Integration - Implemented agent discovery and cross-agent communication using Google's A2A protocol specification
  3. Dockerization - Created optimized multi-stage Dockerfiles for each service
  4. Cloud Run Deployment - Configured auto-scaling (0-3 instances), secret mounting, and environment variables
  5. Dashboard Development - Built static-first React dashboard for instant loading
  6. CI/CD Pipeline - Automated builds, image pushes, and deployments with Cloud Build

Challenges we ran into

  1. A2A Protocol Implementation
  2. The A2A protocol was relatively new, with limited examples beyond the official docs
  3. Challenge: Getting agents to properly discover each other and route queries
  4. Solution: Implemented agent card discovery at /.well-known/agent-card.json and tested cross-agent communication extensively
  5. Secret Management in Cloud Run
  6. Needed to securely mount Kubernetes kubeconfig and API keys
  7. Challenge: Different secrets for different services (kubeconfig only for k8s-copilot)
  8. Solution: Used Google Secret Manager with granular IAM permissions and selective secret mounting per service
  9. Cost Data Analysis Without BigQuery Export
  10. Initially planned to use BigQuery billing export for detailed cost analysis
  11. Challenge: Setting up billing export requires organization-level permissions
  12. Solution: Pivoted to using GCP Billing API and Cloud Asset Inventory for real-time cost data
  13. Dashboard Static vs Dynamic Agent Discovery
  14. First implementation used dynamic agent discovery with API calls on every page load
  15. Challenge: Slow loading times, CORS issues, and complex error handling
  16. Solution: Switched to static configuration with environment variables, achieving instant page loads while maintaining flexibility. This has room for improvement long term.
  17. Gemini API Rate Limits During Development
  18. Hit rate limits during testing with multiple concurrent agent requests
  19. Solution: Implemented exponential backoff retry logic and added caching for repeated queries

Accomplishments that we're proud of

  • Successfully implemented A2A protocol - Agents can genuinely collaborate on complex queries spanning Kubernetes and cost optimization
  • Fully serverless architecture - All 3 services auto-scale from 0 to 3 instances based on demand, minimizing costs
  • Production-ready deployment - Automated CI/CD pipeline with Cloud Build, proper secret management, and monitoring
  • Clean architecture - Separation of concerns with 3 independent services that communicate via well-defined protocols
  • Type-safe full-stack application - TypeScript on frontend, strong typing in Python agents
  • Real-world utility - Solves actual DevOps pain points with natural language interfaces to complex systems
  • Comprehensive documentation - README, architecture diagrams, and code comments for maintainability

What we learned

Technical Learnings:

  • How to build production-grade AI agents using Google ADK
  • Implementing agent-to-agent communication with the A2A protocol
  • Deploying multi-service architectures on Cloud Run with proper networking
  • Managing secrets and sensitive data in serverless environments
  • Optimizing Docker images for faster Cloud Run deployments
  • Building modern React applications with TanStack Start

Architectural Insights:

  • The power of specialized agents vs monolithic AI systems
  • Benefits of serverless for AI workloads (cost efficiency, auto-scaling)
  • Importance of proper error handling in AI agent responses
  • Static-first approaches for better UX in agent discovery

AI/ML Insights:

  • How to design effective tools for Gemini to interact with external APIs
  • Prompt engineering for consistent, structured responses
  • Handling context windows and conversation history in agents
  • Balancing between agent autonomy and user control

What's next for Cloud Copilots

Short-term (Next Month):

  • Add BigQuery billing export integration for deeper cost analysis
  • Implement conversation history and multi-turn interactions
  • Add support for multiple Kubernetes clusters
  • Create scheduled cost reports and alerts
  • Add authentication and multi-user support

Medium-term (Next Quarter):

  • Add more specialized agents:
  • Security Copilot - Vulnerability scanning and compliance
  • Observability Copilot - Logs, metrics, and traces analysis
  • CI/CD Copilot - Pipeline optimization and deployment automation
  • Implement agent orchestration for complex workflows
  • Add voice interface for hands-free operations
  • Build Slack/Teams integration for conversational DevOps

Long-term Vision:

  • Create an agent marketplace where teams can publish and share custom agents
  • Support for multi-cloud environments (AWS, Azure)
  • Advanced agent collaboration with hierarchical task delegation
  • ML-powered anomaly detection and predictive analytics
  • Open-source the framework for building domain-specific AI agents

Community Goals:

  • Publish blog posts and tutorials on building AI agents with ADK
  • Create video courses on A2A protocol implementation
  • Contribute improvements back to Google ADK
  • Build a community around collaborative AI agents for DevOps

Built With

Share this project:

Updates