Inspiration
As an SRE, managing Kubernetes clusters and optimizing cloud costs are two of the most complex challenges facing DevOps teams today. While working with multiple GKE clusters and analyzing GCP billing data, I realized that these tasks required constant context-switching between different tools, dashboards, and APIs. I wanted to explore how AI agents could not only simplify these workflows individually but also collaborate to solve complex problems that span multiple domains. When I discovered Google's Agent Development Kit (ADK) and the A2A (Agent-to-Agent) protocol, I saw an opportunity to build specialized AI agents that could work together seamlessly - just like a team of expert copilots. The vision was simple: What if you could ask natural language questions about your infrastructure and get intelligent, actionable answers - whether it's about cluster health, resource usage, or cost optimization?
What it does
Cloud AI Copilots is a multi-agent platform consisting of three Cloud Run services:
- K8s Copilot - Kubernetes Management Agent
- Provides natural language interface to Kubernetes clusters
- Monitors cluster health, pod status, and resource utilization
- Troubleshoots deployment issues and suggests optimizations
- Manages deployments, services, and configurations through conversation
Exposes both A2A protocol and web UI for human interaction
Cost Optimization Copilot - Cloud Cost Analysis Agent
Analyzes GCP billing data and resource usage patterns
Identifies idle resources and cost-saving opportunities
Provides budget forecasts and spending recommendations
Detects anomalies in cloud spending
Offers rightsizing suggestions for compute resources
Copilot Dashboard - Unified Interface
Single-page application that connects both agents
Provides agent discovery and status monitoring
Routes user queries to appropriate agents
Built with TanStack Start (React) for modern UX
Agent Collaboration (A2A Protocol): The magic happens when agents work together. For example, when you ask K8s Copilot: "Are there any pods consuming excessive resources that might be increasing our costs?" - it automatically collaborates with Cost Copilot via the A2A protocol. K8s Copilot identifies resource-heavy pods while Cost Copilot analyzes the financial impact, providing a comprehensive answer.
How we built it
Architecture:
- 3 Cloud Run Services deployed in europe-west1
- CI/CD Pipeline using Cloud Build and Artifact Registry
- Secret Management via Google Secret Manager
- Container Registry for Docker images with multi-stage builds
Tech Stack: Backend (Python):
- Google Agent Development Kit (ADK) - Agent framework
- Gemini 2.5 Flash - Large language model for intelligent responses
- Kubernetes Python Client - Direct cluster API access
- GCP Billing API & Cloud Asset Inventory - Cost data analysis
- Uvicorn - ASGI server for high-performance async handling
- A2A Protocol - Agent-to-agent communication standard
Frontend (TypeScript):
- TanStack Start - Full-stack React framework with SSR
- React 19 - UI library
- TypeScript - Type safety and better DX
- Tailwind CSS - Utility-first styling
- shadcn/ui - Beautiful, accessible components
Development Process:
- Agent Development - Built specialized agents with ADK, defining custom tools for Kubernetes API and GCP Billing API interactions
- A2A Integration - Implemented agent discovery and cross-agent communication using Google's A2A protocol specification
- Dockerization - Created optimized multi-stage Dockerfiles for each service
- Cloud Run Deployment - Configured auto-scaling (0-3 instances), secret mounting, and environment variables
- Dashboard Development - Built static-first React dashboard for instant loading
- CI/CD Pipeline - Automated builds, image pushes, and deployments with Cloud Build
Challenges we ran into
- A2A Protocol Implementation
- The A2A protocol was relatively new, with limited examples beyond the official docs
- Challenge: Getting agents to properly discover each other and route queries
- Solution: Implemented agent card discovery at /.well-known/agent-card.json and tested cross-agent communication extensively
- Secret Management in Cloud Run
- Needed to securely mount Kubernetes kubeconfig and API keys
- Challenge: Different secrets for different services (kubeconfig only for k8s-copilot)
- Solution: Used Google Secret Manager with granular IAM permissions and selective secret mounting per service
- Cost Data Analysis Without BigQuery Export
- Initially planned to use BigQuery billing export for detailed cost analysis
- Challenge: Setting up billing export requires organization-level permissions
- Solution: Pivoted to using GCP Billing API and Cloud Asset Inventory for real-time cost data
- Dashboard Static vs Dynamic Agent Discovery
- First implementation used dynamic agent discovery with API calls on every page load
- Challenge: Slow loading times, CORS issues, and complex error handling
- Solution: Switched to static configuration with environment variables, achieving instant page loads while maintaining flexibility. This has room for improvement long term.
- Gemini API Rate Limits During Development
- Hit rate limits during testing with multiple concurrent agent requests
- Solution: Implemented exponential backoff retry logic and added caching for repeated queries
Accomplishments that we're proud of
- Successfully implemented A2A protocol - Agents can genuinely collaborate on complex queries spanning Kubernetes and cost optimization
- Fully serverless architecture - All 3 services auto-scale from 0 to 3 instances based on demand, minimizing costs
- Production-ready deployment - Automated CI/CD pipeline with Cloud Build, proper secret management, and monitoring
- Clean architecture - Separation of concerns with 3 independent services that communicate via well-defined protocols
- Type-safe full-stack application - TypeScript on frontend, strong typing in Python agents
- Real-world utility - Solves actual DevOps pain points with natural language interfaces to complex systems
- Comprehensive documentation - README, architecture diagrams, and code comments for maintainability
What we learned
Technical Learnings:
- How to build production-grade AI agents using Google ADK
- Implementing agent-to-agent communication with the A2A protocol
- Deploying multi-service architectures on Cloud Run with proper networking
- Managing secrets and sensitive data in serverless environments
- Optimizing Docker images for faster Cloud Run deployments
- Building modern React applications with TanStack Start
Architectural Insights:
- The power of specialized agents vs monolithic AI systems
- Benefits of serverless for AI workloads (cost efficiency, auto-scaling)
- Importance of proper error handling in AI agent responses
- Static-first approaches for better UX in agent discovery
AI/ML Insights:
- How to design effective tools for Gemini to interact with external APIs
- Prompt engineering for consistent, structured responses
- Handling context windows and conversation history in agents
- Balancing between agent autonomy and user control
What's next for Cloud Copilots
Short-term (Next Month):
- Add BigQuery billing export integration for deeper cost analysis
- Implement conversation history and multi-turn interactions
- Add support for multiple Kubernetes clusters
- Create scheduled cost reports and alerts
- Add authentication and multi-user support
Medium-term (Next Quarter):
- Add more specialized agents:
- Security Copilot - Vulnerability scanning and compliance
- Observability Copilot - Logs, metrics, and traces analysis
- CI/CD Copilot - Pipeline optimization and deployment automation
- Implement agent orchestration for complex workflows
- Add voice interface for hands-free operations
- Build Slack/Teams integration for conversational DevOps
Long-term Vision:
- Create an agent marketplace where teams can publish and share custom agents
- Support for multi-cloud environments (AWS, Azure)
- Advanced agent collaboration with hierarchical task delegation
- ML-powered anomaly detection and predictive analytics
- Open-source the framework for building domain-specific AI agents
Community Goals:
- Publish blog posts and tutorials on building AI agents with ADK
- Create video courses on A2A protocol implementation
- Contribute improvements back to Google ADK
- Build a community around collaborative AI agents for DevOps
Built With
- a2a-protocol
- adk
- ai-agents
- cloud-run
- devops
- docker
- gcp-billing
- gemini
- google-cloud
- kubernetes
- python
- react
- serverless
- tailwindcss
- tanstack
- typescript

Log in or sign up for Devpost to join the conversation.