Cloud Copilots

Architecture

Inspiration

As an SRE, managing Kubernetes clusters and optimizing cloud costs are two of the most complex challenges facing DevOps teams today. While working with multiple GKE clusters and analyzing GCP billing data, I realized that these tasks required constant context-switching between different tools, dashboards, and APIs. I wanted to explore how AI agents could not only simplify these workflows individually but also collaborate to solve complex problems that span multiple domains. When I discovered Google's Agent Development Kit (ADK) and the A2A (Agent-to-Agent) protocol, I saw an opportunity to build specialized AI agents that could work together seamlessly - just like a team of expert copilots. The vision was simple: What if you could ask natural language questions about your infrastructure and get intelligent, actionable answers - whether it's about cluster health, resource usage, or cost optimization?

What it does

Cloud AI Copilots is a multi-agent platform consisting of three Cloud Run services:

K8s Copilot - Kubernetes Management Agent
Provides natural language interface to Kubernetes clusters
Monitors cluster health, pod status, and resource utilization
Troubleshoots deployment issues and suggests optimizations
Manages deployments, services, and configurations through conversation
Exposes both A2A protocol and web UI for human interaction
Cost Optimization Copilot - Cloud Cost Analysis Agent
Analyzes GCP billing data and resource usage patterns
Identifies idle resources and cost-saving opportunities
Provides budget forecasts and spending recommendations
Detects anomalies in cloud spending
Offers rightsizing suggestions for compute resources
Copilot Dashboard - Unified Interface
Single-page application that connects both agents
Provides agent discovery and status monitoring
Routes user queries to appropriate agents
Built with TanStack Start (React) for modern UX

Agent Collaboration (A2A Protocol): The magic happens when agents work together. For example, when you ask K8s Copilot: "Are there any pods consuming excessive resources that might be increasing our costs?" - it automatically collaborates with Cost Copilot via the A2A protocol. K8s Copilot identifies resource-heavy pods while Cost Copilot analyzes the financial impact, providing a comprehensive answer.

How we built it

Architecture:

3 Cloud Run Services deployed in europe-west1
CI/CD Pipeline using Cloud Build and Artifact Registry
Secret Management via Google Secret Manager
Container Registry for Docker images with multi-stage builds

Tech Stack: Backend (Python):

Google Agent Development Kit (ADK) - Agent framework
Gemini 2.5 Flash - Large language model for intelligent responses
Kubernetes Python Client - Direct cluster API access
GCP Billing API & Cloud Asset Inventory - Cost data analysis
Uvicorn - ASGI server for high-performance async handling
A2A Protocol - Agent-to-agent communication standard

Frontend (TypeScript):

TanStack Start - Full-stack React framework with SSR
React 19 - UI library
TypeScript - Type safety and better DX
Tailwind CSS - Utility-first styling
shadcn/ui - Beautiful, accessible components

Development Process:

Agent Development - Built specialized agents with ADK, defining custom tools for Kubernetes API and GCP Billing API interactions
A2A Integration - Implemented agent discovery and cross-agent communication using Google's A2A protocol specification
Dockerization - Created optimized multi-stage Dockerfiles for each service
Cloud Run Deployment - Configured auto-scaling (0-3 instances), secret mounting, and environment variables
Dashboard Development - Built static-first React dashboard for instant loading
CI/CD Pipeline - Automated builds, image pushes, and deployments with Cloud Build

Challenges we ran into

A2A Protocol Implementation
The A2A protocol was relatively new, with limited examples beyond the official docs
Challenge: Getting agents to properly discover each other and route queries
Solution: Implemented agent card discovery at /.well-known/agent-card.json and tested cross-agent communication extensively
Secret Management in Cloud Run
Needed to securely mount Kubernetes kubeconfig and API keys
Challenge: Different secrets for different services (kubeconfig only for k8s-copilot)
Solution: Used Google Secret Manager with granular IAM permissions and selective secret mounting per service
Cost Data Analysis Without BigQuery Export
Initially planned to use BigQuery billing export for detailed cost analysis
Challenge: Setting up billing export requires organization-level permissions
Solution: Pivoted to using GCP Billing API and Cloud Asset Inventory for real-time cost data
Dashboard Static vs Dynamic Agent Discovery
First implementation used dynamic agent discovery with API calls on every page load
Challenge: Slow loading times, CORS issues, and complex error handling
Solution: Switched to static configuration with environment variables, achieving instant page loads while maintaining flexibility. This has room for improvement long term.
Gemini API Rate Limits During Development
Hit rate limits during testing with multiple concurrent agent requests
Solution: Implemented exponential backoff retry logic and added caching for repeated queries

Accomplishments that we're proud of

Successfully implemented A2A protocol - Agents can genuinely collaborate on complex queries spanning Kubernetes and cost optimization
Fully serverless architecture - All 3 services auto-scale from 0 to 3 instances based on demand, minimizing costs
Production-ready deployment - Automated CI/CD pipeline with Cloud Build, proper secret management, and monitoring
Clean architecture - Separation of concerns with 3 independent services that communicate via well-defined protocols
Type-safe full-stack application - TypeScript on frontend, strong typing in Python agents
Real-world utility - Solves actual DevOps pain points with natural language interfaces to complex systems
Comprehensive documentation - README, architecture diagrams, and code comments for maintainability

What we learned

Technical Learnings:

How to build production-grade AI agents using Google ADK
Implementing agent-to-agent communication with the A2A protocol
Deploying multi-service architectures on Cloud Run with proper networking
Managing secrets and sensitive data in serverless environments
Optimizing Docker images for faster Cloud Run deployments
Building modern React applications with TanStack Start

Architectural Insights:

The power of specialized agents vs monolithic AI systems
Benefits of serverless for AI workloads (cost efficiency, auto-scaling)
Importance of proper error handling in AI agent responses
Static-first approaches for better UX in agent discovery

AI/ML Insights:

How to design effective tools for Gemini to interact with external APIs
Prompt engineering for consistent, structured responses
Handling context windows and conversation history in agents
Balancing between agent autonomy and user control

What's next for Cloud Copilots

Short-term (Next Month):

Add BigQuery billing export integration for deeper cost analysis
Implement conversation history and multi-turn interactions
Add support for multiple Kubernetes clusters
Create scheduled cost reports and alerts
Add authentication and multi-user support

Medium-term (Next Quarter):

Add more specialized agents:
Security Copilot - Vulnerability scanning and compliance
Observability Copilot - Logs, metrics, and traces analysis
CI/CD Copilot - Pipeline optimization and deployment automation
Implement agent orchestration for complex workflows
Add voice interface for hands-free operations
Build Slack/Teams integration for conversational DevOps

Long-term Vision:

Create an agent marketplace where teams can publish and share custom agents
Support for multi-cloud environments (AWS, Azure)
Advanced agent collaboration with hierarchical task delegation
ML-powered anomaly detection and predictive analytics
Open-source the framework for building domain-specific AI agents

Community Goals:

Publish blog posts and tutorials on building AI agents with ADK
Create video courses on A2A protocol implementation
Contribute improvements back to Google ADK
Build a community around collaborative AI agents for DevOps

Built With

a2a-protocol
adk
ai-agents
cloud-run
devops
docker
gcp-billing
gemini
google-cloud
kubernetes
python
react
serverless
tailwindcss
tanstack
typescript

Updates

Merrygold Odey started this project — Nov 10, 2025 07:58 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.