Inspiration

We were inspired by the complexity of modern cloud-native systems. Kubernetes and microservices offer scalability, but debugging them often feels like chasing ghosts across logs, dashboards, and PromQL queries. We wanted to build something that would make operations feel conversational—an assistant that could turn raw telemetry into human-readable insights. The hackathon gave us the perfect playground to show how AI can supercharge internal developer productivity, not just consumer-facing apps.

What it does

Our project is an AI-powered DevOps agent that runs alongside the Online Boutique on GKE. It ingests metrics and logs, detects anomalies, and generates plain-English summaries that help developers quickly understand what’s wrong. Instead of combing through dashboards and queries, teams can ask natural questions like “Why is checkout failing?” and receive clear, actionable answers with suggested remediations.

How we built it

  1. Set up the environment
  2. Created a GKE Autopilot cluster and deployed the Online Boutique demo.
  3. Enabled core APIs: Kubernetes Engine, Monitoring, Logging, and Pub/Sub.

  4. Ingested telemetry

  5. Used MCP for metrics like request rate, error rate, latency, and pod restarts.

  6. Collected logs automatically via Cloud Logging.

  7. Normalized the data

  8. Designed a /snapshot API that returns a JSON schema with key signals.

  9. Grouped error logs into patterns to highlight recurring issues.

  10. Built the AI agent

  11. Exposed /ask and /webhook endpoints in a Cloud Run service.

  12. Integrated with Vertex AI/OpenAI to generate summaries and remediation suggestions.

  13. Designed the frontend

  14. Built with Next.js, Shadcn UI, and motion.dev animations.

  15. Provided a dashboard, service health indicators, and an AI chat area.

Challenges we ran into

  • Prometheus data gaps: At first, PromQL queries returned nothing—we learned Autopilot + MCP needs proper PodMonitoring setup for non-system namespaces.
  • Secrets management: Passing API keys securely required careful use of Secret Manager and Cloud Run revisions.
  • Balancing scope: Timeboxing was essential—we focused on a minimal observability schema before adding stretch features.
  • Complex ecosystem: GCP has many moving parts; breaking the project into clear steps kept us on track.

Accomplishments that we're proud of

  • Perseverance: Getting Google Cloud configured was a major challenge, but we kept pushing through the errors, docs, and dead ends until everything clicked.
  • Building end-to-end: From GKE setup to log ingestion, Pub/Sub event routing, AI reasoning, and a polished frontend, we created a fully working pipeline in just a few days.
  • Clarity for users: The biggest “wow moment” was seeing a teammate type a question and get a human-readable explanation instead of a wall of metrics.

What we learned

We learned how to deploy microservices on GKE Autopilot, how to use Managed Prometheus and Cloud Logging to capture signals, and how to wire those signals into an event-driven AI agent with Pub/Sub and Cloud Run. More importantly, we learned the value of planning and simplifying: breaking the project into steps kept us from getting lost in GCP’s complexity. We also discovered that pre-structuring metrics and logs into a schema made the LLM far more consistent and useful.

What's next for Ahmad and Leo's AI DevOps Agent

From here, we can extend the agent to:

  • Support automated remediation (e.g., safe rollbacks or auto-scaling).
  • Integrate with additional cloud services like alerting into Slack/Teams or tying into CI/CD pipelines.
  • Evolve into a more general-purpose DevOps copilot that works across clusters and environments, helping teams not just detect issues, but prevent them.
Share this project:

Updates