🔥 Inspiration

Massive outages have shown us how one bad secret or misconfigured env var can cripple entire systems — sometimes across regions — costing billions. We wanted a self-healing layer that (a) spots issues in real time and (b) fixes them automatically, before any human sees a PagerDuty alert.

That “rebirth from failure” spirit is why we named it: Ops Phoenix.


⚙️ What it does

  • Watches GCP Secret Manager and Cloud Logs continuously
  • Detects deleted secrets, broken deploys, or malformed config
  • Decides if it should:

    • 🔁 roll back to last successful deploy, or
    • 🔐 restore a previous secret version
  • Heals using Cloud Build + Artifact Registry

  • Notifies via Gmail agent (email) or Slack if it recovers or needs help

📈 Outcome: >99.99% uptime, MTTR in seconds, and on-call still asleep.


🧠 Architecture (per attached diagram)

Core Component: 🟨 Ops Phoenix Orchestrator (Google ADK on Cloud Run)

  • Polls events every 2 mins (via Cloud Scheduler)
  • Routes incidents to specific agents for recovery

Agents (all powered by genai-2.0-flows):

  • 🔐 Secret Manager Agent

    • Uses listSecretVersions, updateSecret to inspect & restore secrets
  • 📬 Gmail Agent

    • Sends real-time recovery or failure notifications
  • 📦 GitHub Agent

    • Triggers workflows or rollback commits (createPR, endRollback)
  • 📜 Cloud Logs Agent

    • Reads recent deploy + infra logs (getLatestErrorLogs, etc.)

External Services:

  • GCP Secret Manager, Gmail, GitHub, GCP Logs all connect to their respective agents
  • All traffic is orchestrated via a central controller with minimal latency

🧪 Challenges we tackled

  • Monolith pain: We started with one ADK agent — too messy. Split into micro-agents to test, debug, and deploy independently.
  • Secret Manager gotchas: Restoring secrets reliably meant working around version state quirks.
  • Failure injection: We built a “chaos-injector” that safely simulates broken secrets without harming real projects.

✅ What we’re proud of

  • 💡 Rebuilt entire orchestration as multi-agent in <48 hours
  • 🧪 Chaos-driven demo: delete a secret, watch Ops Phoenix fully recover it
  • 📦 Test harness: replay failures locally or in any GCP project
  • ☕️ Cloud Run + Java 21 + Spring Boot 3 — no VMs, just instant scale and clean infra

📚 Lessons learned

  • 💥 Fail fast, notify faster — fix silently, but always tell the humans
  • 🪛 Built-in backups (Secret Manager versions) > custom recovery infra
  • 🧩 Agent pattern wins for event-driven platforms. Clean ownership, scale, and debug flow.

🚀 What’s next

  • 🧩 Add support for database configs, external dashboards, and .env file diffs
  • 🌍 Go multi-cloud: AWS SSM + Azure Key Vault agents
  • 🔥 Launch “attack-sim” mode for chaos drills
  • 💰 Show cost savings from downtime avoided

Built With

  • artifact-registry
  • cloud-build
  • cloud-run
  • cloud-sql-(postgresql)
  • github
  • google-autonomic-data-kit-(adk)
  • google-cloud
  • java-21
  • secret-manager
  • spring-boot-3
  • vertex-ai
Share this project:

Updates