🔥 Inspiration
Massive outages have shown us how one bad secret or misconfigured env var can cripple entire systems — sometimes across regions — costing billions. We wanted a self-healing layer that (a) spots issues in real time and (b) fixes them automatically, before any human sees a PagerDuty alert.
That “rebirth from failure” spirit is why we named it: Ops Phoenix.
⚙️ What it does
- Watches GCP Secret Manager and Cloud Logs continuously
- Detects deleted secrets, broken deploys, or malformed config
Decides if it should:
- 🔁 roll back to last successful deploy, or
- 🔐 restore a previous secret version
Heals using Cloud Build + Artifact Registry
Notifies via Gmail agent (email) or Slack if it recovers or needs help
📈 Outcome: >99.99% uptime, MTTR in seconds, and on-call still asleep.
🧠 Architecture (per attached diagram)
Core Component: 🟨 Ops Phoenix Orchestrator (Google ADK on Cloud Run)
- Polls events every 2 mins (via Cloud Scheduler)
- Routes incidents to specific agents for recovery
Agents (all powered by genai-2.0-flows):
🔐 Secret Manager Agent
- Uses
listSecretVersions,updateSecretto inspect & restore secrets
- Uses
📬 Gmail Agent
- Sends real-time recovery or failure notifications
📦 GitHub Agent
- Triggers workflows or rollback commits (
createPR,endRollback)
- Triggers workflows or rollback commits (
📜 Cloud Logs Agent
- Reads recent deploy + infra logs (
getLatestErrorLogs, etc.)
- Reads recent deploy + infra logs (
External Services:
- GCP Secret Manager, Gmail, GitHub, GCP Logs all connect to their respective agents
- All traffic is orchestrated via a central controller with minimal latency
🧪 Challenges we tackled
- Monolith pain: We started with one ADK agent — too messy. Split into micro-agents to test, debug, and deploy independently.
- Secret Manager gotchas: Restoring secrets reliably meant working around version state quirks.
- Failure injection: We built a “chaos-injector” that safely simulates broken secrets without harming real projects.
✅ What we’re proud of
- 💡 Rebuilt entire orchestration as multi-agent in <48 hours
- 🧪 Chaos-driven demo: delete a secret, watch Ops Phoenix fully recover it
- 📦 Test harness: replay failures locally or in any GCP project
- ☕️ Cloud Run + Java 21 + Spring Boot 3 — no VMs, just instant scale and clean infra
📚 Lessons learned
- 💥 Fail fast, notify faster — fix silently, but always tell the humans
- 🪛 Built-in backups (Secret Manager versions) > custom recovery infra
- 🧩 Agent pattern wins for event-driven platforms. Clean ownership, scale, and debug flow.
🚀 What’s next
- 🧩 Add support for database configs, external dashboards, and
.envfile diffs - 🌍 Go multi-cloud: AWS SSM + Azure Key Vault agents
- 🔥 Launch “attack-sim” mode for chaos drills
- 💰 Show cost savings from downtime avoided
Built With
- artifact-registry
- cloud-build
- cloud-run
- cloud-sql-(postgresql)
- github
- google-autonomic-data-kit-(adk)
- google-cloud
- java-21
- secret-manager
- spring-boot-3
- vertex-ai
Log in or sign up for Devpost to join the conversation.