Ops Phoenix

Architecture Diagram for Ops-Phoenix

🔥 Inspiration

Massive outages have shown us how one bad secret or misconfigured env var can cripple entire systems — sometimes across regions — costing billions. We wanted a self-healing layer that (a) spots issues in real time and (b) fixes them automatically, before any human sees a PagerDuty alert.

That “rebirth from failure” spirit is why we named it: Ops Phoenix.

⚙️ What it does

Watches GCP Secret Manager and Cloud Logs continuously
Detects deleted secrets, broken deploys, or malformed config
Decides if it should:
- 🔁 roll back to last successful deploy, or
- 🔐 restore a previous secret version
Heals using Cloud Build + Artifact Registry
Notifies via Gmail agent (email) or Slack if it recovers or needs help

📈 Outcome: >99.99% uptime, MTTR in seconds, and on-call still asleep.

🧠 Architecture (per attached diagram)

Core Component: 🟨 Ops Phoenix Orchestrator (Google ADK on Cloud Run)

Polls events every 2 mins (via Cloud Scheduler)
Routes incidents to specific agents for recovery

Agents (all powered by genai-2.0-flows):

🔐 Secret Manager Agent
- Uses listSecretVersions, updateSecret to inspect & restore secrets
📬 Gmail Agent
- Sends real-time recovery or failure notifications
📦 GitHub Agent
- Triggers workflows or rollback commits (createPR, endRollback)
📜 Cloud Logs Agent
- Reads recent deploy + infra logs (getLatestErrorLogs, etc.)

External Services:

GCP Secret Manager, Gmail, GitHub, GCP Logs all connect to their respective agents
All traffic is orchestrated via a central controller with minimal latency

🧪 Challenges we tackled

Monolith pain: We started with one ADK agent — too messy. Split into micro-agents to test, debug, and deploy independently.
Secret Manager gotchas: Restoring secrets reliably meant working around version state quirks.
Failure injection: We built a “chaos-injector” that safely simulates broken secrets without harming real projects.

✅ What we’re proud of

💡 Rebuilt entire orchestration as multi-agent in <48 hours
🧪 Chaos-driven demo: delete a secret, watch Ops Phoenix fully recover it
📦 Test harness: replay failures locally or in any GCP project
☕️ Cloud Run + Java 21 + Spring Boot 3 — no VMs, just instant scale and clean infra

📚 Lessons learned

💥 Fail fast, notify faster — fix silently, but always tell the humans
🪛 Built-in backups (Secret Manager versions) > custom recovery infra
🧩 Agent pattern wins for event-driven platforms. Clean ownership, scale, and debug flow.