Inspiration
AI agents fail in production every single day. AutoGPT spins in infinite loops, Cursor hallucinates, Replit repeats destructive actions, and Air Canada's bot gave ungrounded answers that cost the company in court. The pattern is always the same: the agent fails, a human eventually notices, debugs it, and recovers — after the money and time are already gone.
We asked a different question: what if agents could protect other agents, and break the loop before a human ever notices? An agent stuck looping is trapped in an orbit it can't escape. So we built the thing that breaks the orbit.
What it does
Orbit Agent is a Gemini-powered Multi-Agent Reliability OS that observes external AI agents (AutoGPT, CrewAI, LangGraph, MCP, custom) over OpenTelemetry — the agents never run inside Orbit Agent. Our pipeline is literally O.R.B.I.T:
- Observe — a detector swarm scores every span for loops, stalled progress, token burn, latency, context drift, and errors.
- Reason — a Gemini Monitor Model (gemini-2.5-flash on Vertex AI) fuses the features into a single risk score.
- Break — a Redis-backed circuit breaker trips the moment risk crosses threshold.
- Intervene — a remediation swarm (retry → prompt rewrite → rollback → tool fallback → human escalation) recovers the agent automatically.
- Track — Arize Phoenix, Elastic, and Dynatrace capture traces, logs, and APM.
The result, shown live and side-by-side:
| Unprotected | Orbit Agent | |
|---|---|---|
| Calls | 400 | 5 |
| Duration | 20 min | 47 sec |
| Cost | \$42 | \$0.36 |
| Outcome | FAILURE | RECOVERED |
That's a 99.1% cost reduction, computed as \( \frac{42 - 0.36}{42} \approx 0.991 \).
How we built it
- Google Cloud: Gemini + Vertex AI as the reasoning brain, Agent Builder, 7 Cloud Run services, Pub/Sub event fabric, Secret Manager, Memorystore Redis, Artifact Registry, and Cloud Logging.
- Backbone: Python 3.11 + FastAPI OTLP collector on port 4318, six detector agents plus feature fusion, a hybrid Gemini risk classifier, a circuit-breaker state machine, and a remediation orchestrator.
- Partners: Arize Phoenix (primary — LLM traces, sessions, evals), MongoDB Atlas (incident history + policies), Elastic (logs), Dynatrace (APM), GitLab (tool-fallback example).
- Frontend: React + TypeScript + Vite + Tailwind — a clean light-theme two-pane demo plus real-time
/healthand/opsdashboards. - Scale: 82 reliability capabilities organized into 10 autonomous agent domains across 6 services, communicating asynchronously over Pub/Sub.
Challenges we ran into
Mid-hackathon we hit billing walls on our original project. VPC connector failures, Memorystore provisioning that took 15+ minutes, and Cloud Build burning credits while our Dockerfiles were silently missing whole folders — the dashboard image didn't even include collector/, so /health/agents kept returning 500s.
We migrated to a fresh project (orchestraos-498316), re-bootstrapped everything from Cloud Shell, rebuilt our images three times, and redeployed until all 7 Cloud Run services were green. An org policy blocked downloadable service-account keys, so we switched to Application Default Credentials. There were hours where the live URL just timed out. We almost didn't finish — and we shipped anyway.
Accomplishments that we're proud of
- A live, end-to-end demo where the circuit breaker trips on the 3rd loop and the agent self-heals — not a mockup.
- 99.1% cost reduction, proven side-by-side, with real Gemini reasoning on Vertex AI.
- All 7 Cloud Run services green after three rebuilds and a full project migration under deadline pressure.
- A genuinely new category: an immune system for AI agents — agents protecting agents.
- A 3-person team that split the system cleanly and merged through disciplined Git branching, finishing under the wire.
What we learned
Reliability is a category, not a feature. Treating agent failures like an SRE problem — telemetry, circuit breakers, self-healing — turns flaky demos into production systems. We also learned that shipping under real infrastructure pressure (billing limits, broken images, timeouts) teaches more than any tutorial: resilience is something you build into both the product and the team.
What's next for Orbit Agent
- Self-evolving policies learned from clustered incident history in MongoDB.
- More live failure scenarios beyond the AutoGPT loop (state corruption, retry storms, destructive actions).
- A marketplace of pluggable remediation strategies.
- The long-term vision: Orbit Agent becomes Kubernetes + Datadog + SRE for AI agents — the reliability layer every agent runs behind.
The Team
Niket Patil — Detection backbone & core engine (collector, detectors, circuit breaker, Arize). Rutuja Kulkarni — AI & monitoring lead (Gemini Monitor Model, remediation, Vertex AI). Ayush Patel — Frontend & infrastructure (React dashboard, Cloud Run, deployment). Built together, shipped together.
Built With
- agent-builder
- arize-phoenix
- artifact-registry
- cloud-logging
- cloud-run
- docker
- dynatrace
- elasticsearch
- fastapi
- gemini
- gitlab
- memorystore
- mongodb-atlas
- opentelemetry
- pub-sub
- python
- react
- redis
- secret-manager
- tailwindcss
- typescript
- vertex-ai
- vite


Log in or sign up for Devpost to join the conversation.