Orbit Agent

metrics
health
dash board
incidents
thumbnail

Inspiration

AI agents fail in production every single day. AutoGPT spins in infinite loops, Cursor hallucinates, Replit repeats destructive actions, and Air Canada's bot gave ungrounded answers that cost the company in court. The pattern is always the same: the agent fails, a human eventually notices, debugs it, and recovers — after the money and time are already gone.

We asked a different question: what if agents could protect other agents, and break the loop before a human ever notices? An agent stuck looping is trapped in an orbit it can't escape. So we built the thing that breaks the orbit.

What it does

Orbit Agent is a Gemini-powered Multi-Agent Reliability OS that observes external AI agents (AutoGPT, CrewAI, LangGraph, MCP, custom) over OpenTelemetry — the agents never run inside Orbit Agent. Our pipeline is literally O.R.B.I.T:

Observe — a detector swarm scores every span for loops, stalled progress, token burn, latency, context drift, and errors.
Reason — a Gemini Monitor Model (gemini-2.5-flash on Vertex AI) fuses the features into a single risk score.
Break — a Redis-backed circuit breaker trips the moment risk crosses threshold.
Intervene — a remediation swarm (retry → prompt rewrite → rollback → tool fallback → human escalation) recovers the agent automatically.
Track — Arize Phoenix, Elastic, and Dynatrace capture traces, logs, and APM.

The result, shown live and side-by-side:

	Unprotected	Orbit Agent
Calls	400	5
Duration	20 min	47 sec
Cost	\$42	\$0.36
Outcome	FAILURE	RECOVERED

That's a 99.1% cost reduction, computed as $ \frac{42 - 0.36}{42} \approx 0.991 $.

How we built it

Google Cloud: Gemini + Vertex AI as the reasoning brain, Agent Builder, 7 Cloud Run services, Pub/Sub event fabric, Secret Manager, Memorystore Redis, Artifact Registry, and Cloud Logging.
Backbone: Python 3.11 + FastAPI OTLP collector on port 4318, six detector agents plus feature fusion, a hybrid Gemini risk classifier, a circuit-breaker state machine, and a remediation orchestrator.
Partners: Arize Phoenix (primary — LLM traces, sessions, evals), MongoDB Atlas (incident history + policies), Elastic (logs), Dynatrace (APM), GitLab (tool-fallback example).
Frontend: React + TypeScript + Vite + Tailwind — a clean light-theme two-pane demo plus real-time /health and /ops dashboards.
Scale: 82 reliability capabilities organized into 10 autonomous agent domains across 6 services, communicating asynchronously over Pub/Sub.

Challenges we ran into

Mid-hackathon we hit billing walls on our original project. VPC connector failures, Memorystore provisioning that took 15+ minutes, and Cloud Build burning credits while our Dockerfiles were silently missing whole folders — the dashboard image didn't even include collector/, so /health/agents kept returning 500s.

We migrated to a fresh project (orchestraos-498316), re-bootstrapped everything from Cloud Shell, rebuilt our images three times, and redeployed until all 7 Cloud Run services were green. An org policy blocked downloadable service-account keys, so we switched to Application Default Credentials. There were hours where the live URL just timed out. We almost didn't finish — and we shipped anyway.

Accomplishments that we're proud of

A live, end-to-end demo where the circuit breaker trips on the 3rd loop and the agent self-heals — not a mockup.
99.1% cost reduction, proven side-by-side, with real Gemini reasoning on Vertex AI.
All 7 Cloud Run services green after three rebuilds and a full project migration under deadline pressure.
A genuinely new category: an immune system for AI agents — agents protecting agents.
A 3-person team that split the system cleanly and merged through disciplined Git branching, finishing under the wire.

What we learned

Reliability is a category, not a feature. Treating agent failures like an SRE problem — telemetry, circuit breakers, self-healing — turns flaky demos into production systems. We also learned that shipping under real infrastructure pressure (billing limits, broken images, timeouts) teaches more than any tutorial: resilience is something you build into both the product and the team.

What's next for Orbit Agent

Self-evolving policies learned from clustered incident history in MongoDB.
More live failure scenarios beyond the AutoGPT loop (state corruption, retry storms, destructive actions).
A marketplace of pluggable remediation strategies.
The long-term vision: Orbit Agent becomes Kubernetes + Datadog + SRE for AI agents — the reliability layer every agent runs behind.

The Team

Niket Patil — Detection backbone & core engine (collector, detectors, circuit breaker, Arize). Rutuja Kulkarni — AI & monitoring lead (Gemini Monitor Model, remediation, Vertex AI). Ayush Patel — Frontend & infrastructure (React dashboard, Cloud Run, deployment). Built together, shipped together.

Built With

agent-builder
arize-phoenix
artifact-registry
cloud-logging
cloud-run
docker
dynatrace
elasticsearch
fastapi
gemini
gitlab
memorystore
mongodb-atlas
opentelemetry
pub-sub
python
react
redis
secret-manager
tailwindcss
typescript
vertex-ai
vite

Submitted to

Google Cloud Rapid Agent Hackathon

Created by

I led the detection backbone and core engine. I built the FastAPI OTLP collector, the full detector swarm (loop fingerprinter, progress, token, latency, context, and error agents) with feature fusion, and the Redis-backed circuit breaker that trips on the loop. I wired the Arize Phoenix integration, set up the repo and Git branching for the team, and led the GCP project migration when we hit billing walls — taking the system from an empty workspace to a working, tested backbone.

Niket Patil
Tech entrepreneur & full-stack developer specializing in ML, multi-agent systems, and digital infrastructure. Always building.
Ayush Patel
Rutuja Kulkarni

Updates

Niket Patil posted an update — Jun 11, 2026 03:20 PM EDT

Orbit Agent is live — submitted to the Google Cloud Rapid Agent Hackathon!

The problem: AI agents fail every day. They get stuck in loops, burn money, and nobody notices until it's too late. So we built the immune system for AI agents.

The result, shown live and side-by-side: • Unprotected agent: 400 calls · 20 min · $42 · FAILURE • Orbit Agent: 5 calls · 47 sec · $0.36 · RECOVERED • 99.1% cost reduction

How it works — the O.R.B.I.T pipeline: Observe — a detector swarm scores every span for loops, stalls, token burn, and drift Reason — Gemini on Vertex AI fuses features into a risk score Break — a Redis-backed circuit breaker trips on the loop Intervene — a remediation swarm recovers the agent automatically Track — Arize Phoenix, Elastic, and Dynatrace

Built across 7 Cloud Run services by Niket Patil, Rutuja Kulkarni, and Ayush Patel. We hit billing walls, broken Docker images, and timeouts mid-build — migrated projects, rebuilt three times, and shipped anyway.

Try it live Demo: https://orchestraos-dashboard-ew3uwemnxq-uc.a.run.app Code: https://github.com/N-i-k-e-t/orchestraos

Every other platform tells you your agent failed — Orbit Agent makes sure it doesn't.

Log in or sign up for Devpost to join the conversation.

Niket Patil started this project — Jun 11, 2026 03:18 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.