Inspiration

Every AI agent we've built has the same vulnerability: it assumes infrastructure works perfectly. The moment an LLM provider returns a 429, or an MCP server times out, the whole experience breaks, users get cryptic errors, agents stall, and trust is lost. We started thinking about what a truly production-grade agent looks like, one that behaves the way experienced engineers design distributed systems: with circuit breakers, fallback chains, and graceful degradation baked in from day one. That question became AgentArmor.

What it does

AgentArmor is a resilient AI agent gateway that sits between your application and your AI infrastructure. It keeps your agent alive and your users informed even when things break underneath.

Core capabilities: LLM Circuit Breaker Wraps each LLM provider in a state machine (CLOSED → OPEN → HALF_OPEN → CLOSED). After a configurable failure threshold, the breaker opens and requests are automatically rerouted with no manual intervention and no downtime.

Automatic Fallback Routing Gemini 2.5 Flash is the primary provider. If it fails or its circuit opens, AgentArmor instantly routes to Llama 3.1 (via Groq) as the fallback. The user never sees an error they see a response, every time.

MCP Server Health Management AgentArmor monitors registered MCP tool servers in real time. When a server goes down, the agent is informed of degraded tool availability and responds gracefully.

Live Observability Dashboard A real-time WebSocket-fed dashboard shows every failure event, breaker state transition, routing decision, and recovery as they happen. Every message includes a full routing trace: which provider was tried, which succeeded, how long it took.

Chaos Injection Panel Any provider or MCP server can be killed mid-conversation with one click, for demo, testing, or validation purposes. This lets you see exactly how AgentArmor behaves under real failure conditions before you ship.

How we built it

AgentArmor is built as two decoupled services:

Backend - Node.js + Express The core gateway runs on Express with a WebSocket server for real-time event streaming. The architecture is built around four modules:

CircuitBreaker: A per-provider state machine implementing the classic circuit breaker pattern. Configurable failure threshold, recovery timeout, and success threshold for re-closing.

LLM Router: Orchestrates the fallback chain. Checks each provider's circuit breaker before attempting a call. On failure, records it, updates the breaker state, and tries the next provider. Returns structured routing metadata with every response.

MCP Server Manager: Maintains a registry of MCP tool servers with health status. Supports chaos injection (kill/restore) and enriches the agent's context with live tool availability before each LLM call.

Event Bus: A central pub/sub system that captures every system event (failures, breaker transitions, recoveries, chaos injections) and broadcasts them to connected dashboard clients via WebSocket.

Frontend - Next.js The dashboard renders the full agent experience: a chat interface with per-message provider badges and fallback indicators, a live chaos control panel, a real-time event stream, and a per-request routing trace viewer.

LLM Stack: Primary: Gemini 2.5 Flash (Google AI) Fallback: Llama 3.1 8B Instant (via Groq)

Challenges we ran into

The hardest design decision was determining when a circuit breaker should open. Too sensitive, and transient errors cause unnecessary fallbacks that erode trust in the primary provider. Too lenient, and real outages bleed through before the system reacts. I settled on a two-failure threshold with a 15-second recovery window and a two-success requirement to re-close. I also ran into real-world rate limiting. Gemini 2.0 Flash kept returning 429 errors during development, which actually validated our circuit breaker in a live scenario before we even finished building it. Switching to Gemini 2.5 Flash resolved the quota issues and improved response quality significantly. The third challenge was MCP context enrichment. Naively injecting tool status into the user message created a bad UX. The raw system context leaked into responses. The fix was moving tool availability into the system prompt layer, keeping user messages clean while still giving the LLM accurate tool context before each call. The fourth was hydration mismatches in Next.js. Timestamps generated on the server differed from client render time, causing React hydration errors on every page load. The fix was deferring timestamp generation to a client-side useEffect, ensuring server and client render identical initial HTML.

Accomplishments that we're proud of

The routing trace feature. Every single message returns a full audit log of which providers were attempted, which succeeded or failed, and why — all rendered live in the dashboard. This is what makes AgentArmor a debugging and observability tool, not just a resilience layer.

Watching Gemini fail mid-conversation, the circuit breaker trip, Llama take over, and the user receive a coherent response — all within the same session, without a page refresh or error message — is genuinely satisfying to demo.

What we learned

I learned that MCP server health is a first-class concern that most agent frameworks treat as an afterthought. Knowing which tools are available before constructing a prompt is as important as knowing which LLM will answer it. I also learned that users should see much less error messages as much as possible.

What's next for AgentArmor — Resilient AI Agent Gateway

AgentArmor is designed to become infrastructure, the resilience layer that every production AI agent needs but almost none have today.

Near-term roadmap: Real MCP server integrations (web search, code execution, calendar) via TrueFoundry's MCP Gateway Request queuing during full outages, with automatic retry on recovery Provider cost routing — not just resilience, but smart routing based on latency, cost, and capability SDK packaging so developers can wrap any LLM call with AgentArmor in three lines of code The business model is B2B SaaS: $49/month per application for teams running production AI agents who cannot afford downtime. Primary market: AI-native startups and enterprise DevOps teams adopting agentic workflows.

Built With

Share this project:

Updates