Inspiration

Production AI agents break in five predictable ways: their LLM provider has a brown-out, their MCP tool server dies mid-call, they hit a rate limit, a tool rejects malformed arguments, or a streaming connection drops halfway through a response. Most demo agents handle none of these. We wanted to build one that survives all five — and prove it live on stage.

What it does

Resilient Agent is a TypeScript agent built on the TrueFoundry AI Gateway. It chats over Server-Sent Events, calls MCP tools, and transparently recovers from five categories of infrastructure failure:

  1. MCP server timeout / kill — supervised reconnect with p-retry, agent loop never sees the gap.
  2. LLM provider outage — cascading failover from Anthropic (via TrueFoundry Gateway) to a local Ollama fallback.
  3. Rate limits (HTTP 429) — Retry-After-aware backoff via bottleneck per-provider queues.
  4. Tool call errors — zod validation + LLM-reflection repair loop, max two repair passes before surfacing.
  5. SSE connection loss — Last-Event-ID resume buffer keeps the response stream consistent across reconnects.

The accompanying Chaos Dashboard lets you inject any of these failures with one click and watch the agent route around it live. No mocks — real provider calls, real failures, real recovery.

How we built it

  • TrueFoundry AI Gateway (anthropic/claude-sonnet-4-6) as the primary LLM rail, with a local Ollama instance as the ultimate failsafe.
  • TrueFoundry MCP Gateway SDK for tool plumbing, plus a local Chaos MCP server we can deterministically break for demo.
  • Resilience layer: cockatiel circuit breakers per provider × tool, p-retry for exponential backoff that respects Retry-After, bottleneck for rate-limit queues, zod for tool-arg repair, an SSE buffer keyed on Last-Event-ID.
  • Hono server with native SSE, React + Vite chaos dashboard with real-time provider lights and an event feed.
  • 29 vitest cases pinned to the failover contract.

Challenges we ran into

The trickiest mode was streaming-resume: once an SSE connection drops, you have to reconstruct the in-flight delta from a Last-Event-ID without re-emitting tokens. We built a small per-session ring buffer, and the dashboard's manual "kill SSE" button became its own regression test.

The second non-obvious one was honesty: our first plan had xAI as a third provider, but the account wasn't cleanly registered in our TrueFoundry tenant. Rather than ship a permanently-failing slot, we dropped to a two-provider chain and told the truer story — cloud primary, local failsafe — which production teams actually run.

Accomplishments that we're proud of

The whole thing routes through TrueFoundry's gateway. Swapping in a new provider is a config change, not a code change. And every failure mode is one button-click reproducible — the dashboard is also our test harness.

What we learned

Resilience is mostly not clever retry logic. It's strict per-call typing, per-provider isolation, and refusing to handle two failure classes in the same code path.

What's next for Resilient Agent

  • Re-add a third provider once we have an additional registered upstream — the chain is configuration-driven.
  • Per-tool circuit breakers with adaptive thresholds.
  • A small fleet of synthetic failure scenarios for CI regression.

Built with

TypeScript · Node 22 · Hono · Vercel AI SDK · TrueFoundry AI Gateway · TrueFoundry MCP Gateway SDK · Anthropic Claude · Ollama · cockatiel · p-retry · bottleneck · zod · React · Vite · vitest

Built With

Share this project:

Updates