Inspiration Every production AI deployment shares a dirty secret: agents are only as reliable as their weakest LLM provider. We've all seen it — an OpenAI brownout at 2 AM brings down a customer support bot, an Anthropic rate limit crashes a data pipeline, or an MCP server timeout leaves users staring at a 500 error page. During a recent incident where three major LLM providers experienced simultaneous degradation, we watched an enterprise agent system fail catastrophically — not because the code was wrong, but because there was no resilience layer between the agent logic and the infrastructure. That moment crystallized a question: Why do we build circuit breakers, retries, and fallback chains for every other microservice, but treat LLM calls as sacred?

The DevNetwork AI/ML Hackathon 2026 and TrueFoundry's Resilient Agents challenge gave us the perfect canvas to answer that question. TrueFoundry's AI Gateway already solved the infrastructure-level problem — virtual models with load balancing, retry, and fallback across 1,000+ LLMs. We wanted to build the application-level resilience on top: circuit breakers that learn, caches that rescue, and a user experience that never breaks.

What It Does ResilientAgent is a fault-tolerant multi-agent orchestration system where no request ever results in an unhandled error — even when every single LLM provider is simultaneously down.

The system exposes a simple REST API (POST /run) where users submit natural-language business requests. An orchestrator agent (Claude Sonnet) decomposes the request into sub-tasks and delegates them to specialist agents — a Data Analyst (GPT-4o) for insights and reporting, and an Action Executor (Gemini Flash) for ticket creation, notifications, and workflow automation.

Every LLM call passes through five resilience layers:

Layer Mechanism Failure Scenario Handled 1 TrueFoundry Virtual Models Provider-level retry and cross-provider failover 2 Circuit Breaker Stops hammering a failing provider (threshold: 3 failures, recovery: 30s) 3 Multi-Level Fallback Chain Cascades through 4 models: orchestrator → analyst → executor → fallback 4 Semantic Cache Serves content-hash-matched responses instantly when all providers are down 5 Degraded Mode UX Returns clear status messages instead of raw stack traces

The system also includes a Chaos Simulator (POST /chaos) that injects failures into any model or MCP server, and a Metrics endpoint (GET /metrics) that exposes real-time circuit breaker states, fallback counts, cache hit rates, and per-request latency logs — full observability into how resilience is performing.

The cost function the system optimizes is not just uptime, but user-perceived reliability:

R=1− N ( total_requests)/ N (errors_seen_by_user) ​ Our target: R≥0.999 How We Built It Architecture: We chose a planner-orchestrator pattern over a single monolithic agent. The orchestrator (Claude Sonnet via Bedrock) receives the user's request, generates a JSON execution plan with parallel sub-tasks, and fans out to specialist agents. Each specialist uses a different LLM through the same TrueFoundry gateway endpoint — one base_url, one api_key, four models.

TrueFoundry Integration: All LLM calls use the OpenAI-compatible SDK pointed at https://gateway.truefoundry.ai/v1. Model names follow the provider/model-id format (e.g., bedrock/global.anthropic.claude-sonnet-4-20250514, openai-main/gpt-4o). This means provider failover happens transparently at the gateway level, while our application code adds circuit-breaking and caching on top.

Circuit Breaker Implementation: We implemented a classic three-state circuit breaker (CLOSED → OPEN → HALF_OPEN) per model.

Semantic Cache: A content-hash cache (SHA-256 of model:prompt) with configurable TTL and LRU eviction. During the "total failure" scenario, the cache checks all model keys — not just the primary — to find any usable cached response. This provides zero-latency rescue even when the request was originally handled by a different model.

API & Docker: FastAPI with auto-generated Swagger docs, health checks, and CORS middleware. Containerized with a multi-stage Dockerfile and docker-compose for one-command deployment. The chaos simulator is a built-in monkey-patch system that injects exceptions into specific model calls.

Demo Video: We built a 3-minute narrated demo using 9 slides that walk through each chaos scenario — from normal operation through cascade failure to recovery — showing real-time metrics and circuit breaker state transitions.

Challenges We Ran Into

  1. Layering Application Resilience on Top of Gateway Resilience. TrueFoundry's virtual models already handle retry and fallback at the infrastructure level. Our circuit breaker and fallback chain operate at the application level. The challenge was ensuring these layers don't fight each other — for example, the gateway might successfully failover to a backup provider, but our circuit breaker for the primary provider should still track the original failure. We solved this by having the circuit breaker monitor the requested model, not the responding model, while the gateway handles transparent provider routing.

  2. Circuit Breaker Threshold Tuning. Too aggressive (threshold = 1) and transient errors trip the circuit unnecessarily, forcing requests through slower fallback paths. Too lenient (threshold = 10) and users wait through multiple timeouts before the circuit opens. We settled on θ=3 consecutive failures with τ=30s recovery, which balanced sensitivity against stability in our testing.

  3. The "Total Failure" User Experience. When all four LLM providers are simultaneously down, there's no AI to generate a helpful error message. We pre-wrote degraded-mode messages that explain the situation clearly, suggest retry timing, and — critically — offer to serve cached responses if available. The challenge was making this feel like a graceful degradation rather than a system failure.

  4. Cross-Model Cache Rescue. A user's request might originally be handled by Claude Sonnet, but when they retry during an outage, Claude is down and only Gemini is up. The cache key includes the model name, so a direct cache lookup fails. We solved this by scanning all model keys during cache rescue, trading accuracy for availability.

Accomplishments That We're Proud Of Zero unhandled errors across all scenarios — from single-provider outages to total cascade failure, the system always returns a meaningful response Sub-millisecond cache rescue — semantic cache delivers responses in <1ms compared to 200−1200ms for live LLM calls Full observability — the /metrics endpoint exposes everything: circuit states, fallback counts, latency distributions, and a rolling request log for post-incident analysis Chaos simulator as a first-class feature — not just a test tool, but a live API endpoint that makes resilience demonstrable and tangible for judges and users Clean separation of concerns — TrueFoundry handles infrastructure resilience (provider failover, load balancing), our code handles application resilience (circuit breaking, caching, UX), with zero coupling between the two layers What We Learned Resilience is a feature, not an afterthought. Most AI agent frameworks assume providers are always available. Building resilience from day one — rather than bolting it on after an incident — resulted in cleaner architecture and more predictable behavior. Circuit breakers need per-model granularity. A single shared circuit breaker doesn't work when different models from the same provider have different failure patterns. Claude Sonnet might be healthy while Claude Haiku is rate-limited. Users don't care why the system is slow — they care that it works. The degraded-mode UX was initially an afterthought, but user testing showed that a clear "running in degraded mode" indicator with an explanation built more trust than a fast but cryptic error. TrueFoundry's gateway abstraction is powerful. Routing to four different LLMs through a single OpenAI-compatible endpoint, with unified auth and observability, dramatically simplified our codebase. We didn't need provider-specific SDKs or retry logic. What's Next for ResilientAgent MCP Server Resilience: Extend the circuit breaker and fallback pattern to MCP tool calls. When an MCP server goes down, queue the action for retry rather than failing the entire agent workflow. Semantic Similarity Cache: Replace exact-hash matching with embedding-based similarity search ( cos(θ)>0.95 ), so semantically equivalent queries hit the cache even when phrased differently. Cost-Aware Routing: Integrate token counting and per-model pricing into the fallback chain. Prefer Gemini Flash ( 0.075/M tokens) for simple extraction tasks, escalate to Claude Sonnet ( 3/M tokens) only for complex reasoning — and track total spend per request. Real-Time Observability Dashboard: A live web dashboard showing circuit breaker states (green/yellow/red), request flow visualization, latency heatmaps, and cost tracking — making resilience visible to operators in real time. Multi-Tenant Isolation: Per-user circuit breakers and rate limits, so one user's chaos scenario doesn't affect another user's experience.

Built With

Share this project:

Updates