What we built
A client-side resilience layer that wraps both legs of the challenge prompt — LLM calls and MCP tool calls — in three concentric layers of fault tolerance: retry with exponential backoff, circuit breaker per target, and a priority-ordered fallback chain. The chain ends in a raw provider target so the agent can still answer even if the TrueFoundry Gateway itself goes down.
How it answers the challenge
The TrueFoundry track asks: "How does your agent behave when an MCP server starts erroring out? An LLM server goes down? OpenAI or Claude errors out or browns out?"
We answer with measurable behaviour for both failure modes:
ResilientLLMwraps chat completions across TrueFoundry Gateway + raw provider targets.ResilientMCPwrapsClientSession.call_toolacross multiple MCP servers using the same retry / breaker / chain primitives.- Both share the same Scorecard so the operator sees one unified view of MTTR, p50/p95, fallback trigger rate, and per-target serve counts.
chaos.pyexposes typed fault hooks for both layers:BurstFault/RandomFault/BrownoutFaultfor the LLM side,MCPToolFault/MCPTimeoutFaultfor the MCP side.
Live results
3 LLM scenarios (demo.py) and 3 MCP scenarios (demo_mcp.py) all hit 100% success rate under chaos:
| Scenario | Layer | Fault | p95 latency | MTTR |
|---|---|---|---|---|
| Clean baseline | LLM | none | 1457 ms | — |
| Burst 503 on primary | LLM | breaker opens | 1947 ms | 682 ms |
| 60% random fail on both gateway targets | LLM | gateway brownout | 1575 ms | 304 ms |
| Primary MCP errors out | MCP | breaker opens | 1601 ms | 283 ms |
| Primary MCP brownout | MCP | 2 s injected latency | 2298 ms | — |
Every LLM attempt — successes and failures — flows through TrueFoundry's AI Gateway and shows up in AI Monitoring → Request Traces.
How we built it
- Python 3.14.3 + OpenAI SDK with
max_retries=0so our resilience layer is the only control plane (no double-counting against the SDK's built-in retries). - TrueFoundry AI Gateway configured with Groq as a provider, exposing
groq/llama-3.1-8b-instantandgroq/llama-3.3-70b-versatile. - MCP transport: official
mcpPython client over streamable-http to two local FastMCP servers. - ~450 lines total across
ResilientLLM,ResilientMCP,_Breaker,Scorecard, and the chaos hooks.
Challenges we ran into
The free TrueFoundry Developer Plan doesn't include weight/priority/latency routing or built-in fallbacks (those are paid features). So we built the resilience layer on the client side and use the gateway purely for the unified API and traces. That turned out to be the right shape for the challenge anyway: the gateway is the primary path, but the agent isn't bound to it.
What we're proud of
The agent survives when the gateway itself goes down (raw-groq-8b last target) and when an MCP server starts erroring out (mcp-fallback second target) — both legs of the challenge prompt covered by the same primitive instead of two separate try/except blocks.
What's next
A scheduled chaos engineering harness, OpenTelemetry traces, and a /scorecard HTTP endpoint that returns live MTTR. We also want to add a real per-call timeout to ResilientMCP so brownout faults can trip the breaker the same way error faults do.
Log in or sign up for Devpost to join the conversation.