What we built

A client-side resilience layer that wraps both legs of the challenge prompt — LLM calls and MCP tool calls — in three concentric layers of fault tolerance: retry with exponential backoff, circuit breaker per target, and a priority-ordered fallback chain. The chain ends in a raw provider target so the agent can still answer even if the TrueFoundry Gateway itself goes down.

How it answers the challenge

The TrueFoundry track asks: "How does your agent behave when an MCP server starts erroring out? An LLM server goes down? OpenAI or Claude errors out or browns out?"

We answer with measurable behaviour for both failure modes:

  • ResilientLLM wraps chat completions across TrueFoundry Gateway + raw provider targets.
  • ResilientMCP wraps ClientSession.call_tool across multiple MCP servers using the same retry / breaker / chain primitives.
  • Both share the same Scorecard so the operator sees one unified view of MTTR, p50/p95, fallback trigger rate, and per-target serve counts.
  • chaos.py exposes typed fault hooks for both layers: BurstFault / RandomFault / BrownoutFault for the LLM side, MCPToolFault / MCPTimeoutFault for the MCP side.

Live results

3 LLM scenarios (demo.py) and 3 MCP scenarios (demo_mcp.py) all hit 100% success rate under chaos:

Scenario Layer Fault p95 latency MTTR
Clean baseline LLM none 1457 ms
Burst 503 on primary LLM breaker opens 1947 ms 682 ms
60% random fail on both gateway targets LLM gateway brownout 1575 ms 304 ms
Primary MCP errors out MCP breaker opens 1601 ms 283 ms
Primary MCP brownout MCP 2 s injected latency 2298 ms

Every LLM attempt — successes and failures — flows through TrueFoundry's AI Gateway and shows up in AI Monitoring → Request Traces.

How we built it

  • Python 3.14.3 + OpenAI SDK with max_retries=0 so our resilience layer is the only control plane (no double-counting against the SDK's built-in retries).
  • TrueFoundry AI Gateway configured with Groq as a provider, exposing groq/llama-3.1-8b-instant and groq/llama-3.3-70b-versatile.
  • MCP transport: official mcp Python client over streamable-http to two local FastMCP servers.
  • ~450 lines total across ResilientLLM, ResilientMCP, _Breaker, Scorecard, and the chaos hooks.

Challenges we ran into

The free TrueFoundry Developer Plan doesn't include weight/priority/latency routing or built-in fallbacks (those are paid features). So we built the resilience layer on the client side and use the gateway purely for the unified API and traces. That turned out to be the right shape for the challenge anyway: the gateway is the primary path, but the agent isn't bound to it.

What we're proud of

The agent survives when the gateway itself goes down (raw-groq-8b last target) and when an MCP server starts erroring out (mcp-fallback second target) — both legs of the challenge prompt covered by the same primitive instead of two separate try/except blocks.

What's next

A scheduled chaos engineering harness, OpenTelemetry traces, and a /scorecard HTTP endpoint that returns live MTTR. We also want to add a real per-call timeout to ResilientMCP so brownout faults can trip the breaker the same way error faults do.

Built With

  • chaos-engineering
  • groq
  • llama
  • mcp
  • openai
  • python
  • truefoundry
Share this project:

Updates