Resilient Agent

What we built

A client-side resilience layer that wraps both legs of the challenge prompt — LLM calls and MCP tool calls — in three concentric layers of fault tolerance: retry with exponential backoff, circuit breaker per target, and a priority-ordered fallback chain. The chain ends in a raw provider target so the agent can still answer even if the TrueFoundry Gateway itself goes down.

How it answers the challenge

The TrueFoundry track asks: "How does your agent behave when an MCP server starts erroring out? An LLM server goes down? OpenAI or Claude errors out or browns out?"

We answer with measurable behaviour for both failure modes:

ResilientLLM wraps chat completions across TrueFoundry Gateway + raw provider targets.
ResilientMCP wraps ClientSession.call_tool across multiple MCP servers using the same retry / breaker / chain primitives.
Both share the same Scorecard so the operator sees one unified view of MTTR, p50/p95, fallback trigger rate, and per-target serve counts.
chaos.py exposes typed fault hooks for both layers: BurstFault / RandomFault / BrownoutFault for the LLM side, MCPToolFault / MCPTimeoutFault for the MCP side.

Live results

3 LLM scenarios (demo.py) and 3 MCP scenarios (demo_mcp.py) all hit 100% success rate under chaos:

Scenario	Layer	Fault	p95 latency	MTTR
Clean baseline	LLM	none	1457 ms	—
Burst 503 on primary	LLM	breaker opens	1947 ms	682 ms
60% random fail on both gateway targets	LLM	gateway brownout	1575 ms	304 ms
Primary MCP errors out	MCP	breaker opens	1601 ms	283 ms
Primary MCP brownout	MCP	2 s injected latency	2298 ms	—

Every LLM attempt — successes and failures — flows through TrueFoundry's AI Gateway and shows up in AI Monitoring → Request Traces.

How we built it

Python 3.14.3 + OpenAI SDK with max_retries=0 so our resilience layer is the only control plane (no double-counting against the SDK's built-in retries).
TrueFoundry AI Gateway configured with Groq as a provider, exposing groq/llama-3.1-8b-instant and groq/llama-3.3-70b-versatile.
MCP transport: official mcp Python client over streamable-http to two local FastMCP servers.
~450 lines total across ResilientLLM, ResilientMCP, _Breaker, Scorecard, and the chaos hooks.

Challenges we ran into

The free TrueFoundry Developer Plan doesn't include weight/priority/latency routing or built-in fallbacks (those are paid features). So we built the resilience layer on the client side and use the gateway purely for the unified API and traces. That turned out to be the right shape for the challenge anyway: the gateway is the primary path, but the agent isn't bound to it.

What we're proud of

The agent survives when the gateway itself goes down (raw-groq-8b last target) and when an MCP server starts erroring out (mcp-fallback second target) — both legs of the challenge prompt covered by the same primitive instead of two separate try/except blocks.

What's next

A scheduled chaos engineering harness, OpenTelemetry traces, and a /scorecard HTTP endpoint that returns live MTTR. We also want to add a real per-call timeout to ResilientMCP so brownout faults can trip the breaker the same way error faults do.

Built With

chaos-engineering
groq
llama
mcp
openai
python
truefoundry

Updates

run58669-maker Q started this project — May 16, 2026 09:30 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.