Agent Resilience Desk

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for Agent Resilience Desk

Inspiration

Most AI agent demos look good when every service is healthy. Real work is different: MCP servers error out, LLM gateways brown out, search gets stale, storage fails, and users still need a clear answer about what happened and what is safe to do next.

TrueFoundry's Resilient Agents challenge asks how an agent behaves when infrastructure chaos happens. Agent Resilience Desk treats that chaos as a user experience problem, not just a backend routing problem.

What it does

Agent Resilience Desk is an interactive web demo for AI agent infrastructure chaos. A judge can turn on faults for the LLM gateway, MCP tool layer, search layer, result store, and external-action approval gate. The agent then runs a workflow and shows how it responds:

switches from the primary model route to a cached policy planner,
marks evidence freshness when search is unavailable,
queues external tool actions instead of pretending they succeeded,
exports a summary when storage is down,
stops risky external actions until a human approves.

The goal is to make resilience visible from the user side, not hidden in backend logs.

How we built it

The demo is a static HTML, CSS, and JavaScript app with a small resilience state machine. Each service fault changes the agent plan, timeline, held-task queue, and operational scorecard.

The product maps directly to TrueFoundry AI Gateway concepts: model fallback and retries, gateway-level observability, MCP tool failure isolation, approval gates, and clear user-facing incident state. The current submission uses deterministic local fault injection so judges can test every failure path without credentials. A future adapter can replace local service status with live TrueFoundry AI Gateway and MCP health checks.

What we learned

A resilient agent needs more than fallback routing. The user must know whether the answer is complete or degraded, which source is stale, which action was not executed, what is waiting for recovery, and where human approval is required.

Challenges we ran into

The main challenge was avoiding a generic status dashboard. The app had to show a live agent decision, not just service health. The timeline and held-task queue are therefore built around user-facing outcomes: completed, degraded, held, or approval-required.

A second challenge was staying honest about integration state. This is a local deterministic fault-injection demo built around TrueFoundry AI Gateway design patterns; it does not claim a production TrueFoundry deployment without credentials.

Accomplishments

Built an interactive failure injection UI.
Implemented multiple agent scenarios.
Implemented gateway, MCP, search, storage, and approval failure paths.
Added recovery replay and held-task review.
Kept the demo small enough to run locally with no external secrets.

What's next

Connect service health to a real TrueFoundry AI Gateway workspace.
Add OpenTelemetry event export for each fallback and hold decision.
Add a team policy editor for deciding which actions can auto-retry.
Add post-incident reports for agent operations teams.

Built With

css
html
javascript
mcp
truefoundry

Updates

Yoshiyuki Hongoh started this project — May 14, 2026 07:58 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.