Resilient Agent Control Tower

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for Resilient Agent Control Tower

Inspiration

Modern AI products often fail in invisible ways: a provider times out, an MCP tool returns malformed data, or a backup model responds differently. The user usually sees only a spinner or a vague error. Resilient Agent Control Tower turns that failure path into a visible, explainable workflow.

What it does

Resilient Agent Control Tower is a browser-based simulator for AI agent reliability incidents. A judge can choose scenarios such as LLM timeout, MCP outage, malformed tool payload, rate-limit pressure, or cascading failure. The app then shows the agent's recovery path, active circuit breakers, fallback route, incident timeline, confidence level, and the final user-facing answer.

The key idea is that resilience is not only backend retries. A good agent should tell the user what failed, what fallback was used, how confident the result is, and when a human should verify the answer.

How I built it

The project is a dependency-free static web app built with HTML, CSS, and JavaScript modules. The simulator logic lives in src/resilience.js, the topology renderer lives in src/topology.js, and the UI wiring lives in src/app.js. A small Node test suite checks that every scenario produces a non-empty user response, a bounded confidence score, and a useful incident timeline.

What makes it useful

Tests multiple incident scenarios without needing real production outages
Shows LLM and MCP dependency status in one dashboard
Demonstrates retries, provider fallback, circuit breakers, and cached context
Produces a user-facing answer that explains degraded reliability
Exports an incident report for support or postmortem review

Challenges

The hardest part was making infrastructure reliability understandable from the user's point of view. The demo avoids turning this into a backend-only diagram and instead focuses on what the user sees: delay, confidence, fallback path, and escalation guidance.

Accomplishments

Built a working simulator with six incident scenarios
Added a topology canvas that changes based on dependency health
Added active policy toggles for retry, fallback, circuit breaker, and cache
Added an exportable JSON incident report
Deployed a live GitHub Pages demo
Added automated scenario tests

What's next

Add live integrations with TrueFoundry AI Gateway routing
Store incident traces for historical reliability analytics
Add team-level SLO dashboards
Add real MCP server health checks
Support policy presets for support, sales, research, and operations agents

Built With

ai-agents
canvas
css
github
html
javascript
mcp
node.js
resilience-engineering

Updates

재훈 이 started this project — May 13, 2026 07:07 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.