Inspiration
Most AI agent demos assume perfect infrastructure.
But real production systems are messy. LLM providers time out. Gateways brown out. MCP servers return malformed payloads. Tool calls fail halfway through an incident. And when that happens, the user usually sees the worst possible experience: a blank failure, a generic apology, or an agent that confidently pretends nothing went wrong.
The TrueFoundry Resilient Agents challenge asked the right question: what should an agent do when the infrastructure underneath it starts failing?
ContinuityOps is our answer.
We built a production-style AI agent resilience control plane that does not just show an agent responding. It shows the operational layer around the agent: gateway routing, fallback policy, MCP failure handling, degraded mode, observability, recovery timelines, and incident reports.
What it does
ContinuityOps helps teams test, observe, and recover AI agent workflows when model or tool infrastructure fails.
The platform includes:
- A polished control plane for monitoring AI agent health
- A live TrueFoundry AI Gateway integration
- Chaos testing for LLM, gateway, and MCP-style failures
- Automatic fallback model selection
- Retry and timeout handling
- Cached tool response recovery
- Invalid tool response handling
- Human approval gates for risky write actions
- Real-time recovery timelines
- User-facing degraded mode explanations
- Clean incident reports showing what failed, how recovery happened, and what the user experienced
The key idea is simple: the user should still get a useful answer even when the infrastructure is unhealthy.
How it works
ContinuityOps simulates a production incident where an AI agent is helping an SRE team investigate checkout latency.
During the incident, the platform can inject failures such as:
- Claude unavailable
- LLM timeout
- Provider rate limit
- AI gateway brownout
- MCP server crash
- Invalid MCP tool response
- Partial tool outage
- Permission-denied tool write
The agent then executes a resilience policy:
- Detect the failure condition
- Record the event in the audit ledger
- Route through TrueFoundry AI Gateway
- Retry or fall back when the primary model path fails
- Use cached or repaired MCP-style tool evidence when tools degrade
- Preserve a user-facing response through degraded mode
- Generate an incident report with recovery details
The report includes request ID, gateway route, failed components, recovery duration, confidence score, and what the end user experienced.
TrueFoundry integration
ContinuityOps uses a live TrueFoundry AI Gateway path for model execution.
The deployed project is configured with:
TRUEFOUNDRY_BASE_URL=https://gateway.truefoundry.aiTRUEFOUNDRY_MODEL=google-gemini/gemini-3.1-flash-lite
In production testing, the deployed app successfully completed a live model call through the gateway:
Primary virtual model completed through the gateway.
This makes the demo more than a static simulation. The model gateway path is live, while chaos controls demonstrate how the surrounding resilience layer behaves when parts of the agent stack fail.
Why it matters
As AI agents move from prototypes into production, reliability becomes a product feature.
A customer support agent cannot stop working because a provider is slow. A DevOps agent cannot hallucinate because an MCP tool returned malformed JSON. A security analyst agent cannot silently skip evidence because a tool server failed. An enterprise AI platform cannot expose raw provider errors to users and call that “resilience.”
ContinuityOps treats AI agent failures like production incidents: observable, recoverable, explainable, and auditable.
Challenges we ran into
The hardest part was making resilience visible.
A normal chatbot demo hides the infrastructure. For this challenge, we needed the opposite: we needed judges to immediately understand what failed, what fallback policy ran, what data was degraded, and what the user ultimately experienced.
We also had to balance reliability and realism. Live AI infrastructure can be unpredictable during a hackathon demo, so ContinuityOps combines:
- A live TrueFoundry gateway path
- Deterministic chaos controls
- Simulated MCP-style adapters
- Clear incident reporting
That lets the demo stay judgeable while still proving the core resilience architecture.
Accomplishments that we're proud of
We are proud that ContinuityOps feels like a real startup product, not just a hackathon prototype.
It has:
- A production-style landing page
- A polished AI infrastructure dashboard
- Live gateway mode
- Chaos testing
- Recovery visualization
- Incident reporting
- Clean deployment
- A clear story around resilient agents
Most importantly, it communicates the challenge theme quickly: this is not another chatbot. It is a fault-tolerant control plane for AI agents.
What we learned
We learned that agent resilience is not just about retries.
True resilience needs:
- User experience design
- Gateway policy
- Tool governance
- Failure classification
- Observability
- Human approval paths
- Clear degraded-mode messaging
- Post-incident reporting
The technical system matters, but so does the trust layer around it. Users need to know that the agent handled failure safely and transparently.
What's next
ContinuityOps could evolve into a full AI reliability platform for enterprise agent teams.
Next steps would include:
- Connecting real external MCP servers through a managed MCP gateway
- Adding per-tenant resilience policies
- Supporting multiple gateway providers and model pools
- Adding historical reliability analytics
- Streaming live agent traces
- Exporting incident reports to Slack, Jira, PagerDuty, or Linear
- Adding policy simulation before production rollout
- Measuring user-impact scores for degraded responses
The long-term vision is to become the reliability and observability layer for production AI agents.
Built With
- github
- lucide-react
- motion
- next.js
- react
- tailwind-css-v4
- truefoundry-ai-gateway
- typescript
- vercel
Log in or sign up for Devpost to join the conversation.