ContinuityOps

Inspiration

Most AI agent demos assume perfect infrastructure.

But real production systems are messy. LLM providers time out. Gateways brown out. MCP servers return malformed payloads. Tool calls fail halfway through an incident. And when that happens, the user usually sees the worst possible experience: a blank failure, a generic apology, or an agent that confidently pretends nothing went wrong.

The TrueFoundry Resilient Agents challenge asked the right question: what should an agent do when the infrastructure underneath it starts failing?

ContinuityOps is our answer.

We built a production-style AI agent resilience control plane that does not just show an agent responding. It shows the operational layer around the agent: gateway routing, fallback policy, MCP failure handling, degraded mode, observability, recovery timelines, and incident reports.

What it does

ContinuityOps helps teams test, observe, and recover AI agent workflows when model or tool infrastructure fails.

The platform includes:

A polished control plane for monitoring AI agent health
A live TrueFoundry AI Gateway integration
Chaos testing for LLM, gateway, and MCP-style failures
Automatic fallback model selection
Retry and timeout handling
Cached tool response recovery
Invalid tool response handling
Human approval gates for risky write actions
Real-time recovery timelines
User-facing degraded mode explanations
Clean incident reports showing what failed, how recovery happened, and what the user experienced

The key idea is simple: the user should still get a useful answer even when the infrastructure is unhealthy.

How it works

ContinuityOps simulates a production incident where an AI agent is helping an SRE team investigate checkout latency.

During the incident, the platform can inject failures such as:

Claude unavailable
LLM timeout
Provider rate limit
AI gateway brownout
MCP server crash
Invalid MCP tool response
Partial tool outage
Permission-denied tool write

The agent then executes a resilience policy:

Detect the failure condition
Record the event in the audit ledger
Route through TrueFoundry AI Gateway
Retry or fall back when the primary model path fails
Use cached or repaired MCP-style tool evidence when tools degrade
Preserve a user-facing response through degraded mode
Generate an incident report with recovery details

The report includes request ID, gateway route, failed components, recovery duration, confidence score, and what the end user experienced.

TrueFoundry integration

ContinuityOps uses a live TrueFoundry AI Gateway path for model execution.

The deployed project is configured with:

TRUEFOUNDRY_BASE_URL=https://gateway.truefoundry.ai
TRUEFOUNDRY_MODEL=google-gemini/gemini-3.1-flash-lite

In production testing, the deployed app successfully completed a live model call through the gateway:

Primary virtual model completed through the gateway.

This makes the demo more than a static simulation. The model gateway path is live, while chaos controls demonstrate how the surrounding resilience layer behaves when parts of the agent stack fail.

Why it matters

As AI agents move from prototypes into production, reliability becomes a product feature.

A customer support agent cannot stop working because a provider is slow. A DevOps agent cannot hallucinate because an MCP tool returned malformed JSON. A security analyst agent cannot silently skip evidence because a tool server failed. An enterprise AI platform cannot expose raw provider errors to users and call that “resilience.”

ContinuityOps treats AI agent failures like production incidents: observable, recoverable, explainable, and auditable.

Challenges we ran into

The hardest part was making resilience visible.

A normal chatbot demo hides the infrastructure. For this challenge, we needed the opposite: we needed judges to immediately understand what failed, what fallback policy ran, what data was degraded, and what the user ultimately experienced.

We also had to balance reliability and realism. Live AI infrastructure can be unpredictable during a hackathon demo, so ContinuityOps combines:

A live TrueFoundry gateway path
Deterministic chaos controls
Simulated MCP-style adapters
Clear incident reporting

That lets the demo stay judgeable while still proving the core resilience architecture.

Accomplishments that we're proud of

We are proud that ContinuityOps feels like a real startup product, not just a hackathon prototype.

It has:

A production-style landing page
A polished AI infrastructure dashboard
Live gateway mode
Chaos testing
Recovery visualization
Incident reporting
Clean deployment
A clear story around resilient agents

Most importantly, it communicates the challenge theme quickly: this is not another chatbot. It is a fault-tolerant control plane for AI agents.

What we learned

We learned that agent resilience is not just about retries.

True resilience needs:

User experience design
Gateway policy
Tool governance
Failure classification
Observability
Human approval paths
Clear degraded-mode messaging
Post-incident reporting

The technical system matters, but so does the trust layer around it. Users need to know that the agent handled failure safely and transparently.

What's next

ContinuityOps could evolve into a full AI reliability platform for enterprise agent teams.

Next steps would include:

Connecting real external MCP servers through a managed MCP gateway
Adding per-tenant resilience policies
Supporting multiple gateway providers and model pools
Adding historical reliability analytics
Streaming live agent traces
Exporting incident reports to Slack, Jira, PagerDuty, or Linear
Adding policy simulation before production rollout
Measuring user-impact scores for degraded responses

The long-term vision is to become the reliability and observability layer for production AI agents.

Built With

github
lucide-react
motion
next.js
react
tailwind-css-v4
truefoundry-ai-gateway
typescript
vercel

Updates

green cat started this project — May 28, 2026 12:56 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.