Home page chat window that sends the message to a fastAPI Server

Inspiration

Modern AI systems are incredibly capable — but also surprisingly fragile. Most AI demos assume ideal conditions where APIs respond instantly, retrieval systems never fail, and infrastructure is always healthy. In reality, production AI systems face outages, rate limits, MCP failures, latency spikes, malformed responses, and degraded services every day.

We built SentinelFlow AI to explore a critical question:

How do we build AI agents users can actually trust when infrastructure inevitably breaks?

The project was inspired by real-world reliability engineering principles used in distributed systems, cloud infrastructure, and observability platforms. Instead of focusing purely on model intelligence, we focused on resilience, graceful degradation, transparency, and production-readiness.

What it does

SentinelFlow AI is a resilient multi-agent orchestration platform designed to maintain stable and trustworthy AI interactions during infrastructure failures.

The system continuously monitors provider health and dynamically adapts when failures occur. If an LLM provider times out, an MCP server crashes, or latency becomes unacceptable, SentinelFlow automatically reroutes requests, retries intelligently, activates fallback providers, and preserves conversation continuity.

Key capabilities include:

Multi-provider failover between OpenAI and Claude
Brownout detection for degraded or slow services
MCP failure recovery and backup retrieval strategies
Graceful user-facing recovery messaging
Real-time observability dashboards
Chaos engineering controls for simulating outages
Health-aware routing and retry orchestration
Conversation continuity during provider failovers

Rather than exposing raw infrastructure errors to users, SentinelFlow communicates system state transparently and continues operating in degraded mode whenever possible.

How we built it

We built SentinelFlow AI using a modular orchestration architecture focused on resilience and observability.

Backend

Python
FastAPI
asyncio-based orchestration engine

AI Infrastructure

TrueFoundry Resilient Agents
OpenAI API
Claude API

Frontend

React / Next.js
Tailwind CSS

Core System Components

Orchestrator Agent
Provider Routing Layer
Health Monitoring Engine
Retry + Failover System
MCP Simulation Layer
Event Logging Pipeline
Observability Dashboard
Chaos Testing Controls

The orchestration layer continuously tracks:

provider latency
timeout frequency
retry attempts
error rates
degraded states
failover events

We also implemented simulated infrastructure chaos scenarios such as:

provider outages
artificial latency injection
malformed MCP responses
retrieval failures

This allowed us to stress-test the system and demonstrate resilience behavior live.

Challenges we ran into

One of the biggest challenges was designing graceful degradation behavior instead of simple error handling.

It’s relatively easy to detect failures. It’s much harder to:

preserve user trust,
maintain conversation continuity,
and recover transparently without creating confusing experiences.

Another major challenge was balancing retry logic and failover timing. Aggressive retries increased latency, while early failovers sometimes caused unnecessary provider switching. Tuning health thresholds and brownout detection required careful orchestration design.

Handling partial failures was also difficult. Some providers didn’t fully fail — they became intermittently unreliable or significantly slower. Designing adaptive routing logic for degraded services became one of the most interesting engineering problems in the project.

We also learned that observability is essential. Without detailed event tracing and health monitoring, debugging orchestration behavior during simulated outages became extremely difficult.

Accomplishments that we're proud of

We’re especially proud that SentinelFlow AI feels like a real production system rather than a typical AI demo.

Some highlights include:

Seamless failover between providers without losing conversation context
Real-time provider health monitoring
Brownout-aware intelligent routing
Human-centered recovery UX instead of raw error messages
Interactive chaos testing controls
Live observability dashboards showing retries, latency, and failovers
Stable degraded-mode operation during infrastructure failures

One of our favorite moments was intentionally disabling the primary provider during a live test and watching the system reroute automatically while continuing the conversation uninterrupted.

We’re also proud that the project demonstrates infrastructure maturity, reliability engineering, and systems thinking — not just prompt engineering.

What we learned

This project taught us that reliability is one of the most important unsolved problems in AI systems.

We learned:

graceful degradation is a UX problem as much as an infrastructure problem
observability is critical for debugging AI orchestration systems
latency and brownouts are often more dangerous than full outages
resilient AI systems require dynamic routing and adaptive recovery strategies
user trust depends heavily on transparency during failures

We also gained a deeper understanding of production engineering concepts such as:

health-aware routing
retry orchestration
distributed system resilience
chaos engineering
fault tolerance patterns
failure-aware UX design

Most importantly, we learned that building trustworthy AI systems requires designing for failure from the beginning — not treating it as an edge case.

What's next for SentinelFlow-AI

We see SentinelFlow AI evolving into a broader resilience platform for production AI systems.

Future improvements include:

predictive outage detection
intelligent provider scoring
cost-aware dynamic model selection
semantic response caching
persistent memory across failovers
distributed multi-agent coordination
self-healing orchestration policies
adaptive recovery strategies based on workload type

We also want to expand the observability layer with:

reliability scoring
recovery-time analytics
orchestration replay debugging
infrastructure tracing visualizations

Long term, we believe resilient orchestration and trustworthy degradation handling will become foundational requirements for real-world AI infrastructure.

SentinelFlow AI is our exploration of what production-grade AI reliability could look like in the future.

Built With

claude
fastapi
openai
python
react
truefoundry
vercel

Submitted to

DevNetwork [AI + ML] Hackathon 2026

Created by

My contribution to SentinelFlow AI has focused on building the resilience orchestration layer behind the system.

So far I’ve worked on:

* designing the multi-provider failover architecture
* implementing provider abstraction layers
* integrating TrueFoundry + Gemini routing
* building fallback handling for provider outages
* setting up FastAPI backend orchestration
* implementing retry/recovery flows
* handling degraded infrastructure states gracefully
* configuring deployment pipelines and cloud hosting
* improving observability and recovery transparency

One thing I’ve learned quickly: building reliable AI systems is as much an infrastructure problem as it is an AI problem.

A lot of AI apps work well under ideal conditions. Making them recover cleanly when systems fail is a very different engineering challenge.

Still early, but excited to keep evolving SentinelFlow AI.

chryl f
AI engineer building production LLM systems, RAG pipelines, and agentic workflows with full-stack expertise in scalable apps.

Updates

chryl f posted an update — May 28, 2026 02:33 AM EDT

Building SentinelFlow AI has shifted how I think about AI systems.

Most demos assume perfect infrastructure:

providers always respond
retrieval always works
latency stays low
agents never fail

Production systems don’t behave that way.

Over the past few days I’ve been building:

multi-provider failover
graceful degradation flows
retry + recovery orchestration
provider abstraction layers
infrastructure-aware routing
resilience testing with simulated outages

One of the most interesting parts has been designing the UX around failure recovery instead of exposing raw infrastructure errors.

Example: Instead of: “500 Internal Server Error”

SentinelFlow can respond with: “⚠ Primary provider is experiencing elevated latency. Switching to backup mode to maintain continuity.”

I’m curious how others are thinking about this:

As AI systems become more embedded into real workflows, will resilience and recovery become just as important as model quality?

AI #LLM #Engineering #FastAPI #AIInfrastructure #Resilience #OpenAI #Gemini #TrueFoundry

Log in or sign up for Devpost to join the conversation.

chryl f started this project — May 28, 2026 02:31 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.