Inspiration
In March 2025, OpenAI experienced a major outage that lasted over four hours. During that window, a CFO at a mid-stage startup watched their entire AI-powered financial planning pipeline go dark — cash-flow projections, compliance risk assessments, and strategic recommendations all vanished simultaneously. There was no fallback, no graceful degradation, just silence.
That story stuck with us because it exposed a fundamental blind spot in the agent ecosystem: everyone is building agents that are smart, but almost nobody is building agents that are resilient. We had already been developing a multi-agent CFO operating system (Multi-Agent CFO OS) with three specialized agents — Finance, Strategy, and Compliance — and we'd integrated EGIS AI for runtime governance and a Consensus Hardening Protocol (CHP) for decision integrity. But when we asked ourselves the hard question — "what happens when the LLM just... stops?" — the answer was uncomfortably quiet.
When DevNetwork announced the AI/ML Hack 2026 with the TrueFoundry Resilient Agents track, it felt like the perfect forcing function. We had the domain expertise from months of building CFO agents; now we needed to systematically answer the infrastructure chaos problem. The track's challenge — "How does your agent behave when infrastructure goes wrong?" — became our north star.
What it does
CFO Resilience Matrix wraps every LLM call through a 5-layer resilience stack built on top of the TrueFoundry AI Gateway, ensuring that CFO agents degrade gracefully instead of failing catastrophically.
┌──────────────┐ ┌───────────┐ ┌────────────┐ ┌───────────────┐ ┌────────────────┐
│ 1. GATEWAY │──▶│ 2. PARITY │──▶│3. GOVERNANCE│──▶│4. STATE MACHINE│──▶│5. USER EXPERIENCE│
│ (failover) │ │ (quality) │ │ (PII) │ │ (CHP states) │ │ (degradation) │
└──────────────┘ └───────────┘ └────────────┘ └───────────────┘ └────────────────┘
Layer 1 — Gateway handles provider failover. When the primary LLM returns a 503, the system automatically routes to a fallback model with exponential backoff ($\text{delay} = 100 \cdot 2^{n-1} \cdot \text{U}(0,1)$ ms, where $n$ is the retry number and $\text{U}(0,1)$ is uniform jitter). Per-provider health tracking uses a circuit-breaker pattern: after 3 consecutive errors or an error rate exceeding 50%, the provider is marked unhealthy.
Layer 2 — Parity runs a quality comparison against a second model. It computes a composite score based on domain-specific key-phrase density, structural markers (numbered lists, headers), and length normalization. If the score difference exceeds the threshold, a PARITY_MISMATCH event fires, indicating possible model drift.
Layer 3 — Governance scans every response for 9 categories of PII using regex patterns — SSNs (\d{3}-\d{2}-\d{4}), email addresses, credit card numbers, bank account and routing numbers, dates of birth, phone numbers, and physical addresses. Minor detections are redacted in-place; excessive PII triggers a full BLOCK.
Layer 4 — State Machine manages agent decisions through a formal CHP lifecycle: EXPLORING → PROVISIONAL → LOCKED, with HALT and RECOVER states for degraded conditions. Confidence ($c$) and degradation level ($d$) drive transitions automatically — if $c < 0.4$ or $d \geq 2$, the system transitions to HALT.
Layer 5 — User Experience formats the final response with structured degradation notices, resilience metadata, and actionable guidance. The user always knows what happened and what to do next, even when the system is running on fumes.
Three specialized agents — FinanceAgent (cash flow, runway, burn rate), StrategyAgent (competitive moat, growth opportunities), and ComplianceAgent (regulatory risk, data privacy) — route every single LLM call through this stack. A built-in chaos engine injects 7 fault scenarios (provider outages, rate limiting, intermittent 500s, MCP errors, slow responses, cascading failures, and full outages) by monkey-patching httpx.Client.request at the class level, making every failure mode reproducible and testable.
How we built it
We started with a deliberate architectural constraint: zero heavy dependencies. No OpenAI SDK, no LangChain, no agent framework — just httpx for HTTP. This wasn't minimalism for its own sake; it was a strategic decision that enabled our chaos engineering approach. By owning the HTTP layer directly, we could monkey-patch httpx.Client.request transparently, injecting failures without the application code knowing anything was different.
The project was built in four phases:
Phase 1 — Gateway Client (Day 1): We implemented ResilientGatewayClient as an httpx-based client that speaks the OpenAI-compatible /v1/chat/completions protocol. Key design decisions included lazy initialization of the httpx client (so the chaos engine could patch before any connections were made), structured GatewayEvent dataclasses for every request lifecycle event, and a deterministic mock mode that returns realistic finance/strategy/compliance responses when no API key is present.
Phase 2 — Resilience Layers (Day 1-2): Each layer implements a BaseResilienceLayer abstract class with a single evaluate(ctx: ResilienceContext) -> LayerVerdict method. The ResilienceContext dataclass flows through all layers, accumulating events, confidence scores, and degradation levels. The ResilienceStack orchestrator evaluates layers sequentially and short-circuits on BLOCK — once Governance flags excessive PII, no further processing occurs.
Phase 3 — Agents & Chaos Engine (Day 2): The three CFO agents inherit from CFOAgent, which handles prompt construction and resilience-stack invocation. The ChaosEngine was the most technically interesting piece — it replaces httpx.Client.request with a wrapper that checks whether the URL matches a gateway path, then evaluates active scenarios to decide whether to inject a delay, return a synthetic error response, or pass through normally. Each scenario has distinct semantics: rate limiting triggers every 3rd call, intermittent errors hit 40% randomly, cascading failures persist for $N$ calls then auto-recover.
Phase 4 — Tests, Demo & Polish (Day 2-3): We wrote 94 tests across 4 test files covering every component — gateway client health tracking, all 7 chaos scenarios with deterministic seeding, individual layer behavior, and full-stack integration. The demo runner produces colored CLI output with confidence bars, state badges, verdict highlights, and chaos statistics. Tests run in 0.48s.
The entire codebase is ~4,900 lines of Python across 16 source files, with a single runtime dependency (httpx >= 0.27.0).
Challenges we ran into
The mock-mode design problem. We needed the full demo to run without any API keys or network connectivity — a hackathon judge shouldn't need to provision a TrueFoundry account to see the project work. This meant building a deterministic mock response generator that produces contextually appropriate responses based on the prompt content. The initial version returned generic text; we iterated to pattern-match finance/strategy/compliance keywords and return domain-specific mock data.
Chaos engine patching scope. Our first attempt patched httpx.Client.request on individual instances, but the gateway client created new instances lazily — so patches on the old instance were invisible. We solved this by patching at the class level (httpx.Client.request = ...), which affects all instances. This required careful teardown in the deactivate() method to avoid leaving side effects.
Deterministic reproducibility. Chaos scenarios that depend on randomness (intermittent errors at 40%, slow responses with random delays) would produce different results each run, making test assertions fragile. We introduced a seed-based random.Random() instance per ChaosEngine, ensuring that seed=42 always produces the same injection pattern. This trade-off between realism and reproducibility was worth it — deterministic tests gave us confidence to iterate fast.
PII pattern edge cases. The governance layer's regex-based PII detection initially produced false positives on financial data — account numbers like "Account: 12345678" triggered the bank-account pattern. We refined patterns to require specific formatting (e.g., routing numbers must be exactly 9 digits, SSNs must match the XXX-XX-XXXX format) and added a threshold system: 1-2 detections get redacted, 3+ trigger a full block.
The "block vs. degrade" boundary. Deciding when to block a response entirely versus degrading it was subjective. We settled on a quantitative rule: if confidence drops below 0.4 and degradation level reaches 2, the state machine transitions to HALT (which effectively blocks). Between 0.4 and 0.7 with degradation level 1, the system degrades with notices. This creates a smooth gradient rather than a hard cliff.
Accomplishments that we're proud of
- 94 tests passing in 0.48 seconds — the entire resilience stack, all 7 chaos scenarios, and all 3 agents are verified in under half a second. This is faster than most projects' import time.
- Zero external dependencies beyond httpx — no SDKs, no frameworks, no bloat. The entire system is ~4,900 lines of pure Python.
- Chaos engineering as a first-class citizen — most resilience systems are tested with manual "turn off the Wi-Fi" approaches. Our chaos engine makes failure injection deterministic, reproducible, and programmable.
- The 5-layer architecture actually works end-to-end — we didn't just design it on paper. Every layer fires real events, the state machine transitions correctly, and the degradation notices appear in the demo output.
- Mock mode makes it zero-friction — anyone can clone the repo,
pip install httpx pytest, and run the full demo with 7 chaos scenarios in under 5 seconds. No API keys, no Docker, no setup.
What we learned
The biggest insight was that resilience is not the same as redundancy. Simply having a fallback LLM provider isn't enough — you need to handle the transition between providers gracefully, track the quality of the fallback response, inform the user that quality may be degraded, and maintain decision-state integrity throughout. Our Parity Layer exists because a fallback model that contradicts the primary model is arguably worse than no response at all.
We also learned that chaos engineering for AI agents is fundamentally different from chaos engineering for web services. With web services, you inject latency and errors at the network level and measure request throughput and error rates. With AI agents, the failures manifest as content quality degradation — a 500 error that gets retried might succeed, but the retry response could be less detailed, miss key financial data, or hallucinate numbers. Our quality scoring function was born from the realization that HTTP status codes don't capture "the model gave a worse answer on retry."
Finally, we learned that state machines are underrated for agent systems. The CHP-inspired decision lifecycle (EXPLORING → PROVISIONAL → LOCKED) gives us a formal framework for reasoning about when an agent's output should be trusted. During chaos scenarios, watching the state machine transition to HALT and then RECOVER as the system stabilizes was genuinely satisfying — it made the resilience tangible and observable.
What's next for CFO Resilience Matrix
Live TrueFoundry Gateway integration. The current implementation has full client code for the TrueFoundry AI Gateway but runs in mock mode by default. The immediate next step is deploying against a live gateway with real model routing, using the TFY_API_KEY environment variable. This will let us test real provider failover between, say, GPT-4o and Claude 3.5 Sonnet.
Adaptive chaos engineering. Right now, chaos scenarios are statically configured. We want to build a "chaos scheduler" that runs scenarios continuously in production, adjusting injection rates based on observed system behavior — like Netflix's Chaos Monkey but for LLM calls.
Streaming resilience. The gateway client has a chat_stream() method stub, but the resilience layers currently operate on complete responses. Extending the governance layer to scan streaming tokens for PII in real-time (and the parity layer to compare streaming outputs) would make the system viable for interactive, token-by-token agent experiences.
Multi-agent consensus. With the state machine layer already implementing CHP states, the natural extension is running multiple agents on the same query and requiring consensus before LOCKED — if FinanceAgent says "invest" but ComplianceAgent says "regulatory risk," the system should surface the conflict rather than picking a winner.
Enterprise dashboard. The resilience events, confidence scores, and chaos statistics are currently CLI-only. A web dashboard showing real-time gateway health, per-agent resilience metrics, and historical chaos test results would make this production-ready for enterprise CFO teams.

Log in or sign up for Devpost to join the conversation.