Aegis — A Resilient AI Agent Runtime

Aegis — the credit_balance gap every LLM gateway misses, closed.
Seven resilience layers, one signed Aegis Receipt.

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for Aegis — A Resilient AI Agent Runtime

The problem

On April 20, 2026, OpenAI went down — ChatGPT, Codex, and the API, all of it. On March 2 and again on March 3, 2026, Anthropic's Claude went down twice in 24 hours. On November 18, 2025, a Cloudflare incident took ChatGPT and Sora with it. Every major LLM provider has had at least one significant outage in the past 12 months.

But the more interesting failure is the one most "resilient" gateways still don't handle. When Anthropic returns 400 credit_balance_too_low, LiteLLM, OpenRouter, Portkey, and even TrueFoundry's default Virtual Model fallback pass it straight through — because 400 is in the 4xx range, the gateway treats it as a client problem. Result: an agent that goes silent the moment a credit card expires. This is a documented industry-wide gap (LiteLLM Issue #24320).

What Aegis does

Aegis is an OpenAI-SDK-compatible chat completion server built on top of TrueFoundry's AI Gateway, with seven resilience layers wrapping it:

Layer	Purpose
L0 Hedge	Race a duplicate request to an alternate provider after `hedgeAfterMs`; whichever returns first wins, the loser is canceled
L1 Retry	TF Gateway exponential backoff with jitter + 3-strike termination (prevents the $437 retry-loop incident)
L2 Model fallback	TF Virtual Model switches model within provider
L3 Provider fallback	TF Virtual Model switches across providers
L3 SPOF Bypass	If TF itself is unhealthy (5xx/connection-refused/timeout), Aegis calls the provider directly using locally-stored keys — TF is not a SPOF for Aegis
L4 Semantic error	Inspects `error.type` / `error.code` / message — catches `credit_balance_too_low`, `insufficient_quota`, `context_overflow`, `model_unavailable` even when status codes don't match the gateway's enum
L5 Graceful degradation	When all providers fail, returns a normal `HTTP 200` chat completion with an honest assistant message naming every failure class, instead of propagating a stack trace
L6 Continuous self-chaos	A drill scheduler injects synthetic (v0) or Toxiproxy-driven (v1) failures every 30 seconds; the response Receipt carries `last_chaos_survival: "47s ago"` as a provable freshness signal

Plus MCP tool execution with classification-aware resilience:

READ_HEDGE (get_/read_/search_/list_/query_/...): races two MCP servers, "prefer first OK" semantics
WRITE_TIED (create_/send_/delete_/update_/...): single fire + idempotency-key retry on timeout
UNKNOWN_TIED: conservative tied default

Every response carries an Aegis Receipt — a signed JSON envelope with the full layer trace: providers tried, hedge cost, semantic match, contract compliance, TF health, chaos survival. One artifact for operators, auditors, judges, and (selectively) end users.

How we built it

Runtime: Bun >=1.3 + TypeScript (strict)
Server: Hono with streamSSE for token streaming
LLM client: OpenAI SDK pointed at the TrueFoundry AI Gateway base URL (TF proxies all providers via OpenAI-compatible API)
Agents: OpenAI Agents SDK (TypeScript) — MCP first-class
MCP: @modelcontextprotocol/sdk for tool wiring; the convention proposed here is x-aegis-idempotent: true|false as an annotation upstream MCP servers can adopt
Chaos: toxiproxy-node-client for network-level fault injection
Validation: Zod at every external boundary
Lint/format: Biome
Tests: Bun's built-in runner — 50 unit tests, 148 assertions, runs in ~700ms

Challenges we ran into

TF Virtual Model fallback_status_codes is a fixed enum. Adding 400 to the fallback list shows "Successfully updated" in the UI but is silently stripped on save. credit_balance_too_low is HTTP 400 — so it never triggers TF's built-in fallback. That gap is Aegis L4.
Hono streamSSE crashes the entire server on an unhandled throw (honojs/hono#2164). Every Aegis streaming branch is wrapped in try/catch with the Receipt emitted as a custom event before the OpenAI [DONE] sentinel.
Hedging MCP tool calls would double-fire writes. Our classifier reads name patterns and an opt-in x-aegis-idempotent annotation, then routes to a TIED policy (single fire + idempotency-key retry) for anything classified as write or unknown.
Both TF integrations went credit-exhausted during the build week. We used this as the real demo path: every video scene shows live credit-balance errors getting caught by L4 and graceful-degraded by L5, no simulation.

Accomplishments we're proud of

A genuine industry-gap fix. Aegis L4 is the first agent runtime we know of that handles credit_balance_too_low — a known unsolved problem documented across LiteLLM, Portkey, and OpenRouter issue trackers.
Hedge for LLMs, properly. Jeff Dean's "Tail at Scale" hedging adapted to LLM calls with cost-aware cancellation and a verifiable cost-vs-latency receipt.
TF is not a SPOF for Aegis. Even the gateway we depend on has a bypass path. Most "TF-on-top" agents would die when TF dies; Aegis routes around it.
50 tests / 0 fail / ~700ms. Real production-grade test discipline, not a sketch.

What we learned

"First response" and "first useful response" are different things. Streaming hedge needs the latter.
An error-message regex is fine as a fallback — but the structured error.type / error.code fields are the load-bearing detection path. We backfill from message only when structured is absent.
The most powerful submission artifact is one auditable JSON object that ties every layer's decision back to a single request. The Aegis Receipt is that object.

What's next

L6 with Toxiproxy — replace synthetic drills with real network fault injection against a shadow request copy
x-aegis-idempotent — open a proposal upstream to the MCP working group
UI — a Storm Log dashboard pulling TF AI Monitoring traces + Aegis Receipts in real time
Streaming hedge with TTFT — race two streams from the start, hand the client the faster one

Built With

biome
bun
hono
model-context-protocol
openai-agents-sdk
openai-sdk
toxiproxy
truefoundry-ai-gateway
truefoundry-mcp-gateway
typescript
zod

Updates

Hokuto Torigoe started this project — May 28, 2026 02:38 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.