resilient-agent: production failure modes for AI agents

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

Track

"Resilient Agents" - building agents that survive real infrastructure failure instead of demoing in a clean sandbox.

What it is

A small Python reference agent that fetches a list of URLs, asks an LLM provider to summarize each one, and keeps running when the provider misbehaves. Four resilience primitives sit between the agent loop and the provider: retry, circuit breaker, budget cap, and egress allowlist. Every interesting event is written to an append-only audit log so you can read what happened after the run.

The problem

Most agent demos run on a fast network, against a stable provider, with no budget. Production agents never get that. They get rate-limited at 4 a.m., they get 503s the day after a release, they time out on a flaky network, and they burn through a budget while nobody is watching. The agent loop keeps retrying until the bill arrives.

Each failure mode in this project has a named primitive that handles it. The primitive emits a trace event. The trace lets you see the run after the fact and decide whether the agent is healthy.

Four primitives

Retry. Exponential backoff with full jitter. Retryable codes are configurable per provider (429, 408, 503, 504, and friends). Non-retryable codes fail fast. A GiveUp escape hatch lets a call short-circuit the loop when the failure is permanent.

Circuit breaker. Three states: closed, open, half-open. After N consecutive failures the breaker opens and rejects calls immediately so the agent stops paying for an upstream that is on fire. After a recovery window it transitions to half-open and lets one probe through. If the probe succeeds the breaker closes. If it fails the breaker re-opens.

Budget. USD cap with two-phase reserve and commit. A call reserves an estimated cost before the request goes out, then commits the actual cost on success or releases on failure. The cap is checked at both points so an over-budget run is stopped at the agent level, not at the credit-card level.

Egress guard. Host allowlist with optional wildcard subdomains. Tool calls and fetches are checked against the list before they leave the process. An unlisted host fails immediately with a structured denial. This blocks prompt-injected tool calls from reaching arbitrary destinations.

A fifth module, the tracer, writes every event to a JSONL file. One line per event: timestamp, kind, payload. The agent emits run_start, retry, breaker_opened, breaker_half_opened, breaker_closed, egress_denied, budget_exceeded, and run_end. After a run you can grep the file or feed it into an external observability tool.

Demo

examples/research_run.py runs the agent on five sample URLs plus two adversarial ones, against a fake provider that injects realistic failures. Warm mode produces a mix of retries and one denial. Storm mode raises the failure rate, drops retries to one attempt, and shows the breaker opening, rejecting subsequent calls, and protecting the budget.

examples/breaker_lifecycle.py walks the breaker through its full state machine. Phase one drives the upstream into failure. The agent pauses for the recovery window. Phase two swaps in a healthy provider and the probe closes the breaker. The transition log prints: closed, open, half_open, closed.

The trace from the warm-mode run looks like this:

{"kind": "run_start", "payload": {"url_count": 7, "budget_cap_usd": 1.0}}
{"kind": "retry", "payload": {"url": ".../intro", "code": 429, "message": "rate limited"}}
{"kind": "retry", "payload": {"url": ".../intro", "code": 429, "message": "rate limited"}}
{"kind": "egress_denied", "payload": {"url": "https://random-paste-bin.example.com/abc", "host": "random-paste-bin.example.com"}}
{"kind": "run_end", "payload": {"successes": 5, "failures": 2, "retries": 4, "spent_usd": 0.25}}

That is enough to debug a run an hour after it happened.

Why each library

The primitives are inspired by libraries I have shipped to crates.io and PyPI: llm-retry for the retry policy, llm-circuit-breaker for the half-open probe pattern, token-budget-py and agentleash for the budget pool and egress allowlist, and agenttrace for the trace format. The reference agent vendors a small inline copy of each one so the demo runs with zero install dependencies and finishes end to end in under a minute.

Plugging in real Gemini

The provider contract is one method, summarize(url, text) -> SummarizeResult. Swapping in Gemini is twenty lines of adapter code (shown in the README). The retry, breaker, budget, and trace logic runs unchanged against real traffic.

Tests

Forty-three tests cover each primitive in isolation, the fake provider, the tracer, and the composed agent end to end. The agent tests force specific failure profiles with seeded RNGs so the suite is deterministic. No real LLM provider is contacted.

What is shipped

Public repo at github.com/MukundaKatta/resilient-agent. MIT license. Reproducible demo. Clean trace output. The point is not a framework. It is to show what the inside of a resilient agent should look like in around seven hundred lines of Python.