## 💡 Inspiration
What if the AI agent itself becomes the incident?

Most "AI for DevOps" demos assume the LLM is always available. But the TrueFoundry Resilient Agents Challenge asked the opposite question: what happens to the user when the AI infrastructure goes down?

I built an incident response bot that answers that question with a meta-irony: an agent that expects to die and is engineered so that the user never notices. When Groq browns out, the gateway transparently flips to Gemini. When both die, a SQLite cache returns the last good answer. When even cache misses, a static "manual procedure" reply ships — same shape, same channels, zero dropped requests.

The headline metric I wired into the product itself: user_dropped_count: 0.


## 🚀 What it does

The bot diagnoses IT incidents (DB timeouts, 503s, deploy regressions) through three entry points:

  • Discord slash commands (/incident, /chaos, /stats) — the on-call channel
  • Streamlit dashboard (color-coded responses, real-time stats, chaos toggle button) — the war-room screen
  • REST API (POST /chat/tools, GET /stats, etc.) — for integration

Under the hood, the agent:

  1. Calls tools autonomously — it picks between get_recent_commits, get_recent_pulls, sentry_get_recent_errors, logs_tail based on the incident description. I see it
    combine two tools (Sentry + Logs) for ambiguous incidents.
  2. Synthesizes a structured diagnosis: (1) one-line root cause, (2) what to check first, (3) immediate mitigation.
  3. Survives infrastructure chaos through a 4-layer resilience chain — see below.

Every response carries a source field (live / cache / graceful), color-coded in the UI so on-call engineers can see exactly which layer answered them.


## 🛠 How I built it

Stack: Python 3.11, FastAPI, TrueFoundry AI Gateway (Groq llama-3.3-70b primary + Gemini 2.5 flash-lite fallback as a virtual model), SQLite for cache, Streamlit + discord.py
for entry points.

### The 4-Layer Resilience Chain

  Layer 1+2 (TrueFoundry Gateway, automatic)
      Groq llama-3.3-70b  ──brownout/429──▶  Gemini 2.5 flash-lite

  Layer 3 (application — chat_with_tools wrapper)
      try:    call Gateway → cache.put(result)
      except: cache.get(question_hash) → return stale-but-good answer

  Layer 4 (last resort)
      cache miss + total outage → static "manual procedure" reply

The wrapper logs every fallback decision through stats.record(result), so GET /stats exposes the resilience profile in real time — what feeds the Streamlit and Discord dashboards.

### Demonstrating it live: CHAOS_MODE toggle

I injected a runtime kill switch (POST /chaos/toggle) that forces every Gateway call to raise. It flips an env var that the LLM client reads per-request — no server restart, no
redeploy. On stage, you press the dashboard button, then ask the same question again, and watch the response flip from green ("source: live") to yellow ("source: cache, age: 30s")
to red ("source: graceful") — but never an error toast to the user.

### Tools

  • GitHub (real API via httpx) — get_recent_commits, get_recent_pulls
  • Sentry (deterministic mock for demo reproducibility) — sentry_get_recent_errors
  • Logs (deterministic mock) — logs_tail

Mocks are intentional: they let me script consistent demo videos without real-service data drift on the day of judging. The registry in src/tools/__init__.py is built so
swapping a mock for a real MCP server only changes one import line.


## 🧗 Challenges I ran into

  1. Tool-calling system prompt tuning — early prompts made the LLM "answer abstractly" instead of citing tool output verbatim. The fix: list every tool by name in the system
    prompt and instruct it to combine tools for ambiguous incidents.
  2. PowerShell encoding traps — naive Add-Content appended .env lines without a newline and corrupted TFY_MODEL (TFY_MODEL=valueGITHUB_REPO=...). UTF-8 (no-BOM) .ps1
    files crashed because PowerShell decoded them as CP949. Fixes: [IO.File]::WriteAllText with explicit UTF-8 + ASCII-only .ps1 files.
  3. Discord 403 Missing Access — I copied the application ID into DISCORD_GUILD_ID instead of the server ID. I added an on_ready log that prints every guild the bot is
    actually in, with <-- MATCH markers, so the misalignment is now visible in <5 seconds.
  4. GitHub Push Protection caught a real PAT in .env.example during the first push. The fix: revoke the leaked token, replace the value with a placeholder, reset git history,
    and re-push.
  5. Distinguishing "live" from "cache" in the UI — the cache returns an identical payload to the live response (that's the whole point). I added metadata fields (source,
    cache_age_seconds, cache_stale) and color-coded them in both Streamlit and Discord embeds so the user can see the resilience working without the answer itself changing.

## 🏅 Accomplishments that I'm proud of

  • user_dropped_count: 0 through every chaos scenario I threw at it — 3 simulated outages, 2 cache-only periods, 1 graceful-fallback period.
  • LLM autonomously combining 2 tools (Sentry + Logs) for ambiguous incidents — visible in tool_calls_made: ["sentry_get_recent_errors", "logs_tail"].
  • One-click chaos demo that flips the entire pipeline's behavior at runtime, without restarts — perfect for showing investors and judges.
  • Three independent entry points (Discord, Streamlit, REST) all hitting the same resilient backend — proves the architecture is channel-agnostic.
  • Built and verified in a single intensive build cycle — FastAPI bootstrap → resilience layer → chaos toggle → multi-channel UX, collapsed into days 6–11 of the 15-day plan.

## 🎓 What I learned

  • TrueFoundry Gateway's virtual model abstraction is the cleanest API-level fallback I've ever used — defining a single <group>/<model> route that wraps Groq + Gemini means
    application code never has to know which model answered.
  • Graceful degradation has to be visible, not silent. The first version hid the fallback path from users; reviewers couldn't tell the resilience was working. Exposing source
    in the response (and color-coding it) turned the resilience from a backend feature into a product feature.
  • Deterministic mocks beat real APIs for demos. Real Sentry/GitHub data shifts hour-to-hour and breaks the demo's narrative. Mocks let the same incident script land the same
    diagnosis every time.
  • Counter metrics that name the user's pain win. total_requests is forgettable. user_dropped_count: 0 is a headline.

## 🚀 What's next for Incident Response Bot

  • B2B SaaS for SMB DevOps teams — per-team Slack/Teams workspaces, RBAC, billing.
  • Real MCP integrations — swap mock Sentry/Logs for real MCP servers (Sentry MCP, Datadog MCP), wire PagerDuty triggers.
  • Predictive incident scoring — correlate deploy events with incident frequency to surface "this deploy looks risky" before it pages anyone.
  • Multi-tenant cache — per-customer SQLite namespaces (or upgrade to Postgres + pgvector) so one customer's chaos drill doesn't pollute another's cache.
  • Postmortem audit trail — every fallback hit logged with reason and recovery time, exportable for SOC 2 compliance reviews.

Built With

Share this project:

Updates