Inspiration

Production outages cost businesses up to $9,000 per minute, but human triage averages 15–30 minutes just to acknowledge alerts. While autonomous AI agents present an attractive solution, giving an LLM raw terminal access is a major security risk—one bad prompt, hallucination, or injection exploit could delete a database or expose customer secrets.

Furthermore, if the primary AI API experiences rate limits (429) or server errors (500) during a critical recovery window, the agent gets interrupted. We built Auto-SRE to solve both problems: a resilient, zero-trust DevOps autopilot that fixes infrastructure crashes securely while guaranteeing uninterrupted execution via automated in-flight model failovers.


What it does

Auto-SRE acts as an autonomous virtual engineer in a secure operational "War-Room."

  • Stateful Remediation Loop: Listens to cluster outage alerts (such as Out-of-Memory failures or connection pool exhaustion). It sequentially runs system diagnostics, tests proposed shell scripts in a sandboxed staging environment, and verifies health using synthetic canary traffic.
  • Resilient AI Gateway: Intercepts 429 rate limits, 500 server errors, or timeouts. If the primary model (Claude 3.5 Sonnet) fails, the gateway dynamically mutates the active model to AWS Llama 3.3 Instruct in-flight, preserving the full context window and conversation history so remediation completes.
  • TrueFoundry 4-Hook Guardrails: Programmatic security middleware:
    • Input Hook: Detects and sanitizes prompt injections.
    • Output Hook: Audits raw completions.
    • Pre-Invoke Hook: Inspects shell commands and blocks destructive actions (like rm -rf /, dd, or database drops), forcing the LLM to pivot.
    • Post-Invoke Hook: Detects and redacts credentials and PII.
  • Real-time DevOps Dashboard: Displays a visual cluster topology map, telemetry performance metrics, a live terminal of the agent's internal thoughts, and a Chaos Panel allowing users to inject incidents or simulate network failures.

How we built it

  • Backend: Built with FastAPI (Python) utilizing AsyncOpenAI SDK configured to route through the TrueFoundry AI Gateway. We implemented custom middleware hooks for regex-based script safety audits and PII redaction, alongside an asynchronous, event-driven state machine.
  • Frontend: Developed with React, styled using Tailwind CSS (v3) for a premium, monospaced DevOps dark mode console. Communication is handled via polling for state sync and Server-Sent Events (SSE) for streaming terminal monologue logs.
  • Sabotage Engine: Programmed modular, state-mutating mock Model Context Protocol (MCP) tools that simulate real-world service metrics (PostgreSQL log outputs, JVM metrics) and sandbox run states.

Challenges we ran into

  • Reversible Tokenization: Redacting passwords or access keys is easy, but if the LLM receives redacted placeholders (e.g. [REDACTED_PASSWORD]), it tries to write repair scripts with that placeholder, causing the sandbox test to fail. We solved this by developing a reversible tokenization engine: mapping credentials to reference tokens (__SECRET_TOKEN_0__) in transit, and dynamically restoring the real values only at the sandbox execution step.
  • Preserving Context Window on Failover: Copying the active state, tool calls, and message array mid-execution to a completely different model (Llama) required meticulous alignment of OpenAI-compatible schemas to prevent parsing errors.

Accomplishments that we're proud of

  • In-Flight Model Mutation: Successfully simulating AWS Bedrock throttling, watching the gateway catch the 429 error, and seeing the model switch seamlessly in the middle of a diagnostic step without restarting the SRE loop.
  • Zero-Trust Guardrail Blocking: Hard-blocking destructive commands and seeing the agent receive the guardrail error, analyze its mistake, and rewrite a safe command to successfully restart the database.
  • Immersive UX: Creating a visual topology map with pulsing SVG status lines that react instantly to back-end incidents.

What we learned

  • Prompt Engineering is Model-Specific: Claude and Llama have different reasoning formats. Designing prompts that work reliably across both models during a failover requires clean system constraints.
  • Structured Outputs are Key: Emitting structured JSON for final SRE reports is much cleaner than relying on raw LLM text parsing, leading to better UI rendering.

What's next for Auto-SRE

  • Kubernetes MCP Server: Move from mock tools to a real MCP server that interfaces with a staging Kubernetes cluster to fetch real logs (kubectl logs) and scale pods.
  • Slack & PagerDuty Integration: Allow the agent to notify teams on Slack when an incident starts, request manual approval for high-risk scripts, and post final diagnostic summaries.
  • Collaborative Multi-Agent Swarms: Introduce specialist agents (e.g. Database DBA Agent, Network Proxy Agent) that negotiate recovery scripts collaboratively.

Built With

Share this project:

Updates