Auto-SRE: Autonomous Site Reliability Engineering

Inspiration

Production outages are costly and stressful. The average enterprise loses thousands of dollars per minute of downtime, and the traditional response involves paging a sleepy on-call engineer at 3am who then has to manually dig through logs, search Stack Overflow, and write a fix under pressure. We wanted to automate that entire workflow — from detection to diagnosis to human approval — using modern AI agents.

What We Learned

  • How to orchestrate multiple AI and infrastructure APIs into a coherent agentic pipeline
  • The complexity of async webhook-based APIs and real-time data flows
  • How to design a system where AI assists humans rather than replacing them — the "human in the loop" approval step was a conscious architectural decision
  • How Aerospike handles high-throughput real-time ingestion and why it's suited for production monitoring workloads
  • How LLMs like Gemini can generate actionable code diffs when given the right codebase context

How We Built It

We split into three parallel workstreams:

AI Agent Service (Python/Flask): The brain of the system. Receives error logs, enriches them with codebase context via Macroscope, sends them to Gemini to generate a bug summary and git-style code diff, then triggers a Bland AI phone call to notify the on-call engineer.

Data Pipeline & Tracing (Python/Flask): The nervous system. Uses Airbyte to stream mock error logs into a local Aerospike database. A polling service detects new errors and triggers the AI agent. Overmind wraps the entire execution for real-time agentic tracing visible to judges.

Incident Dashboard (React/Tailwind): The interface. A dark-mode cyberpunk-themed command center secured with Auth0. Polls the backend every 2 seconds, displays live incident status, renders the Gemini-generated code diff, and provides an "Approve & Deploy" button for the engineer.

Challenges We Faced

  • Async webhook complexity: Macroscope responds asynchronously via webhook callbacks, which required ngrok tunneling to expose localhost. Free tier ngrok added an interstitial page that blocked external POST requests, requiring significant debugging before resolving with a static domain.
  • Gemini quota limits: The free tier capped at 20 requests/day, forcing us to manage API usage carefully during development and switch to a fresh project mid-hackathon.
  • JSON parsing reliability: Gemini occasionally wraps responses in markdown code fences despite being explicitly told not to, requiring defensive parsing logic.
  • Coordinating three independent services: Agreeing on a shared JSON contract early was critical — without it, the frontend and backend would have been incompatible at integration time.

Built With

Share this project:

Updates