AUTO-SRE

Live Initial Dashboard
Human Aproval showing the logs , and fixes and code test result .
Summary of the incident.
Dashboard after breaking target app - skeptic ai spawning the isolated sandbox for code fix testing .
Dashboard after deploying the fix.
Novus DashBoard
Novus MEMORY

What Inspired Us

Every company has an engineer who carries the weight of production reliability. They're the ones who get paged at midnight when something breaks. They're the ones who spend hours reading cryptic error logs, making educated guesses about root causes, writing fixes under pressure, and deploying with uncertainty.

We've all been that engineer.

The frustration isn't just the lost sleep. It's the helplessness — knowing that 80% of your time is spent on diagnosis rather than solution. It's watching incidents cascade because the initial fix was incomplete. It's the knowledge that somewhere out there, a customer is experiencing timeouts and errors while you scramble.

We realized something: This problem has been solved by humans a thousand times over. The same patterns repeat. A missing database index. A retry loop without backoff. A connection pool exhausted. Skilled engineers can diagnose these in minutes. The question became: Why can't an AI system do this faster, more reliably, and without the emotional toll?

AutoSRE is our answer. Not to replace engineers — but to give them back their nights. To handle the routine incidents with the rigor and speed that humans deserve. To transform reliability engineering from a 3-month on-call rotation into a system that scales.

We built this because we believe engineers shouldn't have to choose between career and sleep. Production reliability should be automated, auditable, and intelligent. That's AutoSRE.

What We Built

AutoSRE is a production-ready SRE automation platform built on a three-agent orchestration architecture:

The Detective — Analyzes logs and error traces in real-time to identify root causes with 95% confidence
The Fixer — Generates code patches that directly address the diagnosed problem
The Skeptic — Validates fixes in an isolated sandbox environment before deployment

The system achieves incident resolution in under 60 seconds, with complete human governance and audit trails through Novus AI observability.

How We Built It

Frontend: React.js (TypeScript)

Real-time dashboard showing system health, incident status, and agent activity
Live terminal streaming agent reasoning and decisions
Governance modal for human approval gates
Interactive sandbox validation feedback

Backend: Node.js + Express.js

Three independent agent orchestration engines (Detective, Fixer, Skeptic)
Groq API integration for LLM reasoning (Llama 3)
Sandbox environment management using Node's child_process
Real-time WebSocket streaming of agent logs and decisions
Railway deployment with environment variable configuration

AI/LLM Integration:

Groq API (Llama 3) for low-latency agent reasoning
Multi-turn agentic loops with structured prompts
Tool-use architecture (agents call specific analysis and code-writing tools)

Observability & Compliance:

Novus AI integration for complete incident audit trails
Real-time telemetry capture at every decision point
Dashboard logging of all agent decisions and human approvals

Architecture Highlights:

Sandbox testing via isolated Node.js process spawning (~200ms startup, 2-second validation)
Hot-reload deployment with zero downtime
Golden file recovery system for reliable incident re-triggering
Human-in-the-loop governance with optional sandbox validation

Challenges We Faced

1. Agentic Loop Reliability Initially, agents would hallucinate or propose incomplete fixes. Solution: Implemented the Skeptic as a validation layer that rejects low-confidence patches and sends them back to the Fixer with specific feedback. This created a multi-round improvement loop.

2. Real vs. Simulated Execution Early versions fake-recovered broken systems. We rebuilt to execute actual code patches in containers, run real synthetic traffic, and deploy truly working solutions. This added complexity but made the system enterprise-grade.

3. Sandbox Testing Performance Running full Docker containers for each validation was too slow (~30 seconds per test). We optimized using Node's native child_process instead, reducing sandbox startup and test execution to under 2 seconds.

4. Agent Confidence & Root Cause Analysis Agents would initially miss root causes (60% confidence). We implemented a two-phase detective system:

Phase 1: Quick pattern matching on error logs
Phase 2: If rejected by Skeptic, deeper trace analysis with specialized tools
Result: 95% confidence and correct fixes on second iteration

5. Human Governance Without Slowing Down Keeping humans in control while maintaining speed was critical. We designed approval gates at strategic points (after initial Skeptic safety check, before sandbox deployment) rather than requiring approval at every step.

6. Deployment to Production Getting a full orchestrator + target app + frontend all running on Railway with proper port management required a monorepo start script that spawns and coordinates multiple processes.

What We Learned

Agents work best in teams. A single LLM is unreliable. Three agents with different roles (analysis, generation, validation) catch mistakes the others miss.
Speed matters, but safety matters more. Validating in isolation before production is non-negotiable. It's worth the 2 extra seconds.
Humans are still essential. Fully autonomous systems fail. A human approval gate makes the system trustworthy and enterprise-ready.
Iteration beats perfection. Agents don't need to get it right the first time. A feedback loop (Skeptic rejects → Fixer improves) produces better results.
Observability is the product. Being able to explain why the system made a decision (Novus audit trail) is as important as the decision itself.