NeuroScale Ops

AI-powered incident triage and auto-remediation for platform teams


Inspiration

Platform teams waste 45-90 minutes per incident in coordination, not debugging.

The typical flow looks like this: alert fires, an SRE manually runs kubectl, posts in Slack for approval, waits, then applies a fix. Every step is manual and every minute compounds.

The question we asked: what if AI triaged incidents with confidence scoring, routed them to the right person with SLAs, and executed fixes automatically with a full audit trail?

That's NeuroScale Ops.


What It Does

A 7-stage UiPath Maestro Case that handles the full incident lifecycle end to end:

Prometheus Alert
    -> Groq AI Triage (94% confidence)
    -> Cost Impact Analysis
    -> One-click SRE Approval
    -> Auto-Remediation (kubectl, ArgoCD, Kyverno)
    -> Post-Mortem Auto-Generated

Results

Metric Before After
MTTR 45 min 15 min
SRE hands-on work 60 min 2 min
Incident types handled 0 automated 5
Audit trail None Full, compliance-ready

Incident types covered: OOMKill, CrashLoop, Policy Violation, Cost Spike, Deployment Failure


How We Built It

Layer Technology Role
Orchestration UiPath Maestro Stateful case management, SLAs, and escalation (v1.0.0)
Agents Python Detector, triage, and remediation with structured state passing
AI / LLM Groq llama-3.3-70b Root cause analysis with an 85% confidence gate before any action
Cost analysis OpenCost API Real per-namespace cost impact calculated per incident
Human approval UiPath Apps In-context approval forms surfaced at Stage 4 and Stage 6
GitOps execution ArgoCD, kubectl, Kyverno Actual remediation actions against the cluster
Testing pytest 17/17 tests passing across detector, triage, cost, remediation, and E2E

Challenges We Ran Into

  1. Groq prompt design - Getting reliable, nuanced confidence scores across 5 incident types required careful prompt iteration and structured output schemas.
  2. Maestro SLA logic - Wiring 15-minute approval gates with automatic escalation paths inside Maestro's case model took significant trial and error.
  3. Real cost impact modeling - Each incident type has different cost implications; building accurate per-namespace models without over-simplifying was non-trivial.
  4. Circuit breaker safety - CrashLoop intentionally escalates at MEDIUM confidence by design, which required careful logic to avoid false positives in both directions.
  5. E2E test coverage - Achieving full scenario coverage across all five incident types with realistic simulated state required a custom test harness.

Accomplishments We're Proud Of

  1. Published a real Maestro Case on Automation Cloud (v1.0.0, not a prototype or demo)
  2. All 5 incident types handled within a single orchestrated case
  3. 97% reduction in manual SRE work per incident
  4. Safety-first design with an 85% confidence gate required before any auto-remediation runs
  5. 17/17 tests passing with a full audit trail that is compliance-ready out of the box
  6. Clean multi-layer architecture with no duct-tape integrations between components

What We Learned

  1. Maestro is the primitive SREs actually need. Stateful, audited, and SLA-bound case management fits incident response far better than ad-hoc scripts or chat-ops workflows.
  2. LLM confidence scoring matters more than raw accuracy. Knowing when the model is uncertain is more valuable than getting it right most of the time.
  3. Cost analysis changes remediation decisions. Without financial context, the system would make technically correct but operationally wrong choices.
  4. Human gates need full context. Approval forms must surface AI reasoning, incident details, cost impact, and SLA countdown together or approvers make worse decisions.
  5. Circuit breakers beat blind auto-remediation. Escalating on uncertainty is always safer than acting on low-confidence signals.

What's Next

  • Multi-cluster management for broader operational control across environments
  • Custom runbooks tailored to individual teams, services, and operational requirements
  • Predictive alerting to identify potential incidents before they fully materialize
  • Slack integration for in-thread approvals, notifications, and collaboration
  • Cost forecasting with proactive budget alerts and optimization recommendations
  • On-call rotation integration to streamline incident ownership and escalation
  • Open-source reference implementation to encourage adoption and community contributions ```

Built With

  • 1.29
  • 3.11+
  • api
  • apps
  • argocd
  • groq
  • kubernetes
  • kyverno
  • llama-3.3-70b
  • llm
  • maestro
  • opencost
  • prometheus
  • python
  • slack/pagerduty
  • uipath
Share this project:

Updates