NeuroScale Ops
AI-powered incident triage and auto-remediation for platform teams
Inspiration
Platform teams waste 45-90 minutes per incident in coordination, not debugging.
The typical flow looks like this: alert fires, an SRE manually runs kubectl, posts in Slack for approval, waits, then applies a fix. Every step is manual and every minute compounds.
The question we asked: what if AI triaged incidents with confidence scoring, routed them to the right person with SLAs, and executed fixes automatically with a full audit trail?
That's NeuroScale Ops.
What It Does
A 7-stage UiPath Maestro Case that handles the full incident lifecycle end to end:
Prometheus Alert
-> Groq AI Triage (94% confidence)
-> Cost Impact Analysis
-> One-click SRE Approval
-> Auto-Remediation (kubectl, ArgoCD, Kyverno)
-> Post-Mortem Auto-Generated
Results
| Metric | Before | After |
|---|---|---|
| MTTR | 45 min | 15 min |
| SRE hands-on work | 60 min | 2 min |
| Incident types handled | 0 automated | 5 |
| Audit trail | None | Full, compliance-ready |
Incident types covered: OOMKill, CrashLoop, Policy Violation, Cost Spike, Deployment Failure
How We Built It
| Layer | Technology | Role |
|---|---|---|
| Orchestration | UiPath Maestro | Stateful case management, SLAs, and escalation (v1.0.0) |
| Agents | Python | Detector, triage, and remediation with structured state passing |
| AI / LLM | Groq llama-3.3-70b |
Root cause analysis with an 85% confidence gate before any action |
| Cost analysis | OpenCost API | Real per-namespace cost impact calculated per incident |
| Human approval | UiPath Apps | In-context approval forms surfaced at Stage 4 and Stage 6 |
| GitOps execution | ArgoCD, kubectl, Kyverno | Actual remediation actions against the cluster |
| Testing | pytest | 17/17 tests passing across detector, triage, cost, remediation, and E2E |
Challenges We Ran Into
- Groq prompt design - Getting reliable, nuanced confidence scores across 5 incident types required careful prompt iteration and structured output schemas.
- Maestro SLA logic - Wiring 15-minute approval gates with automatic escalation paths inside Maestro's case model took significant trial and error.
- Real cost impact modeling - Each incident type has different cost implications; building accurate per-namespace models without over-simplifying was non-trivial.
- Circuit breaker safety - CrashLoop intentionally escalates at MEDIUM confidence by design, which required careful logic to avoid false positives in both directions.
- E2E test coverage - Achieving full scenario coverage across all five incident types with realistic simulated state required a custom test harness.
Accomplishments We're Proud Of
- Published a real Maestro Case on Automation Cloud (v1.0.0, not a prototype or demo)
- All 5 incident types handled within a single orchestrated case
- 97% reduction in manual SRE work per incident
- Safety-first design with an 85% confidence gate required before any auto-remediation runs
- 17/17 tests passing with a full audit trail that is compliance-ready out of the box
- Clean multi-layer architecture with no duct-tape integrations between components
What We Learned
- Maestro is the primitive SREs actually need. Stateful, audited, and SLA-bound case management fits incident response far better than ad-hoc scripts or chat-ops workflows.
- LLM confidence scoring matters more than raw accuracy. Knowing when the model is uncertain is more valuable than getting it right most of the time.
- Cost analysis changes remediation decisions. Without financial context, the system would make technically correct but operationally wrong choices.
- Human gates need full context. Approval forms must surface AI reasoning, incident details, cost impact, and SLA countdown together or approvers make worse decisions.
- Circuit breakers beat blind auto-remediation. Escalating on uncertainty is always safer than acting on low-confidence signals.
What's Next
- Multi-cluster management for broader operational control across environments
- Custom runbooks tailored to individual teams, services, and operational requirements
- Predictive alerting to identify potential incidents before they fully materialize
- Slack integration for in-thread approvals, notifications, and collaboration
- Cost forecasting with proactive budget alerts and optimization recommendations
- On-call rotation integration to streamline incident ownership and escalation
- Open-source reference implementation to encourage adoption and community contributions ```
Built With
- 1.29
- 3.11+
- api
- apps
- argocd
- groq
- kubernetes
- kyverno
- llama-3.3-70b
- llm
- maestro
- opencost
- prometheus
- python
- slack/pagerduty
- uipath
Log in or sign up for Devpost to join the conversation.