Vigil

Deployment fixed
Error occured-> Agent triggered

Inspiration

It's 3 AM. A PagerDuty alert fires. An engineer wakes up, bleary-eyed, staring at a wall of logs trying to figure out why the payment service is timing out. By the time they identify a database lock and kill the offending query, it's 4:30 AM—and thousands of transactions have failed.

This scenario plays out every night across thousands of engineering teams. The modern microservices landscape has created an explosion of complexity that has outpaced human cognitive capacity. We're drowning in metrics, logs, and traces, yet Mean Time to Resolution (MTTR) keeps climbing.

We asked ourselves: What if the system could heal itself?

Not just alert a human. Not just suggest a fix. But actually observe, reason, decide, and act—autonomously, safely, and verifiably.

That question became Vigil.

What it does

Vigil is an Autonomous Reliability Engine that transforms incident response from "Human-in-the-Loop" to "Human-on-the-Loop." It doesn't just watch your infrastructure—it understands it, reasons about it, and heals it.

The Autonomic Loop (9-Stage Incident Lifecycle):

TRIGGERED → Alert received from monitoring systems via PagerDuty
INVESTIGATING → Cleric AI begins root cause analysis
HYPOTHESIS_RECEIVED → Investigation findings posted to incident timeline
CONTEXT_RETRIEVED → Sanity CMS returns relevant runbooks and past incidents
SYNTHESIZING → Claude generates remediation using Chain-of-Thought reasoning
EXECUTING → Coder provisions isolated Kubernetes pod to run the fix
VERIFYING → Lightpanda verification swarm confirms the fix works
RESOLVED → Incident automatically closed with full audit trail
ESCALATED → Human intervention requested when confidence is low

The Confidence Protocol:

Vigil doesn't blindly execute. It implements a multi-layer safety system:

Auto-execute: Cleric confidence > 90% AND runbook match > 85% AND risk = LOW
Human approval: High-risk operations or low-confidence scenarios trigger Slack approval workflow
PII Protection: Skyflow redacts sensitive data before any LLM processing

The result? Engineers wake up to resolved incidents, not active fires.

How we built it

We architected Vigil as three parallel development streams, integrated through a custom Orchestration Control Plane (OCP).

Technology Stack:

Layer	Technology	Purpose
Runtime	Node.js 20, Express.js	Core OCP server
State	Redis 7 (ioredis)	Incident state machine, caching
Database	PostgreSQL 15	Coder workspace metadata
Incident Management	PagerDuty	Webhooks, alerts, annotations
Knowledge Base	Sanity CMS	Runbooks, GROQ queries, semantic search
Reasoning	Anthropic Claude (claude-3-sonnet)	Chain-of-Thought remediation synthesis
Fallback AI	Google Gemini	Alternative LLM via factory pattern
Privacy	Skyflow	PII detection and redaction
Execution	Coder v2.28.3 + Terraform	Ephemeral Kubernetes workspaces
Verification	Lightpanda	Zig-based headless browser (10x faster than Chrome)
Approval	Slack	Human-in-the-loop workflow
Infrastructure	Docker, Kubernetes, Minikube	Container orchestration

Architecture:

[Datadog Alert] ↓ [PagerDuty] ──webhook──→ [OCP: Express Server] ↓ [Redis: State Machine] ↓ ┌───────────────┼───────────────┐ ↓ ↓ ↓ [Cleric: RCA] [Sanity: Context] [Skyflow: PII] └───────────────┼───────────────┘ ↓ Claude: Reasoning ↓ ┌─────────┴─────────┐ ↓ ↓ [High Confidence] [Low Confidence] ↓ ↓ [Coder: Execute] [Slack: Approve] ↓ ↓ [Lightpanda: Verify] ←──────┘ ↓ [PagerDuty: Resolve]

Three Parallel Workstreams:

Stream A (Signal & Control): OCP server, PagerDuty webhooks, HMAC signature verification, Redis state management, Cleric hypothesis parsing
Stream B (Cognition & Memory): Sanity integration, runbook ingestion pipeline, Claude prompt engineering, edge case detection, Lightpanda verification library
Stream C (Action & Environment): Coder deployment on Kubernetes, Terraform templates, NetworkPolicy isolation, SSH-based execution

Challenges we ran into

1. The Confidence Problem

How do you know when an AI is hallucinating versus when it has found the true root cause? An overconfident system creates more noise than signal.

Solution: We implemented multi-source validation with configurable thresholds. Cleric's hypothesis must correlate with Sanity's runbooks. We built an edge case detection system (edge-cases.js - 369 lines) that identifies ambiguous scenarios and automatically escalates to humans.

2. PII in Production Logs

Production incidents often contain sensitive data—API keys, customer emails, session tokens. Sending raw logs to an LLM is a privacy nightmare.

Solution: We integrated Skyflow's Detect API to identify and redact PII before any data reaches Claude. The original sensitive values never leave our infrastructure.

3. Asynchronous Orchestration Hell

Cleric posts findings as PagerDuty annotations (not a synchronous API). Coder workspaces take time to provision. The entire system is event-driven across 5+ services.

Solution: We built a 9-stage state machine in Redis with fallback to in-memory storage. Each incident maintains its lifecycle state as webhooks arrive asynchronously. The OCP is stateless but the state is persistent.

4. Blast Radius Containment

Executing AI-generated code on production is terrifying. One wrong command could cascade into catastrophe.

Solution: Every remediation runs in an ephemeral Coder workspace—a Kubernetes pod with:

NetworkPolicy denying all ingress, limited egress
Non-root security context
Short-lived credentials injected via Terraform rich_parameter_values
Automatic destruction after execution

5. Knowledge Poisoning Prevention

If a bad fix "works" temporarily (e.g., a restart that masks a memory leak), the system might learn it as valid.

Solution: Our incident-learner.js implements delayed ingestion with a stability period. If the alert re-fires within 24 hours, the previous resolution is marked as "ineffective" and won't be suggested again.

Accomplishments that we're proud of

2,172+ lines of production-grade service code across 12 integrated services
End-to-end autonomous remediation with a 9-stage lifecycle
Chain-of-Thought reasoning that forces Claude to analyze before acting—parseable confidence levels and risk assessment
"First, Do No Harm" prompting with built-in safety rules (never DROP TABLE, never rm -rf, never disable auth)
Verification Swarm using Lightpanda—hundreds of concurrent browser sessions at 10x the speed and 1/10th the memory of Chrome
Privacy-first architecture with Skyflow PII redaction before any LLM processing
Multi-LLM support via factory pattern (Claude primary, Gemini fallback)
Zero-trust execution in isolated Kubernetes pods with network policies

What we learned

Integration > Innovation: The power isn't in any single tool—it's in the orchestration. Making PagerDuty, Sanity, Claude, Coder, Skyflow, and Lightpanda work as a unified system required deep API understanding and careful state management.
Confidence scoring is non-negotiable: An autonomous system without confidence handling creates chaos. The "Human-on-the-Loop" model only works when the system knows its own limitations.
Prompt engineering is systems engineering: Claude's effectiveness depended entirely on structured prompts—the 5-step Chain-of-Thought process, the "Cautious Senior SRE" persona, and explicit safety rules embedded in system-prompts.js.
Async is hard, state machines help: Event-driven architectures across multiple services need rigorous state tracking. Our 9-stage Redis state machine became the source of truth.
Security can't be an afterthought: HMAC signature verification, PII redaction, non-root containers, and network policies had to be designed in from day one.
Fallbacks are essential: Redis unavailable? Fall back to in-memory. Claude rate-limited? Fall back to Gemini. Every integration needs a graceful degradation path.

What's next for Vigil

Short-term:

Complete Stream C Week 3-4: Full incident-triggered automation and integration testing
Expand runbook coverage (currently: high-latency-api, memory-exhaustion, database-connection-pool)
Production hardening and performance optimization

Medium-term:

Predictive healing: Use historical patterns to remediate before alerts fire
Multi-cloud execution: Extend Coder templates for AWS EKS, GCP GKE, Azure AKS
Feedback learning loop: Engineer ratings on remediations to continuously improve prompts
OneUptime integration: Alternative to PagerDuty for self-hosted monitoring

Long-term:

Autonomous chaos engineering: Vigil proactively tests its own healing capabilities
Cross-organization learning: Anonymized incident patterns (with consent) to build collective intelligence
Custom LLM fine-tuning: Train specialized models on SRE remediation patterns

Vigil isn't just a tool—it's a shift in philosophy. From reactive to proactive. From firefighting to oversight. From "Human-in-the-Loop" to "Human-on-the-Loop."

Always watching. Always healing.