Trust your reflex.

Reflex: The Self-Correcting On-Call Engineer

Reflex turns on-call from “page someone and pray they’ve seen it before” into a system that investigates, validates a fix safely, and learns from every incident.

Inspiration

For decades, “on-call” has meant sleepless nights and 3 AM panic. Tools like PagerDuty helped teams route alerts, but the actual work still depends on institutional memory: someone remembers that this crash loop usually means a bad env var, or that connection spikes go away after a pool reset.

In 2026, that feels backwards. Why should a human be the first line of defense for patterns we’ve already seen?

PagerDuty started with a simple model: computers detect incidents, then humans investigate and fix them. Computers should investigate and validate the remediation, and humans should supervise the final step.

So we built Reflex: an AI-native reliability engineer. Instead of acting like a notification hose, Reflex treats incidents as a repeatable loop: detect, diagnose, test a mitigation safely, verify, deploy in the right environment, and remember, so the next time it sees the same signature, it responds faster and with less guesswork.

What it does

Reflex monitors real-time logs and automatically spins up an investigation agent when something turns critical. It only escalates to a human when it can’t prove a safe fix.

Unlike coordination-first incident tools, Reflex is the first responder: it performs the investigation, proves a safe mitigation, and only then asks for a human decision.

Core capabilities:

Real-time incident detection (logs-first): Watches a live log stream and flags critical signatures.
Autonomous triage and root cause hypothesis: Extracts the relevant stack trace / failing component and proposes likely causes.
Safe remediation in an isolated sandbox: Tests candidate fixes in a secure environment before anything touches production.
Human-in-the-loop deployment: If validation passes, Reflex asks for approval (or runs in autopilot for the demo).
Memory that compounds: Stores the signature and the validated remediation as reusable, machine-executable behavior, so repeats resolve faster without human recall.
Postmortem draft: Converts the incident trace into a structured postmortem template of timeline, root cause, remediation, and follow-ups so humans review instead of starting from nothing.

What this means for software engineers:

Fewer wake-ups: humans get paged only for high-signal edge cases.
Faster MTTR: by the time a human sees it, Reflex has already narrowed the diagnosis and tested a mitigation.
Less postmortem fatigue: the “lesson learned” becomes reusable behavior, not just a doc people forget to read.

How it works (the cycle)

Detect: Reflex ingests logs and triggers on critical status / known error patterns
Diagnose: It isolates the error signature (stack trace + service context + recent changes when available).
Retrieve memory: Checks whether this signature (or a close match) has worked before.
Execute in sandbox: Spins up an isolated environment to reproduce and test a mitigation safely.
Validate: Runs a verification step (tests, health checks, or a replay of the failure).
Deploy (with guardrails): If validated: request approval, then apply the remediation.
Store the lesson: Saves: signature, actions tried, success or failure, best action and confidence.
Strengthen policy: Updates the action-selection policy so repeat incidents converge faster.

Postmortem draft: Auto-generates a pre-filled postmortem from the incident transcript, so humans edit instead of writing from scratch.

How we built it

Our tech stack combines modern tools to deliver a full “mission control” on-call experience:

Frontend: A Streamlit dashboard that unifies live logs, a collapsible investigation timeline, sandbox output, deploy approval with autopilot, and a memory view showing what Reflex has learned and is reusing in real time.
Backend: Node.js orchestrate ingestion, incident state, and agent coordination, structuring raw signals into an incident workflow and continuously updating and retrieving memory across runs.
Infrastructure + Memory: Modal provides scalable compute for agent workloads and persistent storage so successful mitigations carry forward instead of resetting after each page.
Sandbox: E2B enables safe validation by reproducing issues, applying patches, and running checks in isolation before escalation or deployment.
Agent: An LLM handles investigation and patch proposals, while a lightweight contextual bandit selects the most reliable action for each incident signature and improves with experience.

Challenges we ran into

Making AI behavior trustworthy: We had to turn “agent work” into something visible: clear steps, a sandbox transcript, and explicit validation before deploy.
Balancing density with clarity: Incident tooling gets overwhelming fast. We iterated on a split-screen layout so logs, investigation, and deployment decisions can be monitored without cognitive overload.
Simulating real infrastructure constraints: Not every incident is a code bug. Some are database issues, configuration drift, or external dependency failures. We designed Reflex so it escalates intelligently when it can’t validate a safe remediation path.

Accomplishments that we’re proud of

Built a full on-call experience that shifts from alerting, reasoning, to validated remediation.
Designed a human-in-the-loop workflow where the human approves a proven fix instead of starting from zero.
Implemented a memory + policy layer so the system learns from incidents instead of re-solving the same outage repeatedly.
Spent a wonderful Valentine's Day with the guys.

What we learned

Context is king. The quality of an incident response is limited by what the system can remember and retrieve quickly.
Trust requires visibility. People don’t trust “autonomous fixes” unless they can see validation, constraints, and exactly what changed.
Hybrid is the real future. The best on-call setup isn’t fully human or fully AI, it’s an AI that knows when to act and when to pull in the right human with the right context.

What’s next for Reflex

Deeper Modal integration: connect the UI to live inference endpoints and richer incident workloads.
Autonomous remediation with stronger guardrails: expand from diagnosis to action, with sandboxed verification and scoped permissions.
Broader incident coverage: integrate beyond logs into deploy history, runbooks, metrics, and database/dependency checks.
Better memory structure: separate short-term “incident context” from long-term “policy memory,” so Reflex stays grounded during active incidents while still compounding over time