Inspiration

Every SRE knows the drill: your pager goes off, you scramble to check logs, trace the root cause across multiple services, figure out the fix, and execute it — all while half-asleep. Most infrastructure incidents follow predictable patterns: a cache goes down, dependent services start failing, and the fix is often a restart or a config change. Yet it's still handled manually, every single time.

I asked myself: what if an AI agent could do what an on-call engineer does — observe logs, reason about cascading failures, plan a fix, and execute it — autonomously?

That's how Nightwatch was born.

What it does

Nightwatch is an autonomous SRE agent that monitors Docker infrastructure in real-time. It:

  • Observes container logs continuously, filtering noise (health checks, graceful shutdowns, startup messages) from real errors using keyword and pattern matching
  • Diagnoses incidents by building incident graphs — structured representations with nodes (affected components), edges (causal chains), and root cause identification
  • Assesses feasibility before planning — a separate capability that determines whether safe remediation is even possible, and can ask the user for missing information
  • Plans remediation as a sequence of Docker commands, with full context about the infrastructure and incident
  • Validates every command against safety rules (no shell injection, no destructive patterns, single container per command)
  • Requests approval — the user reviews the plan and can approve or reject it with feedback
  • Executes the fix and verifies the system is healthy again
  • Learns across sessions — when the agent asks questions (during feasibility assessment or escalation), answers are persisted to a knowledge file and consulted in future incidents

It uses Gemini as its reasoning engine — not as a chatbot, but as an orchestrator that decides which capability to invoke at each step of the incident lifecycle via function calling.

Key differentiators

Escalation system with circuit breaker — When the agent gets stuck (infeasible remediation, repeated failures, missing information), it escalates to the user with a specific reason and what context it needs. If maximum replanning attempts are exhausted, the same mechanism kicks in — the agent never hard-fails, it asks for help. The user can provide context to unblock progress (resetting the attempt counter), or dismiss the incident entirely.

Replanning with failure context — When execution fails, verification fails, or the user rejects a plan, the failure context (including user feedback) is fed back into the planner. The planner maintains conversation history across attempts, so it builds on what it already tried rather than starting from scratch.

Knowledge persistence — Facts learned from user interactions are saved to knowledge.md and automatically injected into future feasibility assessments and planning sessions. The agent genuinely learns from experience.

Two operating modes

  • Remediate: Full pipeline — analyze, assess feasibility, plan, validate, get user approval, execute, verify. The agent resolves the incident end-to-end.
  • Observe: Diagnosis only — analyze and assess feasibility, then report findings. No planning or execution. Useful for monitoring without intervention.

How I built it

Nightwatch follows a state machine architecture where Gemini acts as the decision-making core:

Logs → Analyze → Assess Feasibility → Plan → Validate → Approve → Execute → Verify → Resolved

Each step is a capability — a self-contained function that takes the current incident state and returns an updated state. Gemini selects which capability to invoke via function calling, making the system modular and predictable.

Key technical decisions:

  • TypeScript for type safety across the entire pipeline — incident states, capability contracts, and command validation are all statically typed
  • Gemini function calling for orchestration — the LLM doesn't generate free-form text, it selects structured actions from a capability registry
  • Two-tier LLM architecture — the orchestrator decides which capability to invoke, while individual capability agents (analysis, planning, feasibility) handle the execution of each step with their own system prompts and tool access
  • Docker API tools exposed to Gemini during analysis and planning — the agent can inspect containers, check states, and list services to ground its reasoning in real data
  • Sliding window log batching with backpressure — logs are debounced so related errors from a cascading failure arrive in a single batch, with a hard cap that force-flushes to prevent memory exhaustion
  • Safety-first command validation — every generated command is validated against a strict blocklist before execution: Docker-only, no shell operators, no destructive patterns, single container per command
  • Immutable state — the IncidentResolutionState is a readonly type. Capabilities return new state objects rather than mutating, making the data flow predictable and debuggable

Gemini Integration

Nightwatch is fundamentally driven by Gemini 3 and would not function without it. Gemini is used as a reasoning and orchestration engine, not as a conversational interface or text generator.

Nightwatch uses gemini-3-pro-preview via the Gemini GenAI SDK, which provides native support for multi-turn conversation history, structured function calling, and reliable agent runtime behavior. The SDK is used directly rather than raw HTTP calls to ensure consistent state handling across incident resolution and replanning cycles.

Orchestration via function calling

At the top level, Gemini orchestrates the entire incident lifecycle using function calling. Each capability in the system — such as analysis, feasibility assessment, planning, validation, execution, escalation, and verification — is registered as a high-level tool. Gemini receives the full incident state as structured JSON and selects exactly one capability to invoke at each step, driving state transitions in a deterministic, inspectable way. Gemini does not generate free-form control logic; all orchestration occurs through structured tool invocation.

Hierarchical tool usage

Within each capability, Gemini can invoke lower-level primitive tools as needed to ground its reasoning. These include Docker API tools (e.g., container inspection and state queries) as well as other purpose-built tools exposed to the agent. This results in multi-layer tool calling: Gemini reasons at a high level to choose the next capability, and at a lower level to gather real-time system information required to complete that capability.

Reasoning configuration

Gemini is configured with thinking mode enabled (medium to high, depending on the capability) for tasks that require deep reasoning, such as incident diagnosis, feasibility assessment, remediation planning, and replanning after failures. The model reasons over the complete incident state, prior attempts, and system constraints before selecting actions, while still returning only structured outputs.

Structured outputs and validation

All critical Gemini outputs are returned as schema-enforced JSON rather than free-form text. Incident graphs, feasibility assessments, remediation plans, and escalation requests are all validated against predefined schemas before being accepted. Invalid or unsafe outputs trigger replanning, ensuring correctness and safety.

Persistent context and replanning

Conversation history is preserved across planning and replanning attempts. When execution fails, verification fails, or a user rejects a plan, Gemini receives the full history of prior attempts and feedback, allowing it to adapt incrementally instead of restarting from scratch.

In Nightwatch, Gemini is essential for inferring causal chains from noisy logs, deciding whether remediation is safe or feasible, adapting plans based on failures and human feedback, and orchestrating recovery without hard failures.

I also built Clipper, a realistic multi-service video processing platform (API, PostgreSQL, Redis, LocalStack S3, transcoder, notifier, frontend) as a test bed. It comes with a chaos.ps1 script that can simulate failure scenarios — cache failures, Redis OOM, connection limit exhaustion, network partitions, multi-service cascading failures, and full infrastructure meltdowns.

Challenges I ran into

Getting cascading failures into a single analysis batch. In real infrastructure, failures don't happen simultaneously — they cascade over seconds or minutes. I designed a sliding window (debounce) log buffer with a processing lock and backpressure limits so that related errors from a cascading failure arrive as a single batch rather than triggering separate incidents.

Making the agent recover instead of fail. Early on, when the agent couldn't fix something, it would just stop. I redesigned the system so that every dead end routes through a user interaction — escalation, approval rejection, or circuit breaker — giving the human a chance to provide context and letting the agent continue with new information.

Keeping the planner's memory across retries. When a plan fails and the agent replans, it needs to remember what it already tried. I solved this by threading the planner's conversation history through the state object, so each replanning attempt builds on the full context of previous attempts rather than starting blind.

Accomplishments that I'm proud of

  • The agent can resolve a full cascading failure — from detecting error logs across multiple containers, through root cause identification, planning, user approval, execution, and verification — in a single automated pass
  • The escalation and circuit breaker system means the agent never silently fails; it always either resolves the incident or brings a human into the loop with specific, actionable context
  • Knowledge persistence creates a genuine learning loop — the agent gets better at handling an environment the more it interacts with it
  • The safety validation layer is strict enough to prevent real damage while flexible enough to allow the agent to do useful work — no shell injection, no destructive patterns, single container targeting per command

What I learned

  • LLMs as orchestrators, not chatbots: Using Gemini for structured decision-making (function calling over a capability registry) produces far more reliable results than free-form generation
  • The importance of grounding: Giving the agent tools to inspect real container state (not just logs) dramatically improved diagnosis accuracy
  • User interaction is a capability, not an interruption: Treating escalation, approval, and questions as first-class capabilities in the state machine made the system more resilient — it asks for help instead of failing silently
  • Knowledge compounds: Persisting facts across sessions means the agent handles recurring scenarios faster each time, without re-asking the same questions
  • Safety is a feature, not a constraint: The validation layer isn't overhead — it's what makes autonomous execution trustworthy

What's next for Nightwatch- Autonomous SRE Agent

  • Incident history and analytics — Persist every incident lifecycle (detection, diagnosis, resolution, outcome) to a database so the agent builds institutional memory — what incidents recur, which remediations succeed, and where it tends to escalate
  • Metrics-based detection — Incorporate container metrics (CPU, memory, network) from the Docker stats API alongside logs to catch performance degradation before it becomes an outage — shifting from reactive to proactive monitoring
  • Alerting integrations — Send notifications to Slack, PagerDuty, or webhooks when incidents are detected, resolved, or escalated — making Nightwatch usable in real on-call workflows instead of just a terminal session
  • Multi-incident handling — Run concurrent resolution pipelines when multiple unrelated failures occur simultaneously, so the agent doesn't have to finish one incident before starting the next
  • Kubernetes support — Adapt the capability system to work with Kubernetes primitives (pods, deployments, services) alongside Docker, extending Nightwatch to orchestrated environments

Built With

Share this project:

Updates