Inspiration:

Every minute an infrastructure incident goes unresolved, businesses bleed money and trust. On-call engineers burn out juggling alerts, runbooks, and post-mortems — often recreating context that was documented months ago. We asked: what if an AI team could handle the entire incident lifecycle autonomously, remembering every past incident to fight the next one smarter? That question became Incident Commander — a multi-agent system with true agentic memory, built for the CockroachDB × AWS Hackathon.

What it does:

Incident Commander is an AI-powered incident response platform that autonomously manages the full incident lifecycle through four specialized agents:

Triage Agent — Classifies incoming incidents by severity, assigns impact levels, and identifies affected services Investigation Agent — Performs root cause analysis by cross-referencing the incident against historical data stored in a vector knowledge base Resolution Agent — Generates actionable fix steps, referencing relevant runbooks and past resolutions Post-Mortem Agent — Produces detailed post-incident reports, stores lessons learned as vector embeddings, and uploads artifacts to S3 The system uses RAG (Retrieval-Augmented Generation) over CockroachDB's distributed pgvector indexes, so every resolved incident makes the system smarter. Users interact through a real-time dashboard where they can create incidents, trigger agent pipelines, search the knowledge base, and review AI-generated insights — all within a single unified interface.

How we built it:

Frontend: Next.js 16 (App Router) with TypeScript, Tailwind CSS 4, and shadcn/ui components for a clean, responsive dashboard experience.

Database: CockroachDB Cloud cluster ("chosen-hare") with pgvector extension. We use VECTOR(1536) columns with distributed vector indexes for cosine similarity search. Three core tables power the system:

incidents — full incident records with status, severity, agent actions, and post-mortem data runbooks — operational runbooks embedded as vectors for semantic search incident_embeddings — vector store linking incidents to their embeddings for RAG retrieval AI Reasoning: AWS Bedrock with Claude Sonnet 4 handles all LLM reasoning. Each agent sends structured prompts with context retrieved from CockroachDB, and parses structured JSON responses back.

Storage: AWS S3 stores post-mortem artifacts and report files, with pre-signed URLs for secure access.

Embeddings: For the demo, we use deterministic hash-based 1536-dimensional embeddings. In production, this swaps to OpenAI text-embedding-ada-002 or Bedrock's Titan embeddings with zero code changes.

API Layer: Next.js API routes handle all CRUD operations and agent orchestration. Each agent endpoint triggers Bedrock invocation, RAG search against CockroachDB vector indexes, and S3 artifact storage.

Challenges we ran into:

CockroachDB pgvector dimension requirement: CockroachDB requires explicit VECTOR(1536) dimensions — unlike PostgreSQL which accepts bare VECTOR. Our initial schema used undimensioned vectors and had to drop and recreate all tables.

Node.js pg SSL connection failure: CockroachDB Cloud requires SSL with a root certificate. Passing the sslrootcert as a query parameter in the connection string silently failed in Node.js's pg library. We solved this by using explicit connection parameters with ssl: { ca: fs.readFileSync(certPath) } instead.

Agent prompt engineering for structured output: Getting Claude to return parseable JSON consistently required careful system prompt design with explicit output schemas and retry logic with invokeClaudeJSON() — a typed wrapper that validates and retries on malformed responses.

**Balancing demo reproducibility with production readiness: We needed deterministic behavior for hackathon demos while keeping the architecture production-grade. Our hash-based embedding approach solved this — it produces consistent vectors without an external embedding service, while the vector search pipeline remains identical to what a production deployment would use.

Accomplishments that we're proud of:

Fully functional end-to-end pipeline — from incident creation through triage, investigation, resolution, and post-mortem, all AI-driven and all connected to real infrastructure Distributed vector search — RAG over CockroachDB's distributed pgvector indexes actually retrieves semantically relevant past incidents and runbooks Multi-agent architecture — four distinct AI agents with specialized roles, each with tailored prompts and context injection Real AWS integration — live Claude Sonnet 4 reasoning via Bedrock and actual S3 artifact storage, not mocked Clean, deployable codebase — well-structured Next.js app with proper TypeScript types, API routes, and separation of concerns

What we learned:

*CockroachDB's pgvector support is powerful but has subtle differences from vanilla PostgreSQL — dimension requirements, index syntax, and connection handling all differ *Multi-agent systems need careful state management — each agent must receive precisely the right context without overwhelming the LLM's context window *RAG quality depends heavily on how you chunk and store knowledge — raw incident dumps perform far worse than structured embeddings with metadata *AWS Bedrock's API is remarkably clean for structured output, but error handling and retry logic are essential for production use *The combination of distributed SQL (CockroachDB) + serverless AI (Bedrock) + object storage (S3) creates a naturally scalable architecture that grows with your incident history

What's next for Incident Commander:

Real embeddings — swap hash-based demos for Bedrock Titan or OpenAI embeddings for genuine semantic search WebSocket streaming — stream agent reasoning in real-time to the dashboard as Claude processes each step Slack/Teams integration — create and triage incidents directly from chat, with agents posting updates to channels Multi-tenant support — CockroachDB's distributed architecture makes it natural to isolate teams/organizations Auto-escalation — if an agent can't resolve an incident within a confidence threshold, automatically page a human with full AI-generated context Historical analytics — trend analysis across incidents, MTTR tracking, and prediction of recurring failure patterns

Built With

Share this project:

Updates