Incident Copilot (TiDB Serverless + Agentic RAG)

Inspiration

On-call firefights are messy: context hides in runbooks, Slack threads, and log silos. Even when you find similar past incidents, you still need to summarize impact, pick a severity, and open the right tickets. We wanted an agent that doesn’t just answer—it retrieves, reasons, and acts. TiDB Serverless was a perfect fit: vectors + SQL in one place, HTAP for both incident artifacts and analytics, scale-to-zero for demos, and dead-simple MySQL compatibility.

What it does

Incident Copilot ingests runbooks and logs into TiDB Serverless with embeddings. When you ask a question or an incident hits, it:

runs hybrid retrieval (vector + keyword + light recency boost),
summarizes likely root cause, classifies SEV1/2/3, and estimates confidence,
proposes actionable next steps and can execute them (Slack update, GitHub issue),
persists everything (sources, actions, audit) back to TiDB for an end-to-end trail.

How we built it

Stack: Java 21 + Spring Boot, TiDB Serverless (VECTOR + SQL), OpenAI (embeddings + chat), Slack + GitHub APIs.
Schema: documents, doc_chunks (VECTOR(1536)), incidents, incident_evidence, actions.
Ingestion: chunk (≈3.6k chars, 400 overlap) → text-embedding-3-small (1536-D) → insert [x,y,...] vector text into TiDB.
Retrieval: VEC_COSINE_DISTANCE(embedding, '[...]') top-K + keyword candidates; merge with weights (α=0.75, β=0.2, γ=0.05).
Reasoning: one chat call returns JSON {summary, suspected_cause, severity, confidence, actions[]}.
Actions: Slack chat.postMessage, GitHub Issues; statuses recorded with refs; idempotent inserts.
Ops: Dockerfile, .github/workflows/ci.yml (Maven verify + multi-arch image push), Postman collection for a 60-second demo.

Challenges we ran into

Vector gotchas: aligning embedding dimension (1536 vs 3072) and TiDB’s vector literal ("[...]") + distance functions.
Hybrid ranking: tuning α/β/γ so keyword noise doesn’t drown semantic matches.
LLM determinism: enforcing JSON outputs and guarding against partial/actionless replies.
Action reliability: handling missing tokens, retries, and capturing external refs for audit.
Throughput vs cost: batching embeddings, caching by content hash, and keeping latency <~3–6s including external calls.

Accomplishments that we’re proud of

A true agentic path: ingest → retrieve → reason → act, fully auditable in TiDB.
<3s p95 for retrieve+summarize on a modest dataset; <6s with Slack/GitHub actions enabled (demo conditions).
Clean developer UX: Swagger, Postman, Docker, and CI so judges can run it instantly.
Minimal but solid Java code with clear seams (Retrieval/LLM/Actions) and schema that scales.

What we learned

TiDB’s VECTOR type and functions make hybrid RAG feel native to SQL—no extra vector service needed.
A small keyword signal + recency boost improves relevance over pure k-NN.
JSON-schema-like prompts reduce action errors and make downstream automation safer.
Auditability matters: storing sources, decisions, and external refs turns an LLM into an operable system.

What’s next for Incident Copilot (TiDB Serverless + Agentic RAG)

More tools: Jira tickets, Google Calendar post-mortems, PagerDuty paging.
Better retrieval: full-text/BM25 index + RRF fusion; ANN index when available; per-service priors.
Evals & safety: unit tests for prompts, offline evals on incident sets, stricter JSON schema validation.
Semantic cache: reuse answers for near-duplicate queries; dedupe actions by idempotency keys.
Multi-tenant & RBAC: workspace isolation, key scoping, and audit dashboards.
UI polish: lightweight React console with source highlighting and action toggles.
Post-mortems: auto-generate timeline & RCA docs from the persisted evidence/actions.

Built With

jackson
java
maven
okhttp
serverless
springboot
springdoc-openapi
tidb

Updates

Aakash Chaudhary started this project — Aug 09, 2025 06:30 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.