Inspiration

On-call firefights are messy: context hides in runbooks, Slack threads, and log silos. Even when you find similar past incidents, you still need to summarize impact, pick a severity, and open the right tickets. We wanted an agent that doesn’t just answer—it retrieves, reasons, and acts. TiDB Serverless was a perfect fit: vectors + SQL in one place, HTAP for both incident artifacts and analytics, scale-to-zero for demos, and dead-simple MySQL compatibility.

What it does

Incident Copilot ingests runbooks and logs into TiDB Serverless with embeddings. When you ask a question or an incident hits, it:

  • runs hybrid retrieval (vector + keyword + light recency boost),

  • summarizes likely root cause, classifies SEV1/2/3, and estimates confidence,

  • proposes actionable next steps and can execute them (Slack update, GitHub issue),

  • persists everything (sources, actions, audit) back to TiDB for an end-to-end trail.

How we built it

  • Stack: Java 21 + Spring Boot, TiDB Serverless (VECTOR + SQL), OpenAI (embeddings + chat), Slack + GitHub APIs.

  • Schema: documents, doc_chunks (VECTOR(1536)), incidents, incident_evidence, actions.

  • Ingestion: chunk (≈3.6k chars, 400 overlap) → text-embedding-3-small (1536-D) → insert [x,y,...] vector text into TiDB.

  • Retrieval: VEC_COSINE_DISTANCE(embedding, '[...]') top-K + keyword candidates; merge with weights (α=0.75, β=0.2, γ=0.05).

  • Reasoning: one chat call returns JSON {summary, suspected_cause, severity, confidence, actions[]}.

  • Actions: Slack chat.postMessage, GitHub Issues; statuses recorded with refs; idempotent inserts.

  • Ops: Dockerfile, .github/workflows/ci.yml (Maven verify + multi-arch image push), Postman collection for a 60-second demo.

Challenges we ran into

  • Vector gotchas: aligning embedding dimension (1536 vs 3072) and TiDB’s vector literal ("[...]") + distance functions.

  • Hybrid ranking: tuning α/β/γ so keyword noise doesn’t drown semantic matches.

  • LLM determinism: enforcing JSON outputs and guarding against partial/actionless replies.

  • Action reliability: handling missing tokens, retries, and capturing external refs for audit.

  • Throughput vs cost: batching embeddings, caching by content hash, and keeping latency <~3–6s including external calls.

Accomplishments that we’re proud of

  • A true agentic path: ingest → retrieve → reason → act, fully auditable in TiDB.

  • <3s p95 for retrieve+summarize on a modest dataset; <6s with Slack/GitHub actions enabled (demo conditions).

  • Clean developer UX: Swagger, Postman, Docker, and CI so judges can run it instantly.

  • Minimal but solid Java code with clear seams (Retrieval/LLM/Actions) and schema that scales.

What we learned

  • TiDB’s VECTOR type and functions make hybrid RAG feel native to SQL—no extra vector service needed.

  • A small keyword signal + recency boost improves relevance over pure k-NN.

  • JSON-schema-like prompts reduce action errors and make downstream automation safer.

  • Auditability matters: storing sources, decisions, and external refs turns an LLM into an operable system.

What’s next for Incident Copilot (TiDB Serverless + Agentic RAG)

  • More tools: Jira tickets, Google Calendar post-mortems, PagerDuty paging.

  • Better retrieval: full-text/BM25 index + RRF fusion; ANN index when available; per-service priors.

  • Evals & safety: unit tests for prompts, offline evals on incident sets, stricter JSON schema validation.

  • Semantic cache: reuse answers for near-duplicate queries; dedupe actions by idempotency keys.

  • Multi-tenant & RBAC: workspace isolation, key scoping, and audit dashboards.

  • UI polish: lightweight React console with source highlighting and action toggles.

  • Post-mortems: auto-generate timeline & RCA docs from the persisted evidence/actions.

Built With

  • jackson
  • java
  • maven
  • okhttp
  • serverless
  • springboot
  • springdoc-openapi
  • tidb
Share this project:

Updates