Inspiration

Ops tools often feel like a black box during incidents. You get alerts, but no clear story of why things broke. I wanted to build something transparent, trustworthy, and demo-ready — where every AI-generated insight is backed by real evidence and clear visuals.


What it does

RootCause AI is an intelligent AIOps assistant that:

  • Diagnoses incidents with an LLM-guided, schema-validated RCA.
  • Links every causal step to concrete logs, metrics, commits, or bug reports.
  • Renders interactive causal chains in a no-code Streamlit UI.
  • Detects real-time anomalies and predicts CPU, memory, or response-time issues before they escalate.

How we built it

  • Analyzer Core → Normalizes events, builds a RAG prompt, validates JSON, ranks hypotheses, and maps them to evidence.
  • Connectors → Logs, GitHub commits, metrics (CSV/JSON), bug reports, and Datadog live metrics.
  • No-Code UI → Streamlit app for provider selection, demo mode, anomaly & prediction panels, and causal chain visualization.
  • Simulation Mode → Prebuilt incidents (e.g., DB deadlock) for demoing without external data.
  • Visualizations → NetworkX + Plotly interactive graphs with PNG export.

Challenges we ran into

  • Getting consistent JSON outputs from different LLMs.
  • Normalizing timestamps across logs, metrics, and commits.
  • Balancing anomaly sensitivity: early warnings vs. false positives.
  • Designing graphs that are both clear and information-dense.

Accomplishments we’re proud of

  • Built an end-to-end RCA demo that anyone can run in minutes.
  • Delivered auditable, evidence-linked AI outputs instead of black-box guesses.
  • Made anomaly detection and predictions simple, fast, and explainable without heavy ML.
  • Created a forkable, extensible reference project for any DevOps team.

What we learned

  • Schema-validated JSON makes LLM outputs reliable.
  • RCA prompts must include timestamps, severities, and diffs for accuracy.
  • Simple stats (z-scores, trends) can outperform ML for clarity and speed.
  • Transparency builds trust — teams adopt AI faster when they see why it reached a conclusion.

What’s next for RootCause AI

  • Add connectors for Kubernetes, cloud logs, and CI/CD pipelines.
  • Improve multi-incident correlation to uncover cross-outage patterns.
  • Enhance prediction models with lightweight ML for long-term forecasting.
  • Package as a plug-and-play open-source tool for easy adoption.

Built With

  • and-requests-for-the-web-stack-and-no?code-experience.-numpy-and-pandas-for-rolling-stats
  • datadog
  • github
  • matplotlib
  • netwokx
  • numpy
  • openai
  • pandas
  • python
  • streamlit-ui
  • z?scores
Share this project:

Updates