OpsGuardian: The Statistical SRE Agent

Inspiration

We were tired of midnight alert and messy dashboards. We tried using generic AI chatbots to help, but they failed—they hallucinated numbers and guessed error rates. In SRE, a guess is dangerous.

We wanted to build an AI that doesn't guess, but calculates. An agent that combines the reasoning of an LLM with the mathematical rigor of Elasticsearch.

How we built it

We built OpsGuardian using the Elastic Agent Builder. We designed a structured "Triad of Truth" architecture:

  1. The Calculator (Math): We used ES|QL to perform real-time math (e.g., EVAL error_rate). The agent calculates the exact error percentage directly from the logs.
  2. The Historian (Patterns): We used Elastic's search capabilities to find historical logs that match current incidents.
  3. The Fixer (Knowledge): We used Semantic Search (RAG) to retrieve the correct Standard Operating Procedure (SOP) to fix the issue.

🚧 Challenges we faced

The biggest challenge was stopping the AI from guessing. Initially, the LLM tried to "read" raw logs line-by-line, which was slow and inaccurate. We had to refine the System Prompt to force the agent to rely strictly on our custom ES|QL tools for data. We learned that providing specific, hard-coded tools is much safer than open-ended prompts.

What we learned

  • ES|QL is a Superpower: It turns the database into a calculation engine, ensuring zero hallucinations.
  • Tools > Prompts: Giving the agent robust tools is more effective than writing long instructions.
  • Accuracy First: An SRE agent must be deterministic. If the data says the system is healthy, the agent must trust the data, not the user's panic.

What's next for OpsGuardian

True Vector Search: Upgrading the knowledge retrieval tool to fully utilize ELSER (Elastic Learned Sparse EncodeR) for zero-shot semantic understanding. Active Remediation: Giving OpsGuardian the ability to execute write operations (e.g., "Block IP", "Restart Pod") via MCP (Model Context Protocol) integration with infrastructure tools. Proactive Alerting: Moving from a reactive chat interface to a proactive observer that pushes alerts when statistical anomalies are detected.

Built With

Share this project:

Updates