OpsGuardian: The Statistical SRE Agent
Inspiration
We were tired of midnight alert and messy dashboards. We tried using generic AI chatbots to help, but they failed—they hallucinated numbers and guessed error rates. In SRE, a guess is dangerous.
We wanted to build an AI that doesn't guess, but calculates. An agent that combines the reasoning of an LLM with the mathematical rigor of Elasticsearch.
How we built it
We built OpsGuardian using the Elastic Agent Builder. We designed a structured "Triad of Truth" architecture:
- The Calculator (Math): We used ES|QL to perform real-time math (e.g.,
EVAL error_rate). The agent calculates the exact error percentage directly from the logs. - The Historian (Patterns): We used Elastic's search capabilities to find historical logs that match current incidents.
- The Fixer (Knowledge): We used Semantic Search (RAG) to retrieve the correct Standard Operating Procedure (SOP) to fix the issue.
🚧 Challenges we faced
The biggest challenge was stopping the AI from guessing. Initially, the LLM tried to "read" raw logs line-by-line, which was slow and inaccurate. We had to refine the System Prompt to force the agent to rely strictly on our custom ES|QL tools for data. We learned that providing specific, hard-coded tools is much safer than open-ended prompts.
What we learned
- ES|QL is a Superpower: It turns the database into a calculation engine, ensuring zero hallucinations.
- Tools > Prompts: Giving the agent robust tools is more effective than writing long instructions.
- Accuracy First: An SRE agent must be deterministic. If the data says the system is healthy, the agent must trust the data, not the user's panic.
What's next for OpsGuardian
True Vector Search: Upgrading the knowledge retrieval tool to fully utilize ELSER (Elastic Learned Sparse EncodeR) for zero-shot semantic understanding. Active Remediation: Giving OpsGuardian the ability to execute write operations (e.g., "Block IP", "Restart Pod") via MCP (Model Context Protocol) integration with infrastructure tools. Proactive Alerting: Moving from a reactive chat interface to a proactive observer that pushes alerts when statistical anomalies are detected.
Built With
- elasticsearch
- esql
- json
- kibana
Log in or sign up for Devpost to join the conversation.