Inspiration

DevOps teams suffer from "alert fatigue" and the stress of 3 AM incidents. Sifting through thousands of log lines in Kibana to find a root cause is time-consuming and prone to human error. We wanted to build an agent that doesn't just "chat" but actually investigates like a human SRE would—looking for spikes, reading errors, and checking runbooks.

What it does

Elastic SRE Guardian is a web-based agent that investigates production incidents.

  1. Ingests Logs: We simulate a microservices environment sending logs to Elasticsearch.
  2. Autonomous Investigation: The user acts as an Incident Commander, asking "Why is service X failing?". The Agent queries Elasticsearch to find error rate spikes and correlates them with specific log messages.
  3. Hybrid Intelligence: It uses a LangChain-based agent for flexible reasoning BUT includes a Deterministic Fallback Engine. If the LLM is unavailable, hard-coded heuristics take over to ensure the user always gets a helpful response.

How we built it

  • Backend: Python Flask.
  • Data: Elasticsearch (using the Python client) storing structured JSON logs.
  • AI: LangChain (OpenAI) for the reasoning loop.
  • Frontend: Vanilla HTML/JS for a lightweight, fast UI.
  • Resilience: We implemented a robust try/except block that seamlessly switches from the AI Agent to the Deterministic Engine if API calls fail, ensuring enterprise-grade reliability.

Challenges we ran into

Integrating the "Agentic" workflow with the strict structure of Elasticsearch queries was tricky. LLMs often hallucinate query syntax. We solved this by creating strict Tools (search_logs, get_spikes) that abstract the complex JSON DSL away from the LLM, letting it focus on the "why" rather than the "how".

Accomplishments that we're proud of

The Fallback Engine. Most AI hackathon projects break if the API key fails. Ours degrades gracefully into a powerful, rule-based log analyzer. This "Hybrid AI" approach is the future of reliable software.

What's next for Elastic SRE Guardian

  • Auto-Remediation: Giving the agent write-access to Kubernetes to restart pods.
  • RAG for Runbooks: Indexing our internal Confluence pages so the agent can cite specific company policies for fixing issues.
  • Slack Bot: Bringing the agent directly into the war-room channel.
Share this project:

Updates