ElasticSeer: The Autonomous SRE

Inspiration

Modern cloud architectures move faster than human SREs can type. We noticed a critical gap: traditional observability platforms tell you what is broken, but leave the "Why" and "How to fix it" to burnt-out engineers.

We built ElasticSeer to bridge the gap between raw telemetry and autonomous action. We wanted a platform that doesn't just watch your system—it actively repairs it.

What it does

ElasticSeer is an Autonomous AI SRE that monitors your production stack 24/7. It handles the entire incident lifecycle in four phases:

  • Observe: The Observer Engine continuously scans Elasticsearch indices for metric anomalies using 3-sigma statistical analysis.
  • Analyze: When a spike occurs, Gemini 1.5 Flash (via Agent Builder) performs a "Rich Analysis," correlating logs, metrics, and traces to find the root cause.
  • Remediate: The agent identifies the bug in the codebase, creates a GitHub PR with the patch, and notifies the team via Slack and Jira—all in under 60 seconds.
  • Visualize: A stunning "Command Center" dashboard shows live reasoning traces, KPI counters, and the system's "thought process" in real-time.

How we built it

  • Intelligence: Gemini 1.5 Flash orchestrated via the Model Context Protocol (MCP), allowing the AI to query internal Elasticsearch data securely through the Elastic Agent Builder.
  • Data Engine: Elasticsearch Serverless for high-performance log storage, service metrics, and incident history.
  • The Core: A FastAPI (Python) backend handling persistent monitoring loops and autonomous multi-agent workflows.
  • The UI: A React/Vite/Tailwind frontend featuring "Elastic Aesthetics"—glassmorphism, animated data flows, and a "Reasoning Trace" feed.
  • Infrastructure: A distributed setup with the backend on Vultr (VPS) and the frontend on Vercel.

Challenges we ran into

  • Mixed Content Hurdles: Deploying a cross-cloud architecture (Vercel HTTPS calling a Vultr HTTP VPS) triggered browser security blocks. We solved this by implementing a server-side proxy (Vercel Rewrites) to ensure a seamless experience.
  • Reasoning at Scale: Making complex AI "thoughts" feel real-time in the UI required a specialized Reasoning-Trace architecture using Server-Sent Events (SSE) to prevent API timeouts.
  • Data Consistency: Aligning complex Pydantic data models between the AI's reasoning engine and the React UI to ensure zero-crash reliability during high-pressure incident simulations.

Accomplishments we're proud of

  • One-Prompt Remediation: Achieving a full "One Prompt to PR" flow where the AI fixes real production code based on live telemetry.
  • The Command Center: Building a premium, "wow-factor" dashboard that feels like a professional enterprise SaaS product.
  • ES|QL Mastery: Implementing complex ES|QL queries through the MCP tools to perform cross-index correlation that would normally take hours of manual filtering.

What we learned

  • The Value of MCP: Grounding an LLM in real-world observability data via the Model Context Protocol is a game-changer for agent reliability.
  • Agentic UX: We learned that a "Black Box" AI can be unsettling. Showing the AI's "Reasoning Trace" builds user trust and makes the autonomous experience feel magical rather than mysterious.

🔮 What's next for ElasticSeer

  • Historical Learning: Implementing vector-based "Similar Fix" retrieval using Elasticsearch Vector Database to let the agent learn from previous incidents.
  • Human-in-the-Loop 2.0: Expanding Slack interactivity to allow engineers to approve or edit AI patches directly via interactive Slack buttons.
  • Multi-Cloud Discovery: Extending the Observer Engine to auto-discover and monitor resources across AWS, GCP, and Azure.

Built With

Share this project:

Updates