Inspiration

Every SRE engineer knows the 3 AM nightmare: production is down, customers are affected, and you're frantically scrolling through thousands of logs in kibana trying to find what broke. Then you search Git history, manually correlate errors to recent commits, create a revert PR, and notify the team. This process takes 3-4 hours of stressful manual work.

I wanted to build an agent that could do all of this automatically - detect the problem, find the root cause, fix it, and notify the team - all from a single prompt.

What it does

The Elastic SRE Agent automates the complete incident response lifecycle: DETECT - Uses ES|QL to query application logs and identify error spikes by service

ANALYZE - Uses semantic search to match error messages to recent commits, even when they use completely different words

FIX - Creates GitHub issues and triggers GitHub Actions via Elastic Workflows to automatically create a revert PR

NOTIFY - Sends Slack alerts with full incident context to the on-call team

Result: What used to take 3-4 hours now takes under 3 minutes.

How we built it

No code required! I built this entirely using Elastic's Agent Builder platform:

Created two Elasticsearch indices with semantic_text fields: application-logs - stores error logs from services github-commits - stores commit messages and metadata from our repo

Built custom agent tools: ES|QL tool for error detection queries Index search tool with semantic similarity for root cause analysis Workflow tools for Slack and GitHub integration

Created Elastic Workflows that make HTTP calls to: Slack webhooks for notifications GitHub API to create Issues and trigger Actions

Set up a GitHub Action that runs git revert and creates PRs when triggered by the workflow

Challenges we ran into

GitHub PR Creation: Creating a true revert PR requires actual git operations (not just API calls). I solved this by having the Elastic Workflow trigger a GitHub Action that runs git revert - elegantly bridging the gap between HTTP APIs and git commands.

Workflow Data Flow: Learning the correct Liquid templating syntax (steps.<name>.output.data) for passing data between workflow steps took debugging. The Elastic execution logs were invaluable for understanding the actual response structure.

Accomplishments that we're proud of

The semantic search magic: The agent matched "NullPointerException in PaymentProcessor" to a commit saying "Removed null safety checks" - these share ZERO keywords but the agent understood they're related. This is the power of meaning-based search!

End-to-end automation: The agent doesn't just find problems - it fixes them. Automatically creating a GitHub PR from a chat prompt feels like the future of SRE.

What we learned

How to use semantic_text fields for meaning-based search ES|QL for powerful log aggregation queries Elastic Workflows for orchestrating external APIs like Slack, GitHub How to bridge workflows with GitHub Actions for git operations The importance of good index design for agent capabilities

What's next for Elastic SRE Agent: The Self-Healing Incident Commander

-Multi-repo support: Detect which repository caused the issue based on service mapping

-PagerDuty integration: Automatic incident creation and escalation

-Rollback verification: Agent monitors if the revert actually fixed the error rate

-Learning from incidents: Store incident patterns to predict future issues

-Adding MCP (Model Context Protocol): Support this si developers can query the agent directly from VS Code.

Expanding the knowledge base: Include Jira tickets and Confluence docs for better context.

Built With

  • agent-builder
  • elastic-workflows
  • elasticsearch
  • elser
  • esql
  • github-actions
  • github-issues
  • slack-api
  • vector-search
Share this project:

Updates