Inspiration
Every SRE engineer knows the 3 AM nightmare: production is down, customers are affected, and you're frantically scrolling through thousands of logs in kibana trying to find what broke. Then you search Git history, manually correlate errors to recent commits, create a revert PR, and notify the team. This process takes 3-4 hours of stressful manual work.
I wanted to build an agent that could do all of this automatically - detect the problem, find the root cause, fix it, and notify the team - all from a single prompt.
What it does
The Elastic SRE Agent automates the complete incident response lifecycle: DETECT - Uses ES|QL to query application logs and identify error spikes by service
ANALYZE - Uses semantic search to match error messages to recent commits, even when they use completely different words
FIX - Creates GitHub issues and triggers GitHub Actions via Elastic Workflows to automatically create a revert PR
NOTIFY - Sends Slack alerts with full incident context to the on-call team
Result: What used to take 3-4 hours now takes under 3 minutes.
How we built it
No code required! I built this entirely using Elastic's Agent Builder platform:
Created two Elasticsearch indices with semantic_text fields: application-logs - stores error logs from services github-commits - stores commit messages and metadata from our repo
Built custom agent tools: ES|QL tool for error detection queries Index search tool with semantic similarity for root cause analysis Workflow tools for Slack and GitHub integration
Created Elastic Workflows that make HTTP calls to: Slack webhooks for notifications GitHub API to create Issues and trigger Actions
Set up a GitHub Action that runs git revert and creates PRs when triggered by the workflow
Challenges we ran into
GitHub PR Creation: Creating a true revert PR requires actual git operations (not just API calls). I solved this by having the Elastic Workflow trigger a GitHub Action that runs git revert - elegantly bridging the gap between HTTP APIs and git commands.
Workflow Data Flow: Learning the correct Liquid templating syntax (steps.<name>.output.data) for passing data between workflow steps took debugging. The Elastic execution logs were invaluable for understanding the actual response structure.
Accomplishments that we're proud of
The semantic search magic: The agent matched "NullPointerException in PaymentProcessor" to a commit saying "Removed null safety checks" - these share ZERO keywords but the agent understood they're related. This is the power of meaning-based search!
End-to-end automation: The agent doesn't just find problems - it fixes them. Automatically creating a GitHub PR from a chat prompt feels like the future of SRE.
What we learned
How to use semantic_text fields for meaning-based search ES|QL for powerful log aggregation queries Elastic Workflows for orchestrating external APIs like Slack, GitHub How to bridge workflows with GitHub Actions for git operations The importance of good index design for agent capabilities
What's next for Elastic SRE Agent: The Self-Healing Incident Commander
-Multi-repo support: Detect which repository caused the issue based on service mapping
-PagerDuty integration: Automatic incident creation and escalation
-Rollback verification: Agent monitors if the revert actually fixed the error rate
-Learning from incidents: Store incident patterns to predict future issues
-Adding MCP (Model Context Protocol): Support this si developers can query the agent directly from VS Code.
Expanding the knowledge base: Include Jira tickets and Confluence docs for better context.
Built With
- agent-builder
- elastic-workflows
- elasticsearch
- elser
- esql
- github-actions
- github-issues
- slack-api
- vector-search
Log in or sign up for Devpost to join the conversation.