Inspiration
Manual incident triage is the ultimate "SRE Toil." When a service fails, engineers waste the first 30 minutes jumping between dashboards and GitHub PR histories to figure out "What changed?" We built Argus-SRE to automate this entire investigative phase, acting as a tireless 24/7 triage partner.
What it does
Argus-SRE is a multi-agent system that reacts to service failures:
- Triage Agent: Uses ES|QL to scan logs and identify the specific error signature (e.g., a
NullPointerException). - Specialist Agent: Takes that signature and correlates it with service metadata and recent GitHub Pull Requests.
- The "Smoking Gun": It identifies the exact PR that likely caused the failure, tags the developer responsible, and provides the relevant Runbook for the fix.
How we built it
- Elastic Agent Builder & Workflows: This is the "brain" of Argus-SRE. We used the Agent Builder to define specialized Triage and Specialist agents, orchestrating a "Chain of Thought" workflow.
- ES|QL: Served as our high-speed correlation engine for cross-index lookups, linking real-time error logs with our custom GitHub PR Index.
- Elasticsearch Alerts: Configured to detect service failures (like 500 errors), acting as the "starting gun" for the investigation.
- Slack Integration: Pushes an actionable report—complete with the "Smoking Gun" PR link—directly to the SRE channel.
- Python: Built the ingestion pipeline to index GitHub deployment data and simulate failure patterns.
Challenges we ran into
Balancing accuracy with context was tricky. In our demo, we implemented a LIMIT 3 lookup to keep the AI focused. However, Argus-SRE is production-ready: by swapping that limit for a time-based ES|QL filter, it scales to handle massive deployment volumes.
Accomplishments that we're proud of
We successfully created a seamless "handoff" where one AI agent identifies the problem and the second finds the person and code responsible. It transforms a chaotic wall of logs into a single actionable name.
What we learned
We mastered using ES|QL for complex correlation across logs and metadata. We also gained deep insight into using Elastic Agent Builder to automate SRE logic with precision.
What's next for Argus-SRE
- CI/CD Integration with Governance: Integrating with CI/CD pipelines to trigger automated rollbacks.
- Human-in-the-Loop Workflows: Implementing an "Approve/Reject" flow in Slack where an SRE can trigger a rollback or open a Jira ticket with a single click after reviewing the AI's findings.
- Multi-Provider Support: Expanding beyond GitHub to support GitLab and Bitbucket.
Built With
- agent-builder
- elastic-workflow
- elasticsearch
- es|ql
- python
- slack
Log in or sign up for Devpost to join the conversation.