Inspiration

Manual incident triage is the ultimate "SRE Toil." When a service fails, engineers waste the first 30 minutes jumping between dashboards and GitHub PR histories to figure out "What changed?" We built Argus-SRE to automate this entire investigative phase, acting as a tireless 24/7 triage partner.

What it does

Argus-SRE is a multi-agent system that reacts to service failures:

  • Triage Agent: Uses ES|QL to scan logs and identify the specific error signature (e.g., a NullPointerException).
  • Specialist Agent: Takes that signature and correlates it with service metadata and recent GitHub Pull Requests.
  • The "Smoking Gun": It identifies the exact PR that likely caused the failure, tags the developer responsible, and provides the relevant Runbook for the fix.

How we built it

  • Elastic Agent Builder & Workflows: This is the "brain" of Argus-SRE. We used the Agent Builder to define specialized Triage and Specialist agents, orchestrating a "Chain of Thought" workflow.
  • ES|QL: Served as our high-speed correlation engine for cross-index lookups, linking real-time error logs with our custom GitHub PR Index.
  • Elasticsearch Alerts: Configured to detect service failures (like 500 errors), acting as the "starting gun" for the investigation.
  • Slack Integration: Pushes an actionable report—complete with the "Smoking Gun" PR link—directly to the SRE channel.
  • Python: Built the ingestion pipeline to index GitHub deployment data and simulate failure patterns.

Challenges we ran into

Balancing accuracy with context was tricky. In our demo, we implemented a LIMIT 3 lookup to keep the AI focused. However, Argus-SRE is production-ready: by swapping that limit for a time-based ES|QL filter, it scales to handle massive deployment volumes.

Accomplishments that we're proud of

We successfully created a seamless "handoff" where one AI agent identifies the problem and the second finds the person and code responsible. It transforms a chaotic wall of logs into a single actionable name.

What we learned

We mastered using ES|QL for complex correlation across logs and metadata. We also gained deep insight into using Elastic Agent Builder to automate SRE logic with precision.

What's next for Argus-SRE

  • CI/CD Integration with Governance: Integrating with CI/CD pipelines to trigger automated rollbacks.
  • Human-in-the-Loop Workflows: Implementing an "Approve/Reject" flow in Slack where an SRE can trigger a rollback or open a Jira ticket with a single click after reviewing the AI's findings.
  • Multi-Provider Support: Expanding beyond GitHub to support GitLab and Bitbucket.

Built With

Share this project:

Updates