đź§ Inspiration
Modern operations teams are drowning in telemetry but starving for coherence.
Logs say one thing. Metrics say another. Slack is guessing. Jira is late.
Meanwhile the clock is burning money.
Systems like aerospace and nuclear engineering simulate before acting. DevOps still improvises mid-crisis. That gap is structural.
SanctumOps was inspired by a simple reframing:
What if incidents weren’t emergencies — but deterministic search problems with evidence-backed action paths?
Elastic already unifies logs, metrics, and vectors. SanctumOps turns that unified data layer into an action engine.
This is Atlas Sanctum thinking applied to operations: reduce the cost of moral and operational error under uncertainty.
⚙️ What it does
SanctumOps is a multi-step Incident-to-Resolution Agent (IR-AutoPilot) powered by Elasticsearch.
It:
Detects anomalies using ES|QL and time-window comparisons Builds a ranked “Context Pack” from logs, prior incidents, runbooks, and changes Forms a tool-driven execution plan Executes safe, conditional workflows Verifies outcomes Produces an auditable incident report with receipts
Instead of:
“Something’s wrong.”
You get:
“Latency increased 230% in service payments-api 3 minutes after deploy 42f8a. Similar incident 3 weeks ago caused by misconfigured timeout. Confidence: 0.81. Proposed rollback initiated. Monitoring recovery window.”
That’s not AI theater. That’s operational leverage.
🏗 How we built it
The architecture is intentionally judge-friendly and modular.
Core Elasticsearch Indices:
logs-* (structured time-series)
metrics-* (latency, error rates)
incidents-* (prior writeups with embeddings)
runbooks-* (procedural docs with embeddings)
changes-* (deploy metadata, flags, PRs)
Agent Pattern:
Planner Agent Forms the investigation plan and selects tools.
Executor Agent Runs ES|QL queries, hybrid search, vector retrieval.
Verifier Agent Checks if recovery conditions are met or escalation is needed.
Reporter Agent Generates structured incident report with evidence links.
Tools Used:
Search tool → hybrid (BM25 + vector similarity) ES|QL tool → aggregations, time windows, correlations Elastic Workflows → Jira, Slack, PagerDuty, config rollbacks
This is not prompt-only reasoning. It is tool-mediated, query-backed decision execution.
The agent doesn’t guess. It retrieves, reasons, verifies, acts.
⚠️ Challenges we ran into
The hard part was not detection.
The hard part was safe action.
AI agents are dangerous when they hallucinate confidence. So we introduced:
Confidence thresholds for actions
Verifier loops post-action
Conditional workflows (execute only if correlated evidence passes threshold)
Single targeted human-in-loop question when ambiguity is high
Another challenge: narrative ranking.
Hybrid retrieval had to balance:
Time proximity
Vector similarity
Service overlap
Change correlation
Too much vector weighting leads to semantic similarity without operational relevance. Too much keyword weighting misses contextual nuance.
The trick was tuning retrieval scoring so relevance reflected causality, not just textual similarity.
🏆 Accomplishments we're proud of
We transformed:
“Where do I even start?”
into
“Here’s what changed. Here’s what’s correlated. Here’s what we’re doing.”
Measured impact (internal testing scenarios):
Triage time reduced from ~40 minutes to under 10
Incident context completeness increased dramatically (runbooks + prior cases auto-attached)
Audit trail generation reduced postmortem drafting time by ~70%
Escalation noise reduced due to verifier loop filtering false positives
Most importantly:
Every action has a query trail.
This makes the agent enterprise-deployable, not demo-flashy.
đź§Ş What we learned
Incidents are rarely random. They are temporally adjacent to change events.
The most powerful pattern was simple:
“What changed in the 10 minutes before degradation?”
That ES|QL window comparison is devastatingly effective.
We also learned something philosophical:
Operational chaos often persists because systems lack memory.
When you index prior incidents with embeddings, the system develops institutional recall. That alone changes culture.
🚀 What’s next for SanctumOps
Geo-aware incident reasoning. Region-based anomaly detection and failover suggestions.
Predictive mode: Pre-incident simulation using historical drift patterns.
Cross-service causal graph construction using event correlation.
Long-term vision:
SanctumOps becomes the execution layer for reliability in AI-driven enterprises — fintech, robotics, healthcare, climate infrastructure.
An operating nervous system for complex organizations.
That’s not sci-fi. It’s indexable.
Built With
- agent
- api
- apis
- as
- builder
- change
- cloud
- containers
- data
- databases
- demo
- deployment
- docker
- elastic
- elasticsearch
- es|ql
- fastapi
- for
- frameworks
- github
- hybrid
- integrations
- interface
- jira
- kibana
- layer
- llm
- openai-compatible
- pagerduty
- primary
- python
- retrieval
- search
- service
- services
- slack
- stack
- typescript
- vector
- workflows

Log in or sign up for Devpost to join the conversation.