Self-Optimizing Elastic Infra Agent

Inspiration

Elasticsearch clusters at scale develop query performance issues that accumulate silently in slowlogs until they cause production incidents. SRE teams spend repetitive hours diagnosing the same anti-patterns — wildcard queries, unbounded aggregations, circuit breaker saturation — and manually writing reindex plans. The project exists to automate that entire detection-to-execution cycle.

Elasticsearch clusters degrade silently — wildcard queries, script filters, and unbounded aggregations accumulate in slowlogs while engineers investigate manually and apply fixes reactively. This agent reads the slowlog, diagnoses the root structural cause, retrieves the exact Elastic documentation fix, benchmarks it on a throwaway cluster, checks the cloud cost impact, and routes a ready-to-approve reindex plan to the operator — the entire cycle runs automatically, every hour, without a human in the loop until the final decision.

Authenticity

Most Elasticsearch monitoring tools detect slow queries and alert. This project detects, diagnoses the structural cause using static analysis of the query JSON, retrieves best-practice documentation via vector search, validates the fix with a live benchmark on an ephemeral cluster before human approval, and executes the full mapping migration as a sequenced, reversible workflow. The combination of structural query analysis, RAG-backed documentation research, LLM planning, simulation, and gated execution in a single automated loop is not present in existing Elasticsearch tooling.

Focused Users

The target users are platform engineers, SREs, and Elasticsearch cluster administrators managing production clusters with active query workloads. The agent runs as a scheduled Python process connected to a monitoring Elasticsearch cluster. It requires no manual trigger once deployed — the scheduler runs the full optimization cycle at the configured interval, escalating to the operator only at the approval gate.

Manual SRE toil elimination — SRE teams spend significant time manually reading slowlogs, identifying query anti-patterns, writing reindex plans, and executing index changes. This project automates that entire cycle.

Query performance degradation at scale — As Elasticsearch indices grow to tens or hundreds of millions of documents, wildcard and regexp queries on keyword fields cause cluster-wide latency spikes. The agent detects and resolves these before they breach SLAs.

Circuit breaker saturation — Unbounded aggregations and fielddata cache overuse trigger circuit breaker exceptions. The diagnosis layer detects these patterns and proposes heap or mapping-level fixes.

Operational risk from undocumented changes — Reindexing at scale without a structured plan, cost check, benchmark, and approval gate is a common source of production incidents. This project enforces a structured, gated pipeline for every change.

Documentation gap between engineers and best practices — Elastic documentation is extensive and frequently updated. The vector knowledge base ensures the agent always references current best-practice documentation before proposing any change.

Cloud cost accountability — Teams scaling nodes to compensate for query inefficiency waste cloud budget. The cost gate forces a comparison between optimization (often free) and scaling (monthly recurring cost).

Industry: Platform engineering, site reliability engineering, search infrastructure, cloud-native operations, FinTech/eCommerce/media platforms running Elasticsearch at scale.

What it does

Traditionally, Elasticsearch performance optimization is tribal knowledge — senior engineers know the anti-patterns, junior engineers do not, and there is no systematic process for catching them before they cause incidents. This project encodes that knowledge as detectors and retrieves current best practices from documentation automatically. Reindex operations, which currently require careful manual planning to avoid downtime, are reduced to an approved three-step workflow with a simulation-confirmed outcome. The cost gate changes how infrastructure teams justify optimization work — instead of arguing for engineering time, the cost comparison between a free query fix and a recurring node scaling cost makes the business case automatically. Teams running multiple Elasticsearch clusters can apply the same agent across all of them with different configuration, scaling the optimization process without scaling the team. The approval webhook integration means the optimization pipeline fits into existing incident management and change control workflows rather than requiring a new process.

SRE teams currently operate a manual loop: alert fires → engineer reads slowlog → identifies pattern → consults documentation → writes reindex plan → reviews in team → executes → verifies. This project collapses that loop into an automated cycle where the engineer only participates at the approval step. Over time, the approval history becomes a record of all cluster optimization decisions with their rationale, telemetry context, simulation results, and cost justification — a structured audit trail that does not exist in current practices. As model quality improves and more anti-pattern detectors are added, the gap between detected issue and applied fix shrinks from days to minutes. Teams that adopt the agent shift from reactive incident response to scheduled preventive optimization. The simulation gate, specifically, changes the risk profile of Elasticsearch schema migrations — engineers currently apply reindexes speculatively; the agent runs them on a throwaway cluster first.

The agent monitors an Elasticsearch cluster's slowlog on a schedule, runs six structural anti-pattern detectors against the slowest queries, searches a vector index of Elastic documentation for the relevant fix, generates a three-step optimization plan using an LLM, validates it through cost and simulation gates, and executes the approved changes against the production cluster.

How we built it

The Self-Optimizing Elastic Infra Agent is an autonomous SRE system that monitors Elasticsearch clusters, identifies query performance bottlenecks, diagnoses their structural causes, researches documentation-backed fixes, proposes and validates optimization plans, and executes approved changes through a gated pipeline — all without requiring manual SRE intervention per cycle.

The setup layer uses bootstrap.py to enable xpack Stack Monitoring, configure slowlog thresholds on all production indices, and populate a custom index-metadata Elasticsearch index by crawling all production index mappings. docs_indexer.py discovers and crawls elastic.co documentation via sitemap XML, strips HTML, chunks text, embeds using all-MiniLM-L6-v2, and stores vectors in a FAISS IndexFlatIP for retrieval.

The toolset provides three instruments to the agent: an ES|QL tool that queries the monitoring cluster for slowlog events, node metrics, and circuit breaker stats; a knowledge tool that performs vector similarity search against the FAISS documentation index; and an execution tool that wraps all Elasticsearch mutation endpoints (reindex, settings, templates, aliases) behind an approval gate.

The agent loop runs on an APScheduler interval. Each cycle calls run_optimization_cycle() in the orchestrator, which sequences five steps: Identify (ES|QL query), Diagnose (six structural anti-pattern detectors + cardinality estimation), Research (FAISS search), Propose (LLM-generated Plan A→B→C), and Execute (gated dispatch). The reasoning model receives a composed context message containing all telemetry, diagnosis findings, and recommended documentation queries, then produces the structured plan.

Before any destructive action is dispatched, two automated gates run: the cost reasoner compares the action cost against the current node scaling cost using cloud billing API data; the simulation engine provisions a Docker ephemeral Elasticsearch cluster, seeds data, and benchmarks the proposed change before/after to confirm improvement. Only after both gates pass does the action enter the approval queue, where the operator decides via CLI or webhook. The React UI exposes all 24 functional panels, each calling the Anthropic API with the exact telemetry and system prompt its corresponding backend module uses, making every agent decision visible and interactive.

In short, the backend is Python with elasticsearch-py for all cluster interactions, ES|QL for analytical queries against the monitoring cluster, sentence-transformers with FAISS for documentation retrieval, and APScheduler for the timed agent loop. The LLM layer supports both Anthropic and OpenAI APIs via swappable adapters. Docker is used to provision ephemeral single-node Elasticsearch clusters for pre-approval simulation. The UI is React JSX calling the Anthropic API directly to produce live agent reasoning output per panel. Databases: Elasticsearch (production cluster), Elasticsearch (monitoring cluster), Elasticsearch index-metadata index, FAISS IndexFlatIP file.

Core Features

Slowlog-based query identification — Runs ES|QL against the monitoring cluster to surface the top-N slowest queries by average duration and execution count.

Structural query diagnosis — Six detectors analyze each slow query's JSON body for wildcard on high-cardinality fields, leading wildcards, unbounded term aggregations, deep nested queries, Painless script queries, and regexp queries. Cardinality estimation runs a live aggregation to confirm field uniqueness before flagging.

Documentation-backed research — A FAISS vector index of elastic.co documentation is searched using sentence embeddings. The agent retrieves the relevant documentation chunks before proposing any fix.

LLM-driven optimization planning — The reasoning model (Claude 3.5 Sonnet or GPT-4o) receives the composed telemetry context, diagnosis, and documentation findings, then produces a structured three-step plan: create index template → trigger reindex → update alias.

Cost gate — Before any action is dispatched, the cost reasoner compares the action cost against the cost of scaling a node using real cloud billing data.

Simulation gate — A Docker ephemeral single-node Elasticsearch cluster is provisioned, test data is seeded, and the proposed change is benchmarked before/after to confirm latency improvement.

Approval gate — Every destructive action is blocked until an operator approves via CLI stdin prompt or HTTP webhook. The execution tool does not call any Elasticsearch endpoint without a recorded decision.

Scheduled loop — APScheduler runs the full Identify→Diagnose→Research→Propose→Execute cycle at a configurable interval automatically.

Reindex workflow — A three-step workflow (create template, reindex, swap alias) is executed as a sequenced plan where each step generates its own approval request.

Full observability UI — A React interface with 24 panels surfaces every backend function, showing real agent reasoning output by calling the Anthropic API with the exact context each module uses.

Challenges we ran into

Composing the build_initial_message() context to include slowlog data, diagnosis output, node metrics, and recommended documentation queries in a single pass — without exceeding the model context window — required careful sequencing. Designing the approval gate to block execution_tool calls without coupling the workflow layer to the approval mechanism directly was the main architectural constraint. Differentiating run_simulation() (mapping comparison) from simulate_cluster_settings_change() (before/after settings benchmark) required separate Docker lifecycle flows.

Accomplishments that we're proud of

A fully automated Identify→Diagnose→Research→Propose→Cost Gate→Simulation Gate→Approve→Execute pipeline with no hardcoded rules — every optimization decision is derived from live cluster telemetry and documentation retrieval. The React UI surfaces every backend function with real agent reasoning, making the agent's decision process fully auditable.

What we learned

Structural ES|QL queries against slowlog data streams are more reliable than parsing raw log text for identifying query anti-patterns. Cardinality estimation via a live aggregation before flagging a wildcard query eliminates false positives on low-cardinality fields. LLM reasoning quality for optimization planning is directly proportional to how precisely the telemetry context is structured before the first model call.

What's next for Self-Optimizing Elastic Infra Agent

Automated rollback on post-deployment latency regression detection, multi-cluster support across environments, support for ILM policy optimization recommendations, integration with Elastic's Fleet API for agent-level configuration changes, and ML-based anomaly detection on node metrics as a pre-diagnostic signal before the ES|QL slowlog query runs.

Known Limitations

Slowlog threshold dependency — Queries below the configured slowlog threshold are not detected. Adjust index.search.slowlog.threshold.query.warn to match operational requirements.
Static anti-pattern detectors — Detectors analyze query JSON structure. Performance issues caused by data skew, shard imbalance, or hardware-level resource contention are not detectable by structural analysis alone.
Simulation fidelity — The Docker ephemeral cluster is a single-node, minimal-memory instance. Benchmark results are directionally valid but do not replicate multi-node replica overhead or production query concurrency.
FAISS index staleness — The knowledge base reflects elastic.co documentation at the time of the last build-kb run. Re-run periodically after major Elasticsearch releases.
No automatic rollback — If an approved change degrades production performance post-deployment, reversal requires a manual operator action. A future version will add regression detection with automated rollback routing through the same approval gate.
Billing API dependency — The cost gate requires a valid BILLING_API_KEY and correctly configured CLOUD_COST_PER_NODE_MONTHLY. An incorrect cost reference produces incorrect gate decisions.

Built With

all-minilm-l6-v2
anthropic
anthropic-messages-api
apscheduler
argparse
aws/gcp/azure-billing-apis
beautifulsoup4
claude-3.5-sonnet
claude-sonnet-4-20250514
diagnosis-agent
docker
docker-sdk
elasticsearch
elasticsearch-/-alias
elasticsearch-/-cluster/settings
elasticsearch-/-esql
elasticsearch-/-index-template
elasticsearch-/-reindex
elasticsearch-/-tasks
elasticsearch-py
faiss-(faiss-cpu)
faiss-indexflatip
gpt-4o
javascript-(jsx)
openai
openai-chat-completions-api
pytest
python
react
reasoning-agent
requests
sentence-transformers-(all-minilm-l6-v2)
sre-orchestrator-agent

Updates

Samira Samrose started this project — Feb 27, 2026 10:42 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.