-
-
logo
-
AstralMart service topology — edges represent runtime HTTP/gRPC calls. Grey nodes are external dependencies.
-
The AstralMart knowledge base — six Elasticsearch indices that give the agent the operational context a senior engineer carries in head
-
Agentic architecture
-
Request flow for an investigation session. The MCP server is the bridge between the Orchestrator and the specialist agents.
-
Request flow for a tutoring session. The Teaching Agent holds the lab state across turns via the conversation ID.
-
mcp tools
-
task solver tools
-
teaching agent tools
-
orchestrator agent tools
-
input
Inspiration
New engineers on observability teams can see the data but can't yet read it. The knowledge that makes metrics meaningful — what's normal, who owns what, what the runbook says — lives in senior engineers' heads and transfers slowly. We wanted to build something that closes that gap faster: an assistant that investigates incidents and teaches engineers how to do it themselves, using the same company systems and data they'll eventually need to master.
We don't just give the fish — we teach how to fish.
What it does
AstralMart Assistant is a three-agent onboarding system for SRE teams built around a single idea: solve the incident and teach the engineer how to solve the next one.
- Orchestrator — the engineer's only point of contact. Understands the intent of each message, delegates silently to specialists, and composes a single coherent response.
- Task Solver — the investigator. Queries live APM telemetry, identifies the root cause, and cross-references company runbooks and postmortems to return a structured diagnosis with an action card.
- Teaching Agent — the coach. Runs a step-by-step guided lab tied to the exact scenario just investigated, asking questions at each step so the engineer learns the investigation path using real company data and tools.
How we built it
All three agents are built in Kibana Agent Builder (Anthropic Claude Sonnet 4.5). The knowledge base lives in Elasticsearch: 11 service profiles, 4 runbooks, 4 postmortems, 10 norms, 5 training labs. Live telemetry comes from the OpenTelemetry Demo — 28 containers exporting real traces and metrics to Elastic Cloud APM via OTLP.
A custom MCP server (FastAPI + FastMCP) bridges the agents. The Orchestrator calls invoke_task_solver, invoke_teaching_agent, and log_progress as tools. The MCP server handles Kibana API round-trips and threads conversation IDs so the Teaching Agent holds lab state across turns.
Three separate agents instead of one large one: shorter, focused system prompts that perform significantly better.
Challenges we ran into
Monolithic agents don't scale. Early versions packed investigation, tutoring, routing, and progress tracking into a single agent. The prompts grew long and defensive — and performance dropped. Splitting into three agents with focused responsibilities fixed it. Shorter prompts, clearer behaviour, easier debugging.
The 60-second timeout wall. The MCP connector cuts off at ~60s. Our Task Solver was making two Kibana API round-trips per investigation and frequently hit the limit. A two-path prompt solved it: if the service name is already in the message, skip discovery and go straight to the ES|QL query. One architectural decision, one full round-trip saved.
Prompt-level state machines. Without explicit sequencing rules, agents would occasionally collapse or reorder steps. A strict contract in each system prompt — "after Step N, the next output must be Step N+1, no exceptions" — fixed the behaviour completely. Every challenge in this project had a prompt engineering answer once the architecture was right.
Accomplishments that we're proud of
A multi-agent architecture that actually works in harmony. Three agents with clean, non-overlapping responsibilities — each knowing exactly what it owns and nothing more. The Orchestrator routes, the Task Solver investigates, the Teaching Agent coaches. No agent steps on another. The engineer sees one seamless conversation.
Constraints pushed us toward better design. The timeout wall and prompt reliability issues forced us to strip out everything unnecessary — fewer round-trips, shorter prompts, explicit contracts. The result is a system that does more with less: minimal tool calls, maximum signal, configurations that are predictable and easy to reason about.
What we learned
Systems fail at the boundaries. The agents were the easy part. The handoffs — tool call ordering, conversation threading, state preservation — were where the real complexity lived.
Prompt length is a proxy for trust. Short, confident prompts with clear state machines outperformed long defensive ones every time.
Simulation quality determines output quality. Time spent on realistic seed data was never wasted.
What's next for AstralMart Onboarding Assistant
The architecture is domain-agnostic — swap the knowledge base for a security team's detection playbooks or a fintech compliance team's policy docs and the same system works. Next: a progress dashboard inside Kibana, richer branching training labs, and team-level onboarding analytics.
Built With
- anthropic-claude-sonnet-4.5
- docker
- elastic-agent-builder
- elastic-cloud
- elasticsearch
- es|ql
- fastapi
- fastmcp
- kibana
- mcp-(model-context-protocol)
- opentelemetry
- otlp
- python
Log in or sign up for Devpost to join the conversation.