Muntjac

Muntjac — self-healing observability agent. Detect, diagnose, repair, and audit incidents in under 6 seconds via Splunk MCP + Gemini.
Watchdog polls 6 SPL rules → Gemini diagnoses via Splunk MCP → bounded repair (safe: auto, risky: escalate) → audit back to Splunk.
error spike recovered in 5.5s. All 5 chaos scenarios verified. Traditional MTTD: hours. Muntjac: seconds.

Inspiration

At 3 AM, the same five symptoms wake the same on-call engineers to run the same five SPL queries. Splunk already has all the data — what's missing is an agent that knows the playbook and is allowed to execute it.

What it does

Muntjac is a self-healing observability framework. It continuously polls Splunk for anomalies (error spikes, latency drift, heartbeat gaps, memory leaks, silent failures), dispatches an AI agent that diagnoses root cause via Splunk MCP Server, executes bounded repairs (safe actions auto-run; risky ones await approval), and writes the full audit trail back to Splunk — all queryable with a single stats command.

How we built it

Splunk Cloud Platform as the log backbone
Splunk MCP Server (Splunkbase 7931) as the AI-readable interface to Splunk data
Python watchdog polling 6 SPL detection rules covering multi-layer failure modes
AI agent (Claude CLI / Gemini CLI) connected to Splunk MCP via mcp-remote bridge
ACTION_ROUTES table enforcing safety boundaries at the dispatcher level — not relying on LLM judgment
Splunk HEC for writing muntjac:detection, muntjac:diagnosis, muntjac:repair, and muntjac:summary events back to Splunk

Challenges we ran into

Silent field-name mismatch: HEC wrote event= but detection rules searched msg=, causing 3 of 6 rules to silently never fire. Only caught by comparing raw _raw field names.
Action vocabulary drift: The LLM invented non-existent action names. Fixed by making ACTION_ROUTES the single source of truth for both the diagnosis prompt and the repair dispatcher.
HEC backpressure under chaos injection: Solved with 1/5 sampling for routine traffic and a 10,000-event queue depth.

Accomplishments that we're proud of

5/5 chaos scenarios pass e2e on a live Splunk Cloud trial cluster, with audit events verified by querying back through Splunk MCP
Safety boundaries enforced by code (ACTION_ROUTES), not by hoping the LLM behaves correctly
Fastest full loop (chaos → detect → diagnose → repair → audit in Splunk): 5.5 seconds

What we learned

Splunk MCP Server turns Splunk from a "human reads dashboards" tool into an "AI queries and acts" platform. The hardest bugs were silent ones — field mismatches that made everything look fine while half the detection rules were dead.

What's next for Muntjac

Multi-tenant support with per-team safety policies
Feedback loop: agent learns from past incident resolutions stored in Splunk
Integration with Splunk SOAR for enterprise approval workflows

Built With

claude-ai
docker
gemini
mcp-remote
python
splunk-cloud
splunk-hec
splunk-mcp-server

Updates

cyh7789 DannyHuang started this project — Jun 04, 2026 06:33 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.