Find Evill

SplunkOps Project Story What It Does SplunkOps is an autonomous incident-response agent for SANS SIFT and Splunk. Given an alert or analyst prompt, it investigates endpoint, identity, network, memory, and disk evidence; correlates the results; records cited findings; and produces a concise incident brief with recommended containment steps. The key behavior is self-correction. The agent does not simply run one search and summarize it. It checks whether the evidence sources agree, looks for gaps between Splunk telemetry and SIFT artifacts, and revisits the case when a finding is unsupported or contradicted. How We Built It We built SplunkOps around a custom MCP server instead of generic shell access. The server exposes typed functions for: Splunk searches. Splunk index discovery. SIFT memory forensics. SIFT disk and timeline tooling. The agent loop is intentionally small and inspectable. It keeps an append-only event log and treats that log as the source of truth. Tool calls, tool results, evidence rows, findings, reflexion passes, and final decisions are all events. Findings are Pydantic objects, not free-form paragraphs. Every finding must cite at least one evidence row. The final RCA brief is generated only from those typed findings and the accumulated evidence list. Challenges The hardest design problem was balancing speed with evidence integrity. A generic agent can move quickly if it has a shell, but that also gives it the ability to destroy evidence, run expensive or irrelevant commands, and drown itself in raw output. We solved that by making the MCP server a real security boundary: SPL write and egress commands are blocked before they reach Splunk. SIFT commands are allow-listed. Original case artifacts are read from SIFT_CASES_DIR. Derived artifacts are written only under SIFT_OUTPUT_DIR. Raw outputs are capped before entering the model context. Another challenge was hallucination control. Instead of asking the model to "be careful," the data model requires citations. Unsupported claims cannot become findings without pointing back to exact evidence rows. What We Learned The most useful defensive AI pattern is not a bigger prompt. It is a narrower tool boundary. Once the agent cannot mutate evidence and cannot assert uncited findings, the remaining work becomes analyst workflow design: sequencing queries, comparing sources, and deciding when to ask a human. We also learned that reflexion works best when it is grounded in the event log. The agent can self-correct more reliably when it reviews concrete tool calls and evidence rows rather than a vague summary of its own prior thoughts. What's Next The next step is a broader benchmark suite: More public DFIR datasets with documented ground truth. Automated scoring for false positives, missed artifacts, and hallucinated claims. More SIFT wrappers, especially for registry, browser, and persistence artifacts. A richer UI for inspecting citation chains from finding to raw evidence. Policy-controlled containment adapters for EDR, Slack, Jira, and ticketing systems.

Built With

html5

Updates

Arya Ariya started this project — Jun 15, 2026 11:43 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.