evtx-sentinel

Claude Code drives the MCP server over stdio. Six read-only tools call Hayabusa and EvtxECmd; every call is logged with a SHA-256 hash.
Benchmark on 39 real attack files: 53 findings,17 hallucinations caught and corrected automatically, 0 false positives, 20 MITRE techniques.
The complete tool surface the agent can access. No shell, write, or delete exists — destroying evidence is structurally impossible.
The agent claimed lsass.exe; the raw record held winlogon.exe. verify_finding flags it HALLUCINATED, then auto-corrects in Phase 3.
Every tool call lands in an append-only log with a SHA-256 of its result. 202 entries any finding traces to the exact call that made it.

Inspiration

AI-powered attackers now move at machine speed, while defenders still investigate manually. The natural fix is an autonomous AI agent — and tools like Protocol SIFT already do this, connecting an LLM to the full SANS SIFT forensic toolkit. But when we tested Protocol SIFT on real attack data, we found a quiet, dangerous problem: it produced findings where the field values were simply wrong. The right event, but the wrong process name. The right record, but the wrong PID. Nothing in the system caught it, because its only guardrail was an instruction telling the model not to hallucinate. We built evtx-sentinel to replace that instruction with something stronger: architecture.

What it does

evtx-sentinel is a read-only typed MCP server for autonomous Windows Event Log (.evtx) analysis on the SANS SIFT Workstation. An AI agent connects to it and can do exactly five things: list evidence files, run Hayabusa sigma scans, look up individual records, summarize logon patterns, and verify a finding against the raw EVTX record. That is the complete surface area. There is no tool to run a shell command, no tool to write a file, no tool that could alter or destroy evidence — even if the agent tried.

Every finding generated during triage is passed through verify_finding() before it can enter the final report. The server re-reads the actual record and compares every reported field against ground truth. On our benchmark run across 39 credential-theft EVTX files, the system generated 53 findings, automatically caught and corrected 17 field-level hallucinations, and finished with 0 false positives discarded from the confirmed set.

How we built it

The MCP server is written in Python using the official mcp SDK (v1.27.2) over stdio transport. It exposes six typed read-only tools: list_evtx_files, register_evidence, run_sigma_scan, get_event_detail, get_logon_summary, and verify_finding. Each tool shells out to Hayabusa v3.9.0 in analysis-only mode — no tool can write, rename, move, or delete any evidence file. Every call appends a structured JSON line to logs/execution_log.jsonl containing the ISO 8601 UTC timestamp, tool name, arguments, a SHA-256 digest of the result, and wall-clock duration. This audit trail makes every inference traceable to a specific raw tool invocation.

The agent runs a four-phase protocol:

Phase 1 — Triage: run_sigma_scan across all registered files surfaces candidate findings.
Phase 2 — Validate: each finding passes through verify_finding(), which re-reads the record and compares every field against ground truth.
Phase 3 — Correct: get_event_detail() fetches the real field values for any failed finding and rewrites it with a CORRECTED annotation.
Phase 4 — Report: a structured report with CONFIRMED, CORRECTED, and DISCARDED sections so reviewers see exactly what changed and why.

Our methodology was baseline-first. We ran stock Protocol SIFT on the same 39-file dataset, audited its output by hand, and catalogued every error. The most instructive: the PPLdump detection at EventRecordID 564592, where the agent reported lsass.exe as the target process but the actual record contained winlogon.exe. That single field-attribution error defined the threat model we designed against.

Challenges we ran into

Hayabusa outputs pretty-printed multi-line JSON, not JSONL. Splitting on newlines produced broken records. We fixed it by feeding raw stdout through Python's json.JSONDecoder.raw_decode() in a streaming loop, advancing the cursor after each decoded object.
Hayabusa refuses to overwrite existing output files. Reruns exited early rather than clobbering the path. Fixed with the -C clobber flag so runs are idempotent.
MCP server registration didn't follow the agent into case subdirectories. When the agent changed working directory, the project-scoped registration fell out of scope. Fixed by registering with an absolute path and always launching from the evtx-sentinel root.

Accomplishments that we're proud of

We caught and corrected 17 hallucinations automatically across 39 real attack files, with zero false positives — no human in the loop.
We made the safety property architectural, not advisory: the agent physically cannot modify evidence because the tools to do so do not exist.
Every single finding is traceable to a raw tool call via a SHA-256-stamped execution log — full chain of custody across 20 MITRE ATT&CK techniques spanning 2017–2022.

What we learned

Architectural guardrails are categorically stronger than prompt-based ones. A typed function that exposes no destructive surface cannot be argued, jailbroken, or misunderstood into causing harm. "Do not hallucinate" can be — and routinely is — ignored.
The difference between a hallucination and a field-attribution error matters for detection. The agent found the correct event but reported the value from the wrong field. Review that only asks "did the event exist?" passes this error; field-level verification against the raw record catches it. That insight is the core of verify_finding().

What's next for evtx-sentinel

Expand beyond Credential Access to all MITRE ATT&CK tactics. The MCP architecture is tactic-agnostic; adding Persistence, Lateral Movement, and Defense Evasion is a matter of broadening rule selection and verification heuristics.
Add EvtxECmd Maps integration for richer field extraction, giving the correction phase the full parsed field set for any event.
Build a multi-agent variant where a Memory agent and a Disk agent cross-correlate findings — catching anti-forensic evasion that event logs alone cannot detect.