Inspiration

Every DFIR analyst knows the moment: you have Volatility output in one terminal, MFTECmd CSV in another, and AmCache results in a third, and you're manually scanning across all three trying to find the process that appears in memory but has no corresponding disk artifact. That cross-referencing is the hardest, most cognitively demanding part of triage — and it's exactly the part that gets rushed at 2am during an active incident.

The question that drove this project was simple: what if the agent did that cross-referencing for you, and kept doing it until it had no open questions left?

We were also frustrated by how most "AI for security" demos work — a chatbot that answers questions about logs you paste into it. That's not how incident response works. A real investigation is iterative: you find one artifact, it points to three more, you follow each thread, you hit a dead end, you backtrack. We wanted to build something that modeled that investigative reasoning loop in code, not just in a prompt.

The SANS SIFT Workstation was the perfect foundation because it already has every tool an analyst needs. The gap was an intelligence layer that could orchestrate those tools systematically rather than waiting for a human to decide what to run next.

What it does

AEGIS (Autonomous Evidence & Guided Investigation System) is a self-correcting DFIR agent that performs digital forensic triage autonomously on SANS SIFT Workstation images.

A human analyst triage workflow looks like this: run Volatility to list processes, run AmcacheParser to see what executed, run MFTECmd to check the filesystem, then manually cross-reference all three to find the contradiction that proves something was injected. AEGIS does that cross-referencing automatically, in a loop, and keeps going until it has no open contradictions and no unexplored evidence gaps.

It exposes the full SIFT Workstation toolchain — AmcacheParser, PECmd, MFTECmd, EvtxECmd, vol3, srum_dump2, rip.pl, LECmd — as a structured MCP server with 19 read-only tools. Claude (claude-sonnet-4-6) calls those tools, gathers evidence across disk, memory, and log artifacts, then runs a deterministic conflict detector and gap analyzer at the end of each iteration. If conflicts or gaps remain, it self-corrects with targeted re-queries. It only exits the loop when both counters hit zero, or when the configurable iteration ceiling is reached.

The final output is a structured report with a kill chain narrative, IOC list, confidence-tiered findings, and a JSONL execution log with SHA-256 hashes on every data point for chain-of-custody.

How we built it

MCP Server (server/) — Built on FastMCP (Python). Every tool is a thin, read-only wrapper over a SIFT CLI call. A typed parser layer (server/parsers/) converts raw TSV/CSV/XML output into Python dataclasses before any data reaches the model. The tool registry (server/tools/registry.py) has a startup assertion that enumerates all exposed tools and verifies none are write operations — a server that could modify evidence will refuse to start.

Agent Loop (agent/loop.py) — Uses the Anthropic Python SDK with tool_use API calls to Claude. Each iteration: gather evidence → detect conflicts → identify gaps → decide whether to continue. LoopConfig controls max iterations, confidence thresholds, and output directory. Pagination is enforced at the MCP layer signature level — mft_timeline, evtx, usnjrnl, and srum require limit and offset parameters so unbounded data never reaches context.

Conflict Detector (agent/scoring/conflicts.py) — Six deterministic checks that cross-reference findings across sources: phantom processes (in memory, absent from Prefetch/AmCache), execution-without-disk evidence, deleted-file-in-memory disagreement, network connections without owning process, persistence paths missing from MFT, and Prefetch/AmCache timestamp disagreement.

Gap Analyzer (agent/scoring/gaps.py) — Nine gap types that identify what hasn't been queried yet relative to what's been found. A phantom process conflict with no malfind run is a CRITICAL gap. A network connection to an external IP with no registry or persistence check is a HIGH gap.

Confidence Scoring (agent/scoring/confidence.py) — Each finding starts at a base score determined by tool reliability, then receives corroboration boosts if multiple independent sources agree, contradiction penalties if sources conflict, and traceability penalties if a finding has no raw_hash in the execution log.

Reporting (reporting/synthesize.py) — Jinja2 template renders a structured markdown report. The JSON report is machine-readable for downstream SIEM ingestion.

Benchmark Harness (benchmarks/harness.py) — Compares agent output against a ground truth JSON with known IOCs and expected conflicts, scoring TP/FP/FN and counting hallucinations (findings without a raw_hash traceable to the execution log).

Challenges we ran into

Context overflow. Early prototype runs on large forensic images caused the MFT timeline (200,000+ entries on a real image) to fill the entire context window. The solution was enforcing pagination at the MCP tool signature level — not as a prompt instruction, but as a required parameter. The model cannot call mft_timeline() without specifying limit and offset. This architectural constraint is more reliable than any system prompt instruction.

Hallucination without a ground truth anchor. When an LLM summarizes findings across multiple tool calls, it can introduce artifacts that were never in any tool output. We addressed this by logging the SHA-256 of every raw tool response to the JSONL execution log, then cross-referencing every finding in the final report against those hashes. A finding without a traceable hash is flagged as a potential hallucination and excluded from the high-confidence tier.

Evidence integrity vs. prompt-based integrity. Initially, "don't modify evidence" was a system prompt instruction. That's not sufficient — a sufficiently long conversation can drift away from system prompt constraints. The fix was architectural: disk images are mounted read-only at the OS level by install.sh before the server starts, and the MCP server exposes zero write operations. You cannot violate evidence integrity by any prompt path.

Windows encoding on the demo runner. The Jinja2 report template uses Unicode arrow characters (→, ✓) that Windows CP1252 terminal and file system calls cannot encode by default. Fixed with $env:PYTHONIOENCODING="utf-8" for terminal output and explicit encoding="utf-8" on every file write in the synthesis layer.

False-positive phantom process conflicts. The conflict detector initially flagged standard Windows binaries (explorer.exe, cmd.exe, powershell.exe) as phantom processes because the demo MFT dataset only contained attack-path artifacts. Resolved by ensuring the MFT demo data includes normal system binary paths — the conflict detector now only fires on truly missing entries.

Accomplishments that we're proud of

The self-correction loop actually works. The most satisfying moment in building AEGIS was watching it detect a phantom process conflict in Iteration 1, autonomously decide to run malfind() in Iteration 2, confirm shellcode injection in the target PID, and then report zero remaining conflicts — without any human instruction between those two iterations. The loop doing exactly what it was designed to do, end-to-end, felt like the architecture paying off.

Evidence integrity is architectural, not advisory. We're proud that "don't modify evidence" is not a sentence in a system prompt — it's a property guaranteed by construction. Disk images are mounted read-only at the OS level. The MCP server has a startup assertion that enumerates every exposed tool and verifies zero write operations exist. You cannot violate evidence integrity through any prompt path, regardless of what the model says or does.

The hallucination detection benchmark. Building a quantitative measure of LLM hallucination in a DFIR context — cross-referencing every finding against SHA-256 hashes of raw tool output in the execution log — is something we haven't seen done in this domain before. It gives the system a verifiable accuracy claim rather than a vibes-based one.

Typed parsers as a reasoning multiplier. Replacing raw TSV/CSV tool output with structured, named Python dataclasses before any data reaches the model produced a noticeable and immediate improvement in reasoning quality. The parser layer ended up being one of the highest-leverage pieces of the entire system, and it's the part we're most likely to keep building on.

Zero false negatives on the core APT scenario. The demo scenario — phishing → macro → certutil download → Cobalt Strike beacon → persistence → lateral movement — produces every major artifact class a real investigation would encounter. AEGIS surfaces all of them, correctly tiered by confidence, with no fabricated IOCs in the final report. For a system built in a hackathon window, that accuracy benchmark clears a bar we're genuinely proud of.

What we learned

Deterministic post-processing beats prompt engineering for verification. Asking Claude "check your own work for gaps" in a prompt produces inconsistent results. Replacing that with a deterministic Python gap analyzer that checks a fixed set of conditions against the current evidence dict produces reliable, reproducible coverage guarantees. The model's job is evidence gathering; the code's job is verification.

The MCP boundary is the right place for safety constraints. Every safety property we tried to enforce in prompts was either fragile (drifts over long conversations) or unverifiable (you can't prove a prompt instruction was followed). Every property enforced at the MCP server boundary is guaranteed by construction. Read-only tools, mandatory pagination, typed output — these are architectural properties, not behavioral hopes.

Tool output structure determines reasoning quality more than prompt wording. When the model received raw Volatility tab-separated output, the reasoning was shallow and inconsistent. When it received typed dataclasses with named fields, confident scores, and explicit is_deleted / is_suspicious flags, the reasoning improved substantially. The parser layer is doing as much work as the prompt.

What's next for AEGIS — Autonomous Evidence & Guided Investigation System

Live SIFT Workstation integration. The demo runs against pre-baked data. The MCP server is written for real SIFT tool calls — the next step is end-to-end testing against an actual SIFT VM with a CTF memory image and disk image.

Streaming iteration output. Currently the loop runs synchronously. Adding a websocket or SSE stream would allow a SOC analyst to watch the agent reason in real time and inject guidance between iterations — a human-in-the-loop triage workflow.

Extended conflict types. Current detector covers 6 conflict types. Planned additions: DLL hijacking detection (process loads DLL from unusual path), NTFS ADS detection (alternate data stream in MFT not referenced by any process), and timestomping detection (MFT $STANDARD_INFORMATION vs $FILE_NAME timestamp delta).

Automated IOC enrichment. Post-triage, automatically query VirusTotal and MISP for each extracted IOC hash and IP, and fold the verdict back into the confidence score before the final report is written.

Multi-image case correlation. A ransomware investigation often spans 20+ endpoint images. The next architecture step is a case-level correlation layer that aggregates findings across multiple AEGIS runs, identifies lateral movement paths, and generates a unified attack timeline.

Share this project:

Updates