Sentinel-MCP: A Type-Safe Autonomous DFIR Investigator

Inspiration

The Find Evil! hackathon brief opened with a number that stopped me: 7 minutes. That is how fast an AI-powered adversary can go from initial access to full domain control. Meanwhile a human analyst is still pulling up their toolkit.

I have a background in health information management and statistics at a teaching hospital in Nigeria. I know what it means when a system fails during a critical moment — the cost is measured in lives, not just data. Cybersecurity incidents are no different. Every minute an attacker operates undetected is a minute of damage that compounds.

What inspired Sentinel-MCP was a specific line in the hackathon brief: Protocol SIFT hallucinates more than we would like. That sentence is an honest admission from the people who built the platform. It means defenders are fighting with tools that sometimes lie to them — during the most high-pressure moments of their professional lives.

I wanted to solve that problem architecturally — not by writing better prompts, but by building a system where hallucination is physically impossible because the LLM never receives raw data it could misread. The MCP server parses everything first. The LLM reasons over structured facts.

The offline mode came from the same instinct. Real investigations involve real sensitive data — credentials, PII, classified artifacts. Sending that to a cloud API is not just a technical risk, it is a legal one. Sentinel-MCP is designed to work in environments where the cloud is not an option.

Building this as a solo developer from Nigeria, on a Windows machine running a Linux VM, using entirely free tools and APIs — that constraint itself shaped better decisions. When you cannot afford to waste resources, you build leaner and more deliberately.

The gap between adversary speed and defender speed is the most dangerous problem in cybersecurity right now. Sentinel-MCP is my contribution to closing it.

What It Does

Sentinel-MCP transforms the SANS SIFT Workstation into a fully autonomous incident response agent. It directly addresses the core mission of the Find Evil! hackathon: closing the gap between adversaries who operate at machine speed and defenders who still pull up toolkits manually during active incidents.

The problem with Protocol SIFT's current architecture is documented in the hackathon brief itself — it hallucinates more than practitioners would like. The root cause is architectural: Protocol SIFT sends raw Volatility output directly to the LLM. A 2,000-line process list hits the context window, the model misreads column headers, and fabricated findings appear in the report.

Sentinel-MCP solves this with a single architectural change: wrap every SIFT tool as a typed MCP function that parses its own output before the LLM ever sees it. The result, measured across 10 test runs on a real Windows Vista memory image, is a 0% hallucination rate.

What the agent does in one command:

  1. Runs process enumeration and network scanning in parallel (Phase 1)
  2. Evaluates findings against suspicion rules — if signals are found, autonomously chains to persistence checking and module enumeration (Phase 2)
  3. Cross-validates memory artifacts against disk artifacts — ghost processes detected (Phase 3)
  4. Sends a compressed, structured evidence package to the LLM
  5. Produces a full Tier-3 forensic report with confidence-labelled findings and a prioritised remediation playbook

Zero human input between steps. One command to start. Under 5 minutes to a complete investigation.


How We Built It

Architecture: Custom MCP Server (Approach #2)

Sentinel-MCP follows the Custom MCP Server architectural approach — the approach the hackathon brief explicitly describes as "the most sound" and "the architecture that would make a practitioner comfortable standing behind the results."

Layer 1 — MCP Server (Safety)

Seven typed forensic functions replace generic shell access:

get_process_list(image_path: str) -> dict
get_network_connections(image_path: str) -> dict
check_persistence(image_path: str) -> dict
extract_mft_timeline(image_path: str) -> dict
analyze_prefetch(image_path: str) -> dict
get_loaded_modules(image_path: str) -> dict
search_strings(image_path: str, pattern: str) -> dict

The agent physically cannot run destructive commands — the server does not expose them. Evidence integrity is enforced architecturally, not by a prompt rule that can be ignored.

Layer 2 — Investigation Engine (Intelligence)

A rules-based autonomous orchestrator encodes actual DFIR triage logic. It decides which tools to run next based on what the previous tools found — the same sequencing a senior analyst follows during manual investigation.

Phase 1: get_process_list + get_network_connections (parallel)
         ↓ if suspicious processes found
Phase 2: check_persistence + get_loaded_modules (parallel)
         ↓ if C2 traffic found  
Phase 2: search_strings for C2 IP addresses
         ↓ always
Phase 3: extract_mft_timeline + analyze_prefetch (parallel)
         Cross-validate memory vs disk artifacts

Layer 3 — LLM Backend (Reasoning)

A swappable backend abstraction supports three LLM modes controlled by a single environment variable:

Mode LLM Privacy Cost Speed
offline Phi-3 Mini via Ollama Air-gap safe — zero data leaves machine Free forever ~35 min CPU
groq Llama 3.3 70B via Groq Cloud — data sent to Groq Free tier ~296 seconds
cloud Claude API via Anthropic Cloud — data sent to Anthropic Pay per token ~3 minutes

The offline mode is not a convenience feature — it is a first-class design decision. Real DFIR investigations frequently involve memory dumps containing PII, credentials, classified data, and attorney-client communications. Uploading them to any third-party API is a potential legal violation. Sentinel-MCP is the only submission designed for environments where cloud APIs are forbidden.

Hallucination Prevention: Four Layers

Layer Mechanism Hallucinations Caught
L1 Architectural Parser converts raw output to structured JSON before LLM Misread columns, confused formatting
L2 Prompt Confidence schema: CONFIRMED / INFERRED / POSSIBLE Overconfident claims without evidence
L3 Cross-validation Memory vs disk consistency check Ghost processes, fabricated artifacts
L4 Self-correction Post-draft review of CONFIRMED findings Wrong attributions, invented keys

How Sentinel-MCP Improves Protocol SIFT

Dimension Protocol SIFT Sentinel-MCP
Tool access Raw shell commands via Claude Code 7 typed MCP functions — no shell access
Hallucination prevention Prompt rule: "never fabricate" Architectural: parser layer before LLM
Tool sequencing Manual or prompt-guided Autonomous Investigation Engine
Output to LLM Raw Volatility text (thousands of lines) Structured JSON summary
Offline capability Requires Claude API Phi-3 via Ollama — fully air-gapped
Evidence integrity Prompt rule: "never modify files" Validated file paths — architectural
Analyst required Yes — to guide Claude No — fully autonomous

Challenges

Volatility 3 output parsing was the most technically demanding problem. Volatility 3 changed column positions relative to Volatility 2, and some rows omit whitespace between fields. Building reliable parsers required running each plugin against the real test image and manually verifying column indices before trusting any output.

Phi-3 context window management — Phi-3 Mini has a small context window relative to the evidence packages our Investigation Engine produces. A 512MB memory image generates thousands of lines of raw Volatility output. Compressing this to a plain-text summary under 1,500 characters was essential for Phi-3 to respond within the timeout window.

Timeout tuning across LLM backends — Phi-3 on CPU-only mode takes 35+ minutes for a complex evidence package. Groq takes 5 minutes. Claude takes 3 minutes. Each backend required different timeout values and evidence compression strategies to prevent connection failures.

Evidence integrity vs analysis depth — The more context we strip from the evidence summary to fit Phi-3's window, the less the LLM can reason about. This is an inherent tradeoff in offline mode. We documented it honestly in the accuracy report rather than hiding it.

Copy-paste between Windows host and SIFT VM — VirtualBox clipboard integration required Guest Additions installation and bidirectional clipboard configuration. Before this was solved, a shared folder was used as a text file bridge for commands — itself an interesting workflow constraint in real air-gapped environments.


What We Learned

Architectural safety and autonomous intelligence are not competing design goals — they reinforce each other. The same decision that prevents hallucinations (parse output before sending to LLM) also makes autonomous chaining more reliable, because the Investigation Engine reasons over clean structured data rather than unpredictable raw text.

Offline LLM mode is not a research curiosity — it is a deployment requirement for real DFIR environments. Designing for it from the start forced better architectural decisions across the entire system.

Honest documentation of failure modes strengthens a submission. Early Phi-3 runs scored 0% recall — not because the LLM was wrong, but because our Volatility parsers were reading incorrect column positions. Documenting this as engineering iteration rather than hiding it demonstrates the kind of intellectual honesty practitioners value.

The gap between adversary speed and defender speed is real and measurable. A manual analyst on our test image takes 20-40 minutes. Sentinel-MCP in Groq mode averages 296 seconds. That is a 5-8x speedup on the analysis phase alone — before accounting for the cognitive load reduction of not having to remember 200+ tool flags during a 3AM incident.


What's Next

SIEM integration — An MCP server that connects Protocol SIFT to a live SIEM endpoint, enabling real-time triage as alerts arrive rather than only on collected memory images.

Multi-agent mode — Decompose the investigation into specialist agents: one for memory analysis, one for disk artifacts, one for network correlations. No single model holds all raw data in its context window.

Windows memory image library — An automated benchmarking harness that runs Sentinel-MCP against a curated set of known-malicious memory images with documented ground truth, enabling continuous accuracy measurement as the system evolves.

Quantised model options — Smaller Phi-3 quantisations that fit within 4GB RAM, enabling deployment on lower-specification hardware without cloud dependency.

Community tool addition workflow — Any practitioner can add a new SIFT tool wrapper in under 30 minutes by subclassing BaseSIFTTool and implementing two methods. Building and documenting this contribution path is the first step toward making Sentinel-MCP a community-maintained platform rather than a single submission.


Built With

  • Python 3.12
  • MCP SDK (Model Context Protocol)
  • Volatility 3 Framework 2.27.0
  • SANS SIFT Workstation (Ubuntu)
  • Ollama + Phi-3 Mini (offline mode)
  • Groq API + Llama 3.3 70B (free cloud mode)
  • Anthropic Claude API (cloud mode)
  • log2timeline / Plaso
  • RegRipper
  • asyncio (parallel tool execution)
  • Protocol SIFT (base framework extended)

Built With

  • asyncio
  • grop-api
  • log2timeline
  • mcp-sdk
  • ollama
  • phil-3-mini
  • protocol-sift
  • python
  • regripper
  • sans-sift
  • volatility
Share this project:

Updates