Inspiration

Modern attacks move faster than traditional incident response. Public reporting has shown attacker breakout times measured in minutes, while defenders often still need to collect evidence, choose tools, run plugins, correlate artifacts, and write a reliable narrative.

The problem is not only speed. It is trust.

Many AI-assisted DFIR workflows perform single-pass analysis. They may flag legitimate software as malicious, miss cross-source evidence, or present weak assumptions as conclusions.

A senior incident responder does not work that way. They challenge findings, demand corroboration, clear false positives, and explain why a conclusion is justified.

That gap inspired PHANTOM DFIR:

What if a forensic AI agent could argue with itself before presenting findings?

PHANTOM DFIR was built to behave less like a raw tool runner and more like an autonomous investigative workflow: collect evidence, generate hypotheses, validate them, challenge them, correct mistakes, and produce traceable reports.


What it does

PHANTOM DFIR stands for Parallel Hypothesis Analysis with Multi-agent Threat Hunting Overlay Network.

It is an autonomous digital forensics and incident response framework for memory, disk, and network evidence. PHANTOM routes evidence to the correct analysis engine, runs forensic tooling, validates findings, performs self-correction, and generates reports with traceable evidence.

PHANTOM supports:

  • Memory images
  • Disk and E01 images
  • PCAP / PCAPNG network captures
  • Offline LLM-assisted reasoning with Ollama
  • Rule-based fallback when no LLM is available
  • Read-only MCP forensic tooling
  • Ground-truth benchmark validation

Core Workflow

PHANTOM uses an adversarial investigation model:

Component Responsibility
Router Detects evidence type and routes memory, disk, or PCAP evidence to the correct engine
Collector Runs forensic collection using Volatility 2/3, Sleuth Kit, tshark, strings, YARA-style rules, and other tools
Investigator Generates hypotheses from deterministic rules and optional LLM reasoning
Evidence Agent Re-queries specific processes, IOCs, identities, services, files, or network artifacts
Skeptic Challenges weak findings, clears false positives, and requires corroboration
Evidence Gap Controller Detects missing evidence and decides whether to rerun, continue, or accept
Reporter Produces JSON, Markdown, execution logs, reasoning traces, and benchmark-ready output

Key Capabilities

  • Memory analysis using Volatility 2 and Volatility 3
  • Disk artifact extraction from filesystem, registry, browser, email, malware, crypto, prefetch, UserAssist, and Shimcache evidence
  • PCAP analysis with HTTP objects, webmail attribution, identity extraction, and communication role ranking
  • Harassment attribution logic for sender, victim, alias, and internal IP correlation
  • Crypto evidence recovery for AES, GPG, and BitLocker workflows
  • MITRE ATT&CK-style mapping from validated indicators
  • MCP-based read-only forensic tool access
  • SHA256 evidence integrity verification
  • Self-correction and false-positive clearing
  • Benchmark validation against known ground truth

Benchmark Results

PHANTOM was tested across memory, disk, crypto, and PCAP evidence using a ground-truth benchmark framework.

Case Evidence Type Result
Ali Hadi memory Memory Fully reproduced
SysInternals challenge Disk Fully reproduced
CFReDS data leakage Disk Fully reproduced
Ali Hadi Encrypt Them All Disk / Crypto Fully reproduced
Nitroba harassment attribution PCAP Fully reproduced

Overall benchmark summary:

Metric Result
Cases scored 5
Fully reproduced 5 / 5
Average adjusted score 97%
Verdict matches 5 / 5

The benchmark framework validates findings against known ground truth instead of relying only on subjective report quality.


Self-Correction Example

In memory analysis, PHANTOM may initially identify a suspicious executable or IOC from a single source.

Instead of immediately escalating it, the workflow performs adversarial validation:

Investigator:
“This process or artifact may indicate suspicious activity.”

Evidence Agent:
Re-query process path, service metadata, command line, network state, and supporting plugins.

Skeptic:
“This finding is weak or benign unless corroborated by independent evidence.”

If the evidence does not support the claim, PHANTOM clears or downgrades the finding.

This prevents the system from treating every suspicious string, YARA hit, or tool name as a confirmed compromise.


Case Example: Nitroba PCAP Attribution

PHANTOM’s PCAP engine extracted and correlated webmail identity evidence from network traffic.

It identified:

  • Suspect account: jcoachj@gmail.com
  • Threat alias: the_whole_world_is_watching@nitroba.org
  • Victim: lilytuckrige@yahoo.com
  • Internal IP: 192.168.15.4
  • Case classification: harassment attribution
  • Verdict: high-confidence attribution

The tool also corrected earlier over-attribution by separating:

  • Primary suspects
  • Primary victims
  • Communication participants
  • Background identities
  • Session/tracking/request tokens

This reduced noise from unrelated browsing artifacts and ranked identities by evidentiary proximity instead of raw frequency.


How we built it

Core Stack

  • Python 3.10 for orchestration and forensic pipelines
  • LangGraph for multi-agent memory investigation
  • Ollama / qwen2.5:14b for optional offline reasoning
  • Volatility 2 and Volatility 3 for memory forensics
  • Sleuth Kit for disk image analysis
  • tshark for PCAP extraction
  • ClamAV, YARA-style rules, GPG, libbde/dislocker for malware and crypto workflows
  • FastAPI + Uvicorn for the MCP server
  • JSON / Markdown reports for traceable outputs
  • Ground-truth benchmark framework for validation

Architecture Guardrails

PHANTOM uses structural safety controls rather than relying only on prompts.

1. Read-Only Evidence Handling

Evidence is treated as read-only. The tool hashes evidence and avoids modifying original images.

2. MCP Trust Boundary

The MCP server exposes typed read-only forensic tools. It does not provide destructive shell actions.

3. SHA256 Integrity Verification

Evidence is hashed during analysis so reports can reference evidence identity and integrity.

4. Bounded Self-Correction

Self-correction loops have maximum iteration limits to avoid runaway analysis.

5. Rule-Based Fallback

If the LLM is unavailable, PHANTOM still runs deterministic collection, validation, skepticism, and reporting.


Challenges we ran into

1. False Positives from Weak Indicators

Early versions treated isolated suspicious strings or tool names as stronger evidence than they deserved.

We fixed this by requiring corroboration from multiple sources and by adding a skeptic stage that downgrades under-supported findings.

2. Identity Over-Attribution in PCAPs

The first PCAP attribution engine extracted many webmail identities but ranked unrelated browsing artifacts too highly.

We improved this by adding identity validation, role separation, recipient/sender logic, and attribution chains.

3. Sender vs Victim Confusion

Threat recipients were initially being ranked as suspects because they appeared frequently in threatening communications.

We added role precedence:

  • Sender/authenticated account evidence increases suspect confidence
  • Recipient/target evidence increases victim confidence
  • Recipient-only identities cannot outrank sender identities

4. Crypto Case Runtime

Deep disk and crypto analysis can be expensive, especially when scanning large extracted filesystem caches.

We added clearer progress messages, skip options for expensive modules, and benchmark workflows so long runs are explainable and reproducible.

5. LLM Output Reliability

LLMs can return vague hypotheses or malformed structured output.

PHANTOM validates LLM-generated indicators, rejects vague IOCs, and falls back to rule-based investigation when needed.


Accomplishments that we're proud of

  • Built a working autonomous DFIR pipeline across memory, disk, crypto, and PCAP evidence
  • Added self-correction and skeptic review instead of single-pass conclusions
  • Fully reproduced 5 benchmark cases with a 97% average adjusted score
  • Correctly attributed the Nitroba harassment case to jcoachj@gmail.com
  • Recovered Ali Hadi Encrypt Them All crypto evidence, including AES, GPG, and BitLocker artifacts
  • Added execution logs, reasoning traces, and benchmark validation
  • Built a read-only MCP forensic tool server
  • Added graceful no-LLM mode for offline deterministic analysis
  • Preserved evidence safety with read-only workflows and SHA256 integrity checks

What we learned

  1. Forensic AI needs skepticism
    Detection is not enough. The system must challenge its own conclusions before reporting them.

  2. Evidence proximity matters
    The best identity is not always the most frequent identity. It is the identity closest to the action being investigated.

  3. Ground truth matters
    Benchmarking against known cases makes accuracy measurable and prevents overclaiming.

  4. Rules and LLMs work best together
    Deterministic forensic logic provides stability, while LLM reasoning helps with narrative, gap analysis, and hypothesis generation.

  5. Read-only architecture is essential
    Forensic systems must protect evidence integrity by design.


What's next

  • Add Zeek and Suricata enrichment for deeper network analysis
  • Expand multi-source correlation across memory, disk, and PCAP from the same incident
  • Add more benchmark cases from public forensic datasets
  • Improve timeline fusion across filesystem, registry, browser, memory, and network artifacts
  • Add richer report provenance so every finding links directly to the tool output that produced it
  • Improve runtime controls for very large disk and crypto evidence

Built With

  • fastapi
  • json
  • langgraph
  • mitre-att&ck
  • model-context-protocol-(mcp)
  • ollama
  • python
  • qwen-2.5
  • sans-sift-workstation
  • sha256
  • sleuthkit
  • subprocess
  • threadpoolexecutor
  • uvicorn
  • volatility-3
Share this project:

Updates