find-evil!

Adversarial Multi-Agent DFIR Correlation Engine

Find Evil! Hackathon submission — SANS SIFT Workstation · Protocol SIFT · LangGraph Multi-Agent Framework track

What it does

find-evil closes the hallucination gap in autonomous DFIR by running every analyst finding through a second adversarial agent whose only job is to attack it.

Three core innovations — none of which exist in Protocol SIFT baseline:

1. Adversarial Self-Verification Engine

Every claim the analyst produces must cite exact text from raw tool output. A second Claude instance (the Adversary) receives both the claims and the original tool output, and attacks each claim with specific, typed challenges:

hallucinated_citation — the cited text is not in the raw output → claim suppressed
unsupported_inference — evidence exists but doesn't support the conclusion → confidence penalty
false_positive_malfind — .NET JIT / packed legitimate binary → confidence penalty
alternative_explanation — benign explanation ignored → reinvestigation triggered

Only claims that survive adversarial review reach the final report. Hallucination rate is measured and compared against Protocol SIFT baseline (12%).

2. Algorithmic Timestomp Detection

Pure code, zero LLM. Compares $STANDARD_INFORMATION vs $FILE_NAME timestamps from MFT records. Detects:

$SI.created < $FN.created — physically impossible without timestamp manipulation
$SI.modified < $FN.created — file modified before it existed
All timestamps with microsecond = 0 — characteristic of timestomping tools
Suspiciously round timestamps — attackers often use exact values

Zero hallucination possible. Every detection is a deterministic computation.

3. Live Evidence Graph

Every artifact is a NetworkX node. Every relationship is a typed edge. Agents run graph queries instead of reading JSON blobs:

"Find processes with no disk executable AND active network connection" — one graph traversal
"Find orphaned network connections with no process" — rootkit indicator
"Find suspicious parent-child chains" — Word spawning PowerShell, etc.
PageRank on suspicious nodes surfaces the highest-priority pivot points

Case data ──► [Ingest: all SIFT tools] ──► [Evidence Graph + Timeline Resolver]
                                                        │
                                              [Analyst Agent: cited claims]
                                                        │
                                           [Adversary Agent: attacks claims]
                                                        │
                                          ┌─── disputed? ───┐
                                          │                  │
                                    [Targeted         [Promote: filter
                                    Reinvestigation]   by confidence]
                                          │                  │
                                          └────► analyst ◄───┘
                                                        │
                                           [Correlate: graph queries]
                                                        │
                                              [Finalize: report + metrics]

Architecture

Pattern: Multi-Agent Framework (LangGraph StateGraph)

Security boundaries:

_assert_read_only() called before every SIFT tool → architectural enforcement
ALLOWED_BINARIES set → no arbitrary shell execution
All tools via subprocess.run(capture_output=True) → no shell injection
LLM has no run_shell() function → prompt injection cannot cause spoliation

Context window management:

Smart truncation: first 60 + last 20 lines per tool (tail often has summary data)
Agents exchange typed Claim objects with EvidenceCitation — not raw text dumps
Evidence graph queries return typed node lists — not JSON blobs

Self-correction:

Adversary flags claims with reinvestigate=True and specifies which tools to re-run
Targeted reinvestigation re-runs only the disputed tools, not the full pipeline
Hard max_iterations cap prevents runaway loops
Full execution trace logged to JSONL for every iteration

Try it out

Prerequisites

SIFT Workstation — download from sans.org/tools/sift-workstation
Protocol SIFT: bash curl -fsSL https://raw.githubusercontent.com/teamdfir/protocol-sift/main/install.sh | bash
Python 3.11+
ANTHROPIC_API_KEY set in environment

Install

git clone https://github.com/<handle>/find-evil
cd find-evil
pip install -r requirements.txt

Option 1 — CLI (fastest path for judges)

# Disk + memory
python cli.py \
  --case CASE001 \
  --disk   /mnt/evidence/disk.E01 \
  --memory /mnt/evidence/mem.vmem \
  --max-iter 2

# Memory only (works with Volatility public samples)
python cli.py --case DEMO_001 --memory /path/to/cridex.vmem

# Verbose debug output
python cli.py --case CASE001 --disk /mnt/evidence/disk.E01 --verbose

Option 2 — API + React dashboard

# Terminal 1 — API server
python -m server.api

# Terminal 2 — React UI  
cd ui && npm install && npm run dev
# Open http://localhost:5173

Mount evidence read-only (recommended)

sudo mkdir -p /mnt/evidence
sudo mount -o ro,loop /path/to/disk.E01 /mnt/evidence
python cli.py --case CASE001 --disk /mnt/evidence

Using Volatility Foundation public sample (no SIFT needed for memory-only)

# Download cridex public memory sample
wget -O cridex.vmem "https://github.com/volatilityfoundation/volatility/wiki/Memory-Samples"
python cli.py --case CRIDEX_001 --memory ./cridex.vmem

Output files

output/<case_id>/
  triage_report.json     ← full structured findings with all claims
  graph.json             ← Cytoscape.js evidence graph

logs/<case_id>/
  execution.jsonl        ← full agent trace (one JSON event per line)
  summary.json           ← human-readable run summary

benchmark/
  <case_id>_history.jsonl ← hallucination rate over iterations

Reading execution logs

# All adversarial results with hallucination rates
jq 'select(.event_type == "adversarial_result")' logs/CASE001/execution.jsonl

# All tool calls with timing
jq 'select(.event_type == "tool_call") | {tool, duration_ms, error}' logs/CASE001/execution.jsonl

# Claims that triggered reinvestigation
jq 'select(.event_type == "node_start" and .node == "reinvestigate")' logs/CASE001/execution.jsonl

# Total token usage
jq -s '[.[].tokens // 0] | add' logs/CASE001/execution.jsonl

# Timestomping anomalies
jq '.timestamp_anomalies[]' output/CASE001/triage_report.json

Project structure

find-evil/
├── core/
│   ├── schema.py          ← Claim, EvidenceCitation, AdversarialAttack data model
│   ├── adversarial.py     ← Analyst + Adversary agents, reinvestigation logic
│   ├── timeline.py        ← MFT parser, TimestompDetector, TimelineResolver
│   └── graph.py           ← NetworkX evidence graph, typed queries, PageRank
├── agents/
│   └── orchestrator.py    ← LangGraph StateGraph pipeline
├── tools/
│   └── sift_tools.py      ← SIFT CLI wrappers (read-only enforced)
├── logs/
│   └── execution_logger.py ← JSONL audit trail + SSE streaming
├── benchmark/
│   └── harness.py         ← Hallucination rate measurement, ground truth scoring
├── server/
│   └── api.py             ← FastAPI + SSE streaming
├── ui/src/
│   └── App.jsx            ← React dashboard (graph viz, claim explorer, benchmark)
├── data/ground_truth/
│   └── demo_001.json      ← Cridex ground truth for benchmark scoring
├── docs/
│   ├── accuracy_report.md
│   └── dataset.md
├── cli.py                 ← Rich terminal CLI
└── requirements.txt

Every submission lives on as a community tool.

Built With

Updates

Anshuman Bahekar started this project — Jun 15, 2026 11:24 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.