wolfIR

Security Overview
Remediation Engine
Risk Analysis
Lyra AI
Threat Intel
Documentation
Incident Timeline
What if?
Blast Radius Simulator
Export Report
AI Security Posture
IR Protocol (NIST)
Compliance Mapping
AI Security Graph
Architecture & STRIDE
Autonomous Agent
Attack Path
ChangeSet Risk
Incident History
AI Compliance
AWS Organizations
Cost Impact
Simulation
Architecture Overview
5 Agent IR Pipeline

Inspiration

Every security platform today runs on AI. GuardDuty uses ML for behavioral anomaly detection. Security Hub correlates findings across Inspector, Macie, and dozens of partner tools. CloudTrail captures every API call. IAM Access Analyzer flags overly permissive policies. Everyone's using AI to detect threats. The signal infrastructure AWS has built is genuinely exceptional.

But I kept asking myself one question while building wolfir: what happens when someone attacks the AI?

If an attacker can inject instructions into the data my models process — abuse my Bedrock inference API, trigger runaway invocations, extract sensitive account patterns through model outputs — my security tool becomes the attack surface. MITRE built the ATLAS framework specifically to catalogue adversarial ML threats. Almost nobody deploys it in production. wolfir became the exception.

That's the problem I built wolfir to solve. Not just cloud security. AI security too — a platform that secures your cloud, and secures itself.

960 alerts a day. 40% go uninvestigated.

That number — from Prophet Security's 2025 AI in SOC Survey — captures the second motivation. So why does the journey from "something happened" to "we fixed it and documented it" still take 45 minutes of manual work at 2am?

The problem isn't detection. It's the time cost of turning detection into action. Correlating events, building a timeline, figuring out root cause, mapping the blast radius, writing a remediation plan with real AWS CLI commands, and documenting everything for the team — that part is still almost entirely manual. An analyst who should be making decisions is instead building a spreadsheet.

I explore both problems in depth on the blog: Why I Built wolfir and Who Watches the AI?.

The name is deliberate. wolf + IR (Incident Response). A wolf hunts in a pack — coordinated, patient, each member with a role, sharing context, moving from signal to resolution together. That's exactly how wolfir works: seven Amazon Nova capabilities, each specialized, sharing structured state, going from raw CloudTrail events to a complete incident response package.

What It Does

wolfir is an autonomous cloud security and AI security platform for AWS, built on two architecturally connected pillars. Try the live demo → — explore the full platform with built-in demo scenarios, then connect your own AWS account for real analysis against your live environment.

Pillar 1 — Cloud Security: End-to-End Incident Response

You feed it CloudTrail events — from a live AWS account or one of three built-in attack scenarios — and a five-agent Nova pipeline fires automatically:

Detect → Investigate → Classify → Remediate → Document

The full pipeline architecture is explained in Seven Nova Capabilities, One Pipeline: A Deep Dive.

The output is a complete incident response package, not a report. Here is everything it produces:

Attack Analysis

Chronological attack timeline with attack chain reconstruction and root cause identification
Interactive Attack Path Diagram — a React Flow graph where every event is a clickable node showing its risk score, MITRE ATT&CK technique, source IP, timestamp, and the exact IAM control that would have prevented it
Per-event risk scores with confidence intervals (e.g. 77/100, CI: 70–84) using three parallel Nova Micro calls via asyncio.gather()
MITRE ATT&CK technique mapping per finding

Blast Radius Simulator

Given the compromised identity, runs IAM policy simulation to map every AWS resource an attacker can reach
Results tiered as CRITICAL / HIGH / MEDIUM / LOW with estimated financial impact per resource
Turns "we had an IAM incident" into "here's exactly what was at risk and what to lock down first"
Covered in detail in When AI Should (and Shouldn't) Click the Button

AWS Organizations Dashboard

Full org tree — Management Account → OUs → Member Accounts — with real-time threat level indicators
Cross-account lateral movement detection: identifies when compromise in one account creates vectors into others
SCP gap analysis: which Service Control Policies are missing and what they would have blocked

Remediation — Actions, Not Suggestions

Real, executable AWS CLI commands generated for every finding
Three-tier human-in-the-loop approval system:
- AUTO — low-risk actions execute immediately (tagging, monitoring changes)
- APPROVAL — policy modifications, security group changes, key operations require explicit sign-off
- MANUAL — destructive actions generate CLI commands for analyst execution
Before/after state snapshots and one-click rollback for every executed step
Every action is CloudTrail-logged and stored as execution proof with timestamp

Compliance Mapping — 6 Frameworks, Automatic

Every finding auto-mapped to specific control IDs across:
- CIS AWS Foundations Benchmark
- NIST 800-53 rev5 (AC, AU, IA, SC controls)
- SOC 2 Type II (CC6, CC7, A1)
- PCI-DSS v4.0 (Requirements 7, 10)
- SOX IT General Controls (ITGC-04, ITGC-06)
- HIPAA (§164.312, §164.308)
No manual auditing — the compliance report generates automatically with every incident

Cost Impact Analysis

Financial exposure estimate using the IBM Cost of Data Breach 2024 methodology
Covers direct compute cost, data exposure per PII record, incident response labor, and regulatory fine exposure (GDPR, CCPA, HIPAA)
Low/mid/high scenario ranges with confidence intervals — executive-ready financial framing alongside the technical timeline

SLA Tracker

Real-time P1/P2 SLA compliance monitoring across every incident
P1 (CRITICAL): 15-minute detection-to-acknowledgement, 1-hour to remediation
P2 (HIGH): 1-hour acknowledgement, 4-hour remediation
Breach prediction for in-flight incidents based on pipeline stage and historical resolution time
All SLA events logged to DynamoDB for automatic compliance reporting

ChangeSet Analysis

Pre-deployment security review of CloudFormation change sets
Evaluates IAM policy changes, security group modifications, S3 configuration changes, encryption changes, and network topology changes
Each finding rated by risk tier and mapped to compliance controls it would violate if deployed
15-second review instead of a manual audit that takes hours

Documentation

JIRA tickets, Slack alerts, and Confluence postmortems generated automatically
Nova Canvas generates incident-specific PDF report cover art — a crypto mining incident looks different from a data exfiltration incident
Nova Act generates executable browser automation steps for AWS Console navigation and JIRA ticket creation

Cross-Incident Memory

Every incident persisted to DynamoDB with behavioral fingerprints
Four-signal correlation on every new incident:
- SHA-256 fingerprint matching (attack type + sorted MITRE technique list)
- MITRE technique overlap (≥2 shared techniques triggers a match)
- IOC matching — shared IP addresses and IAM ARNs across incidents
- Cosine semantic similarity via Nova Multimodal Embeddings (384 dimensions)
Run scenario 1, then scenario 2 — wolfir surfaces "78% probability this is the same attacker"

Agentic Query

A genuine Strands Agents SDK autonomous agent — not a fixed pipeline
19 registered Strands tools across 6 AWS service modules: CloudTrail, IAM, CloudWatch, Security Hub, Nova Canvas, AI Security
Ask it anything. It plans its own investigation, picks its own tools, executes multi-step workflows
Extended thinking enabled for complex multi-tool queries
Backed by Bedrock Knowledge Base (S3 Vectors) for RAG-powered playbook retrieval

Voice Interface — Aria

Nova 2 Sonic speech-to-speech streaming
Ask "what happened in the last 24 hours?" and get spoken incident analysis
Hands-free SOC investigation when you have four browser tabs open

Visual Architecture Analysis

Upload an architecture diagram (PNG/JPG) and Nova Pro performs STRIDE threat assessment
Reads actual VPC topology, identifies security groups, load balancers, databases, API gateways
50+ check types in under 30 seconds — text models cannot do this; multimodal is non-negotiable

Three Built-In Demo Scenarios — No AWS Account Required

IAM Privilege Escalation — Contractor misuses an AssumeRole chain to gain AdministratorAccess. MITRE T1098, T1078. 9 events. CRITICAL.
AWS Organizations Cross-Account Breach — Compromised Dev account role pivots via STS into Production and Security accounts across 3 OUs, 12 member accounts. 18 events. CRITICAL.
Shadow AI / Unauthorized LLM Use — Ungoverned Bedrock InvokeModel calls combined with a prompt injection attempt. Exercises the MITRE ATLAS self-monitoring pipeline — the AI security pillar catching what the cloud pillar cannot. 7 events. CRITICAL.

See Demo Mode That Actually Works for how I built the offline-first architecture that powers all three scenarios.

Pillar 2 — AI Security: wolfir Watches Itself

Every Bedrock invocation powering the analysis above is simultaneously monitored against the MITRE ATLAS framework. This is the part nobody else is building.

6 MITRE ATLAS Techniques — Runtime Checks, Not Labels

AML.T0051 — Prompt Injection: Pattern scanning on every user input AND CloudTrail event data fields before reaching the model. 12 injection signatures including role-override patterns, data exfiltration probes, and instruction injection via resource names.
AML.T0016 — Capability Theft: Every Bedrock invocation is recorded with the model ID. If any non-approved model outside the Nova allowlist is invoked, status flips to WARNING and names the offending model.
AML.T0040 — ML Inference API Access: Invocation rate monitoring with baseline comparison. Alert threshold: >3× baseline. Expected pipeline spikes annotated as PIPELINE_RUN — unexpected spikes flag immediately.
AML.T0043 — Adversarial Data: Input validation on CloudTrail event structure integrity — manipulated timestamps, anomalous field values, structurally malformed events flagged before reaching the temporal agent.
AML.T0024 — Data Exfiltration via Inference: Output scanning on every model response for AWS account IDs, access key patterns (AKIA/ASIA prefixes), private IP ranges, and secrets patterns. Flagged before reaching the frontend.
AML.T0048 — Model Poisoning: Explicitly N/A. wolfir uses Bedrock foundation models without fine-tuning. Documented honestly — honest non-applicability is more credible than a fake detection.

Beyond MITRE ATLAS

OWASP LLM Top 10 posture dashboard (LLM01–LLM10) with live radar chart — posture score 87/100
Shadow AI detection — CloudTrail InvokeModel monitoring for ungoverned Bedrock calls from non-approved principals
NIST AI RMF alignment — GOVERN, MAP, MEASURE, MANAGE function coverage
EU AI Act readiness assessment
AI-BOM — Bedrock model inventory for the account
Bedrock Guardrails coverage audit — which guardrails are active vs. not configured

When you run an incident analysis, the AI Security dashboard updates based on the actual Bedrock invocations that just ran. Cloud security and AI security share a live data plane. The two pillars are connected, not bolted together.

Read the full AI security architecture in Who Watches the AI? Why wolfir Monitors Its Own Intelligence Pipeline.

7 Amazon Nova Capabilities — Each Doing Non-Fungible Work

The full model selection reasoning is in Why I Chose Each Amazon Nova Model.

Nova Pro → Visual STRIDE threat modeling on uploaded architecture diagrams. The only multimodal Nova — text models cannot read a VPC diagram.
Nova 2 Lite → Timeline reconstruction, remediation plan generation, documentation, Agentic Query orchestration with extended thinking, Aria voice responses. The reasoning workhorse.
Nova Micro → Per-event risk classification at temperature=0.1 for determinism. 3 parallel calls via asyncio.gather() = confidence interval output. Speed over capability for classification.
Nova 2 Sonic → Aria voice assistant with WebSocket streaming. Speech-to-speech, not TTS bolted onto an LLM.
Nova Canvas → Incident-specific PDF report cover image generation. Each incident type gets a unique visual — not decoration, a signal this is a real security deliverable.
Nova Act → Executable browser automation plans for AWS Console navigation and JIRA ticket creation. Generates actions, not just instructions.
Nova Multimodal Embeddings → 384-dimensional behavioral vectors stored in DynamoDB for cross-incident semantic similarity. Finds campaigns described differently each time.

How I Built It

Backend: Python + FastAPI + Strands Agents SDK + boto3 + uvicorn

Frontend: React + TypeScript + Vite + Tailwind CSS + Framer Motion + React Flow, deployed on Vercel

Infrastructure: Terraform (S3 + Bedrock Knowledge Base + playbook upload), Docker + docker-compose (single docker compose up for the full stack)

Memory: DynamoDB (PAY_PER_REQUEST, auto-table-creation on first use, in-memory fallback when unavailable)

Knowledge: Bedrock Knowledge Base (S3 Vectors) + AWS Knowledge MCP — dual source with fallback chain for RAG-powered playbook retrieval

Safety: Bedrock Guardrails (content filters, prompt attack blocking, PII masking) + MITRE ATLAS runtime monitoring

The key architectural decision: treat every pipeline step as a specialized agent, not a general-purpose model.

The first version was a single Nova 2 Lite call that received CloudTrail events and was asked to produce a timeline, risk scores, remediation steps, and documentation simultaneously. It failed fast. Not because the model lacked capability — because a model trying to do four genuinely different cognitive tasks at once produces mediocre output on all four.

Context pruning at every handoff was the engineering breakthrough. The naive approach — passing full agent output to the next agent — collapses at any realistic incident size. A 50-event CloudTrail incident produces ~40K tokens by the time you reach the remediation agent. The fix:

# Each agent receives only what it needs — nothing more
timeline_handoff = {
    "attack_pattern": result["identified_pattern"],
    "root_cause":     result["root_cause"],
    "affected_resources": result["affected_resources"][:10],
    "risk_signals":   [e for e in events if e.get("flagged")],
    "confidence":     result["confidence_score"],
}
# ~800 tokens to next agent, not 12K

Token consumption dropped from ~40K to ~12K per pipeline run. Hallucinations from context bloat disappeared.

The MCP architecture creates explicit typed contracts. Six FastMCP server modules expose 19 Strands tools to the agent layer. The alternative — direct boto3 calls from within agent prompts — creates implicit coupling. When AWS API behavior changes, agent behavior changes unpredictably. MCP servers create a clean seam: mock the server in tests, never touch agent logic.

The persistent async worker solved the Strands SDK event loop problem. Strands @tool functions are synchronous. The agent implementations use async Bedrock calls. The naive fix (new event loop per tool call) added 200–400ms of overhead per call. The solution:

_WORKER_LOOP = asyncio.new_event_loop()
_WORKER_THREAD = threading.Thread(target=_WORKER_LOOP.run_forever, daemon=True)
_WORKER_THREAD.start()

def _run_async(coro):
    future = asyncio.run_coroutine_threadsafe(coro, _WORKER_LOOP)
    return future.result(timeout=180)
# Overhead: 200–400ms → under 5ms per call

Credentials never leave the user's machine. wolfir uses local AWS CLI profiles or mounted ~/.aws volumes. Nothing is stored or transmitted. In a security product, this is not optional.

Full technical deep-dives:

Challenges We Ran Into

Full engineering post-mortem: The Bugs That Taught Me Everything

Token bloat collapsed the first architecture. Passing full agent outputs sequentially hit context limits before the remediation agent started reasoning. The pruning layer fixed this: 60% reduction in token consumption, hallucinations from context bloat eliminated.

Strands SDK isn't designed for deterministic multi-step handoffs. It excels at autonomous single-agent reasoning. For a security pipeline requiring auditable, deterministic execution order and inspectable intermediate state, you have to architect around it. Running specialized models as @tool functions inside a Strands orchestrator gave me the best of both worlds.

The persistent async worker bridge. Strands tool functions are synchronous. FastAPI owns an event loop. Creating a new event loop per tool call works but adds hundreds of milliseconds per call. The single persistent worker thread with asyncio.run_coroutine_threadsafe() reduced per-call overhead from 200–400ms to under 5ms.

Hallucinated threat narratives from routine events. Feed 50 routine CloudTrail events — PutLogEvents, DescribeInstances, service-to-service AssumeRole — and ask "what's the attack pattern?" The model invents one. Confidently. With specific MITRE techniques. The fix was filter_interesting_events(): 60+ routine event names stripped before the temporal agent sees anything. When all events after filtering are routine, the pipeline returns "no threat detected" — worth more than a convincing-sounding false incident.

Risk score miscalibration at temperature=0.1. Nova Micro consistently over-called certain events — GetCallerIdentity scored HIGH (it's also a routine SDK health check). Hard calibration overrides applied after model scoring, combined with three parallel calls and confidence intervals, fixed this.

The model outputs fake access key IDs in remediation steps. When Nova 2 Lite generates a remediation plan involving a compromised access key, it outputs placeholders like AKIAEXAMPLE. The fix: runtime discovery in every executor function — if the model-generated key doesn't match the real format, look it up via iam.list_access_keys() before executing.

Session revocation requires a non-obvious IAM pattern. "Revoke all active sessions for a role" sounds straightforward. AWS doesn't have an API for this. The only correct approach is a DateLessThan: aws:TokenIssueTime deny policy condition — any token issued before a specific timestamp is denied all actions.

Bedrock Guardrails blocked legitimate security queries. Asking "which IAM role was used in the attack?" trips content filters because the word "attack" matches a sensitive phrase filter. Security tooling is inherently full of words content filters are trained to block. The fix: tuning the guardrail to allow security operations terminology, plus graceful degradation with suggested rephrase rather than a 500 error.

CloudTrail event field names are inconsistent. The same event time might be eventTime, EventTime, or buried inside a serialized CloudTrailEvent JSON string. Every parsing function needed defensive extraction with multiple fallback paths. This cost days of debugging.

DynamoDB table auto-creation race condition. Under concurrent requests, both find the table missing and both try to create it. The second throws ResourceInUseException. Fix: catch it and continue — the table was created by the first request.

Demo mode had to mirror real execution exactly. Every ExecutionResult object, every risk score shape, every timeline format must be identical whether the pipeline ran live or returned pre-computed outputs. The discipline enforced clean, typed API contracts between every layer.

Accomplishments I'm Proud Of

The two-pillar architecture is genuinely new. Cloud security platforms exist. AI security tools are emerging. A platform where one pillar monitors the other in real time — where running an incident analysis updates the AI security posture dashboard — is a new category. I haven't seen another project that monitors its own Bedrock pipeline against MITRE ATLAS in production.

Seven Nova capabilities, each doing work that can't be substituted. Nova Pro because you literally cannot analyze an uploaded architecture diagram for STRIDE threats without multimodal capability. Nova Micro at temperature=0.1 for determinism — more reliable for classification than Nova 2 Lite at default temperature. Nova Embeddings for behavioral similarity — finds campaigns described differently each time. See Why I Chose Each Amazon Nova Model.

The Blast Radius Simulator changes the incident response question. Not "what happened?" but "what could have happened, and what do we lock down right now?" IAM policy simulation as a first-class incident response tool gives security teams the prioritized list they actually need.

The cross-incident memory correlation. Running two demo scenarios and watching wolfir surface "78% probability this is the same attacker" — with overlapping IOCs, MITRE technique overlap, and semantic similarity from Nova Embeddings — that's the moment that made the engineering feel worth it.

Remediations actually execute. Before/after state snapshots, CloudTrail confirmation of every action, one-click rollback. This is not a report generator. It takes action with accountability. Full reasoning in When AI Should (and Shouldn't) Click the Button.

Infrastructure as Code from day one. terraform apply creates the full AWS environment. terraform destroy removes everything. docker compose up runs the full stack in under 2 minutes. These aren't aspirational — they're the baseline.

Credentials never leave the user's machine. In a security product, this isn't a nice-to-have. wolfir uses local AWS CLI profiles and mounted credentials volumes. Nothing is stored or transmitted.

What I Learned

Multi-agent systems are less about AI capability than about the seams between agents. The hard problems were: what format does the handoff use, what gets pruned, what gets carried forward, and how does the orchestrator know when a step has produced usable output. Get those seams wrong and no amount of model capability compensates.

Model selection matters more than model capability. The breakthrough wasn't using a more powerful model — it was matching each task to the right model. Nova Micro at temperature=0.1 for classification. Nova 2 Lite for long-context reasoning. Nova Pro for vision. Using the wrong model for a task creates quality problems that better prompts can't fix.

Hallucination prevention is a design pattern, not a prompt trick. Event filtering before the temporal agent, calibration overrides for the risk scorer, confidence intervals instead of single scores, "no threat detected" as a valid output, context pruning at every handoff — these are code patterns. Prompts alone couldn't achieve this.

Demo engineering is product engineering. Making the full five-agent pipeline work offline, client-side, with pre-computed real Nova outputs forced me to define clean API contracts, unified data shapes, and graceful degradation across every layer. The discipline of keeping demo mode and real mode on the same API contract made the real backend better. Full explanation: Demo Mode That Actually Works.

Infrastructure decisions compound. Terraform from day one meant the Knowledge Base setup was three commands. Docker from day one meant the demo setup was one command. These aren't things you add later cleanly.

The most important realization: securing AI is the next frontier in cloud security. SIEM and SOAR tools are converging on AI for detection and triage. The attack surface shifts to those AI systems. MITRE ATLAS exists. Most teams don't deploy it. I built wolfir as proof of concept for a monitoring layer that every AI-powered security platform will eventually need.

What's Next for wolfir

Real-time streaming via EventBridge — EventBridge rules on CloudTrail write events → SQS FIFO with priority tiers → wolfir workers on ECS Fargate. Shifts wolfir from batch analysis to live alert stream.
Multi-account correlation across AWS Organizations — cross-incident memory system extended to detect lateral movement campaigns across OUs, not just within a single account.
AI Red Teaming module — automated prompt injection testing against the user's own Bedrock pipelines. If you deploy Nova in production, you should verify that your guardrails actually hold against real adversarial inputs.
Deeper JIRA / Slack / PagerDuty integration via Nova Act — full browser automation for ticket creation, incident channel posting, and postmortem generation in one flow.
Community playbook library — the Bedrock Knowledge Base integration already supports custom playbooks in S3. The vision: community-contributed SOPs for common AWS incident types — curated, versioned, searchable via RAG.
Enterprise deployment template — CloudFormation template for VPC-isolated backend, private Bedrock endpoints, WAF in front of the FastAPI layer, and multi-region DynamoDB Global Tables for cross-region incident correlation.
SOC integration — direct Slack, PagerDuty, and JIRA webhook output with on-call escalation logic built into the remediation approval flow.

Live Demo: wolfir.vercel.app

Source Code: github.com/bhavikam28/wolfir

Blog:

Built With

act
agents
amazon
amazon-web-services
atlas
bedrock
boto3
canvas
cloudtrail
cloudwatch
css
dynamodb
embeddings
fastapi
fastmcp
framer
hub
iam
lite
llm
micro
mitre
motion
multimodal
nova
owasp
pro
python
react
s3
sdk
security
sonic
strands
tailwind
typescript
vite