ClawGuardian: AI Security Through Adversarial Threat Modeling

Inspiration

The inspiration for ClawGuardian emerged from a critical realization: as AI agents become increasingly autonomous and capable of interacting across multiple channels (from WhatsApp and Discord to Slack, Signal, and beyond), the attack surface grows exponentially.

We observed that:

  • Prompt injection attacks (both direct and indirect) represent a fundamental threat to autonomous AI systems
  • Supply chain vulnerabilities in AI skill/plugin ecosystems remain poorly understood
  • Trust boundaries in multi-channel agent deployments lack formal documentation
  • Most AI security frameworks focus on model behavior, not operational deployment threats

The MITRE ATLAS framework provided the ideal foundation: a collaborative, adversarial threat landscape specifically designed for AI/ML systems. We decided to build OpenClaw's threat model using this framework, making security transparent, community-driven, and actionable.


What it does

ClawGuardian is a comprehensive threat modeling and security governance system for OpenClaw that:

Capability Description Impact
MITRE ATLAS Mapping Maps all identified threats to industry-standard ATLAS tactics and techniques Enables comparison with other AI/ML security initiatives
Trust Boundary Analysis Documents 5 critical trust boundaries across the OpenClaw architecture Clarifies which components must defend against which threats
Risk Matrix Quantifies likelihood x impact for 25+ distinct threats Prioritizes mitigation work (P0/Critical -> P2/Medium)
Attack Chain Analysis Models realistic multi-step attack scenarios (e.g., skill injection -> moderation bypass -> credential theft) Reveals systemic weaknesses and cascading failures
Moderation Pattern Registry Catalogs known-bad patterns and detection limits in ClawHub Identifies evasion techniques and guides improvements
Living Documentation Community-editable threat model with structured contribution workflows Scales security expertise across the OpenClaw ecosystem

Core Components

  • 8 ATLAS Tactics covering reconnaissance through impact
  • 25+ Documented Threats with attack vectors, mitigations, and residual risk
  • 3 Critical Attack Chains showing realistic exploitation paths
  • Supply Chain Analysis of ClawHub security controls
  • Cryptographic & Protocol Guardrails for channel access

AWS Architecture

ClawGuard’s backend is built on AWS using a security-first architecture across five pillars: identity, encryption, compute, edge delivery, and network isolation. The system is designed to minimize attack surface, enforce least privilege, and maintain low operational cost.

Core Services

  • Authentication: Amazon Cognito

    • Managed identity with enforced TOTP MFA
    • No credential handling in application code
    • JWT-based access scoped to IAM roles
  • Key Management: AWS Key Management Service

    • Asymmetric signing for blockchain transactions (keys never leave HSM)
    • Envelope encryption for sensitive data (AES-256 + wrapped DEKs)
    • Strict IAM isolation for signing operations
  • Secrets: AWS Secrets Manager

    • Stores API keys, RPC endpoints, internal tokens
    • Automatic rotation with audit trail via CloudTrail
  • Compute Layer: AWS Fargate + Amazon API Gateway

    • Fargate runs persistent services (RPC subscriptions, API handlers)
    • API Gateway enforces SigV4 auth, WAF rules, and rate limiting
    • Deployed inside private VPC subnets
  • Edge & Frontend: Amazon S3 + Amazon CloudFront

    • SPA hosted on private S3 bucket
    • CloudFront provides global caching with restricted origin access
  • AI Integration: AWS Bedrock (Claude Haiku)

    • Runs fully inside AWS via VPC endpoints (no internet egress)
    • Used for threat analysis, incident response, and audit summarization

Network & Isolation

  • All services run inside a private VPC
  • Communication to AWS services via VPC Endpoints (PrivateLink)
  • No public internet exposure for internal services
  • Eliminates NAT gateway requirement and reduces data exfiltration risk

Security Model

  • Zero wildcard IAM policies (strict resource-level permissions)
  • Non-exportable cryptographic keys (KMS-backed signing)
  • Full auditability via CloudTrail and CloudWatch
  • Encrypted data at rest and in transit
  • Separation of concerns:

    • AWS → control plane
    • Blockchain → data integrity layer

Cost Profile

  • Total infrastructure: ~$200/month
  • Key optimizations:

    • VPC endpoints instead of NAT gateway
    • Fargate over EC2 (pay-per-use)
    • Claude Haiku (low-cost model)
    • Envelope encryption (reduces KMS calls)

How we built it

Phase 1: Framework Selection & Architecture

We began by studying MITRE ATLAS the industry standard for adversarial threats to AI systems. Unlike traditional threat models (STRIDE, OWASP), ATLAS is specifically designed for ML/AI contexts, including:

  • Prompt injection techniques
  • Supply chain compromises in AI ecosystems
  • Model poisoning and data manipulation
  • Evasion and defense circumvention

We then mapped OpenClaw's architecture to 5 trust boundaries:

Untrusted Channels -> Gateway Auth -> Sessions -> Tools -> External Content

Phase 2: Threat Enumeration & Analysis

For each ATLAS tactic, we identified OpenClaw-specific threats:

// Example: T-EXEC-001: Direct Prompt Injection
{
  id: "T-EXEC-001",
  atlasId: "AML.T0051.000",
  description: "Attacker sends crafted prompts to manipulate agent behavior",
  attackVector: "Channel messages containing adversarial instructions",
  mitigations: ["Pattern detection", "External content wrapping"],
  residualRisk: "Critical",
  recommendations: ["Multi-layer defense", "Output validation"]
}

Each threat was documented with:

  • Attack vector: How does it happen?
  • Affected components: What gets hit?
  • Current mitigations: What's already protecting us?
  • Residual risk: What's still exposed?
  • Recommendations: What should we do?

Phase 3: Risk Quantification & Prioritization

We created a risk matrix balancing likelihood and impact:

Threat Likelihood Impact Priority
T-EXEC-001 (Direct Prompt Injection) High Critical P0
T-PERSIST-001 (Malicious Skill) High Critical P0
T-IMPACT-002 (DoS) High Medium P1

Phase 4: Attack Chain Modeling

We traced realistic multi-step attack scenarios:

Chain 1: Skill-Based Data Theft

Publish Malicious Skill
  -> Evade Moderation Patterns
  -> Harvest Credentials from Agent Context
  -> Data Exfiltration

Phase 5: Documentation & Community Process

  • Created THREAT-MODEL-ATLAS.md as the living source of truth
  • Defined CONTRIBUTING-THREAT-MODEL.md for community submissions
  • Mapped all recommendations to implementation PRs

Challenges we ran into

Challenge 1: Prompt Injection - The "Unsolvable" Problem

The Problem: Direct prompt injection (AML.T0051.000) is fundamentally difficult because LLMs are designed to follow instructions in their input.

Our Approach:

  • Admitted that blocking all injections is impossible
  • Documented realistic attack sophistication levels
  • Layered defenses: pattern detection -> content wrapping -> output validation
  • Prioritized user confirmation for sensitive actions

Result: Residual risk = Critical, but with explicit compensating controls

Challenge 2: Supply Chain Complexity

The Problem: ClawHub skill ecosystem has minimal barriers (GitHub age verification + regex patterns), but we couldn't sandbox skills without major refactoring.

Our Approach:

  • Documented T-PERSIST-001 (Malicious Skill) as P0 critical
  • Built moderation pattern registry to track known-bad signatures
  • Planned VirusTotal integration for behavioral code analysis
  • Accepted transitional risk while engineering long-term solutions

Result: Roadmap clear; mitigations incremental but measurable

Challenge 3: Trust Boundary Clarity

The Problem: OpenClaw has multiple execution contexts (sandbox vs. host, different channels), making it unclear where responsibility shifts.

Our Approach:

  • Drew explicit 5-boundary diagram showing data flow
  • Assigned mitigations to specific boundary layer
  • Clarified that user judgment (exec approvals) is sometimes the boundary

Result: Architecture more defensible; trade-offs explicit

Challenge 4: MITRE ATLAS Mapping

The Problem: ATLAS taxonomy is broad; OpenClaw doesn't map cleanly to all techniques.

Our Approach:

  • Only mapped threats we actually identified
  • Used ATLAS IDs for external referenceability
  • Hyperlinked to atlas.mitre.org for transparency

Result: Credible, peer-reviewable threat model


Accomplishments that we're proud of

1. Industry-First Threat Model for a Multi-Channel AI Gateway

OpenClaw's threat model is the first to systematically document:

  • Multi-channel authentication attacks (T-ACCESS-002: AllowFrom spoofing)
  • Session isolation under prompt injection (T-DISC-002: Session data extraction)
  • Plugin supply chain compromise (T-PERSIST-001, T-PERSIST-002)
  • Indirect injection via fetched content (T-EXEC-002)

Shipped as: docs/security/THREAT-MODEL-ATLAS.md - peer-reviewable and community-editable

2. Actionable Risk Prioritization

We quantified 25+ threats and grouped them into 3 tiers:

  • P0 (Critical): T-EXEC-001, T-PERSIST-001, T-EXFIL-003 - immediate action
  • P1 (High): T-IMPACT-001, T-EXEC-002, T-ACCESS-003 - 2-4 week window
  • P2 (Medium): T-EVADE-001, T-ACCESS-001, T-PERSIST-002 - backlog

Impact: Prevents security whack-a-mole; focuses engineering on real leverage points

3. Attack Chain Modeling

We identified 3 critical attack chains showing how threats cascade:

Chain 2: Prompt Injection -> RCE

T-EXEC-001 (Direct Injection)
  -> T-EXEC-004 (Bypass Exec Approval)
  -> T-IMPACT-001 (Arbitrary Command Execution)
  -> System Compromise

Value: Reveals that exec approvals are a critical single point of failure

4. Data Flow Transparency

Created detailed data flow diagrams showing what data goes where:

Flow Path Protection
F1 Channel -> Gateway TLS, AllowFrom
F2 Gateway -> Agent Session isolation
F3 Agent -> Tools Policy enforcement
F4 Agent -> External SSRF blocking

Result: Security reviewers can trace data and spot gaps

5. Community-Driven Documentation

Process for contributing:

  • Report new threat to CONTRIBUTING-THREAT-MODEL.md
  • Include ATLAS mapping + attack vector + mitigation
  • PR review + merge to main
  • Auto-linked from docs site

Impact: Scales security expertise beyond core team

6. Credibility Through MITRE Partnership

By using ATLAS and cross-referencing atlas.mitre.org for each technique, we:

  • Enable security researchers to validate our findings
  • Link OpenClaw to broader AI security ecosystem
  • Support MITRE ATLAS case study submissions

What we learned

1. AI Security is About Trade-Offs, Not Perfection

Blocking all threats is impossible, which means explicit risk acceptance plus layered defense.

  • We can't block all prompt injections
  • We can't sandbox all skills without breaking extensibility
  • We can document what we can't protect and compensate elsewhere

Lesson: Transparency about residual risk is more credible than false claims of perfection

2. Trust Boundaries Are the Core Unit of Analysis

Rather than asking "Is OpenClaw secure?", ask:

  • "Which boundary is responsible for this threat?"
  • "What's the attack surface between boundaries?"
  • "Where does user judgment have to fill gaps?"

Result: More precise security conversations

3. Multi-Channel Amplifies Attack Surface

Every integrated channel (Discord, Telegram, Signal, Slack, etc.) is a potential initial access vector. But different channels have different spoofing/identity verification properties:

  • Signal: Strong identity (phone number, sealed sender)
  • Telegram: Username only (spoofable)
  • Discord: Username + no central identity (vulnerable to homoglyphs)

Lesson: Channel-specific security models required

4. Supply Chain Risk Scales Poorly

Every ClawHub skill installation is a potential compromise. But with:

  • GitHub account age verification: Medium bar
  • Regex patterns: Easy to bypass
  • No sandboxing: Skills run with agent privileges

Lesson: VirusTotal + behavioral analysis + sandboxing all needed; none sufficient alone

5. MITRE ATLAS is Gold Standard for AI Threats

MITRE ATLAS provided:

  • Peer-reviewed technique taxonomy
  • Cross-industry reference frame
  • Connection to broader AI security community
  • Extensible for OpenClaw-specific threats

Lesson: Building on established frameworks beats inventing proprietary models

6. Living Documentation Beats Yearly Assessments

A threat model that is:

  • Version-controlled
  • Community-editable
  • Linked to implementation PRs
  • Hyperlinked to MITRE ATLAS

...beats a static annual security audit

Lesson: Security is a continuous process, not a checkbox


What's next for ClawGuardian

Phase 2: Implementation & Mitigation

Priority Recommendation Timeline Impact
P0 VirusTotal Code Insight integration for ClawHub 2-4 weeks Catches behavioral malware (not just patterns)
P0 Skill sandboxing / capability-based security 4-8 weeks Prevent credential theft (T-EXFIL-003)
P0 Output validation layer for LLM responses 2 weeks Mitigate T-EXEC-001, T-IMPACT-003
P1 Per-sender rate limiting + cost budgets 1-2 weeks Prevent T-IMPACT-002 (DoS)
P1 Token encryption at rest 2-3 weeks Raise bar for T-ACCESS-003
P1 Exec approval bypass testing + hardening 3-4 weeks Strengthen T-EXEC-004 defense

Phase 3: Community & Governance

  • MITRE ATLAS Case Study Submission: Document OpenClaw findings for broader AI security community
  • Security Documentation Hub: Centralized security guides, best practices, and configuration hardening
  • Security Advisory Process: Structured vulnerability disclosure workflow
  • Threat Model Contributors: Recognize security researchers who contribute threats/mitigations

Phase 4: Advanced Mitigations

Short-term (P2, 2-3 months):

  • Channel-specific identity verification (cryptographic proofs for Telegram, etc.)
  • Config integrity verification + audit logging
  • Update signing + version pinning for skills

Medium-term (P3, 3-6 months):

  • Prompt injection detection using ML anomaly detection
  • Sensitive data classification + context redaction
  • Formal verification of exec approval bypasses

Long-term (Research direction):

  • Federated threat intelligence with other AI platforms
  • Automated attack chain simulation
  • Zero-trust agent architecture (fine-grained capability delegation)

Phase 5: Ecosystem Integration

ClawGuardian will become:

  1. Reference Implementation: How to build threat models for AI agents
  2. Community Standard: CONTRIBUTING-THREAT-MODEL.md for third-party plugins
  3. Continuous Validation: Automated security regression testing in CI
  4. Open Research: Published white papers on multi-channel AI security

Key Metrics

By end of Q2 2026:

  • All P0 mitigations (VirusTotal, sandboxing, output validation) shipped
  • 100 percent of attack chains have defined mitigations
  • Community threat contributions from 3+ external security researchers
  • Zero high-risk residuals accepted without explicit compensating controls
  • MITRE ATLAS case study published

Resources

  • Living Threat Model: docs/security/THREAT-MODEL-ATLAS.md
  • Contributing Guide: docs/security/CONTRIBUTING-THREAT-MODEL.md
  • MITRE ATLAS: atlas.mitre.org
  • Security Report: security@openclaw.ai

"Security is not a destination; it's a journey of transparent risk management."

ClawGuardian Philosophy

Share this project:

Updates