ClawGuardian: AI Security Through Adversarial Threat Modeling
Inspiration
The inspiration for ClawGuardian emerged from a critical realization: as AI agents become increasingly autonomous and capable of interacting across multiple channels (from WhatsApp and Discord to Slack, Signal, and beyond), the attack surface grows exponentially.
We observed that:
- Prompt injection attacks (both direct and indirect) represent a fundamental threat to autonomous AI systems
- Supply chain vulnerabilities in AI skill/plugin ecosystems remain poorly understood
- Trust boundaries in multi-channel agent deployments lack formal documentation
- Most AI security frameworks focus on model behavior, not operational deployment threats
The MITRE ATLAS framework provided the ideal foundation: a collaborative, adversarial threat landscape specifically designed for AI/ML systems. We decided to build OpenClaw's threat model using this framework, making security transparent, community-driven, and actionable.
What it does
ClawGuardian is a comprehensive threat modeling and security governance system for OpenClaw that:
| Capability | Description | Impact |
|---|---|---|
| MITRE ATLAS Mapping | Maps all identified threats to industry-standard ATLAS tactics and techniques | Enables comparison with other AI/ML security initiatives |
| Trust Boundary Analysis | Documents 5 critical trust boundaries across the OpenClaw architecture | Clarifies which components must defend against which threats |
| Risk Matrix | Quantifies likelihood x impact for 25+ distinct threats | Prioritizes mitigation work (P0/Critical -> P2/Medium) |
| Attack Chain Analysis | Models realistic multi-step attack scenarios (e.g., skill injection -> moderation bypass -> credential theft) | Reveals systemic weaknesses and cascading failures |
| Moderation Pattern Registry | Catalogs known-bad patterns and detection limits in ClawHub | Identifies evasion techniques and guides improvements |
| Living Documentation | Community-editable threat model with structured contribution workflows | Scales security expertise across the OpenClaw ecosystem |
Core Components
- 8 ATLAS Tactics covering reconnaissance through impact
- 25+ Documented Threats with attack vectors, mitigations, and residual risk
- 3 Critical Attack Chains showing realistic exploitation paths
- Supply Chain Analysis of ClawHub security controls
- Cryptographic & Protocol Guardrails for channel access
AWS Architecture
ClawGuard’s backend is built on AWS using a security-first architecture across five pillars: identity, encryption, compute, edge delivery, and network isolation. The system is designed to minimize attack surface, enforce least privilege, and maintain low operational cost.
Core Services
Authentication: Amazon Cognito
- Managed identity with enforced TOTP MFA
- No credential handling in application code
- JWT-based access scoped to IAM roles
Key Management: AWS Key Management Service
- Asymmetric signing for blockchain transactions (keys never leave HSM)
- Envelope encryption for sensitive data (AES-256 + wrapped DEKs)
- Strict IAM isolation for signing operations
Secrets: AWS Secrets Manager
- Stores API keys, RPC endpoints, internal tokens
- Automatic rotation with audit trail via CloudTrail
Compute Layer: AWS Fargate + Amazon API Gateway
- Fargate runs persistent services (RPC subscriptions, API handlers)
- API Gateway enforces SigV4 auth, WAF rules, and rate limiting
- Deployed inside private VPC subnets
Edge & Frontend: Amazon S3 + Amazon CloudFront
- SPA hosted on private S3 bucket
- CloudFront provides global caching with restricted origin access
AI Integration: AWS Bedrock (Claude Haiku)
- Runs fully inside AWS via VPC endpoints (no internet egress)
- Used for threat analysis, incident response, and audit summarization
Network & Isolation
- All services run inside a private VPC
- Communication to AWS services via VPC Endpoints (PrivateLink)
- No public internet exposure for internal services
- Eliminates NAT gateway requirement and reduces data exfiltration risk
Security Model
- Zero wildcard IAM policies (strict resource-level permissions)
- Non-exportable cryptographic keys (KMS-backed signing)
- Full auditability via CloudTrail and CloudWatch
- Encrypted data at rest and in transit
Separation of concerns:
- AWS → control plane
- Blockchain → data integrity layer
Cost Profile
- Total infrastructure: ~$200/month
Key optimizations:
- VPC endpoints instead of NAT gateway
- Fargate over EC2 (pay-per-use)
- Claude Haiku (low-cost model)
- Envelope encryption (reduces KMS calls)
How we built it
Phase 1: Framework Selection & Architecture
We began by studying MITRE ATLAS the industry standard for adversarial threats to AI systems. Unlike traditional threat models (STRIDE, OWASP), ATLAS is specifically designed for ML/AI contexts, including:
- Prompt injection techniques
- Supply chain compromises in AI ecosystems
- Model poisoning and data manipulation
- Evasion and defense circumvention
We then mapped OpenClaw's architecture to 5 trust boundaries:
Untrusted Channels -> Gateway Auth -> Sessions -> Tools -> External Content
Phase 2: Threat Enumeration & Analysis
For each ATLAS tactic, we identified OpenClaw-specific threats:
// Example: T-EXEC-001: Direct Prompt Injection
{
id: "T-EXEC-001",
atlasId: "AML.T0051.000",
description: "Attacker sends crafted prompts to manipulate agent behavior",
attackVector: "Channel messages containing adversarial instructions",
mitigations: ["Pattern detection", "External content wrapping"],
residualRisk: "Critical",
recommendations: ["Multi-layer defense", "Output validation"]
}
Each threat was documented with:
- Attack vector: How does it happen?
- Affected components: What gets hit?
- Current mitigations: What's already protecting us?
- Residual risk: What's still exposed?
- Recommendations: What should we do?
Phase 3: Risk Quantification & Prioritization
We created a risk matrix balancing likelihood and impact:
| Threat | Likelihood | Impact | Priority |
|---|---|---|---|
| T-EXEC-001 (Direct Prompt Injection) | High | Critical | P0 |
| T-PERSIST-001 (Malicious Skill) | High | Critical | P0 |
| T-IMPACT-002 (DoS) | High | Medium | P1 |
Phase 4: Attack Chain Modeling
We traced realistic multi-step attack scenarios:
Chain 1: Skill-Based Data Theft
Publish Malicious Skill
-> Evade Moderation Patterns
-> Harvest Credentials from Agent Context
-> Data Exfiltration
Phase 5: Documentation & Community Process
- Created THREAT-MODEL-ATLAS.md as the living source of truth
- Defined CONTRIBUTING-THREAT-MODEL.md for community submissions
- Mapped all recommendations to implementation PRs
Challenges we ran into
Challenge 1: Prompt Injection - The "Unsolvable" Problem
The Problem: Direct prompt injection (AML.T0051.000) is fundamentally difficult because LLMs are designed to follow instructions in their input.
Our Approach:
- Admitted that blocking all injections is impossible
- Documented realistic attack sophistication levels
- Layered defenses: pattern detection -> content wrapping -> output validation
- Prioritized user confirmation for sensitive actions
Result: Residual risk = Critical, but with explicit compensating controls
Challenge 2: Supply Chain Complexity
The Problem: ClawHub skill ecosystem has minimal barriers (GitHub age verification + regex patterns), but we couldn't sandbox skills without major refactoring.
Our Approach:
- Documented T-PERSIST-001 (Malicious Skill) as P0 critical
- Built moderation pattern registry to track known-bad signatures
- Planned VirusTotal integration for behavioral code analysis
- Accepted transitional risk while engineering long-term solutions
Result: Roadmap clear; mitigations incremental but measurable
Challenge 3: Trust Boundary Clarity
The Problem: OpenClaw has multiple execution contexts (sandbox vs. host, different channels), making it unclear where responsibility shifts.
Our Approach:
- Drew explicit 5-boundary diagram showing data flow
- Assigned mitigations to specific boundary layer
- Clarified that user judgment (exec approvals) is sometimes the boundary
Result: Architecture more defensible; trade-offs explicit
Challenge 4: MITRE ATLAS Mapping
The Problem: ATLAS taxonomy is broad; OpenClaw doesn't map cleanly to all techniques.
Our Approach:
- Only mapped threats we actually identified
- Used ATLAS IDs for external referenceability
- Hyperlinked to atlas.mitre.org for transparency
Result: Credible, peer-reviewable threat model
Accomplishments that we're proud of
1. Industry-First Threat Model for a Multi-Channel AI Gateway
OpenClaw's threat model is the first to systematically document:
- Multi-channel authentication attacks (T-ACCESS-002: AllowFrom spoofing)
- Session isolation under prompt injection (T-DISC-002: Session data extraction)
- Plugin supply chain compromise (T-PERSIST-001, T-PERSIST-002)
- Indirect injection via fetched content (T-EXEC-002)
Shipped as: docs/security/THREAT-MODEL-ATLAS.md - peer-reviewable and community-editable
2. Actionable Risk Prioritization
We quantified 25+ threats and grouped them into 3 tiers:
- P0 (Critical): T-EXEC-001, T-PERSIST-001, T-EXFIL-003 - immediate action
- P1 (High): T-IMPACT-001, T-EXEC-002, T-ACCESS-003 - 2-4 week window
- P2 (Medium): T-EVADE-001, T-ACCESS-001, T-PERSIST-002 - backlog
Impact: Prevents security whack-a-mole; focuses engineering on real leverage points
3. Attack Chain Modeling
We identified 3 critical attack chains showing how threats cascade:
Chain 2: Prompt Injection -> RCE
T-EXEC-001 (Direct Injection)
-> T-EXEC-004 (Bypass Exec Approval)
-> T-IMPACT-001 (Arbitrary Command Execution)
-> System Compromise
Value: Reveals that exec approvals are a critical single point of failure
4. Data Flow Transparency
Created detailed data flow diagrams showing what data goes where:
| Flow | Path | Protection |
|---|---|---|
| F1 | Channel -> Gateway | TLS, AllowFrom |
| F2 | Gateway -> Agent | Session isolation |
| F3 | Agent -> Tools | Policy enforcement |
| F4 | Agent -> External | SSRF blocking |
Result: Security reviewers can trace data and spot gaps
5. Community-Driven Documentation
Process for contributing:
- Report new threat to CONTRIBUTING-THREAT-MODEL.md
- Include ATLAS mapping + attack vector + mitigation
- PR review + merge to main
- Auto-linked from docs site
Impact: Scales security expertise beyond core team
6. Credibility Through MITRE Partnership
By using ATLAS and cross-referencing atlas.mitre.org for each technique, we:
- Enable security researchers to validate our findings
- Link OpenClaw to broader AI security ecosystem
- Support MITRE ATLAS case study submissions
What we learned
1. AI Security is About Trade-Offs, Not Perfection
Blocking all threats is impossible, which means explicit risk acceptance plus layered defense.
- We can't block all prompt injections
- We can't sandbox all skills without breaking extensibility
- We can document what we can't protect and compensate elsewhere
Lesson: Transparency about residual risk is more credible than false claims of perfection
2. Trust Boundaries Are the Core Unit of Analysis
Rather than asking "Is OpenClaw secure?", ask:
- "Which boundary is responsible for this threat?"
- "What's the attack surface between boundaries?"
- "Where does user judgment have to fill gaps?"
Result: More precise security conversations
3. Multi-Channel Amplifies Attack Surface
Every integrated channel (Discord, Telegram, Signal, Slack, etc.) is a potential initial access vector. But different channels have different spoofing/identity verification properties:
- Signal: Strong identity (phone number, sealed sender)
- Telegram: Username only (spoofable)
- Discord: Username + no central identity (vulnerable to homoglyphs)
Lesson: Channel-specific security models required
4. Supply Chain Risk Scales Poorly
Every ClawHub skill installation is a potential compromise. But with:
- GitHub account age verification: Medium bar
- Regex patterns: Easy to bypass
- No sandboxing: Skills run with agent privileges
Lesson: VirusTotal + behavioral analysis + sandboxing all needed; none sufficient alone
5. MITRE ATLAS is Gold Standard for AI Threats
MITRE ATLAS provided:
- Peer-reviewed technique taxonomy
- Cross-industry reference frame
- Connection to broader AI security community
- Extensible for OpenClaw-specific threats
Lesson: Building on established frameworks beats inventing proprietary models
6. Living Documentation Beats Yearly Assessments
A threat model that is:
- Version-controlled
- Community-editable
- Linked to implementation PRs
- Hyperlinked to MITRE ATLAS
...beats a static annual security audit
Lesson: Security is a continuous process, not a checkbox
What's next for ClawGuardian
Phase 2: Implementation & Mitigation
| Priority | Recommendation | Timeline | Impact |
|---|---|---|---|
| P0 | VirusTotal Code Insight integration for ClawHub | 2-4 weeks | Catches behavioral malware (not just patterns) |
| P0 | Skill sandboxing / capability-based security | 4-8 weeks | Prevent credential theft (T-EXFIL-003) |
| P0 | Output validation layer for LLM responses | 2 weeks | Mitigate T-EXEC-001, T-IMPACT-003 |
| P1 | Per-sender rate limiting + cost budgets | 1-2 weeks | Prevent T-IMPACT-002 (DoS) |
| P1 | Token encryption at rest | 2-3 weeks | Raise bar for T-ACCESS-003 |
| P1 | Exec approval bypass testing + hardening | 3-4 weeks | Strengthen T-EXEC-004 defense |
Phase 3: Community & Governance
- MITRE ATLAS Case Study Submission: Document OpenClaw findings for broader AI security community
- Security Documentation Hub: Centralized security guides, best practices, and configuration hardening
- Security Advisory Process: Structured vulnerability disclosure workflow
- Threat Model Contributors: Recognize security researchers who contribute threats/mitigations
Phase 4: Advanced Mitigations
Short-term (P2, 2-3 months):
- Channel-specific identity verification (cryptographic proofs for Telegram, etc.)
- Config integrity verification + audit logging
- Update signing + version pinning for skills
Medium-term (P3, 3-6 months):
- Prompt injection detection using ML anomaly detection
- Sensitive data classification + context redaction
- Formal verification of exec approval bypasses
Long-term (Research direction):
- Federated threat intelligence with other AI platforms
- Automated attack chain simulation
- Zero-trust agent architecture (fine-grained capability delegation)
Phase 5: Ecosystem Integration
ClawGuardian will become:
- Reference Implementation: How to build threat models for AI agents
- Community Standard: CONTRIBUTING-THREAT-MODEL.md for third-party plugins
- Continuous Validation: Automated security regression testing in CI
- Open Research: Published white papers on multi-channel AI security
Key Metrics
By end of Q2 2026:
- All P0 mitigations (VirusTotal, sandboxing, output validation) shipped
- 100 percent of attack chains have defined mitigations
- Community threat contributions from 3+ external security researchers
- Zero high-risk residuals accepted without explicit compensating controls
- MITRE ATLAS case study published
Resources
- Living Threat Model: docs/security/THREAT-MODEL-ATLAS.md
- Contributing Guide: docs/security/CONTRIBUTING-THREAT-MODEL.md
- MITRE ATLAS: atlas.mitre.org
- Security Report: security@openclaw.ai
"Security is not a destination; it's a journey of transparent risk management."
ClawGuardian Philosophy
Log in or sign up for Devpost to join the conversation.