Inspiration

AI agents are being wired into real businesses faster than anyone is securing them. Prompt injection is now OWASP's #1 security risk for AI applications — two years running — and in recent tests, hidden-instruction attacks hijacked live agents more than 79% of the time.

The scary part is the lethal trifecta: the moment an agent can (1) read your private data, (2) ingest untrusted content, and (3) act in the world — send, pay, post — a single malicious instruction buried in an email or a product review can turn the agent against its owner. No malware. No breach alert. The agent just does what it was told.

Enterprises can throw Amazon Bedrock Guardrails and a security team at this. The 14-person Shopify merchant turning on an AI agent to run support? They have nothing. We built ToolFence for them.

What it does

ToolFence is an ingestion-first security control plane for AI agents, delivered as a gateway MCP that every connector routes through. It watches what the agent reads and gates what it does, across three levels:

  • L1 — Screens. Inspects everything the agent ingests and quarantines prompt-injection payloads on the way in.
  • L2 — Reasons. Propagates a taint flag through the session and triages each action by context — so an individually-innocent action gets caught when the session is poisoned.
  • L3 — Holds. Routes the genuinely risky few to a single break-glass approval, then writes the decision to a tamper-evident, hash-chained audit log.

The result: enterprise-grade agent safety without an enterprise security team — and it kills approval-decision fatigue, because the owner only ever sees the handful of actions that actually matter.

How we built it

  • Gateway MCP (FastMCP, Python) on :8000 — the single entry point every tool call passes through, exposing screened get_inbox / get_reviews and a gated send_email.
  • Screen-service (FastAPI + Uvicorn) — the L1/L2 engine that scores ingested content and emits decisions.
  • Audit layer (Supabase / Postgres) — every decision persisted with a prev_hash → hash chain so the log is tamper-evident and verifiable.
  • Live dashboard (React + Vite + TypeScript) — real-time decision feed, approval queue, and hash-chain viewer.
  • Demo arc: one injection, three beats — L1 quarantines the malicious email, L2 blocks a totally innocent follow-up action because the session is tainted, L3 routes to break-glass sign-off and logs it to a chain you can verify.

Challenges we ran into

  • Instructions and data share one channel. That's the root of prompt injection — the model can't tell the attacker's text from the user's. Our whole design (screen-first + taint propagation) exists to add that boundary from the outside.
  • Catching context-dependent attacks. The hard case isn't the obviously-malicious email — it's the innocent action that's only dangerous because of what happened earlier in the session. Session taint tracking was the key idea.
  • Plumbing under a deadline. Reconciling the gateway↔screen-service auth handshake, a uvicorn/websockets version clash on Python 3.14, and connector schema differences — all while keeping a verifiable audit chain intact.

What we learned

Securing agents isn't about a better model — it's about architecture around the model: least-privilege tools, screening untrusted input, tracking risk across a session, and keeping a human in the loop only where it counts. Defense-in-depth beats any single filter, and the truly defensible layer is L2 contextual reasoning — the part that does an analyst's triage so the owner doesn't drown in prompts.

What's next for ToolFence

  • Real-time action replay so a held action executes on approval.
  • More connectors and a tunable policy engine for high-risk rules.
  • Enterprise tier: SSO, SOC 2, and a compliance story — without ever assuming the SMB has a security team.

Built With

Share this project:

Updates