Inspiration
AI agents are being wired into real businesses faster than anyone is securing them. Prompt injection is now OWASP's #1 security risk for AI applications — two years running — and in recent tests, hidden-instruction attacks hijacked live agents more than 79% of the time.
The scary part is the lethal trifecta: the moment an agent can (1) read your private data, (2) ingest untrusted content, and (3) act in the world — send, pay, post — a single malicious instruction buried in an email or a product review can turn the agent against its owner. No malware. No breach alert. The agent just does what it was told.
Enterprises can throw Amazon Bedrock Guardrails and a security team at this. The 14-person Shopify merchant turning on an AI agent to run support? They have nothing. We built ToolFence for them.
What it does
ToolFence is an ingestion-first security control plane for AI agents, delivered as a gateway MCP that every connector routes through. It watches what the agent reads and gates what it does, across three levels:
- L1 — Screens. Inspects everything the agent ingests and quarantines prompt-injection payloads on the way in.
- L2 — Reasons. Propagates a taint flag through the session and triages each action by context — so an individually-innocent action gets caught when the session is poisoned.
- L3 — Holds. Routes the genuinely risky few to a single break-glass approval, then writes the decision to a tamper-evident, hash-chained audit log.
The result: enterprise-grade agent safety without an enterprise security team — and it kills approval-decision fatigue, because the owner only ever sees the handful of actions that actually matter.
How we built it
- Gateway MCP (
FastMCP, Python) on:8000— the single entry point every tool call passes through, exposing screenedget_inbox/get_reviewsand a gatedsend_email. - Screen-service (
FastAPI+Uvicorn) — the L1/L2 engine that scores ingested content and emits decisions. - Audit layer (
Supabase/ Postgres) — every decision persisted with aprev_hash → hashchain so the log is tamper-evident and verifiable. - Live dashboard (
React+Vite+TypeScript) — real-time decision feed, approval queue, and hash-chain viewer. - Demo arc: one injection, three beats — L1 quarantines the malicious email, L2 blocks a totally innocent follow-up action because the session is tainted, L3 routes to break-glass sign-off and logs it to a chain you can verify.
Challenges we ran into
- Instructions and data share one channel. That's the root of prompt injection — the model can't tell the attacker's text from the user's. Our whole design (screen-first + taint propagation) exists to add that boundary from the outside.
- Catching context-dependent attacks. The hard case isn't the obviously-malicious email — it's the innocent action that's only dangerous because of what happened earlier in the session. Session taint tracking was the key idea.
- Plumbing under a deadline. Reconciling the gateway↔screen-service auth handshake, a
uvicorn/websocketsversion clash on Python 3.14, and connector schema differences — all while keeping a verifiable audit chain intact.
What we learned
Securing agents isn't about a better model — it's about architecture around the model: least-privilege tools, screening untrusted input, tracking risk across a session, and keeping a human in the loop only where it counts. Defense-in-depth beats any single filter, and the truly defensible layer is L2 contextual reasoning — the part that does an analyst's triage so the owner doesn't drown in prompts.
What's next for ToolFence
- Real-time action replay so a held action executes on approval.
- More connectors and a tunable policy engine for high-risk rules.
- Enterprise tier: SSO, SOC 2, and a compliance story — without ever assuming the SMB has a security team.
Built With
- anthropic-claude
- claude-code
- fastapi
- fastmcp
- hugging-face
- model-context-protocol
- ngrok
- postgresql
- python
- react
- supabase
- typescript
- uvicorn
- vite
Log in or sign up for Devpost to join the conversation.