AgentSheriff — Devpost

Inspiration

In 2026 every major lab is shipping AI agents that can send email, push code, place trades, and run shell commands without a human in the loop. The safety tooling is a generation behind. The status quo is to ask the model nicely not to misbehave, which falls apart the first time it reads a prompt-injection on the open web or wanders into a market it doesn't understand.

We wanted to build the missing piece: a separate, deterministic layer that sits in front of the agent and enforces what it can and cannot do, independent of the model, the prompt, or the day's jailbreak. The gateway is a Sheriff protecting from unwanted threats.

What it does

AgentSheriff intercepts every tool call an AI agent attempts and runs three passes:

Detect. The request is scanned for prompt-injection patterns, exfiltration combos, destructive shell commands, credential-shaped strings, and skill-specific risk markers — all by deterministic regex, before any model runs.
Decide. Static rules generated from the agent's skill settle what they can. Borderline calls go to an LLM judge with a skill-tuned prompt. Truly ambiguous ones escalate to a human "Sheriff" via an approval queue — surfaced in the dashboard and pushed to Telegram with inline approve / deny buttons.
Record. Every action — allow, deny, or approval-required — lands in the Sheriff's Ledger with the matched rule, the rationale, and the full request payload. Replayable against a proposed policy change before it ships.

The key idea: policies are generated from SKILL.md files, not hand-authored. An agent ships a SKILL.md declaring what it does (reads markets, places trades, etc.); AgentSheriff parses the command table, extracts the vocabulary of commands and flags, and runs an LLM law generator that can only emit rules over that vocabulary — no hallucinated flags, no over-broad globs. Operators tune from a generated baseline instead of writing YAML from scratch.

For the hackathon demo we wired up a Kalshi prediction-market trading skill as the headline use case: real money, real risk, real consequences if the agent goes off the rails.

How we built it

Backend (Python). FastAPI gateway, SQLAlchemy + SQLite for the ledger, Starlette session middleware for auth (Google OAuth via Authlib), WebSocket broadcast on /v1/stream for live dashboard updates. Threat classifier is two-stage: deterministic regex heuristics score the request first, then a Claude-Sonnet judge runs only on borderline calls or rules that explicitly delegate. Skill registry parses SKILL.md files into a structured policy seed.

LLM law generator. Reads the parsed skill, the agent's declared concerns, and the operator's risk posture, then asks Claude (Sonnet) to produce a starter policy: a list of static rules, a judge prompt, and human-readable rationale strings. The output is sanitized against the skill's actual command vocabulary before it lands — any rule that references a flag the skill doesn't expose is rejected. A coverage pass auto-adds an approval gate for any command the model didn't cover. The result is editable in the dashboard before publishing.

Approvals. Pending approvals fan out to two surfaces: the dashboard approval queue (live over WebSocket) and a Telegram notifier that posts the request as a card with inline approve / deny buttons and edits the message in place once it's resolved.

Frontend (Next.js 16 + Turbopack). Tailwind, Zustand for live state, framer-motion for the Wanted Poster slam-in. Custom Old-West theme: parchment #f3e9d2, brass #b8864b, stamp-ink red #a4161a, Rye headings, Inter body. Marketing landing has a hand-drawn sepia desert backdrop with the AgentSheriff mascot logo as the hero.

Agent harness. Real autonomous agent runs through OpenClaw with exec-policy preset yolo (its built-in approval prompts disabled — AgentSheriff owns approvals now). An OpenClaw plugin and hook bridge every shell tool call into the gateway; a translator normalizes OpenClaw's envelope into our ToolCallRequest shape and routes Kalshi commands to the right skill. Deputy Dusty, our deterministic simulator, is the backup demo path.

Challenges we ran into

Skill-to-policy distance. A SKILL.md describes capability; a policy describes restriction. Bridging them via LLM-generated laws meant a lot of prompt iteration to get rules that were enforceable (not vibes), specific (not over-broad), and explicable (so the rationale didn't read as machine soup). The vocabulary-sanitization step is what made it actually trustworthy.
One Sheriff in town. OpenClaw ships its own approval prompts, ours can't fight with theirs. We landed on disabling theirs entirely via the yolo preset and routing every gate through AgentSheriff.

Accomplishments we're proud of

A real autonomous agent (OpenClaw) running behind a real policy gateway, with shell calls translated, threat-scanned, judged, and audited end-to-end.
Skills ship the policy. Drop in a new SKILL.md, the gateway generates a starter rule set automatically and won't let the LLM invent flags that don't exist. New verticals don't need a YAML wizard.
Approvals where the human actually is. Dashboard and Telegram, same approval, edited in place when resolved.

What we learned

A deterministic policy engine in front of a model is fundamentally more defensible than asking a model to police itself. Better models make the gateway more necessary, not less.
LLM-generated policies are a great seed but a poor source of truth — the human-in-the-loop edit step, and a vocabulary check that rejects hallucinated rules, are non-negotiable.