prompt-shield

Inspiration

Every LLM app that takes free-form user input is one paste away from a prompt-injection attack. I kept seeing teams reach for a heavy model-based guard, a paid API, or a regex they maintained themselves. Each of those is too much friction for a problem that wants a small, boring, dependency-free middle ground.

What it does

prompt-shield runs five focused pattern checks against any input string before it reaches your LLM. Each check is a pure function, returns structured findings with character spans, and feeds a single Shield facade that can either redact high-risk spans or hard-block the request.

The five rules:

role_override: instruction-override and persona-flip language
tool_call_inject: forged tool-call JSON or XML pasted into user input
secret_extract: system-prompt and tool-list enumeration probes
format_break: chat-template control tokens and closing role tags
delimiter_smuggle: unicode bidi overrides and zero-width characters

Output is deterministic. You can snapshot-test it and run it in front of every request without changing latency or cost.

How I built it

Pure Python 3.10+. Zero runtime dependencies. The Shield facade composes rule modules from src/prompt_shield/rules/. Each rule scans the input, returns Finding objects, and the facade merges overlapping high-risk spans before redaction. RiskLevel is an IntEnum so callers can compare risk numerically. 79 tests pass including a corpus of 20 known injection strings drawn from public lists plus 7 benign controls plus per-rule edge cases.

Challenges I ran into

The hardest case was the disregard family of overrides. The pattern catches "ignore previous instructions" cleanly, but "disregard everything above" has no trailing anchor word like instructions or rules, so the first pass missed it. Fixed with a second pattern in role_override that matches the disregard-or-ignore-or-forget plus everything-or-all plus above-or-prior shape.

The bidi smuggling rule also had a unicode-versioning trap. Different Python builds normalize bidi controls inconsistently, so the rule operates on raw code points rather than normalized strings.

Accomplishments I'm proud of

Five rules, zero deps, 79 tests, deterministic output. The whole library is small enough to read in one sitting and adopt the same afternoon. That was the design constraint.

What I learned

Most prompt-injection defense literature is about model-based guards. The pattern-based corner is undersold because it covers only the obvious injections, but those obvious injections are exactly what shows up in production traffic. Boring catches a lot.

What's next for prompt-shield

A second rules pack for tool-poisoning attacks (where the agent's own tool output carries injected instructions). A small CLI that scans a transcript file and prints findings. A reference integration with the sibling library agentleash so the same risk signal can hard-stop a budget-constrained run.

Built With

ai-safety
anthropic
bedrock
guardrails
llm
openai
prompt-injection
pytest
python
security

Updates

Mukunda Katta started this project — May 24, 2026 11:48 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.