PromptProof — Local LLM Red-Team-in-a-Box

Inspiration

Teams ship LLM features fast, but they rarely measure safety regressions. PromptProof turns red-teaming into a repeatable, offline test you can run locally before each release, so “time to first jailbreak” becomes “time to fix it.”

What it does

Generates attacks: Builds structured jailbreak suites (PII leak, tool-abuse, prompt-injection, jailbreaks).

Runs them against your app: Via a lightweight “target adapter” so no app changes needed.

Judges outcomes: Heuristics + optional LLM critic produce Attack Success Rate (ASR), leakage, refusal quality, and an Overall Risk Index (ORI).

Auto-mitigates: Emits system-prompt patches, IO filters, and tool-limit suggestions.

Proves it: Re-run after applying fixes; compare “before vs after” in a clean HTML report.

Why it matters (impact)

Converts vague “be safer” goals into numbers (ASR/ORI) you can track in CI.

Works fully offline with open weights, great for regulated environments, classrooms, and OSS maintainers.

Doubles as a teaching tool: every failing case includes a human-readable explanation and suggested fix.

What’s unique (novelty)

Not another chat UI; it’s a local agentic harness that quantifies safety, proposes specific changes, and proves improvement with a before/after delta.

Uses gpt-oss-20b locally to mutate attack templates and (optionally) act as a structured critic using Harmony-style JSON, no cloud dependency.

How we built it (application of gpt-oss + design)

Models: gpt-oss-20b via Ollama on Windows on Snapdragon (local).

Attack generator: Uses the model to create realistic variants of base scenarios; outputs JSON conforming to our AttackCase schema.

Judge: Fast heuristics (regex + rules) with optional LLM critic for borderline cases.

Mitigations: LLM proposes minimal system-prompt diffs + ready-to-paste filters.

Reports: Jinja2 → report.html plus compare.html (before vs after).

UX: One-command demo (scripts/demo_all.ps1), and a CLI: init, attack, mitigate, report, report-compare. Safety-first defaults and a reproducible log (logs/run_*.jsonl).

Results (from demo run)

Before: ASR 83.3%, ORI 0.783 (12 attacks, 10 successful).

After: ASR 0%, ORI 0.175 (8 attacks, 0 successful).

See reports/before.html, after.html, and compare.html.

Challenges & learnings

Keeping everything offline while preserving realism meant strict token budgets and Harmony-style structured prompts to keep latency low on CPU.

Designing clear, reproducible metrics (ASR, ORI) was key so mitigations don’t just look better, they measure better.

Packaging as an installable CLI with simple Windows scripts made the UX “judge-proof”: one command, one report.

What’s next

Policy packs for sectors (health/finance/education).

Tiny fine-tuned critic (on the generated dataset) to improve refusal grading; submit that as a companion to “Most Useful Fine-Tune.”

VS Code extension to run PromptProof on the currently open repo.

Built With

.ps1)
16-gb-ram
and
batch
dev:
elite)
git/github
httpx
jinja
models/runtimes:-gpt-oss-20b-(open-weight)
ollama-(local-inference)-languages/libs:-python-3.11
powershell
pydantic
qualcomm
rapidfuzz
regex
rich-os/hardware:-windows-on-snapdragon-(surface-laptop-7
ruamel.yaml
scripts:
typer
x

Updates

Faith Olopade started this project — Sep 01, 2025 09:46 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.