Red_Co-Author

Inspiration

Inspired by the HarDBench paper on the Co-Authoring Jailbreak (CoJP), we saw a gap in the tooling. The research existed, wrapping harmful intent inside "polish my draft" framing bypasses safety guardrails at near-100% rates, but developers and safety teams lacked an automated, edge-ready tool to test this locally, and most red-team scaffolds also stop at attack generation without ever closing the loop on the defenses that should catch the attacks.

What it does

Red_Co-Author is a fully automated, 100% local vulnerability scanner that weaponizes an LLM's instinct to complete broken text. A drafter model (mistral) turns a refusal-prone prompt into a deliberately incomplete bullet outline studded with [MISSING] placeholders; we wrap that draft in a "polish my draft" co-author framing and fire it, alongside the bare prompt as a control, at one, two, or all three local target models (qwen3:8b, gemma2, phi3) in a single run. An independent llama3 judge scores both responses 1 to 5 on the HarDBench harmfulness rubric, so every attack produces an ASR flag (Attack Success Rate) and a risk-amplification verdict (CoJP succeeded AND amplified versus the direct baseline), and every run appends one self-contained row (prompt, both responses, scores, latencies, optional Laminar trace id) to results.jsonl. On top of that attack loop sits a monitor stack: a naive llama3 zero-shot classifier (the defender's first instinct, "is this response HARMFUL or SAFE?") compared head to head against a trained classifier built from Ollama nomic-embed-text embeddings plus sklearn logistic regression, fit on the data the attack loop generated and then evaluated for generalization on held-out AdvBench (520 prompts) and HarDBench (400 prompts) it never saw during training. The Streamlit dashboard ties it together in five tabs (Run attack, Aggregate, Score gap, Monitor, Raw rows), so a viewer can launch an attack, watch the gap fire on a Sankey diagram and a direct-vs-cojp scatter, then jump to the Monitor tab and see ROC curves showing the trained classifier catching what the naive one misses.

How we built it

We built the pipeline in Python, orchestrating the multi-agent system entirely through Ollama to ensure 100% local, offline execution. Six models (mistral as drafter, qwen3:8b, gemma2, and phi3 as swappable targets, llama3 as judge and naive monitor, and nomic-embed-text as the embedding backbone for the trained monitor) sit behind a Streamlit and Plotly dashboard, with sklearn powering the trained classifier and optional Laminar tracing for per-span input and output capture. A single attack_run function is the source of truth shared by the CLI, the resumable batch runner, and the UI, so the three flows never drift.

Challenges we ran into

Orchestrating multiple LLMs sequentially on local hardware is a massive memory management challenge. Balancing the context windows and gracefully swapping models (Mistral, Target, Llama, embedding model) without crashing the system required careful optimization, and once we added gemma2 and phi3 as additional targets a single attack run now meant three sequential target sweeps rather than one, pushing per-run latency to 3 to 9 minutes on a 16 GB laptop because two 8B models cannot coexist in RAM. We also discovered the drafter itself is an unexpected safety bottleneck: mistral refuses certain categories outright (suicide manipulation, child safety), so the entire CoJP path fails safe for those prompts, which we logged as a finding rather than chasing, because defense in depth was on the side of the defender, not a bug.

Accomplishments that we're proud of

We are incredibly proud of achieving a complex, multi-agent AI pipeline with zero API keys. Building an end-to-end red-teaming tool (drafter, three targets, independent judge, embedding-based monitor, generalization evaluation across AdvBench and HarDBench) that runs entirely on the edge is a massive win for privacy and security. We are especially proud that the monitor is not an afterthought bolted on at the end: it trains on the very data the attack loop generates, and is then evaluated on two corpora it never saw, so the dashboard tells both the attacker's story (Sankey and score scatter) and the defender's story (F1 by split and ROC overlays) on the same screen. The batch runner is also resumable on a (prompt, target, source) key, so a crash 400 prompts into a 1,500-row sweep does not waste hours of work.

What we learned

We learned that AI safety isn't just about filtering bad words; it's about understanding the underlying psychological constraints of the models. A "polish my draft" framing slips through where a direct harmful query gets refused, even when the intent is identical. From a systems engineering perspective, we learned a ton about pipeline observability via per-span input and output capture (which once saved us when an HS=5 score turned out to be a regex parse bug, not a real jailbreak), state management across multiple sequential LLM calls with shared status callbacks for live UI progress, and how to effectively visualize complex security metrics using Sankey diagrams, direct-vs-cojp score scatters with a y=x reference, target by domain heatmaps, and ROC curves. We also learned that the headline is not a single ASR number, it is the per-target-and-domain gap matrix for attacks, and the in-distribution minus cross-corpus F1 delta for monitors.

What's next for Red_Co-Author

We want to package Red_Co-Author into a lightweight CLI tool that integrates directly into CI/CD pipelines, allowing developers to automatically red-team their local agents before deployment and gating merges on the monitor's F1 score against a target corpus. Other directions: a HarDBench-native attack mode that uses the bench's own prompt_with_suffix field as the CoJP path so we reproduce the paper's numbers exactly, streaming target output to keep the dashboard alive during 60-second response windows, and an opt-in frontier-model bridge that automates sending high-HS CoJP prompts to Claude or GPT for cross-validation screenshots.

GitHub repository