MUTANTE — Forensic-Grade LLM Red-Teaming Platform
Inspiration
Modern AI safety tools rely on brittle keyword filtering — they fail the moment an attacker rephrases a request. We built MUTANTE to answer a harder question: can we measure how close a model gets to being compromised, even when it ultimately refuses? We wanted a system that treats jailbreak attempts as communicative acts, not just malicious strings — and produces audit-ready, mathematically reproducible verdicts. The name is intentional. MUTANTE doesn't just detect attacks. It becomes them — iteratively, intelligently, with surgical precision.
Simulator: https://annatchijova.github.io/vigia/mutante.html
What It Does MUTANTE is a real-time forensic diagnostic platform for LLM security teams. It continuously stress-tests alignment boundaries by:
Mutating adversarial prompts across five semantic transformation vectors (ROT13, Base64, mirror, scramble, zigzag) Firing them at a target model on Vertex AI via the google-genai SDK Evaluating every response through a 4-layer deterministic semiotic engine Scoring each interaction with a Jailbreak Confidence Score (JCS) — a rational-arithmetic value from 0 (fully blocked) to 5 (complete bypass) — with a forensic breakdown of exactly which safety layers degraded and why Streaming all verdicts in real time to Elastic Cloud for operational observability and semantic search Persisting audit logs to BigQuery for longitudinal analysis
The system tells a security team not just whether a model was jailbroken, but which attack vector worked, which semiotic layer failed first, and where to invest in fine-tuning. That distinction makes MUTANTE a diagnostic tool, not just a detector. Validated against 11,000+ adversarial sequences from msoedov/agentic_security and a curated balanced dataset of 1,044 evaluated prompts (527 jailbreak / 517 benign), with zero false positives on benign inputs confirmed in the demo run. The Streamlit dashboard provides five operational views:
Command Center — live KPIs and system pulse Forensic Feed — per-event breakdown with threat tags Analytics — JCS timeline, mutation heatmaps, and ES|QL-driven cluster analytics Attack Families — PCA 2D semantic map of attack clusters, family cards ranked by bypass rate, kNN similarity explorer Agent Console — interactive live red-teaming via the ADK agent
How We Built It The full stack runs on Google Cloud, deployed to Cloud Run at https://mutante-core-engine-430944783183.us-central1.run.app, with the ADK agent hosted on Vertex AI Agent Engine. Core architecture:
Mutation Engine — five syntactic/semantic transformation vectors applied via MutationEngine Bayesian Bandit Orchestrator — Thompson Sampling continuously reprioritizes mutation vectors based on live bypass success rates; by probe 32, it had already identified mirror as the highest-yield vector Deterministic Semiotic Core — 4-layer engine (Syntax, Semantic, Pragmatic, Synergy) using fractions.Fraction exclusively; zero floating-point drift, 100% reproducible verdicts Hybrid Score Fusion Layer — fuses the deterministic JCS with a Gemini LLM-as-Judge layer (pragmatic compliance drift, semantic accommodation, safety boundary erosion, contextual reframing sensitivity) on a unified 0–5 scale with graceful fallback to deterministic-only when the judge is unavailable Bypass Quality Gate — statistical noise filter that validates whether a batch constitutes a genuine vulnerability signal vs. random fluctuation Elastic Cloud semantic layer — dense vector indexing (gemini-embedding-001, 768-dim, cosine), kNN search, attack family clustering, and native ES|QL analytics replacing all match_all + pandas patterns BigQuery sink — daily-partitioned tables for longitudinal audit persistence MCP server — all engine tools exposed via FastMCP, enabling seamless agent interoperability Google ADK 2.0 agent (mutante_security_agent) with FunctionTool cycle: mutate → attack → evaluate → index
Credentials for Elastic Cloud ID (which contains : and = characters) are managed via Google Secret Manager with --set-secrets in deployment — a hard-learned lesson about env var corruption.
Challenges The negation problem was the hardest to solve. A model responding "I will NOT provide step-by-step instructions for synthesizing that compound" contains every bypass-adjacent keyword in the corpus — but it's a refusal. Our bidirectional negation handler attenuates scoring within a contextual window around each matched pattern, distinguishing genuine compliance from resistance that merely mirrors adversarial vocabulary. Architectural integrity under the Elastic migration: upgrading from a log sink with match_all queries to a full dense vector semantic index required redesigning analytics from the ground up without breaking the deterministic verdict layer. The two must remain strictly decoupled — the probabilistic retrieval layer cannot contaminate the rational-arithmetic evidence layer. SDK deprecation during the judging window: the original vertexai.generative_models SDK is being eliminated June 24, 2026 — during active judging. We completed a full migration to google-genai before submission.
Accomplishments
Full end-to-end pipeline running on Google Cloud: mutation → Vertex AI inference → deterministic evaluation → real-time Elastic indexing → live Streamlit dashboard 1,044-row balanced evaluation dataset with confirmed zero false positives on benign inputs Elastic semantic layer: dense vector indexing, kNN attack family clustering, PCA 2D attack map (numpy SVD, no sklearn dependency), ES|QL analytics ADK 2.0 agent with complete FunctionTool registration and Vertex AI Agent Engine deployment Zero ML in the verdict core: every JCS is mathematically reproducible from identical inputs MCP integration making the entire engine composable as an agent toolchain Validated throughput against 11,000+ adversarial sequences from msoedov/agentic_security
What We Learned That "blocked" and "safe" are not synonyms. A model can refuse a request while exhibiting measurable semantic accommodation — adopting the attacker's framing, terminology, or persona — without crossing the hard refusal threshold. The JCS captures that gray zone. Binary evaluators miss it entirely. We also learned that you cannot audit variance with more variance. Deploying a probabilistic classifier to evaluate a probabilistic target creates a loop of stochastic feedback where the auditor inherits the exact vulnerabilities it is trying to measure. Strict architectural decoupling — deterministic evidence layer below, probabilistic enrichment layer above — is not just a design preference; it is the only path to forensically admissible results.
What's Next
Exportable PDF audit reports — the single highest-impact addition for enterprise adoption Expand the semiotic knowledge base with multilingual refusal and bypass pattern matrices (ES, RU, ZH) Kibana dashboards on top of the Elastic index for richer operational visualization Publish the forensic verdict schema as an open standard for LLM safety auditing Continuous red-teaming pipelines as a managed service, triggered on model version updates
UI Update (June 2026): The forensic dashboard was redesigned with a precision instrument aesthetic. The Elasticsearch indices shown in the live demo are populated with sample telemetry generated by the project's own deterministic evaluator (reseed_elastic.py). Production validation results (+1,100 probes, 5 bypasses, 0.42% bypass rate against gemini-2.5-pro/Gemini 3.1 pro preview) are documented in batch_results.jsonl
Built With
- bigquery
- elastic-cloud
- elasticsearch
- fractions
- gemini
- google-adk
- google-cloud-run
- google-genai
- mcp
- plotly
- python
- streamlit
- thompson-sampling
- vertex-ai

Log in or sign up for Devpost to join the conversation.