🧠 Inspiration

The growing complexity of large language models demands rigorous and intelligent red-teaming to ensure safety, trust, and alignment. With the release of OpenAI's GPT-OSS-20B, this was the perfect opportunity to push the boundaries of AI safety testing. Inspired by real-world risks like reward hacking, deception, and covert data exfiltration, I wanted to build an autonomous red-team agent that doesn’t just test models... it outsmarts them.


🤖 What it does

Red Teaming Agent for GPT-OSS-20B is an autonomous, multi-turn vulnerability probing system built using LangGraph and Ollama. It systematically:

  • Targets 9 core vulnerability types (deception, sabotage, CoT manipulation, etc.)
  • Generates sophisticated, context-aware attack prompts with no metadata cues
  • Engages in intelligent multi-turn conversations
  • Analyzes model responses for strategic misalignment or unsafe behavior
  • Generates detailed reports with proof-of-concept interactions and remediation insights

Think of it as an AI that probes another AI…not to destroy it, but to truly understand where it breaks.


🛠️ How I built it

I combined the power of LangGraph for workflow orchestration and Ollama for local LLM execution:

  • Built a modular red-teaming agent using Python, with clear stages: prompt generation, testing, analysis, and reporting
  • Used LLaMA 3.1 as a red-teaming model to intelligently craft attacks
  • Deployed GPT-OSS-20B as the target model via Ollama
  • Developed a robust analysis engine to detect failures like reward hacking, deceptive alignment, and tool misuse
  • Designed comprehensive reporting tools with structured logs, vulnerability breakdowns, and reproducible conversations

All components are plug-and-play, easily configurable via .env or YAML.


🧗‍♂️ Challenges I ran into

  • Balancing prompt subtlety: Crafting attacks that bypass filters without being unrealistic.
  • Conversation depth: Detecting vulnerabilities sometimes took 4–6 turns; optimizing LangGraph’s flow to allow this without bloating.
  • Local model serving: Running 20B models locally required hardware tuning and efficient memory management.
  • Strategic evaluation: Differentiating between intentional deception and model hallucination required nuanced logic.

🏆 Accomplishments that I’m proud of

  • Successfully detected multi-type vulnerabilities in GPT-OSS-20B, including early signs of evaluation awareness and CoT misdirection
  • Created a fully automated pipeline from prompt generation to final report
  • Designed a modular red-teaming framework that can scale across models, providers, and vulnerability taxonomies
  • Generated production-ready outputs, including:
    • Vulnerability reports
    • Conversation logs
    • PoC demonstrations
    • Mitigation recommendations

📚 What I learned

  • Red-teaming LLMs is an adversarial art, not just a testing procedure.
  • Strategic vulnerabilities often hide in nuance word choice, turn timing, or subtle intent.
  • Building AI to detect AI failure requires interpretability at every stage: prompt, response, analysis, and memory.
  • The future of safety isn't just in training better models, it’s in continually challenging them intelligently.

🔮 What's next for Red Teaming

  • Multi-agent testing: Enable swarm-based red-team agents that collaborate and escalate attacks.
  • Cross-model benchmarking: Expand to compare vulnerabilities across open-source models (Mistral, Phi, Gemma, etc.)
  • Visualization layer: Map out vulnerability patterns in a human-readable dashboard.
  • Real-time monitoring integration: Turn this into a live AI firewall for deployed LLMs.

In a world of fast-evolving models, our job isn't just to build smarter AI, it's to test them smarter, too.

Built With

  • argparse
  • dataclasses
  • dotenv
  • filesystem-based
  • git
  • gpt-oss-20b
  • langgraph
  • llama-3.1
  • local-machine-via-ollama
  • ollama-rest-api
  • pydantic
  • python
  • typer
  • yaml
Share this project:

Updates