IGGY


Project Introduction

Dear Grandma is all about pushing AI agents to their limits, responsibly. Our goal is to break, probe, and really understand how these models behave when challenged, so they can be made safer before reaching the real world.

Project Aims

We aim to tackle the hidden weaknesses of advanced AI, working as a responsible red team. Our main goal is to find and understand critical risks and behaviours in modern models including blackbox ones.

Given 7 blackbox animal agents, we probe their behaviour with adversarial queries. Each agent is mapped to its technical framework, underlying model, and architectural design based on observable responses, latency, and error patterns.

Methodologies

Our methodology emphasises comprehensive coverage and measurable outcomes. We conduct repeated attacks across all relevant categories and document the Attack Success Rate (ASR) for each agent, ensuring findings are quantifiable and reproducible.

Red-Team AI Security Testing: Uncovering breakthrough vulnerabilities with the following rigorous methods:

  • PAIR
  • Role-play
  • STI Jinja
  • DAN
  • Please Attack

Success Rate Calculation

Flowchart order of response parsing to detect whether or not the prompt successfully jailbroke the system.

  • RegEx: detecting of key phrases as initial processing of response heuristics
  • Semantic Analysis: Sentence Transformer all-MiniLM-L6-v2 to detect negative semantics e.g. I’m sorry I can’t...
  • LLM as a Judge: Calling Gemini-2.0-flash-001 with attack prompt and response to determine if the prompt succeeded in jailbreaking.

Built With

  • all-minilm-l6-v2
  • bedrock
  • gemini-2.0-flash-001
  • llama3.3
  • novapremier
Share this project:

Updates