Inspiration

Our team was inspired by Google DeepMind's AlphaEvolve paper which demonstrated how AI could autonomously discover and optimize algorithms through evolutionary processes. We wanted to leverage the open-source implementation of this and apply it to red-teaming to improve the safety of models against jailbreak prompts.

There is a demonstrable need to discover and understand AI safety vulnerabilities through evolutionary computation. Traditional red-teaming approaches are often manual and limited in scope, but by applying evolutionary algorithms to jailbreak prompt generation, we can automatically discover novel attack vectors that human testers might miss.

What it does

HackEvolve is an evolutionary coding agent that uses Large Language Models (LLMs) to automatically optimize a dataset of jailbreak prompts. The system:

  1. Evolves jailbreak prompts. We use an evolutionary algorithm to iteratively improve jailbreak prompts, starting from a basic initial prompt and evolving it over hundreds of generations.
  2. Leverages a multi-model approach. We use an ensemble of LLMs (Gemini and Claude) to generate diverse prompt variations (mutations).
  3. Monitors generation progress. We built a web dashboard with interactive visualizations showing evolution progress, safety metrics, and prompt effectiveness across generations and iterations.
  4. Tracks key safety metrics. We measure multiple dimensions including danger level, novelty, answer rate, and refusal rate to comprehensively evaluate prompt effectiveness.

How we built it

We implemented a custom version of the AlphaEvolve architecture. We made use of existing open source forks, primarily relying on OpenEvolve as the foundation of our project.

Using a prebuilt dataset of jailbreaking prompts as the seed for Generation 0, we iteratively mutate the prompts in this dataset over many generations until it becomes a robust population of dangerous rule-violating prompts. Gemini is the model used for the high throughput of mutations. This essentially turns the problem of optimal jailbreak prompt discovery into a BFS on the entire space of possible prompts.

We combine this BFS approach with evolutionary pressure (the core of AlphaEvolve). We use claude-3.5 as an evaluator by asking it to judge the answers that claude-4 provides when given the current jailbreak prompt. We look at the similarity between the produced answer and the ideal answer (outright refusal).

  • High-scoring prompts get stored in the elite archive and are frequently selected as parents
  • Low-scoring prompts are less likely to be chosen for reproduction

This creates natural selection where "better" prompts gradually dominate the population.

Challenges we ran into:

  1. The evaluation of adversary prompts: Naive prompt-keyword checks fail. We improved by testing prompts against Claude and analyzing its response for safety bypasses. However, this direct evaluation faces cost limits and imperfect heuristics for classifying responses, an ongoing challenge in accurately scoring adversarial success.
  2. Resource Balancing for Effective Evolution: API costs for evaluating each prompt with Claude, and the time taken for these evaluations, were significant constraints. This forced a critical balance: we needed enough evolutionary cycles and evaluated prompts per cycle to demonstrate meaningful improvement in adversarial capabilities, while staying within budget. This directly influenced our choices regarding the evaluation LLM's version (cost/speed vs. capability) and the complexity of our prompt generation strategies.

Accomplishments that we're proud of

We broke claude4 and claude4-opus. Using the constructed dataset of jailbreak prompts, we also fine-tuned a version of Qwen 2.5 7b to be safer than these models.

What we learned

  • Budget enough time for model training
  • What a powerful jailbreak prompt looks like

What's next for HackEvolve

We see this approach being applied to regular hacking, such as mutating scripts and strategies that are gradually executed to hack any website or gather as much information on a given target, while generating the least noise/pain possible.

Built With

Share this project:

Updates