Detecting Adversarial Jailbreaking Prompts

Inspiration

The potential boons of AI are difficult to overstate. My personal mission is to help mitigate as many risks from AI as possible so that humanity can fully reap its myriad benefits.

One risk associated with LLMs is that it's exceedingly easy to strip away any safety fine-tuning (e.g. RLHF) from a model while maintaining overall model usefulness. Techniques such as LoRA can be used if one has direct access to model weights. But safety guardrails can be bypassed even if one only has access to the web interface for a model. Llm-attacks developed a technique to adversarially generate malicious prompt suffixes that can be appended to any prompt and cause a wide variety of LLMs to happily generate harmful output.

Today, the consequences of a jailbroken LLM are fairly modest, but as AI capabilities progress, stripping away safety fine-tuning will expose increasingly malicious capabilities and empower attackers more and more. Efforts must start today to study defensive measures for preventing jailbreaking attacks so that we are ready when the consequences of failure are much higher. My project is a contribution to this defensive effort.

What it does

My model detects prompts that contain adversarially-generated malicious prompt suffixes. Prompts to LLMs can be screened by this detector model before being passes to the LLM to reduce the likelihood that the malicious prompt will successfully jailbreak the LLM.

How I built it

I used four different adversarially generated malicious prompt suffixes from the llm-attacks paper to generate a dataset containing both malicious and benign prompts. The benign samples included prompts with suffixes created by slightly manipulating the malicious suffixes.

I created two datasets:

Dataset A: Contains only malicious prompt suffixes 1 and 2 (and their slightly modified benign versions).

Dataset B: Same as dataset A but for malicious suffixes 3 and 4.

I split dataset A into training and test sets and trained a logistic regression model on the training set (after vectorizing the textual data). I then used the model to make predictions on two datasets: (1) The test set for dataset A, and (2) All of dataset B.

Accomplishments that I'm proud of

Dataset A: 100% accuracy on test set

Dataset B: 75% accuracy (even though the model never saw examples of malicious suffixes 3 and 4 during training!)

What's next for Detecting Adversarial Jailbreaking Prompts

I plan to experiment with various other models to see if I can improve performance even more. Ultimately, I think it will be critical to move beyond safety fine-tuning as the main alignment technique for LLMs. Creating models with qualities of both Jekyll and Hyde and then stripping away the Hyde part is much less desirable than building models that are all Jekyll. If Hyde is lurking in the depths of the model, attackers WILL pull him up to the surface. Hyde isn't that scary right now, but empirical research by Anthropic suggests that in the next 2-3 years, he could be very frightening indeed.

Built With

chatgpt
jupyter
python
scikit-learn

Updates

Jack Parker started this project — Nov 04, 2023 10:46 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.