LLM-Xploit

Example 1 - LLM Refusing to Scold the user
Example 1 - LLM scolds the user with adversarial string
Example 2 - LLM Refusing to Scold the user.png
Example 2 - LLM scolds the user with adversarial string

Inspiration

With the increase in the number of Large Language Model (LLM) applications being made following the rise of ChatGPT in early 2023, one should be wary of the potential security risks that are present as these applications are sometimes fine-tuned on sensitive data for specific use cases. Additionally, their potential to produce objectionable content, spread misinformation and hallucinate may cause companies that are unprepared to be caught off guard by malicious actors.

Together with the demand for some form of adversarial testing against LLMs from our clients, we have decided to delve into this field to explore ways we can better inform our clients on the risks of their LLM applications.

What it does

LLM-Xploit is an Adversarial Testing Software as a Service used to discover Vulnerabilities in LLMs. Currently, it is only able to perform prompt based attacks on LLMs. We intend to have multiple attack scenarios where attackers are able to add adversarial prompts to the user's inputs at the API level and influence downstream users. Scolding is one such scenario we have explored where adding a scolding prompt together with an optimized universal adversarial string would cause the the LLM to scold the end user.

We believe the use of reinforcement learning could benefit greatly in a black box context as the environment is unknown (in this case, it would be the LLM's weights) and there are opportunities for exploration and exploitation to optimize for an objective (in this case, the objective would be an adversarial suffix which fulfills the attacker's harmful intent).

How we built it

Components

Reinforcement Learning Adversarial Agent
Target Large Language Model
A disruptive attacker scenario

REINFORCE Algorithm with Baseline

A policy gradient method which uses the classical REINFORCE algorithm with a baseline function to decrease the variance and remove bias from the rewards.
Probabilities of actions yielding rewards better than the baseline would be increased for faster convergence
The reward function uses a pre-trained sentiment classifier to judge the output of the LLM
Entropy regularization is used to encourage action diversity and reduce chances of convergence to a local optimum

Adversarial Testing Pipeline

Once we obtain a client’s API key, we run several adversarial prompt experiments to test how robust their model is to prompt injection exploits.
In each experiment, we begin with the adversarial reinforcement learning agent which is a deep neural net.
It generates and appends an adversarial suffix to user input (training prompts). This action would be passed to the environment and the critic which are the language model and reward function respectively.
The output from the language model together with the user input (training prompts) and the adversarial suffix forms the state which provides the Agent with information about the environment at the current step.
Additionally, the reward function places a value to the current state reached by taking the action. This reward feedback together with the agent’s policy guides it to optimize for higher rewards.
This process repeats until the rewards have converged.

Challenges we ran into

Quality of training prompts

The training prompts were mostly handcrafted and sourced from ShareGPT conversations. Apart of handpicking the training prompts, we also had to ensure that the attacker scenarios which we have chosen must require an adversarial prompt to break (for example, llama-2-13b was already able to scold the user for certain questions asked without the need for an adversarial prompt) .

Limit on number of user prompts that can be used for training

We could not use more than 40 training prompts to train the adversarial model as it would take too long to converge due to the limited compute resources. We made do with 40 prompts and took the opportunity to see how transferable our attacks were to other Language Models which we did not train on.

Accomplishments that we're proud of

We were able to train our reinforcement learning algorithm to converge with Llama-2 and Vicuna variants.
We have managed to show some transfer-ability of the adversarial prompts to a popular chat model, Bard AI (Google).

What we learned

Key LLM concepts - tokenizers and model inference parameters, various language model architectures, prompt engineering
LLM deployment on kubernetes (on-premises) and on mlflow (on databricks)
Framing a reinforcement learning problem and using policy gradient methods solve a black box adversarial attack problem.
Various existing adversarial attacks (& Jailbreaks) on LLMs such as white box gradient attacks and black box red teaming