Inspiration
With the increase in the number of Large Language Model (LLM) applications being made following the rise of ChatGPT in early 2023, one should be wary of the potential security risks that are present as these applications are sometimes fine-tuned on sensitive data for specific use cases. Additionally, their potential to produce objectionable content, spread misinformation and hallucinate may cause companies that are unprepared to be caught off guard by malicious actors.
Together with the demand for some form of adversarial testing against LLMs from our clients, we have decided to delve into this field to explore ways we can better inform our clients on the risks of their LLM applications.
What it does
LLM-Xploit is an Adversarial Testing Software as a Service used to discover Vulnerabilities in LLMs. Currently, it is only able to perform prompt based attacks on LLMs. We intend to have multiple attack scenarios where attackers are able to add adversarial prompts to the user's inputs at the API level and influence downstream users. Scolding is one such scenario we have explored where adding a scolding prompt together with an optimized universal adversarial string would cause the the LLM to scold the end user.
We believe the use of reinforcement learning could benefit greatly in a black box context as the environment is unknown (in this case, it would be the LLM's weights) and there are opportunities for exploration and exploitation to optimize for an objective (in this case, the objective would be an adversarial suffix which fulfills the attacker's harmful intent).
How we built it
Components
- Reinforcement Learning Adversarial Agent
- Target Large Language Model
- A disruptive attacker scenario
REINFORCE Algorithm with Baseline
- A policy gradient method which uses the classical REINFORCE algorithm with a baseline function to decrease the variance and remove bias from the rewards.
- Probabilities of actions yielding rewards better than the baseline would be increased for faster convergence
- The reward function uses a pre-trained sentiment classifier to judge the output of the LLM
- Entropy regularization is used to encourage action diversity and reduce chances of convergence to a local optimum
Adversarial Testing Pipeline
- Once we obtain a client’s API key, we run several adversarial prompt experiments to test how robust their model is to prompt injection exploits.
- In each experiment, we begin with the adversarial reinforcement learning agent which is a deep neural net.
- It generates and appends an adversarial suffix to user input (training prompts). This action would be passed to the environment and the critic which are the language model and reward function respectively.
- The output from the language model together with the user input (training prompts) and the adversarial suffix forms the state which provides the Agent with information about the environment at the current step.
- Additionally, the reward function places a value to the current state reached by taking the action. This reward feedback together with the agent’s policy guides it to optimize for higher rewards.
- This process repeats until the rewards have converged.
Challenges we ran into
Quality of training prompts
The training prompts were mostly handcrafted and sourced from ShareGPT conversations. Apart of handpicking the training prompts, we also had to ensure that the attacker scenarios which we have chosen must require an adversarial prompt to break (for example, llama-2-13b was already able to scold the user for certain questions asked without the need for an adversarial prompt) .
Limit on number of user prompts that can be used for training
We could not use more than 40 training prompts to train the adversarial model as it would take too long to converge due to the limited compute resources. We made do with 40 prompts and took the opportunity to see how transferable our attacks were to other Language Models which we did not train on.
Accomplishments that we're proud of
- We were able to train our reinforcement learning algorithm to converge with Llama-2 and Vicuna variants.
- We have managed to show some transfer-ability of the adversarial prompts to a popular chat model, Bard AI (Google).
What we learned
- Key LLM concepts - tokenizers and model inference parameters, various language model architectures, prompt engineering
- LLM deployment on kubernetes (on-premises) and on mlflow (on databricks)
- Framing a reinforcement learning problem and using policy gradient methods solve a black box adversarial attack problem.
- Various existing adversarial attacks (& Jailbreaks) on LLMs such as white box gradient attacks and black box red teaming
What's next for LLM-Xploit
We hope to investigate and expanding our detection services to:
- Detecting how likely LLMs can generate misinformation
- Detecting how likely LLMs can divulge to divulge sensitive information
Log in or sign up for Devpost to join the conversation.