Inspiration

The rate of LLM adoption has outpaced the establishment of comprehensive security protocols, leaving many applications vulnerable to high-risk issues, as listed in the OWASP Top 10 for LLM applications. We felt that more work should be focused on safeguarding LLMs. To understand what the vulnerabilities are, we start with red teaming the LLM.

Our methodology is inspired by 2 research papers:

  1. Red Teaming Language Models with Language Models link
  2. MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots link

What it does

Our solution is a red-teaming solution to identify the vulnerabilities of a target LLM to different categories of harmful queries.

How we built it

We built it using:

  • Code - Databricks Compute Clusters and Notebook
  • Serving - Databricks Model Serving Endpoint and Cluster Driver Proxy Endpoint
  • Data Store - Databricks Unity Catalog
  • Model Store - Databricks MLflow Model Registry
  • Applications - Streamlit, Langchain, Flask, Hugging Face Transformers, and OpenAI API.

Dataset

  1. advbench/harmful_behaviours.csv link as harmful queries.
  2. Jailbreak Questions from MasterKey/Jailbreaker paper link as harmful queries.
  3. Jailbreakchat.com for jailbreak prompts

Challenges we ran into

  • OSError faced when transformers' trainer completed training.
  • Inability to serve PEFT models as it is being actively developed. Despite this, Databricks provides the flexibility to serve these models on the cluster driver proxy endpoint instead.
    • specifically, after logging the model using pyfunc, we were faced with peft not found error to loaded it.
  • Crafting prompts for fine-tuning required some experimentation.
  • OpenAI API calls were slow during the hackathon period.

Accomplishments that we're proud of

  • We broke Gandalf at level 5 using our experimental features.
  • Showing that generated JB prompts are more effective at tearing down the guardrails of open-sourced LLMs.
  • Consolidation of jailbreak prompts from various sources.
  • Completing our first hackathons.

What we learnt

  • A good platform/UI facilitates development work; having MLFlow integrated to review the metrics gave good insights.
  • Coding assistance allowed for seamless debugging, making resolving issues easier to handle.
  • Always reach out to the relevant chats to seek assistance or clarification.
  • Finetuning LLMs on limited compute resources.

What's next for Red Teaming LLM

  • Develop guardrails and provide guardrail services.​
  • Improve the harmfulness evaluation model.​
  • Categorize jailbreak prompts to ensure coverage.​
  • Train red team LLM to generate category-specific harmful queries.​
  • Train red team LLM to generate a greater variety of jailbreak prompts.​
  • Update the repository with publicly released jailbreak prompts automatically.​
  • Discover novel jailbreak prompts through reinforcement learning.​
  • Provide multilingual prompts and guardrails.​

Built With

Share this project:

Updates