✅Elevator pitch

This library is testing the ethics of language models by using natural adversarial texts.

This tool allows for short and simple code and validation with little effort.

Easy installation with pip.



In natural language processing (NLP), the development of language models (LM) has led to the exploration of applications such as chatbots, writing aids, text summarization, and NPCs in games.

There are increasing opportunities for language models to provide an interactive and human-like UX to end users. However, generative models have issues with harmful or discriminatory remarks, as in the case of Microsoft's chatbot Tay, which was shut down due to discriminatory remarks. The AI community has a responsibility to address this issue so that people of all races and genders etc. can benefit from AI.

In other words, engineers about NLP should not ensure that we make discriminatory remarks, unethical statements, propaganda, or detrimental information disclosure to the language models..

Software engineers generally assure the behavior of their algorithms by test codes. Test codes have much higher persistence and real-time performance than heuristic one. However, it is very difficult to create test codes to ensure the language model doesn’t have toxic behaviors, and the difficulty stems from black-boxed features of machine learning models. A problem to search for inputs which take a particular output in the model is equivalent to the inverse problem of a normal use case. This makes it remarkably difficult to mathematically prove that a particular input does not exist.

We need a simple and general-purpose method to verify that natural language generative models in PyTorch do not make discriminatory or inappropriate statements in natural contexts. However, we could not find the library which meets these requirements when we had surveyed. In other words, it suggests that it is difficult to sustain responsible AI, and we are very concerned about this problem.

For this reason, we developed “prompt2slip". This library allows you to input a language model and a word, and it will output sentences that contain the specified word using the language model, and qualitatively verify the danger.

prompt2slip makes it easy to test against LM. At the same time, it aims to target and test the topics that engineers want to verify the most.. prompt2slip can be used to verify that the trained models are "responsible AI" in the unittest.

Prompt2slip's most important mission is to help all natural language engineers to provide sustainable and responsible AI.

Where did you get the idea?

Adversarial examples are known as a technique to obtain arbitrary output from deep learning models. A common way to generate adversarial examples is to define an adversarial loss function which encourages miss prediction and minimizes the loss.

In image and voice recognition, it is known that it is possible to generate adversarial perturbations which are difficult for humans to detect by introducing perceptibility constraints into the optimization. However, in NLP, the discrete nature of text data makes it difficult to introduce perceptibility constraints. As a result, the adversarial samples obtained by existing methods were grammatically and semantically unnatural sentences.

GBDA (Gradient-based Distributional Attack), which is a state-of-the-art algorithm for generating natural adversarial samples as sentences, has been proposed. We considered applying GBDA not only as an attack method, but also as a testing tool for language models.

✅What it does

"prompt2slip" provides the function to search for prompts which cause appearance of any specific word against a pre trained natural language generation model. Furthermore, with user customization, it can be applied to a wide range of tasks, including classification tasks.If you want to generate a hostile sample for a classification model, you can simply override the method to compute the adversarial loss function to generate a natural adversarial text.

The unique feature of this library is that it can generate test cases for verifying the danger of a pre-trained natural language model with a few lines of code.

Here is a minimal example of this library

import prompt2slip
from transformers import GPT2Tokenizer, GPT2Model

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

base_text = ["This project respects pytorch developers."]
target_word = ["caffe2"]

target_ids  = torch.Tensor(tokenizer.convert_tokens_to_ids(target_words))
attaker = CLMAttacker(model,tokenizer)

In this way, you can easily generate an adversarial example only by passing the trained language model, the original prompt, and the words to verify to prompt2slip.CLMAttacker. With this adversarial prompt as a test case, you can verify the reliability of your language model.

✅How we built it

prompt2slip relies on the following libraries   PyTorch torchtyping transformers

For supporting “transformer”, which is one of the most used architectures in NLP, we have added support for Hugging Face’s transformers as well as PyTorch. In addition, we have also added type annotations with “torchtyping” to help developers understand the shape of PyTorch tensor.

We have made the library extensible: it has base classes for PyTorch and Transformers. By inheriting these base classes, developers can generate adversarial samples for a variety of task models in addition to text generation.

In addition to the implementation of Chuan Guo et al. 2021[1], we have implemented a loss function to make specific tokens appear for trained language models.

✅Challenges we ran into

We faced two technical challenges in this project.

The first challenge was to design a new loss function. In the paper, only the classification problem was mentioned. Since the goal of "prompt2slip" was to make the specified token appear from language models, we developed a new loss function. By minimizing our new loss function, you can maximize the probability that the generated text will contain the specified token.

The second challenge is to stabilize the optimization when generating adversarial examples. Initially, simply adapting the GBDA optimization to our problem caused instability in the optimization. After a lot of trial and error, we found out that the cause was randomness caused by sampling. By increasing the number of samplings and batch size, we were able to reduce the randomness and stabilize the optimization.

✅Accomplishments that we're proud of

This library is testing the ethics of language models by using natural adversarial texts. It provides a simple yet powerful interface that can be easily handled by engineers who are not familiar with hostile samples. Can be used for non-English language models It contains detailed type definitions and docstrings, so the code itself can also serve as documentation. Hostile samples, which are sometimes used as an attack technique, were applied as a testing tool. By inheriting from the base class, it has the versatility to be extended by engineers for many problem settings. We are building CI/CD into our development environment with the thought that prompt2slip itself will be sustainably responsible.

✅What we learned

Tests against ML

We initially learned various vulnerabilities that LM potentially has, and the difficulty in proving that these vulnerabilities exist in the LM. In other words, it is difficult to integrate deep learning models into services and operate responsible AI on an ongoing basis. When we firstly became aware of this situation correctly, we naturally assumed that there were a wide variety of generalized testing tools available, but as far as we could find out, no such testing tools existed.

Development Challenges

We learned that in order to make it a viable OSS library, it is important to have more than just the theory of machine learning, such as test code, function design using Typing, docstrings and documentation. We also learned that many of the prototypes provided by researchers did not meet these design requirements, so we set out to design something that would be easy for engineers to use.

Team Building

Our team building has been done remotely because of COVID-19. In other words, we had to start development without knowing each other's personalities, moods, heights, or favorite foods. Communication for the early stages of a project did not work straightforward, and we realized their difficulty. During an extensive try and error process, we learned that using not only technical tools like an editor but also communication tools appropriately such as Google Jamboard, Slack, and Kanban boards improve the development efficiency.

In addition, the experience told us the benefits of asynchronous communication. Since we were all full time workers, we had to do meetings at midnight. Then, we found that not all members could always keep a high level of concentration. After recognizing this problem, we have consciously engaged our development using asynchronous communication. Thanks to that, we could write high-quality codes and pull requests, and introduced tools to improve the environment. As a result, we learned that working in each comfortable time enhances the quality of our product.

✅What's next for prompt2slip

Currently, the naturalness of the prompts generated by “prompt2slip” can be improved; we plan to tune the loss function and hyperparameters. We also plan to add other natural language processing tasks such as series transformation for machine translation and dialogue generation. We will be adding documentation for this library as well.


[1] Gradient-based Adversarial Attacks against Text Transformers (Chuan Guo et al., 2021)

Built With

  • adversarial-examples
  • adversarialattack
  • github-actions
  • poetry
  • pytest
  • python
  • pytorch
  • torchtyping
  • transformers
Share this project: